\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

Global-Affine and Local-Specific Generative Adversarial Network for semantic-guided image generation

Abstract Full Text(HTML) Figure(14) / Table(6) Related Papers Cited by
  • The recent progress in learning image feature representations has opened the way for tasks such as label-to-image or text-to-image synthesis. However, one particular challenge widely observed in existing methods is the difficulty of synthesizing fine-grained textures and small-scale instances. In this paper, we propose a novel Global-Affine and Local-Specific Generative Adversarial Network (GALS-GAN) to explicitly construct global semantic layouts and learn distinct instance-level features. To achieve this, we adopt the graph convolutional network to calculate the instance locations and spatial relationships from scene graphs, which allows our model to obtain the high-fidelity semantic layouts. Also, a local-specific generator, where we introduce the feature filtering mechanism to separately learn semantic maps for different categories, is utilized to disentangle and generate specific visual features. Moreover, we especially apply a weight map predictor to better combine the global and local pathways considering the highly complementary between these two generation sub-networks. Extensive experiments on the COCO-Stuff and Visual Genome datasets demonstrate the superior generation performance of our model against previous methods, our approach is more capable of capturing photo-realistic local characteristics and rendering small-sized entities with more details.

    Mathematics Subject Classification: Primary: 58F15, 58F17; Secondary: 53C35.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Overview of the proposed GALS-GAN

    Figure 2.  Illustration of a single graph convolution layer

    Figure 3.  Architecture of the MLP

    Figure 4.  Inferring process of the mask predictor

    Figure 5.  Architecture of the local-specific generator

    Figure 6.  Architecture of the multi-scale discriminators

    Figure 7.  Images generated by different level generators

    Figure 8.  Qualitative examples generated by our GALS-GAN based on the COCO-Stuff dataset

    Figure 9.  Qualitative examples generated by our GALS-GAN based on the Visual Genome dataset

    Figure 10.  Qualitative comparison of different models

    Figure 11.  An example of manipulating the synthesized image

    Figure 12.  Example results of different image manipulation types

    Figure 13.  Ablation study of the global-affine generator

    Figure 14.  Ablation study of the local-specific generator

    Table 1.  Statistics of COCO-Stuff and Visual Genome datasets

    datasets train val test categories max min
    COCO-Stuff 74121 1024 2048 171 8 3
    Visual Genome 62565 5506 5088 178 30 3
     | Show Table
    DownLoad: CSV

    Table 2.  Quantitative comparison of images generated by different methods on the COCO-Stuff dataset

    Methods IS $ \uparrow $ FID $ \downarrow $
    64 $ \times $ 64 128 $ \times $ 128 64 $ \times $ 64 128$ \times $ 128
    sg2im [10] 6.7$ \pm $0.1 5.99$ \pm $0.27 67.99 95.18
    stacking-GANs [36] 9.1$ \pm $0.20 12.01$ \pm $0.40 50.94 39.78
    PasteGAN [19] 9.2$ \pm $0.32 - 42.30 -
    PasteGAN (GT layout) [19] 10.20$ \pm $0.20 - 34.30 -
    ours 9.85$ \pm $0.15 13.82$ \pm $0.30 38.29 29.62
     | Show Table
    DownLoad: CSV

    Table 3.  Quantitative comparison of images generated by different methods on Visual Genome dataset

    Methods IS $ \uparrow $ FID $ \downarrow $
    64 $ \times $ 64 128 $ \times $ 128 64 $ \times $ 64 128$ \times $ 128
    sg2im [10] 5.5$ \pm $0.10 4.78$ \pm $0.15 73.79 70.40
    stacking-GANs [36] 6.90$ \pm $0.20 9.24$ \pm $0.41 59.53 50.19
    PasteGAN [19] 7.97$ \pm $0.30 - 58.37 -
    PasteGAN (GT layout) [19] 9.15$ \pm $0.20 - 34.91 -
    ours 8.87$ \pm $0.15 11.20$ \pm $0.55 39.25 29.94
     | Show Table
    DownLoad: CSV

    Table 4.  Comparison of classification accuracy

    Methods Classification Accuracy Score
    COCO-Stuff Visual Genome
    64 $ \times $ 64 128 $ \times $ 128 64 $ \times $ 64 128$ \times $ 128
    sg2im [10] 28.8 24.1 26.7 23.4
    stacking-GANs [36] 33.9 31.2 32.7 30.3
    PasteGAN [19] 40.3 - 38.7 -
    ours 46.1 44.6 45.4 43.5
     | Show Table
    DownLoad: CSV

    Table 5.  Quantitative comparison of predicted semantic layouts

    Methods R@0.3 R@0.5
    COCO-Stuff Visual Genome COCO-Stuff Visual Genome
    sg2im [10] 52.4 21.9 32.2 10.6
    stacking-GANs [36] 65.3 35.0 49.1 23.2
    PasteGAN [19] 71.2 45.2 62.4 33.8
    ours 80.7 48.4 66.2 36.5
     | Show Table
    DownLoad: CSV

    Table 6.  Ablation study of GALS-GAN different architectures

    Architectures IS $ \uparrow $ FID $ \downarrow $
    w/o $ G_{g-a} $ 7.52$ \pm $0.40 78.94
    w/o $ G_{l-s} $ 11.30$ \pm $0.12 46.83
    full model 13.82$ \pm $0.30 29.62
     | Show Table
    DownLoad: CSV
  • [1] H. Caesar, J. Uijlings and V. Ferrari, COCO-Stuff: Thing and stuff classes in context, IEEE Conference on Computer Vision and Pattern Recognition, (2018), 1209–1218. doi: 10.1109/CVPR.2018.00132.
    [2] W. L. Chen and J. Hays, Sketchygan: Towards diverse and realistic sketch to image synthesis, IEEE Conference on Computer Vision and Pattern Recognition, (2018), 9416–9425. doi: 10.1109/CVPR.2018.00981.
    [3] B. Chen, T. Liu, K. Liu, H. Liu and S. Pei, Image Super-Resolution Using Complex Dense Block on Generative Adversarial Networks, IEEE International Conference on Image Processing, (2019), 2866–2870. doi: 10.1109/ICIP.2019.8803711.
    [4] Y. Choi, M. Choi, M. Kim, J. M. Ha, S. H. Kim and J. Choo, Stargan: Unified generative adversarial networks for multi-domain image-to-image translation, IEEE Conference on Computer Vision and Pattern Recognition, (2018), 8789–8797. doi: 10.1109/CVPR.2018.00916.
    [5] Y. Choi, Y. Uh, J. Yoo and J. W. Ha, StarGAN v2: Diverse image synthesis for multiple domains, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 8185–8194. doi: 10.1109/CVPR42600.2020.00821.
    [6] H. Dhamo, A. Farshad, I. Laina, N. Navab, G. D. Hager, F. Tombari and C. Rupprecht, Semantic image manipulation using scene graphs, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 5212–5221. doi: 10.1109/CVPR42600.2020.00526.
    [7] C. Gao, Q. Liu, Q. Xu, L. Wang, J. Liu and C. Zou, SketchyCOCO: Image generation from freehand scene sketches, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 5173–5182. doi: 10.1109/CVPR42600.2020.00522.
    [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio, Generative adversarial nets, Advances in Neural Information Processing Systems, (2014), 2672–2680.
    [9] S. Hong, D. Yang, J. Choi and H. Lee, Inferring semantic layout for hierarchical text-to-image synthesis, IEEE Conference on Computer Vision and Pattern Recognition, (2018), 7986–7994. doi: 10.1109/CVPR.2018.00833.
    [10] J. Johnson, A. Gupta and F. F. Li, Image generation from scene graphs, IEEE Conference on Computer Vision and Pattern Recognition, (2018), 1219–1228. doi: 10.1109/CVPR.2018.00133.
    [11] T. Kaneko, Y. Ushiku and T. Harada, Label-noise robust generative adversarial networks, IEEE Conference on Computer Vision and Pattern Recognition, (2019), 2462–2471. doi: 10.1109/CVPR.2019.00257.
    [12] S. W. Kim, Y. Zhou, J. Philion, A. Torralba and S. Fidler, Learning to Simulate Dynamic Environments With GameGAN, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 1228–1237. doi: 10.1109/CVPR42600.2020.00131.
    [13] D. Kingma and J. Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations, 2019.
    [14] T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, preprint, arXiv: 1609.02907.
    [15] R. Krishna, et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision, 123 (2017), 32-73.  doi: 10.1007/s11263-016-0981-7.
    [16] T. Y. LinM. MaireS. BelongieJ. HaysP. PeronaD. RamananP. Dollar and C. L. Zitnick, Microsoft coco: Common objects in context, European Conference on Computer Vision, 8693 (2014), 740-755.  doi: 10.1007/978-3-319-10602-1_48.
    [17] M. Li, H. Huang, L. Ma, W. Liu, T. Zhang and Y. Jiang, Unsupervised image-to-image translation with stacked cycle-consistent adversarial networks, European Conference on Computer Vision, (2018), 186–201. doi: 10.1007/978-3-030-01240-3_12.
    [18] W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu and J. Gao, Object-driven text-to-image synthesis via adversarial training, IEEE Conference on Computer Vision and Pattern Recognition, (2019), 12166–12174. doi: 10.1109/CVPR.2019.01245.
    [19] Y. Li, T. Ma, Y. Bai, N. Duan, S. Wei, and X. Wang, Pastegan: A semi-parametric method to generate image from scene graph, Advances in Neural Information Processing Systems, 2019.
    [20] B. Li, B. Zhuang, M. Li and J. Gu, Seq-SG2SL: Inferring semantic layout from scene graph through sequence to sequence learning, IEEE International Conference on Computer Vision, (2019), 7434–7442. doi: 10.1109/ICCV.2019.00753.
    [21] S. Liu, T. Wang, D. Bau, J. Y. Zhu and A. Torralba, Diverse Image Generation via Self-Conditioned GANs, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 14274–14283. doi: 10.1109/CVPR42600.2020.01429.
    [22] S. Nam, Y. Kim and S. J. Kim, Text-adaptive generative adversarial networks: Manipulating images with natural language, Advances in Neural Information Processing Systems, (2018), 42–51.
    [23] J. C. NiS. S. ZhangZ. L. ZhouJ. Hou and F. Gao, Instance Mask Embedding and Attribute-Adaptive Generative Adversarial Network for Text-to-Image Synthesis, IEEE Access, 8 (2020), 37697-37711.  doi: 10.1109/ACCESS.2020.2975841.
    [24] T. Park, M. Y. Liu, T. C. Wang and J. Y. Zhu, Semantic image synthesis with spatially-adaptive normalization, IEEE Conference on Computer Vision and Pattern Recognition, (2019), 2332–2341. doi: 10.1109/CVPR.2019.00244.
    [25] T. Qiao, J. Zhang, D. Xu, and D. Tao, Mirrorgan: Learning text-to-image generation by redescription, IEEE Conference on Computer Vision and Pattern Recognition, (2019), 1505–1514.
    [26] S. Ravuri and O. Vinyals, Classification accuracy score for conditional generative models, preprint, arXiv: 1905.10887.
    [27] S. RenK. HeR. Girshick and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39 (2016), 1137-1149.  doi: 10.1109/TPAMI.2016.2577031.
    [28] S. Sah, D. Peri, A. Shringi, C. Zhang, M. Dominguez, A. Savakis and R. Ptucha, Semantically invariant text-to-image generation, IEEE International Conference on Image Processing, (2018), 3783–3787. doi: 10.1109/ICIP.2018.8451656.
    [29] Y. Shen, J. Gu, X. Tang and B. Zhou, Interpreting the Latent space of GANs for semantic face editing, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 9240–9249. doi: 10.1109/CVPR42600.2020.00926.
    [30] T. R. Shaham, T. Dekel and T. Michaeli, SinGAN: Learning a generative model from a single natural image, IEEE International Conference on Computer Vision, (2019), 4569–4579. doi: 10.1109/ICCV.2019.00467.
    [31] W. Sun and T. F. Wu, Learning Layout and Style Reconfigurable GANs for Controllable Image Synthesis, preprint, arXiv: 2003.11571.
    [32] T. Sylvain, P. C. Zhang, Y. Bengio, R. D. Hjelm and S. Sharma, Object-centric image generation from layouts, preprint, arXiv: 2003.07449.
    [33] C. Szegedy, et al., Going deeper with convolutions, IEEE Conference on Computer Vision and Pattern Recognition, (2015), 1–9. doi: 10.1109/CVPR.2015.7298594.
    [34] H. TangH. Liu and N. Sebe, Unified generative adversarial networks for controllable image-to-image translation, IEEE Transactions on Image Processing, 29 (2020), 8916-8929.  doi: 10.1109/TIP.2020.3021789.
    [35] N. N. Vo and J. Hays, Localizing and orienting street views using overhead imagery, European Conference on Computer Vision, (2016), 494–509. doi: 10.1007/978-3-319-46448-0_30.
    [36] D. M. Vo and A. Sugimoto, Visual-relation conscious image generation from structured-text, preprint, arXiv: 1908.01741.
    [37] H. Yu, Y. Huang, L. Pi and L. Wang, Recurrent deconvolutional generative adversarial networks with application to video generation, Pattern Recognition and Computer Vision, (2019), 18–28. doi: 10.1007/978-3-030-31723-2_2.
    [38] L. Z. Zhang, J. C. Wang, Y. S. Xu, J. Min, T. Wen, J. C. Gee and J. B. Shi, Nested Scale-Editing for Conditional Image Synthesis, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 5476–5486. doi: 10.1109/CVPR42600.2020.00552.
  • 加载中

Figures(14)

Tables(6)

SHARE

Article Metrics

HTML views(454) PDF downloads(316) Cited by(0)

Access History

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return