datasets | train | val | test | categories | max | min |
COCO-Stuff | 74121 | 1024 | 2048 | 171 | 8 | 3 |
Visual Genome | 62565 | 5506 | 5088 | 178 | 30 | 3 |
The recent progress in learning image feature representations has opened the way for tasks such as label-to-image or text-to-image synthesis. However, one particular challenge widely observed in existing methods is the difficulty of synthesizing fine-grained textures and small-scale instances. In this paper, we propose a novel Global-Affine and Local-Specific Generative Adversarial Network (GALS-GAN) to explicitly construct global semantic layouts and learn distinct instance-level features. To achieve this, we adopt the graph convolutional network to calculate the instance locations and spatial relationships from scene graphs, which allows our model to obtain the high-fidelity semantic layouts. Also, a local-specific generator, where we introduce the feature filtering mechanism to separately learn semantic maps for different categories, is utilized to disentangle and generate specific visual features. Moreover, we especially apply a weight map predictor to better combine the global and local pathways considering the highly complementary between these two generation sub-networks. Extensive experiments on the COCO-Stuff and Visual Genome datasets demonstrate the superior generation performance of our model against previous methods, our approach is more capable of capturing photo-realistic local characteristics and rendering small-sized entities with more details.
Citation: |
Table 1. Statistics of COCO-Stuff and Visual Genome datasets
datasets | train | val | test | categories | max | min |
COCO-Stuff | 74121 | 1024 | 2048 | 171 | 8 | 3 |
Visual Genome | 62565 | 5506 | 5088 | 178 | 30 | 3 |
Table 2. Quantitative comparison of images generated by different methods on the COCO-Stuff dataset
Methods | IS |
FID |
||
64 |
128 |
64 |
128 |
|
sg2im [10] | 6.7 |
5.99 |
67.99 | 95.18 |
stacking-GANs [36] | 9.1 |
12.01 |
50.94 | 39.78 |
PasteGAN [19] | 9.2 |
- | 42.30 | - |
PasteGAN (GT layout) [19] | 10.20 |
- | 34.30 | - |
ours | 9.85 |
13.82 |
38.29 | 29.62 |
Table 3. Quantitative comparison of images generated by different methods on Visual Genome dataset
Methods | IS |
FID |
||
64 |
128 |
64 |
128 |
|
sg2im [10] | 5.5 |
4.78 |
73.79 | 70.40 |
stacking-GANs [36] | 6.90 |
9.24 |
59.53 | 50.19 |
PasteGAN [19] | 7.97 |
- | 58.37 | - |
PasteGAN (GT layout) [19] | 9.15 |
- | 34.91 | - |
ours | 8.87 |
11.20 |
39.25 | 29.94 |
Table 4. Comparison of classification accuracy
Table 5. Quantitative comparison of predicted semantic layouts
Table 6. Ablation study of GALS-GAN different architectures
Architectures | IS |
FID |
w/o |
7.52 |
78.94 |
w/o |
11.30 |
46.83 |
full model | 13.82 |
29.62 |
[1] |
H. Caesar, J. Uijlings and V. Ferrari, COCO-Stuff: Thing and stuff classes in context, IEEE Conference on Computer Vision and Pattern Recognition, (2018), 1209–1218.
doi: 10.1109/CVPR.2018.00132.![]() ![]() |
[2] |
W. L. Chen and J. Hays, Sketchygan: Towards diverse and realistic sketch to image synthesis, IEEE Conference on Computer Vision and Pattern Recognition, (2018), 9416–9425.
doi: 10.1109/CVPR.2018.00981.![]() ![]() |
[3] |
B. Chen, T. Liu, K. Liu, H. Liu and S. Pei, Image Super-Resolution Using Complex Dense Block on Generative Adversarial Networks, IEEE International Conference on Image Processing, (2019), 2866–2870.
doi: 10.1109/ICIP.2019.8803711.![]() ![]() |
[4] |
Y. Choi, M. Choi, M. Kim, J. M. Ha, S. H. Kim and J. Choo, Stargan: Unified generative adversarial networks for multi-domain image-to-image translation, IEEE Conference on Computer Vision and Pattern Recognition, (2018), 8789–8797.
doi: 10.1109/CVPR.2018.00916.![]() ![]() |
[5] |
Y. Choi, Y. Uh, J. Yoo and J. W. Ha, StarGAN v2: Diverse image synthesis for multiple domains, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 8185–8194.
doi: 10.1109/CVPR42600.2020.00821.![]() ![]() |
[6] |
H. Dhamo, A. Farshad, I. Laina, N. Navab, G. D. Hager, F. Tombari and C. Rupprecht, Semantic image manipulation using scene graphs, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 5212–5221.
doi: 10.1109/CVPR42600.2020.00526.![]() ![]() |
[7] |
C. Gao, Q. Liu, Q. Xu, L. Wang, J. Liu and C. Zou, SketchyCOCO: Image generation from freehand scene sketches, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 5173–5182.
doi: 10.1109/CVPR42600.2020.00522.![]() ![]() |
[8] |
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio, Generative adversarial nets, Advances in Neural Information Processing Systems, (2014), 2672–2680.
![]() |
[9] |
S. Hong, D. Yang, J. Choi and H. Lee, Inferring semantic layout for hierarchical text-to-image synthesis, IEEE Conference on Computer Vision and Pattern Recognition, (2018), 7986–7994.
doi: 10.1109/CVPR.2018.00833.![]() ![]() |
[10] |
J. Johnson, A. Gupta and F. F. Li, Image generation from scene graphs, IEEE Conference on Computer Vision and Pattern Recognition, (2018), 1219–1228.
doi: 10.1109/CVPR.2018.00133.![]() ![]() |
[11] |
T. Kaneko, Y. Ushiku and T. Harada, Label-noise robust generative adversarial networks, IEEE Conference on Computer Vision and Pattern Recognition, (2019), 2462–2471.
doi: 10.1109/CVPR.2019.00257.![]() ![]() |
[12] |
S. W. Kim, Y. Zhou, J. Philion, A. Torralba and S. Fidler, Learning to Simulate Dynamic Environments With GameGAN, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 1228–1237.
doi: 10.1109/CVPR42600.2020.00131.![]() ![]() |
[13] |
D. Kingma and J. Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations, 2019.
![]() |
[14] |
T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, preprint, arXiv: 1609.02907.
![]() |
[15] |
R. Krishna, et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision, 123 (2017), 32-73.
doi: 10.1007/s11263-016-0981-7.![]() ![]() ![]() |
[16] |
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar and C. L. Zitnick, Microsoft coco: Common objects in context, European Conference on Computer Vision, 8693 (2014), 740-755.
doi: 10.1007/978-3-319-10602-1_48.![]() ![]() |
[17] |
M. Li, H. Huang, L. Ma, W. Liu, T. Zhang and Y. Jiang, Unsupervised image-to-image translation with stacked cycle-consistent adversarial networks, European Conference on Computer Vision, (2018), 186–201.
doi: 10.1007/978-3-030-01240-3_12.![]() ![]() |
[18] |
W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu and J. Gao, Object-driven text-to-image synthesis via adversarial training, IEEE Conference on Computer Vision and Pattern Recognition, (2019), 12166–12174.
doi: 10.1109/CVPR.2019.01245.![]() ![]() |
[19] |
Y. Li, T. Ma, Y. Bai, N. Duan, S. Wei, and X. Wang, Pastegan: A semi-parametric method to generate image from scene graph, Advances in Neural Information Processing Systems, 2019.
![]() |
[20] |
B. Li, B. Zhuang, M. Li and J. Gu, Seq-SG2SL: Inferring semantic layout from scene graph through sequence to sequence learning, IEEE International Conference on Computer Vision, (2019), 7434–7442.
doi: 10.1109/ICCV.2019.00753.![]() ![]() |
[21] |
S. Liu, T. Wang, D. Bau, J. Y. Zhu and A. Torralba, Diverse Image Generation via Self-Conditioned GANs, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 14274–14283.
doi: 10.1109/CVPR42600.2020.01429.![]() ![]() |
[22] |
S. Nam, Y. Kim and S. J. Kim, Text-adaptive generative adversarial networks: Manipulating images with natural language, Advances in Neural Information Processing Systems, (2018), 42–51.
![]() |
[23] |
J. C. Ni, S. S. Zhang, Z. L. Zhou, J. Hou and F. Gao, Instance Mask Embedding and Attribute-Adaptive Generative Adversarial Network for Text-to-Image Synthesis, IEEE Access, 8 (2020), 37697-37711.
doi: 10.1109/ACCESS.2020.2975841.![]() ![]() |
[24] |
T. Park, M. Y. Liu, T. C. Wang and J. Y. Zhu, Semantic image synthesis with spatially-adaptive normalization, IEEE Conference on Computer Vision and Pattern Recognition, (2019), 2332–2341.
doi: 10.1109/CVPR.2019.00244.![]() ![]() |
[25] |
T. Qiao, J. Zhang, D. Xu, and D. Tao, Mirrorgan: Learning text-to-image generation by redescription, IEEE Conference on Computer Vision and Pattern Recognition, (2019), 1505–1514.
![]() |
[26] |
S. Ravuri and O. Vinyals, Classification accuracy score for conditional generative models, preprint, arXiv: 1905.10887.
![]() |
[27] |
S. Ren, K. He, R. Girshick and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39 (2016), 1137-1149.
doi: 10.1109/TPAMI.2016.2577031.![]() ![]() |
[28] |
S. Sah, D. Peri, A. Shringi, C. Zhang, M. Dominguez, A. Savakis and R. Ptucha, Semantically invariant text-to-image generation, IEEE International Conference on Image Processing, (2018), 3783–3787.
doi: 10.1109/ICIP.2018.8451656.![]() ![]() |
[29] |
Y. Shen, J. Gu, X. Tang and B. Zhou, Interpreting the Latent space of GANs for semantic face editing, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 9240–9249.
doi: 10.1109/CVPR42600.2020.00926.![]() ![]() |
[30] |
T. R. Shaham, T. Dekel and T. Michaeli, SinGAN: Learning a generative model from a single natural image, IEEE International Conference on Computer Vision, (2019), 4569–4579.
doi: 10.1109/ICCV.2019.00467.![]() ![]() |
[31] |
W. Sun and T. F. Wu, Learning Layout and Style Reconfigurable GANs for Controllable Image Synthesis, preprint, arXiv: 2003.11571.
![]() |
[32] |
T. Sylvain, P. C. Zhang, Y. Bengio, R. D. Hjelm and S. Sharma, Object-centric image generation from layouts, preprint, arXiv: 2003.07449.
![]() |
[33] |
C. Szegedy, et al., Going deeper with convolutions, IEEE Conference on Computer Vision and Pattern Recognition, (2015), 1–9.
doi: 10.1109/CVPR.2015.7298594.![]() ![]() |
[34] |
H. Tang, H. Liu and N. Sebe, Unified generative adversarial networks for controllable image-to-image translation, IEEE Transactions on Image Processing, 29 (2020), 8916-8929.
doi: 10.1109/TIP.2020.3021789.![]() ![]() |
[35] |
N. N. Vo and J. Hays, Localizing and orienting street views using overhead imagery, European Conference on Computer Vision, (2016), 494–509.
doi: 10.1007/978-3-319-46448-0_30.![]() ![]() |
[36] |
D. M. Vo and A. Sugimoto, Visual-relation conscious image generation from structured-text, preprint, arXiv: 1908.01741.
![]() |
[37] |
H. Yu, Y. Huang, L. Pi and L. Wang, Recurrent deconvolutional generative adversarial networks with application to video generation, Pattern Recognition and Computer Vision, (2019), 18–28.
doi: 10.1007/978-3-030-31723-2_2.![]() ![]() |
[38] |
L. Z. Zhang, J. C. Wang, Y. S. Xu, J. Min, T. Wen, J. C. Gee and J. B. Shi, Nested Scale-Editing for Conditional Image Synthesis, IEEE Conference on Computer Vision and Pattern Recognition, (2020), 5476–5486.
doi: 10.1109/CVPR42600.2020.00552.![]() ![]() |
Overview of the proposed GALS-GAN
Illustration of a single graph convolution layer
Architecture of the MLP
Inferring process of the mask predictor
Architecture of the local-specific generator
Architecture of the multi-scale discriminators
Images generated by different level generators
Qualitative examples generated by our GALS-GAN based on the COCO-Stuff dataset
Qualitative examples generated by our GALS-GAN based on the Visual Genome dataset
Qualitative comparison of different models
An example of manipulating the synthesized image
Example results of different image manipulation types
Ablation study of the global-affine generator
Ablation study of the local-specific generator