# Global-Affine and Local-Specific Generative Adversarial Network for semantic-guided image generation

• The recent progress in learning image feature representations has opened the way for tasks such as label-to-image or text-to-image synthesis. However, one particular challenge widely observed in existing methods is the difficulty of synthesizing fine-grained textures and small-scale instances. In this paper, we propose a novel Global-Affine and Local-Specific Generative Adversarial Network (GALS-GAN) to explicitly construct global semantic layouts and learn distinct instance-level features. To achieve this, we adopt the graph convolutional network to calculate the instance locations and spatial relationships from scene graphs, which allows our model to obtain the high-fidelity semantic layouts. Also, a local-specific generator, where we introduce the feature filtering mechanism to separately learn semantic maps for different categories, is utilized to disentangle and generate specific visual features. Moreover, we especially apply a weight map predictor to better combine the global and local pathways considering the highly complementary between these two generation sub-networks. Extensive experiments on the COCO-Stuff and Visual Genome datasets demonstrate the superior generation performance of our model against previous methods, our approach is more capable of capturing photo-realistic local characteristics and rendering small-sized entities with more details.

• Figure 1.  Overview of the proposed GALS-GAN

Figure 2.  Illustration of a single graph convolution layer

Figure 3.  Architecture of the MLP

Figure 4.  Inferring process of the mask predictor

Figure 5.  Architecture of the local-specific generator

Figure 6.  Architecture of the multi-scale discriminators

Figure 7.  Images generated by different level generators

Figure 8.  Qualitative examples generated by our GALS-GAN based on the COCO-Stuff dataset

Figure 9.  Qualitative examples generated by our GALS-GAN based on the Visual Genome dataset

Figure 10.  Qualitative comparison of different models

Figure 11.  An example of manipulating the synthesized image

Figure 12.  Example results of different image manipulation types

Figure 13.  Ablation study of the global-affine generator

Figure 14.  Ablation study of the local-specific generator

Table 1.  Statistics of COCO-Stuff and Visual Genome datasets

 datasets train val test categories max min COCO-Stuff 74121 1024 2048 171 8 3 Visual Genome 62565 5506 5088 178 30 3

Table 2.  Quantitative comparison of images generated by different methods on the COCO-Stuff dataset

 Methods IS $\uparrow$ FID $\downarrow$ 64 $\times$ 64 128 $\times$ 128 64 $\times$ 64 128$\times$ 128 sg2im [10] 6.7$\pm$0.1 5.99$\pm$0.27 67.99 95.18 stacking-GANs [36] 9.1$\pm$0.20 12.01$\pm$0.40 50.94 39.78 PasteGAN [19] 9.2$\pm$0.32 - 42.30 - PasteGAN (GT layout) [19] 10.20$\pm$0.20 - 34.30 - ours 9.85$\pm$0.15 13.82$\pm$0.30 38.29 29.62

Table 3.  Quantitative comparison of images generated by different methods on Visual Genome dataset

 Methods IS $\uparrow$ FID $\downarrow$ 64 $\times$ 64 128 $\times$ 128 64 $\times$ 64 128$\times$ 128 sg2im [10] 5.5$\pm$0.10 4.78$\pm$0.15 73.79 70.40 stacking-GANs [36] 6.90$\pm$0.20 9.24$\pm$0.41 59.53 50.19 PasteGAN [19] 7.97$\pm$0.30 - 58.37 - PasteGAN (GT layout) [19] 9.15$\pm$0.20 - 34.91 - ours 8.87$\pm$0.15 11.20$\pm$0.55 39.25 29.94

Table 4.  Comparison of classification accuracy

 Methods Classification Accuracy Score COCO-Stuff Visual Genome 64 $\times$ 64 128 $\times$ 128 64 $\times$ 64 128$\times$ 128 sg2im [10] 28.8 24.1 26.7 23.4 stacking-GANs [36] 33.9 31.2 32.7 30.3 PasteGAN [19] 40.3 - 38.7 - ours 46.1 44.6 45.4 43.5

Table 5.  Quantitative comparison of predicted semantic layouts

 Methods R@0.3 R@0.5 COCO-Stuff Visual Genome COCO-Stuff Visual Genome sg2im [10] 52.4 21.9 32.2 10.6 stacking-GANs [36] 65.3 35.0 49.1 23.2 PasteGAN [19] 71.2 45.2 62.4 33.8 ours 80.7 48.4 66.2 36.5

Table 6.  Ablation study of GALS-GAN different architectures

 Architectures IS $\uparrow$ FID $\downarrow$ w/o $G_{g-a}$ 7.52$\pm$0.40 78.94 w/o $G_{l-s}$ 11.30$\pm$0.12 46.83 full model 13.82$\pm$0.30 29.62

