We introduce CAFLOW, a new diverse image-to-image translation model that simultaneously leverages the power of autoregressive modeling and the modeling efficiency of conditional normalizing flows. We transform the conditioning image into a sequence of latent encodings using a multiscale normalizing flow and repeat the process for the conditioned image. We model the conditional distribution of the latent encodings by modeling the autoregressive distributions with an efficient multi-scale normalizing flow, where each conditioning factor affects image synthesis at its respective resolution scale. Our proposed framework performs well on a range of image-to-image translation tasks. It outperforms former designs of conditional flows because of its expressive autoregressive structure.
Citation: |
Figure 1. From left to right: ideal dependencies in the $ i^{th} $ autoregressive component. Dual-Glow modeling assumption [23]; information is exchanged only between latent spaces having the same dimension. Our modeling assumption; we retain the dependencies between $ L_i $ and the latent spaces of lower dimension
Figure 2. Left: unconditional normalizing flow architecture used to encode conditioning and conditioned images, denoted by $ Y_n = Y $ and $ W_n = W $ respectively, into a sequence of hierarchical latent variables. Right: design of the conditional transformation $ G_{i}^\theta $ that models the $ i^{th} $ autoregressive component. The index of the flow $ i $ is omitted in both the transformed latent variable $ Z_j $ and the intermediate latent variables $ Z_j^{\prime} $ for simplicity
Figure 6. Qualitative evaluation: Four colorizations proposed by CAFLOW, CINN and ColorGAN for three test images. ColorGAN generates unrealistically diverse colorizations with significant color artifacts (for example a yellow region on a white wall). CINN generates more realistic less diverse colorizations with less pronounced color artifacts compared to ColorGAN, which is reflected in the improved FID score. Finally, CAFLOW generates even more realistic and less diverse colorizations than CINN with even rarer color artifacts, which is more representative of the data distribution according to the FID score
Figure 8. Image super-resolution on the FFHQ dataset. Left: LR bicubicly upsampled. Right: HR image. Middle: 10 super-resolved versions in decreasing conditional log-likelihood order from left to right. We sampled 20 super-resolved images for each LR image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature $ \tau = 0.5 $
Figure 9. Image super-resolution on the FFHQ dataset. Left: LR bicubicly upsampled. Right: HR image. Middle: 10 super-resolved versions in decreasing conditional log-likelihood order from left to right. We sampled 20 super-resolved images for each LR image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature $ \tau = 0.55 $
Figure 10. Image inpainting on the CelebA dataset. Left: Masked image. Right: Ground truth. Middle: 10 inpainted versions in decreasing conditional log-likelihood order from left to right. We sampled 30 inpainted images for each masked image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature $ \tau = 0.5 $
Figure 11. Image inpainting on the CelebA dataset. Left: Masked image. Right: Ground truth. Middle: 10 inpainted versions in decreasing conditional log-likelihood order from left to right. We sampled 30 inpainted images for each masked image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature $ \tau = 0.5 $
Figure 12. Image colorization on the LSUN BEDROOM dataset. Left: Grayscale image. Right: Ground truth. Middle: 10 colorized versions in decreasing conditional log-likelihood order from left to right. We sampled 25 colorized images for each greyscale image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature $ \tau = 0.85 $
Figure 13. Image colorization on the LSUN BEDROOM dataset. Left: Grayscale image. Right: Ground truth. Middle: 10 colorized versions in decreasing conditional log-likelihood order from left to right. We sampled 25 colorized images for each greyscale image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature $ \tau = 0.85 $
Figure 14. Image colorization on the FFHQ dataset. Left: Grayscale image. Right: Ground truth. Middle: 10 colorized versions in decreasing conditional log-likelihood order from left to right. We sampled 25 colorized images for each greyscale image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature $ \tau = 0.7 $
Figure 15. Image colorization on the FFHQ dataset. Left: Grayscale image. Right: Ground truth. Middle: 10 colorized versions in decreasing conditional log-likelihood order from left to right. We sampled 25 colorized images for each greyscale image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature $ \tau = 0.7 $
Figure 16. Sketch to image synthesis on the edges2shoes dataset [10]. Left: Sketch. Right: Ground truth. Middle: 6 samples taken with sampling temperature $ \tau = 0.8 $
Table 1.
Quantitative evaluation of (x4) super-resolution on FFHQ
Table 2.
Quantitative evaluation of colorization on LSUN BEDROOM
Table 3. Quantitative evaluation of inpainting on the CelebA dataset. We report PSNR and LPIPS scores for each method
Method | PSNR$ \uparrow $ | LPIPS$ \downarrow $ |
CAFLOW | 26.08 | 0.06 |
[16] | 24.88 | - |
[1] |
L. Ardizzone, C. Lüth, J. Kruse, C. Rother and U. Köthe, Guided image generation with conditional invertible neural networks, arXiv preprint, arXiv: 1907.02392.
![]() |
[2] |
J. Behrmann, P. Vicol, K.-C. Wang, R. Grosse and J.-H. Jacobsen, Understanding and mitigating exploding inverses in invertible neural networks, in International Conference on Artificial Intelligence and Statistics, PMLR, (2021), 1792-1800.
![]() |
[3] |
M. G. Blanch, M. Mrak, A. F. Smeaton and N. E. O'Connor, End-to-end conditional gan-based architectures for image colourisation, in 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), 2019, 1-6.
![]() |
[4] |
R. T. Q. Chen, Y. Rubanova, J. Bettencourt and D. K. Duvenaud, Neural ordinary differential equations, in Advances in Neural Information Processing Systems, 31 2018.
![]() |
[5] |
L. Dinh, D. Krueger and Y. Bengio, NICE: Non-linear independent components estimation, in 3rd International Conference on Learning Representations, ICLR 2015 (eds. Y. Bengio and Y. LeCun), 2015.
![]() |
[6] |
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing Systems, 27 2014.
![]() |
[7] |
A. Grover, C. Chute, R. Shu, Z. Cao and S. Ermon, Alignflow: Cycle consistent learning from multiple domains via normalizing flows, Proceedings of the AAAI Conference on Artificial Intelligence, 34 (2020), 4028-4035.
doi: 10.1609/aaai.v34i04.5820.![]() ![]() |
[8] |
J. Ho, X. Chen, A. Srinivas, Y. Duan and P. Abbeel, Flow++: Improving flow-based generative models with variational dequantization and architecture design, in International Conference on Machine Learning, PMLR, (2019), 2722-2730.
![]() |
[9] |
C.-W. Huang, D. Krueger, A. Lacoste and A. Courville, Neural autoregressive flows, Proceedings of the 35th International Conference on Machine Learning, 80 (2018), 2078-2087.
![]() |
[10] |
P. Isola, J.-Y. Zhu, T. Zhou and A. A. Efros, Image-to-image translation with conditional adversarial networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), 1125-1134.
![]() |
[11] |
T. Karras, S. Laine and T. Aila, A style-based generator architecture for generative adversarial networks, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), 4401-4410.
![]() |
[12] |
D. P. Kingma and P. Dhariwal, Glow: Generative flow with invertible 1x1 convolutions, Advances in Neural Information Processing Systems, 31 (2018), 10215-10224.
![]() |
[13] |
Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon and W. Wu, Feedback network for image super-resolution, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), 3867-3876.
![]() |
[14] |
J. Liang, A. Lugmayr, K. Zhang, M. Danelljan, L. Van Gool and R. Timofte, Hierarchical conditional flow: A unified framework for image super-resolution and image rescaling, in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 4076-4085.
![]() |
[15] |
Z. Liu, P. Luo, X. Wang and X. Tang, Deep learning face attributes in the wild, in Proceedings of the IEEE International Conference on Computer Vision, (2015), 3730-3738.
![]() |
[16] |
Y. Lu and B. Huang, Structured output learning with conditional generative flows, AAAI.
![]() |
[17] |
A. Lugmayr, M. Danelljan, L. Van Gool and R. Timofte, Srflow: Learning the super-resolution space with normalizing flow, in Computer Vision–ECCV 2020, 2020.
![]() |
[18] |
R. V. Marinescu, D. Moyer and P. Golland, Bayesian image reconstruction using deep generative models, CoRR, abs/2012.04567, https://arXiv.org/abs/2012.04567.
![]() |
[19] |
D. Onken, S. W. Fung, X. Li and L. Ruthotto, Ot-flow: Fast and accurate continuous normalizing flows via optimal transport, Proceedings of the AAAI Conference on Artificial Intelligence, 35 (2021), 9223-9232.
doi: 10.1609/aaai.v35i10.17113.![]() ![]() |
[20] |
A. Pumarola, S. Popov, F. Moreno-Noguer and V. Ferrari, C-flow: Conditional generative flow models for images and 3d point clouds, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 7946-7955.
![]() |
[21] |
D. Rezende and S. Mohamed, Variational inference with normalizing flows, in International Conference on Machine Learning, PMLR, (2015), 1530-1538.
![]() |
[22] |
Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon and B. Poole, Score-based generative modeling through stochastic differential equations, arXiv preprint, arXiv: 2011.13456.
![]() |
[23] |
H. Sun, R. Mehta, H. H. Zhou, Z. Huang, S. C. Johnson, V. Prabhakaran and V. Singh, Dual-glow: Conditional flow-based generative model for modality transfer, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
![]() |
[24] |
A. Verine, B. Negrevergne, Y. Chevaleyre and F. Rossi, On the expressivity of Bi-Lipschitz normalizing flows, in Asian Conference on Machine Learning, PMLR, (2023), 1054-1069.
![]() |
[25] |
Y. Viazovetskyi, V. Ivashkin and E. Kashin, Stylegan2 distillation for feed-forward image manipulation, in European Conference on Computer Vision, Springer, (2020), 170-186.
![]() |
[26] |
X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao and C. C. Loy, Esrgan: Enhanced super-resolution generative adversarial networks, in Computer Vision–ECCV 2018 Workshops (eds. L. Leal-Taixé and S. Roth), 2019, 63-79.
![]() |
[27] |
H. Wu, J. Köhler and F. Noé, Stochastic normalizing flows, arXiv preprint, arXiv: 2002.06707.
![]() |
[28] |
F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser and J. Xiao, Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop, arXiv preprint, arXiv: 1506.03365.
![]() |
[29] |
J. J. Yu, K. Derpanis and M. A. Brubaker, Wavelet flow: Fast training of high resolution normalizing flows, in NeurIPS, 2020.
![]() |
From left to right: ideal dependencies in the
Left: unconditional normalizing flow architecture used to encode conditioning and conditioned images, denoted by
10 super-resolved versions of the LR image in decreasing conditional log-likelihood order
Qualitative comparison of Dual-Glow+ and CAFLOW
Qualitative evaluation on FFHQ 4x super-resolution of 16x16 resolution images
Qualitative evaluation: Four colorizations proposed by CAFLOW, CINN and ColorGAN for three test images. ColorGAN generates unrealistically diverse colorizations with significant color artifacts (for example a yellow region on a white wall). CINN generates more realistic less diverse colorizations with less pronounced color artifacts compared to ColorGAN, which is reflected in the improved FID score. Finally, CAFLOW generates even more realistic and less diverse colorizations than CINN with even rarer color artifacts, which is more representative of the data distribution according to the FID score
Different inpaintings proposed by CAFLOW with
Image super-resolution on the FFHQ dataset. Left: LR bicubicly upsampled. Right: HR image. Middle: 10 super-resolved versions in decreasing conditional log-likelihood order from left to right. We sampled 20 super-resolved images for each LR image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature
Image super-resolution on the FFHQ dataset. Left: LR bicubicly upsampled. Right: HR image. Middle: 10 super-resolved versions in decreasing conditional log-likelihood order from left to right. We sampled 20 super-resolved images for each LR image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature
Image inpainting on the CelebA dataset. Left: Masked image. Right: Ground truth. Middle: 10 inpainted versions in decreasing conditional log-likelihood order from left to right. We sampled 30 inpainted images for each masked image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature
Image inpainting on the CelebA dataset. Left: Masked image. Right: Ground truth. Middle: 10 inpainted versions in decreasing conditional log-likelihood order from left to right. We sampled 30 inpainted images for each masked image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature
Image colorization on the LSUN BEDROOM dataset. Left: Grayscale image. Right: Ground truth. Middle: 10 colorized versions in decreasing conditional log-likelihood order from left to right. We sampled 25 colorized images for each greyscale image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature
Image colorization on the LSUN BEDROOM dataset. Left: Grayscale image. Right: Ground truth. Middle: 10 colorized versions in decreasing conditional log-likelihood order from left to right. We sampled 25 colorized images for each greyscale image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature
Image colorization on the FFHQ dataset. Left: Grayscale image. Right: Ground truth. Middle: 10 colorized versions in decreasing conditional log-likelihood order from left to right. We sampled 25 colorized images for each greyscale image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature
Image colorization on the FFHQ dataset. Left: Grayscale image. Right: Ground truth. Middle: 10 colorized versions in decreasing conditional log-likelihood order from left to right. We sampled 25 colorized images for each greyscale image and we present the 10 images with the highest conditional log-likelihood. We used sampling temperature
Sketch to image synthesis on the edges2shoes dataset [10]. Left: Sketch. Right: Ground truth. Middle: 6 samples taken with sampling temperature