Generative Image Layer Decomposition with Visual Effects

Jinrui Yang1,2,*, Qing Liu2, Yijun Li2, Soo Ye Kim2, Daniil Pakhomov2,
Mengwei Ren2, Jianming Zhang2, Zhe Lin2, Cihang Xie1, Yuyin Zhou1.
1UC Santa Cruz, 2Adobe Research

* This work was done when Jinrui Yang was a research intern at Adobe Research.


Overview of LayerDecomp

alt text

(a) Given an input image and a binary object mask, our model is able to decompose the image into a clean background layer and a transparent foreground layer with preserved visual effects such as shadows and reflections. (b) Subsequently, our decomposition empowers complex and controllable layer-wise editing such as spatial, color and/or style editing.


Abstract

Recent advancements in large generative models, particularly diffusion-based methods, have significantly enhanced the capabilities of image editing. However, achieving precise control over image composition tasks remains a challenge. Layered representations, which allow for independent editing of image components, are essential for user-driven content creation, yet existing approaches often struggle to decompose image into plausible layers with accurately retained transparent visual effects such as shadows and reflections. We propose LayerDecomp, a generative framework for image layer decomposition which outputs photorealistic clean backgrounds and high-quality transparent foregrounds with faithfully preserved visual effects. To enable effective training, we first introduce a dataset preparation pipeline that automatically scales up simulated multi-layer data with synthesized visual effects. To further enhance real-world applicability, we supplement this simulated dataset with camera-captured images containing natural visual effects. Additionally, we propose a consistency loss which enforces the model to learn accurate representations for the transparent foreground layer when ground-truth annotations are not available. Our method achieves superior quality in layer decomposition, outperforming existing approaches in object removal and spatial editing tasks across several benchmarks and multiple user studies, unlocking various creative possibilities for layer-wise image editing. The project page is https://rayjryang.github.io/LayerDecomp/.


The framework of LayerDecomp

alt text

The framework of LayerDecomp. The model takes four inputs: two conditional inputs, including a composite image and an object mask, and two noisy latent representations of the background and foreground layers. During training, we use simulated image triplets alongside camera-captured background-composite image pairs. We also introduce a pixel-space consistency loss to ensure that natural visual effects such as shadows and refelctions are faithfully preserved in the transparent foreground layer.




Object removal - comparison with mask-based methods

alt text

Object removal - comparison with mask-based methods. Our model, using tight input masks, generates more visually plausible results with fewer artifacts compared to ControlNet Inpainting[1], SD-XL Inpainting[2], and PowerPaint[3], which all require loose mask input. Besides, our model delivers coherent foreground layers and supports more advanced downstream editing tasks.




Object removal - comparison with instruction-driven methods.

alt text

Object removal - comparison with instruction-driven methods. Combining with a text-based grounding method, our model can effectively remove target objects and preserve background integrity, while existing instruction-based editing methods, such as Emu-Edit[4], MGIE[5], and OmniGen[6], may struggle to fully remove the target or maintain background consistency.




Object spatial editing

alt text

Object spatial editing. Our model enables precise object moving and resizing with seamless handling of visual effects, resulting in highly effective and realistic edits that preserve content identity. When applied to examples released by specific works, such as DiffusionHandle[7] and DesignEdit[8], our model also achieves satisfying results.




Multi-layer Decomposition and Creative layer-editing

alt text

Multi-layer Decomposition and Creative layer-editing. By sequentially applying our model, we can decompose multiple foreground layers with distinct visual effects, which can then be used for further creative editing tasks.

BibTeX

@article{yang2024generative,
      title={Generative Image Layer Decomposition with Visual Effects},
      author={Yang, Jinrui and Liu, Qing and Li, Yijun and Kim, Soo Ye and Pakhomov, Daniil and Ren, Mengwei and Zhang, Jianming and Lin, Zhe and Xie, Cihang and Zhou, Yuyin},
      journal={arXiv preprint arXiv:2411.17864},
      year={2024}
    }

References

  1. [1] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836-3847, 2023.
  2. [2] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684-10695, 2022.
  3. [3] Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, and Kai Chen. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. arXiv preprint arXiv:2312.03594, 2023.
  4. [4] Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871-8879, 2024.
  5. [5] Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guiding instruction-based image editing via multimodal large language models. In International Conference on Learning Representations (ICLR), 2024.
  6. [6] Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. arXiv preprint arXiv:2409.11340, 2024.
  7. [7] Karran Pandey, Paul Guerrero, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, and Niloy J Mitra. Diffusion handles enabling 3d edits for diffusion models by lifting activations to 3d. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7695-7704, 2024.
  8. [8] Yueru Jia, Yuhui Yuan, Aosong Cheng, Chuke Wang, Ji Li, Huizhu Jia, and Shanghang Zhang. Designedit: Multi-layered latent decomposition and fusion for unified & accu- rate image editing. arXiv preprint arXiv:2403.14487, 2024.

ACKNOWLEDGMENT

We thank owners of images on this site (link) for sharing their valuable assets.