LASAGNA Framework
We formulate the joint generation of composite images, backgrounds, and foregrounds as a flexible, layer-conditional denoising task. This single framework supports multiple workflows, including FG_Gen, BG_Gen, and Text2All. We use a unified input representation with learnable embeddings that distinguish different roles of visual latents (noise, BG, FG, and mask) across tasks, enabling the model to adapt its behavior under various generation settings. This allows a single attention-based model to flexibly process varied combinations of inputs and targets simultaneously.