Abstract
LatentMorph integrates implicit latent reasoning into text-to-image generation through four lightweight components that enable adaptive self-refinement and improve both efficiency and cognitive alignment.
Text-to-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation--a hallmark of human creativity. Current reasoning-augmented paradigms most rely on explicit thought processes, where intermediate reasoning is decoded into discrete text at fixed steps with frequent image decoding and re-encoding, leading to inefficiencies, information loss, and cognitive mismatches. To bridge this gap, we introduce LatentMorph, a novel framework that seamlessly integrates implicit latent reasoning into the T2I generation process. At its core, LatentMorph introduces four lightweight components: (i) a condenser for summarizing intermediate generation states into compact visual memory, (ii) a translator for converting latent thoughts into actionable guidance, (iii) a shaper for dynamically steering next image token predictions, and (iv) an RL-trained invoker for adaptively determining when to invoke reasoning. By performing reasoning entirely in continuous latent spaces, LatentMorph avoids the bottlenecks of explicit reasoning and enables more adaptive self-refinement. Extensive experiments demonstrate that LatentMorph (I) enhances the base model Janus-Pro by 16% on GenEval and 25% on T2I-CompBench; (II) outperforms explicit paradigms (e.g., TwiG) by 15% and 11% on abstract reasoning tasks like WISE and IPV-Txt, (III) while reducing inference time by 44% and token consumption by 51%; and (IV) exhibits 71% cognitive alignment with human intuition on reasoning invocation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning Modal-Mixed Chain-of-Thought Reasoning with Latent Embeddings (2026)
- LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning (2026)
- Unified Thinker: A General Reasoning Modular Core for Image Generation (2026)
- Forest Before Trees: Latent Superposition for Efficient Visual Reasoning (2026)
- Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning (2026)
- Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models (2026)
- Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper