My favourites
updated
Test-Time Scaling with Reflective Generative Model
Paper
• 2507.01951
• Published • 108
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth
Approach
Paper
• 2502.05171
• Published • 154
Autoregressive Diffusion Models
Paper
• 2110.02037
• Published
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative
Image Modeling
Paper
• 2502.09509
• Published • 9
Improving the Diffusability of Autoencoders
Paper
• 2502.14831
• Published • 2
Deep Compression Autoencoder for Efficient High-Resolution Diffusion
Models
Paper
• 2410.10733
• Published • 9
DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured
Latent Space
Paper
• 2508.00413
• Published • 5
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion
Transformers
Paper
• 2504.10483
• Published • 22
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid
Mamba-Transformer Reasoning Model
Paper
• 2508.14444
• Published • 47
MetaCLIP 2: A Worldwide Scaling Recipe
Paper
• 2507.22062
• Published • 37
Waver: Wave Your Way to Lifelike Video Generation
Paper
• 2508.15761
• Published • 38
Qwen-Image Technical Report
Paper
• 2508.02324
• Published • 274
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior
Long-Context Learning
Paper
• 2508.18756
• Published • 36
InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis
Paper
• 2509.10441
• Published • 31
Why Language Models Hallucinate
Paper
• 2509.04664
• Published • 199
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal
Conditioning
Paper
• 2509.08519
• Published • 130
Step1X-Edit: A Practical Framework for General Image Editing
Paper
• 2504.17761
• Published • 92
Transition Matching: Scalable and Flexible Generative Modeling
Paper
• 2506.23589
• Published • 1
MMaDA: Multimodal Large Diffusion Language Models
Paper
• 2505.15809
• Published • 98
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and
Future Frontiers
Paper
• 2506.23918
• Published • 90
Diffusion Beats Autoregressive in Data-Constrained Settings
Paper
• 2507.15857
• Published • 1
Hierarchical Reasoning Model
Paper
• 2506.21734
• Published • 50
UMO: Scaling Multi-Identity Consistency for Image Customization via
Matching Reward
Paper
• 2509.06818
• Published • 29
Wan-Animate: Unified Character Animation and Replacement with Holistic
Replication
Paper
• 2509.14055
• Published • 17
Inpainting-Guided Policy Optimization for Diffusion Large Language
Models
Paper
• 2509.10396
• Published • 16
Lynx: Towards High-Fidelity Personalized Video Generation
Paper
• 2509.15496
• Published • 13
Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model
Self-Distillation
Paper
• 2509.19296
• Published • 28
Video models are zero-shot learners and reasoners
Paper
• 2509.20328
• Published • 100
What Characterizes Effective Reasoning? Revisiting Length, Review, and
Structure of CoT
Paper
• 2509.19284
• Published • 23
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Paper
• 2509.20427
• Published • 84
Paper
• 2509.22358
• Published • 2
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation
and Editing
Paper
• 2509.24900
• Published • 53
Diffusion Transformers with Representation Autoencoders
Paper
• 2510.11690
• Published • 170
WithAnyone: Towards Controllable and ID Consistent Image Generation
Paper
• 2510.14975
• Published • 86
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale
Prediction
Paper
• 2404.02905
• Published • 74
DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion
Paper
• 2510.20766
• Published • 37
Continuous Autoregressive Language Models
Paper
• 2510.27688
• Published • 74
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
Paper
• 2511.09611
• Published • 71
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Paper
• 2512.04677
• Published • 177
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Paper
• 2511.22699
• Published • 245
Vision Bridge Transformer at Scale
Paper
• 2511.23199
• Published • 46
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
Paper
• 2512.07829
• Published • 24
Towards Scalable Pre-training of Visual Tokenizers for Generation
Paper
• 2512.13687
• Published • 106
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management
Paper
• 2512.12967
• Published • 111
What matters for Representation Alignment: Global Information or Spatial Structure?
Paper
• 2512.10794
• Published • 9
KlingAvatar 2.0 Technical Report
Paper
• 2512.13313
• Published • 44
Back to Basics: Let Denoising Generative Models Denoise
Paper
• 2511.13720
• Published • 70
PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss
Paper
• 2602.02493
• Published • 46
BitDance: Scaling Autoregressive Generative Models with Binary Tokens
Paper
• 2602.14041
• Published • 53
DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
Paper
• 2602.12205
• Published • 81
Autoregressive Image Generation with Masked Bit Modeling
Paper
• 2602.09024
• Published • 7
ViT-5: Vision Transformers for The Mid-2020s
Paper
• 2602.08071
• Published • 1
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
Paper
• 2602.12705
• Published • 68
Unified Latents (UL): How to train your latents
Paper
• 2602.17270
• Published • 60
dLLM: Simple Diffusion Language Modeling
Paper
• 2602.22661
• Published • 152
Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
Paper
• 2603.02175
• Published • 24
Unified Multimodal Understanding and Generation Models: Advances,
Challenges, and Opportunities
Paper
• 2505.02567
• Published • 82
Helios: Real Real-Time Long Video Generation Model
Paper
• 2603.04379
• Published • 186
LTX-2: Efficient Joint Audio-Visual Foundation Model
Paper
• 2601.03233
• Published • 176
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
Paper
• 2603.06569
• Published • 119
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
Paper
• 2603.09877
• Published • 48
MAISI-v2: Accelerated 3D High-Resolution Medical Image Synthesis with
Rectified Flow and Region-specific Contrastive Loss
Paper
• 2508.05772
• Published • 3
Agentic Reasoning for Large Language Models
Paper
• 2601.12538
• Published • 204
Mixture-of-Depths Attention
Paper
• 2603.15619
• Published • 80
Repurposing Geometric Foundation Models for Multi-view Diffusion
Paper
• 2603.22275
• Published • 47
PixelSmile: Toward Fine-Grained Facial Expression Editing
Paper
• 2603.25728
• Published • 117
MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild
Paper
• 2603.17187
• Published • 138
MultiGen: Level-Design for Editable Multiplayer Worlds in Diffusion Game Engines
Paper
• 2603.06679
• Published • 5