Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,28 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- any-to-any
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Introduction
|
| 8 |
+
Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2's high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2X speedup and 63\% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts.
|
| 9 |
+
|
| 10 |
+
Paper: https://arxiv.org/abs/2503.08686
|
| 11 |
+
|
| 12 |
+
Code: https://github.com/hustvl/OmniMamba
|
| 13 |
+
|
| 14 |
+
## Citation
|
| 15 |
+
If you find OmniMamba useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
```bibtex
|
| 19 |
+
@misc{zou2025omnimambaefficientunifiedmultimodal,
|
| 20 |
+
title={OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models},
|
| 21 |
+
author={Jialv Zou and Bencheng Liao and Qian Zhang and Wenyu Liu and Xinggang Wang},
|
| 22 |
+
year={2025},
|
| 23 |
+
eprint={2503.08686},
|
| 24 |
+
archivePrefix={arXiv},
|
| 25 |
+
primaryClass={cs.CV},
|
| 26 |
+
url={https://arxiv.org/abs/2503.08686},
|
| 27 |
+
}
|
| 28 |
+
```
|