| --- |
| license: apache-2.0 |
| datasets: |
| - FreedomIntelligence/ALLaVA-4V |
| - Vision-Flan/vision-flan_191-task_1k |
| language: |
| - en |
| base_model: |
| - Lin-Chen/open-llava-next-llama3-8b |
| --- |
| # Adapting Multimodal Large Language Models to Domains via Post-Training (EMNLP 2025) |
|
|
| This repos contains the **visual-instruction synthesizer** in our paper: [On Domain-Specific Post-Training for Multimodal Large Language Models](https://huggingface.co/papers/2411.19930). |
|
|
| The main project page is: [Adapt-MLLM-to-Domains](https://huggingface.co/AdaptLLM/Adapt-MLLM-to-Domains) |
|
|
| ### 1. Basic Usage: Synthesize a task triplet based on a given image-caption pair |
| To synthesize an "instruction-informative response-precise response" triplet based on the following image-caption pair. |
|
|
| <p align='left'> |
| <img src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F650801ced5578ef7e20b33d4%2FmgI_Ayj12_Q_kviWvfAVb.jpeg%26quot%3B width="200"> |
| </p> |
| |
| <details> |
| <summary> Click to expand </summary> |
|
|
| ```python |
| from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration |
| import torch |
| from PIL import Image |
| import requests |
| |
| # Define your input image-caption pair here: |
| ## image |
| url = "/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F650801ced5578ef7e20b33d4%2FmgI_Ayj12_Q_kviWvfAVb.jpeg%26quot%3B%3C%2Fspan%3E%3C!----%3E%3C%2Ftd%3E%3C%2Ftr%3E%3Ctr id="L36"> | | image = Image.open(requests.get(url, stream=True).raw).convert("RGB") |
| |
| ## Caption |
| caption = "Dish: Strawberry Waffles\n\nSteps to prepare:\na). Preheat and grease a waffle iron according to manufacturer's instructions.\nb). Sift flour, baking powder, and salt together in a bowl. Whisk buttermilk, yogurt, butter, eggs, and sugar together in a separate bowl; stir into flour mixture until batter is smooth. Fold strawberries into batter.\nc). Pour about 1/3 cup batter into preheated waffle iron; cook until lightly browned, 5 to 7 minutes. Repeat with remaining batter.\n\nIngredients you'll need:\n(a). 2 1/2 cups all-purpose flour\n(b). 4 teaspoons baking powder\n(c). 3/4 teaspoon salt\n(d). 2 cups buttermilk\n(e). 1/2 cup vanilla Greek-style yogurt\n(f). 1/2 cup butter, melted\n(g). 2 eggs, beaten\n(h). 1 1/2 tablespoons white sugar\n(i). 3/4 cup chopped strawberries, or more to taste" |
| |
| # =========================== Do NOT need to modify the following =============================== |
| |
| # Path to synthesizer |
| model_path = "AdaptLLM/visual-instruction-synthesizer" |
| |
| # Prompt Hints |
| caption_hint = "Describe the image." |
| precise_hint = "Answer with a precise response.\n" |
| informative_hint = "Answer with an informative response.\n" |
| |
| # Function to parse predictions |
| def parse_pred(pred): |
| if not pred.endswith("<|end_of_text|>"): |
| return [] |
| |
| pred = pred[:-len("<|end_of_text|>")] |
| |
| QA_str_list = pred.split("<|start_header_id|>user<|end_header_id|>\n\n") |
| if not pred.endswith("<|eot_id|>"): |
| QA_str_list = QA_str_list[:-1] |
| |
| QA_list = [] |
| for QA_str in QA_str_list: |
| try: |
| assert "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" in QA_str |
| Q_str, A_str = QA_str.split("<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n") |
| Q_str, A_str = Q_str.strip(), A_str[:-len("<|eot_id|>")].strip() |
| assert Q_str and A_str |
| QA_list.append({"Q": Q_str, "A": A_str}) |
| except AssertionError: |
| pass # Skip invalid entries |
| |
| conversations = [] |
| for qa_entry in QA_list: |
| conversations.append({"from": "human", "value": qa_entry["Q"]}) |
| conversations.append({"from": "gpt", "value": qa_entry["A"]}) |
| return conversations |
| |
| # Function to extract task triplets |
| def get_task_triplet(pred): |
| pred_QAs = parse_pred(pred) |
| precise_QAs = {} |
| informative_QAs = {} |
| collected_QA = None |
| |
| for idx in range(0, len(pred_QAs), 2): # Iterate over question-answer pairs |
| question = pred_QAs[idx]["value"] |
| answer = pred_QAs[idx + 1]["value"] |
| if question.startswith(precise_hint): |
| precise_q = question[len(precise_hint):] |
| if precise_q in informative_QAs: |
| collected_QA = { |
| "Q": precise_q, |
| "precise_A": answer, |
| "informative_A": informative_QAs[precise_q], |
| } |
| break |
| else: |
| precise_QAs[precise_q] = answer |
| elif question.startswith(informative_hint): |
| informative_q = question[len(informative_hint):] |
| if informative_q in precise_QAs: |
| collected_QA = { |
| "Q": informative_q, |
| "precise_A": precise_QAs[informative_q], |
| "informative_A": answer, |
| } |
| break |
| else: |
| informative_QAs[informative_q] = answer |
| |
| return collected_QA |
| |
| # Load the processor |
| processor = LlavaNextProcessor.from_pretrained(model_path) |
| |
| # Define image token |
| image_token = "<|reserved_special_token_4|>" |
| |
| # Format the prompt |
| prompt = ( |
| f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n" |
| f"You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language." |
| f"<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" |
| f"{image_token}\n{caption_hint}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" |
| f"{caption}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" |
| ) |
| |
| # Load the model |
| model = LlavaNextForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto") |
| |
| # Prepare inputs and generate output |
| inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device) |
| answer_start = int(inputs["input_ids"].shape[-1]) |
| output = model.generate(**inputs, max_new_tokens=512) |
| |
| # Decode predictions |
| pred = processor.decode(output[0][answer_start:], skip_special_tokens=False) |
| print(f"## Synthesizer predictions:\n{pred}") |
| |
| # Extract task triplets |
| task_triplet = get_task_triplet(pred) |
| print(f"## Synthesized Task triplet:\n{task_triplet}") |
| ``` |
| </details> |
|
|
| ### 2. Advanced Usage: Convert Image-Caption Pairs into Visual Instructions at Scale |
| The following steps show how to convert your own data into visual instructions for post-training MLLMs. |
|
|
| We leverage vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 12.5 hours to convert 100K image-caption pairs. |
|
|
| <details> |
| <summary> Click to expand </summary> |
|
|
| ### 1) Setup |
| Install vLLM using `pip` or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source). |
| ```bash |
| pip install vllm |
| ``` |
|
|
| Clone our code repository and navigate to the inference directory: |
| ```bash |
| git clone https://github.com/bigai-ai/QA-Synthesizer.git |
| cd QA-Synthesizer/vllm_inference |
| SYNTHESIZER=AdaptLLM/visual-instruction-synthesizer |
| CONSISTENCY_CHECKER=meta-llama/Meta-Llama-3-8B # Language model for consistency checks |
| ``` |
|
|
| ### 2) Prepare Your Image-Caption Pairs |
| Format your `image_caption_pairs` file to match the following structure (similar to ShareGPT), or you can use our [data_samples/image_caption_pairs.json](https://github.com/bigai-ai/QA-Synthesizer/blob/main/data_samples/image_caption_pairs.json) for a quick try. |
|
|
| ```json |
| [ |
| { |
| "images": ["image_xxx.jpg"], |
| "messages": [ |
| { |
| "content": "<image>instruction", |
| "role": "user" |
| }, |
| { |
| "content": "response", |
| "role": "assistant" |
| } |
| ] |
| }, |
| ... |
| ] |
| ``` |
|
|
| ### 3) Run Synthesis |
|
|
| The following command generate task triplets using the synthesizer and apply consistency-based filtering to enhance data quality: |
|
|
| ```bash |
| IMAGE_CAPTION='../data_samples/image_caption_pairs.json' # Path to image-caption pairs |
| IMAGE_FOLDER='../data_samples/images' # Path to the image folder |
| OUTPUT_DIR='../data_samples/' # Output directory for synthesized data |
| |
| # Run synthesis with data parallelism; adjust CUDA devices as needed: |
| CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' bash run_synthesis.sh ${SYNTHESIZER} ${CONSISTENCY_CHECKER} ${IMAGE_CAPTION} ${IMAGE_FOLDER} ${OUTPUT_DIR} |
| ``` |
|
|
| The synthesized output will be saved at: |
| ```bash |
| ${OUTPUT_DIR}/image_caption_and_synthetic_task.json |
| ``` |
|
|
| This output can be directly utilized for single-stage post-training with code repo like [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory). |
|
|
| </details> |
|
|
|
|
| ## Citation |
| If you find our work helpful, please cite us. |
|
|
| [Adapt MLLM to Domains](https://huggingface.co/papers/2411.19930) (EMNLP 2025 Findings) |
| ```bibtex |
| @article{adamllm, |
| title={On Domain-Adaptive Post-Training for Multimodal Large Language Models}, |
| author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang}, |
| journal={arXiv preprint arXiv:2411.19930}, |
| year={2024} |
| } |
| ``` |
|
|
| [Adapt LLM to Domains](https://huggingface.co/papers/2309.09530) (ICLR 2024) |
| ```bibtex |
| @inproceedings{ |
| cheng2024adapting, |
| title={Adapting Large Language Models via Reading Comprehension}, |
| author={Daixuan Cheng and Shaohan Huang and Furu Wei}, |
| booktitle={The Twelfth International Conference on Learning Representations}, |
| year={2024}, |
| url={https://openreview.net/forum?id=y886UXPEZ0} |
| } |
| ``` |