YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

PodGPT

An audio-augmented large language model for research and education

🚩 Special Notice

These checkpoints (LoRA adapters) are obtained by continuing pretraining on mistralai/Mixtral-8x7B-Instruct-v0.1.

📚 Table of contents

PodGPT
Installation
Quick Start
Performance Evaluation
Dataset Description
- Continual Pre-training Dataset
- Retrieval-augmented Generation (RAG) Database
Benchmarks and Results
Real-world Deployment
Automatic Speech Recognition
Dataset Builder
Upload and Download Models
Structure of the Code
Citation
Contact
Contributions
Acknowledgements

PodGPT

Our proposed PodGPT computational framework for research and education

💻 Installation

pip install -r requirements.txt

To set up the Transformers and Triton saving path, run the following commands in your terminal:

For example, if your path is /projectnb/vkolagrp/:

export TRANSFORMERS_CACHE=/projectnb/vkolagrp/.cache
export HF_HOME=/projectnb/vkolagrp/.cache
export HF_DATASETS_CACHE=/projectnb/vkolagrp/.cache
export TRITON_CACHE_DIR=/projectnb/vkolagrp/triton/.triton

Please change /projectnb/vkolagrp/ to your own path. After running the commands, all your Transformers models and datasets will be saved in the paths you defined.

🚀 Quick start

🐣 Train lightweight models

For lightweight models (2B, 7B, 8B, and 9B), we optimize the entire model. Please check and setup hyperparameters and Hugging Face READ/WRITE Tokens in config_small.yml.

python main_small.py

🐥 Train heavy models

For lager and heavy models (>9B), we optimize the Low-rank Adapter (LoRA). Please check and setup hyperparameters and Hugging Face READ/WRITE Token in config_large.yml.

python main_large.py

After completing training, many LoRA adapters will be saved. By default, the model_max_length will be set to train_max_len in your codes, as seen here. To ensure proper inference with vLLM, open the tokenizer_config.json file in the checkpoint folder and reset the model_max_length to match the original value of your base model.

This step is crucial because the vLLM engine uses the adapter's tokenizer instead of the base model's tokenizer. If not properly adjusted, the vLLM engine may truncate the input based on the model_max_length specified during training, potentially limiting the model's performance during inference while there are longer inputs.

🐤 Train quantized large models

We also provide support for quantizing larger models, e.g., the LLaMA 3.3 70B model, using the GPTQ algorithm and then optimizing the LoRA. The large models can be deployed on consumer GPUs after quantization.

Due to the suspended development of the AutoGPTQ package, we strongly recommend conducting quantization using the GPTQModel package!

After completing training, many LoRA adapters will be saved. By default, the model_max_length will be set to train_max_len in your codes, as seen here. To ensure proper inference with vLLM, open the tokenizer_config.json file in the checkpoint folder and reset the model_max_length to match the original value of your base model.

This step is crucial because the vLLM engine uses the adapter's tokenizer instead of the base model's tokenizer. If not properly adjusted, the vLLM engine may truncate the input based on the model_max_length specified during training, potentially limiting the model's performance during inference while there are longer inputs.

First, install the GPTQModel,

pip install -v gptqmodel --no-build-isolation

Then,

python quantization_GPTQModel.py "meta-llama/Llama-3.3-70B-Instruct" "./gptq_model" --bits 4 --group_size 128 --seqlen 2048 --damp 0.01 --desc_act 1 --dtype bfloat16

Alternatively, we can use the Hugging Face transformers package to do the quantization.

python quantization_HF.py --repo "meta-llama/Meta-Llama-3.1-70B-Instruct" --bits 4 --group_size 128

Lastly, we provide a quantization script based on the Python AutoGPTQ package.
Please use the pip install auto-gptq==0.6.0 --no-build-isolation to install the AutoGPTQ.

python quantization.py "meta-llama/Meta-Llama-3.1-70B-Instruct" "./gptq_model" --bits 4 --group_size 128 --desc_act 1 --dtype bfloat16 --seqlen 2048 --damp 0.01

After the quantization process, you can upload the quantized model to your Hugging Face, for example,

python upload_quantized_model.py --repo "shuyuej/Llama-3.3-70B-Instruct-GPTQ" --folder_path "./gptq_model"

Finally, we optimize the LoRA adapter,

python main_quantization.py

Quantized Model Training Special Notice:

Stable training of the quantized model with a LoRA adapter is tricky. We found the fine-tuned model tends to repeat the answer during the generation process. Here, we believe we have fundamentally solved this problem. For details, please check this GitHub solution. Also, please check our configurations and model loader.

Here is the training loss of our quantized LLaMA 3.3 70B model
Check this solution if you cannot successfully start the model training.
Check this solution if your adapters cannot be saved due to PEFT.
There are many unexpected issues for model quantization as well as model training, checkpoint saving, and vLLM inference. Please submit a GitHub issue if you cannot solve it. We should meet all the problems before in terms of single-GPU and distributed-GPU, e.g., 4 A100 80G GPUs, settings.

📊 Performance evaluation

All inferences are conducted using the vLLM engine. We use inference.py to sequentially evaluate the performance of multiple checkpoints (models). Please check here for more information.

To enable vLLM distributed inference, run the following command in your terminal:

export VLLM_WORKER_MULTIPROC_METHOD=spawn

Please note that this command is unnecessary if you are using a single GPU for inference. It is only required for distributed inference across multiple GPUs.
For multi-GPU inference and CUDA memory release, please check this solution for detailed guidance.

📜 Prompt format

We simply use Directly answer the best option: instead of Answer: to better guide LLMs to generate the best option and to easier extract the best option from the responses.
Please modify these lines if you wanna try other prompts.

english_prompt = "Directly answer the best option:"
english_prompt_pubmedqa = "Directly answer yes/no/maybe:"
hindi_prompt = "सीधे सबसे अच्छे विकल्प के साथ जवाब दें:"
french_prompt = "Répondez directement avec la meilleure option:"
spanish_prompt = "Responde directamente con la mejor opción:"
chinese_prompt = "直接回答最优选项:"

🛠 Model inference

Sequentially evaluate the performance of multiple checkpoints (models).
Please note that we use --eval_pretrain to indicate whether to evaluate the original pre-trained model.

python inference.py --mode small --eval_pretrain True --id 35166 52749 70332 87915

🤖 OpenAI ChatGPT support

We also offer support for running OpenAI ChatGPT inference using API. Please enter your OpenAI API Key here.

Please note that OpenAI ChatGPT API is extremely expensive.
Please only use it if you have a budget for it!

python inference.py --mode chatgpt

📚 Dataset description

Please follow our instructions to transcribe your own podcasts and build your own dataset.

Continual pre-training dataset

The podcasts data used for the continual pre-training of PodGPT:

Retrieval-augmented generation database

Scientific literature used for retrieval-augmented generation (RAG):

🏆 Benchmarks and results

Multilingual benchmarks

We utilized a comprehensive set of medical benchmarks from the most widely spoken languages in the world, including English, Mandarin, French, Spanish, and Hindi.

Language	Dataset	# test examples	# of choices	Link	Ref
English	MedExpQA	125	5	Link	Paper
	MedQA	1273	4	Link	Paper
	MedMCQA	4183	4	Link	Paper
	PubMedQA	500	3	Link	Paper
	MMLU - Anatomy	135	4	Link	Paper
	MMLU - Clinical Knowledge	265	4	Link	Paper
	MMLU - College Biology	144	4	Link	Paper
	MMLU - College Medicine	173	4	Link	Paper
	MMLU - Medical Genetics	100	4	Link	Paper
	MMLU - Professional Medicine	272	4	Link	Paper
French	MedExpQA	125	5	Link	Paper
	MedMCQA	622	5	Link	Paper
	MMLU - Anatomy	135	4	Link	Paper
	MMLU - Clinical Knowledge	265	4	Link	Paper
	MMLU - College Biology	144	4	Link	Paper
	MMLU - College Medicine	173	4	Link	Paper
	MMLU - Medical Genetics	100	4	Link	Paper
	MMLU - Professional Medicine	272	4	Link	Paper
Spanish	HEAD-QA	2742	4	Link	Paper
	MedExpQA	125	5	Link	Paper
	MMLU - Anatomy	135	4	Link	Paper
	MMLU - Clinical Knowledge	265	4	Link	Paper
	MMLU - College Biology	144	4	Link	Paper
	MMLU - College Medicine	173	4	Link	Paper
	MMLU - Medical Genetics	100	4	Link	Paper
	MMLU - Professional Medicine	272	4	Link	Paper
Chinese	MedQA-MCMLE	3426	4	Link	Paper
	CMMLU - Anatomy	148	4	Link	Paper
	CMMLU - Clinical Knowledge	237	4	Link	Paper
	CMMLU - College Medicine	273	4	Link	Paper
	CMMLU - Medical Genetics	176	4	Link	Paper
	CMMLU - Traditional Chinese Medicine	185	4	Link	Paper
	CMMLU - Virology	169	4	Link	Paper
Hindi	MMLU - Anatomy	135	4	Link	Paper
	MMLU - Clinical Knowledge	265	4	Link	Paper
	MMLU - College Biology	144	4	Link	Paper
	MMLU - College Medicine	173	4	Link	Paper
	MMLU - Medical Genetics	100	4	Link	Paper
	MMLU - Professional Medicine	272	4	Link	Paper

Performance on in-domain benchmarks

Performance of retrieval-augmented generation

Zero-shot cross-lingual performance

🔥 Real-world deployment

For real-world deployment, please refer to the vLLM Distributed Inference and Serving and OpenAI Compatible Server. We provide a deployment script here.

The vLLM version we are using is 0.6.2. Please check this version.

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at http://localhost:8000.

vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
    --quantization gptq \
    --trust-remote-code \
    --dtype float16 \
    --max-model-len 4096 \
    --distributed-executor-backend mp \
    --pipeline-parallel-size 4 \
    --api-key token-abc123

Please check here if you wanna change Engine Arguments.

If you would like to deploy your LoRA adapter, please refer to the vLLM documentation for a detailed guide.
It provides step-by-step instructions on how to serve LoRA adapters effectively in a vLLM environment.
We have also shared our trained LoRA adapter here. Please download it manually if needed.

git lfs install
git clone https://huggingface.co/shuyuej/Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ

To download the safetensors using git clone, ensure you initialize Git LFS with git lfs install. If you encounter the error "git: 'lfs' is not a git command," refer to this StackOverflow issue for troubleshooting. Alternatively, you can manually download the git-lfs:

$ wget https://github.com/git-lfs/git-lfs/releases/download/v3.2.0/git-lfs-linux-amd64-v3.2.0.tar.gz
$ tar -xzf git-lfs-linux-amd64-v3.2.0.tar.gz

Download the LoRA model from Huggig Face to your folder:
$ /git-lfs-3.2.0/git-lfs install
$ /git-lfs-3.2.0/git-lfs clone https://huggingface.co/shuyuej/Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ

Then, use the vLLM to serve the base model with the LoRA adapter by including the --enable-lora flag and specifying --lora-modules:

vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
    --quantization gptq \
    --trust-remote-code \
    --dtype float16 \
    --max-model-len 4096 \
    --distributed-executor-backend mp \
    --pipeline-parallel-size 4 \
    --api-key token-abc123 \
    --enable-lora \
    --lora-modules adapter=Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ

Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the openai python package:

#!/usr/bin/env python
# coding=utf-8

import time
import asyncio

from openai import AsyncOpenAI

# Our system prompt
SYSTEM_PROMPT = (
    "I am PodGPT, a large language model developed by the Kolachalama Lab in Boston, "
    "specializing in science, technology, engineering, mathematics, and medicine "
    "(STEMM)-related research and education, powered by podcast audio.\n"
    "I provide information based on established scientific knowledge but must not offer "
    "personal medical advice or present myself as a licensed medical professional.\n"
    "I will maintain a consistently professional and informative tone, avoiding humor, "
    "sarcasm, and pop culture references.\n"
    "I will prioritize factual accuracy and clarity while ensuring my responses are "
    "educational and non-harmful, adhering to the principle of 'do no harm'.\n"
    "My responses are for informational purposes only and should not be considered a "
    "substitute for professional consultation."
)

# Initialize the AsyncOpenAI client
client = AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)


async def main(message):
    """
    Streaming responses with async usage and "await" with each API call:
    Reference: https://github.com/openai/openai-python?tab=readme-ov-file#streaming-responses
    :param message: The user query
    """
    start_time = time.time()
    stream = await client.chat.completions.create(
        model="shuyuej/Llama-3.3-70B-Instruct-GPTQ",
        messages=[
            {
                "role": "system",
                "content": SYSTEM_PROMPT,
            },
            {
                "role": "user",
                "content": message,
            }
        ],
        max_tokens=2048,
        temperature=0.2,
        top_p=1,
        stream=True,
        extra_body={
            "ignore_eos": False,
            # https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ/raw/main/config.json#L10-L14
            "stop_token_ids": [128001, 128008, 128009],
        },
    )

    print(f"The user's query is\n {message}\n  ")
    print("The model's response is\n")
    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="")
    print(f"\nInference time: {time.time() - start_time:.2f} seconds\n")
    print("=" * 100)


if __name__ == "__main__":
    # Some random user queries
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
        "Can you tell me more about Bruce Lee?",
        "What are the differences between DNA and RNA?",
        "What is dementia and Alzheimer's disease?",
        "Tell me the differences between Alzheimer's disease and dementia"
    ]
    
    # Conduct model inference
    for message in prompts:
        asyncio.run(main(message=message))
        print("\n\n")

Here is a demo of the real-world model inference and deployment

🎯 Automatic speech recognition

In this file, we provide Automatic Speech Recognition (ASR) service.

python audio2text.py

⚒️ Dataset builder

We used the following codes to pre-process our transcripts and generate the training dataset.

python database_builder.py

🛠️ Upload and download models

In the scripts folder, we offer support for both uploading and downloading models.

To upload your checkpoints to Hugging Face model repo,

python upload_model.py --repo "shuyuej/DrGemma2B" --id 35166 52749 70332 87915

To download your model or files from Hugging Face repo,

python download_model.py --repo "shuyuej/DrGemma2B" --repo_type "model" --save_dir "./save_folder"

🖼️ Structure of the code

At the root of the project, you will see:

├── config_benchmark.yml
├── config_chatgpt.yml
├── config_large.yml
├── config_quantization.yml
├── config_small.yml
├── main_large.py
├── main_quantization.py
├── main_small.py
├── lib
│   ├── data_manager.py
│   ├── evaluation.py
│   ├── model_loader_large.py
│   ├── model_loader_quantization.py
│   └── model_loader_small.py
├── inference
│   └── inference.py
├── quantization
│   ├── model_split.py
│   ├── quantization.py
│   ├── quantization_HF.py
│   ├── quantization_GPTQModel.py
│   └── upload_quantized_model.py
├── download_files
│   ├── download_model_from_hf.py
│   └── download_model_to_local.py
├── requirements.txt
├── benchmark
├── results
├── save_folder
├── scripts
│   ├── audio2text.py
│   ├── database_builder.py
│   ├── download_model.py
│   ├── deployment.py
│   └── upload_model.py
└── utils
    ├── answer_utils.py
    ├── benchmark_utils.py
    ├── eval_utils.py
    └── utils.py

🙏 Citation

If you find our work useful in your research, please consider citing it in your publications. We provide a BibTeX entry below.

@article {Jia2024podgpt,
    author = {Jia S, Bit S, Searls E, Lauber MV, Claus LA, Fan P, Jasodanand VH, Veerapaneni D, Wang WM, Au R, Kolachalama VB.},
    title = {{PodGPT}: An audio-augmented large language model for research and education},
    year = {2025},
    doi = {[10.1101/2024.07.11.24310304](https://doi.org/10.1038/s44385-025-00022-0)},
    journal = {npj Biomedical Innovations},
    volume = 2,
    number = 26,
}

📧 Contact

Core Contributor and Maintainer (Equal Contributions):

Database Contributor and Maintainer:

If you have any questions, please drop us an email at brucejia@bu.edu, sbit@bu.edu, and nsearls@bu.edu.

🔨 Contributions

We always welcome contributions to help make PodGPT better. If you would like to contribute, please submit a pull request.

🙌 Acknowledgements

This repository is maintained by members of the Kolachalama Laboratory.

⬆️ Back to Top

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for vkola-lab/PodGPT-Mixtral-8x7B-Instruct-v0.1

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Paper • 2404.05590 • Published Apr 8, 2024

Apollo: Lightweight Multilingual Medical LLMs towards Democratizing Medical AI to 6B People

Paper • 2403.03640 • Published Mar 6, 2024 • 2

CMMLU: Measuring massive multitask language understanding in Chinese

Paper • 2306.09212 • Published Jun 15, 2023

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Paper • 2210.17323 • Published Oct 31, 2022 • 10

MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

Paper • 2203.14371 • Published Mar 27, 2022