Instructions to use irfanalee/doc-intel-gpt-oss-20b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use irfanalee/doc-intel-gpt-oss-20b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="irfanalee/doc-intel-gpt-oss-20b", filename="gguf/gpt-oss-20b.MXFP4.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use irfanalee/doc-intel-gpt-oss-20b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf irfanalee/doc-intel-gpt-oss-20b # Run inference directly in the terminal: llama-cli -hf irfanalee/doc-intel-gpt-oss-20b
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf irfanalee/doc-intel-gpt-oss-20b # Run inference directly in the terminal: llama-cli -hf irfanalee/doc-intel-gpt-oss-20b
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf irfanalee/doc-intel-gpt-oss-20b # Run inference directly in the terminal: ./llama-cli -hf irfanalee/doc-intel-gpt-oss-20b
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf irfanalee/doc-intel-gpt-oss-20b # Run inference directly in the terminal: ./build/bin/llama-cli -hf irfanalee/doc-intel-gpt-oss-20b
Use Docker
docker model run hf.co/irfanalee/doc-intel-gpt-oss-20b
- LM Studio
- Jan
- vLLM
How to use irfanalee/doc-intel-gpt-oss-20b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "irfanalee/doc-intel-gpt-oss-20b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "irfanalee/doc-intel-gpt-oss-20b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/irfanalee/doc-intel-gpt-oss-20b
- Ollama
How to use irfanalee/doc-intel-gpt-oss-20b with Ollama:
ollama run hf.co/irfanalee/doc-intel-gpt-oss-20b
- Unsloth Studio new
How to use irfanalee/doc-intel-gpt-oss-20b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for irfanalee/doc-intel-gpt-oss-20b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for irfanalee/doc-intel-gpt-oss-20b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for irfanalee/doc-intel-gpt-oss-20b to start chatting
- Pi new
How to use irfanalee/doc-intel-gpt-oss-20b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf irfanalee/doc-intel-gpt-oss-20b
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "irfanalee/doc-intel-gpt-oss-20b" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use irfanalee/doc-intel-gpt-oss-20b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf irfanalee/doc-intel-gpt-oss-20b
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default irfanalee/doc-intel-gpt-oss-20b
Run Hermes
hermes
- Docker Model Runner
How to use irfanalee/doc-intel-gpt-oss-20b with Docker Model Runner:
docker model run hf.co/irfanalee/doc-intel-gpt-oss-20b
- Lemonade
How to use irfanalee/doc-intel-gpt-oss-20b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull irfanalee/doc-intel-gpt-oss-20b
Run and chat with the model
lemonade run user.doc-intel-gpt-oss-20b-{{QUANT_TAG}}List all available models
lemonade list
Unsloth issue - BUG
When loading a gpt-oss-20b model exported to MXFP4 GGUF via Unsloth, Ollama produces completely unrelated output regardless of the prompt. The model ignores the user's input entirely and generates random content from its pretraining distribution.
Github PR for this
https://github.com/unslothai/unsloth/pull/4087
Document Intelligence β Fine-tuned gpt-oss-20b
Fine-tuning OpenAI's gpt-oss-20b (open-sourced GPT model) for structured document extraction using Unsloth QLoRA on a single NVIDIA RTX A4000 (16GB).
The model extracts structured JSON data from three document types: invoices/receipts, legal contracts, and general document Q&A.
What It Does
| Document Type | Input | Output |
|---|---|---|
| Invoice / Receipt | Receipt or invoice text | Vendor, date, line items, subtotal, tax, total |
| Legal Contract | Contract excerpt | Parties, effective date, key terms, obligations, termination |
| General Document | Document + question | Direct answer with supporting context |
Model Details
| Property | Value |
|---|---|
| Base model | unsloth/gpt-oss-20b (OpenAI open-source release) |
| Fine-tuning method | QLoRA (4-bit quantization + LoRA adapters) |
| Training framework | Unsloth + TRL SFT |
| Hardware | NVIDIA RTX A4000 16GB |
| Training date | February 18, 2026 |
Training Configuration
| Parameter | Value |
|---|---|
| LoRA rank (r) | 32 |
| LoRA alpha | 64 |
| LoRA dropout | 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Epochs | 2 |
| Batch size | 1 |
| Gradient accumulation | 16 (effective batch = 16) |
| Learning rate | 2e-4 |
| LR scheduler | Cosine |
| Warmup ratio | 0.05 |
| Optimizer | AdamW 8-bit |
| Max sequence length | 1536 |
| Precision | bfloat16 |
Training Data
| Dataset | Source | Examples | Document Type |
|---|---|---|---|
| CORD v2 | Naver Clova | 1,000 | Receipts / Invoices |
| CUAD | Atticus Project | 5,000 | Legal Contracts |
| DocVQA | HuggingFace M4 | 5,000 | General Document Q&A |
| Total | 11,000 |
Train/eval split: 90/10 β 9,900 train / 1,100 eval
All examples are formatted as ChatML conversations with task-specific system prompts. Images are not included (text-only training).
Framework Versions
| Library | Version |
|---|---|
| Unsloth | 2026.2.1 |
| TRL | 0.24.0 |
| Transformers | 4.57.6 |
| PyTorch | 2.10.0+cu128 |
| CUDA | 12.8 |
Output Files
output/
βββ lora_adapters/ # LoRA adapter weights (88MB) β use with base model
β βββ adapter_config.json
β βββ adapter_model.safetensors
β βββ tokenizer files
βββ checkpoint-1100/ # Training checkpoint
βββ checkpoint-1200/ # Training checkpoint
βββ checkpoint-1238/ # Final checkpoint
βββ gguf/
βββ doc-intel_gguf/
βββ gpt-oss-20b.MXFP4.gguf # Quantized model (13GB, MXFP4)
βββ Modelfile # Ollama deployment config
Project Scripts
| Script | Purpose |
|---|---|
prepare_training_data.py |
Converts CORD, CUAD, DocVQA datasets to JSONL training format |
train_moe.py |
QLoRA fine-tuning with Unsloth |
test_moe.py |
Runs test cases + interactive mode against the trained model |
export_gguf.py |
Exports trained model to GGUF format for Ollama / llama.cpp |
add_sroie.py |
Appends additional CORD (val/test) examples to training data |
Quickstart
Run with Python (LoRA adapters)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="output/lora_adapters",
max_seq_length=1536,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
Run with Ollama (GGUF)
ollama create doc-intel -f output/gguf/doc-intel_gguf/Modelfile
ollama run doc-intel "Extract all information from this receipt: ..."
Run test suite
python3 test_moe.py
Example Output
Input (Invoice):
COFFEE HOUSE β Date: 2024-02-15
Cappuccino $4.50 | Croissant $3.25 | Latte $5.00
Subtotal: $12.75 | Tax: $1.02 | Total: $13.77
Output:
{
"vendor": "Coffee House",
"date": "2024-02-15",
"items": [
{"name": "Cappuccino", "count": 1, "price": "4.50"},
{"name": "Croissant", "count": 1, "price": "3.25"},
{"name": "Latte", "count": 1, "price": "5.00"}
],
"subtotal": "12.75",
"tax": "1.02",
"total": "13.77"
}
- Downloads last month
- 8
We're not able to determine the quantization variants.