aquif-3.5-Nano-1B

aquif-3.5-Nano-1B is a lightweight yet capable language model with 1.72B parameters, delivering strong performance-to-efficiency ratios for resource-constrained deployments. Built on Qwen3-1.7B with comprehensive instruction-tuning, this model achieves competitive results across reasoning, mathematics, and code generation tasks while maintaining compatibility with consumer-grade hardware.

With a 40K token context window and bfloat16 precision, aquif-3.5-Nano-1B enables practical applications requiring fast inference and minimal memory overhead.

Model Overview

Attribute	Value
Total Parameters	1.72B
Context Window	40K tokens
Hidden Size	2048
Attention Heads	16
Key-Value Heads	8
Hidden Layers	28
Activation	SiLU
Precision	BF16
Model Type	Causal Language Model (Qwen3)
Multilingual	10 languages
License	Apache 2.0

Key Features

Efficient Architecture

aquif-3.5-Nano-1B achieves remarkable performance density through:

Optimized Layer Configuration: 28 layers with full attention mechanism balancing capacity and efficiency
Extended Context: 40K token window enables complex reasoning and document processing without architectural constraints
Memory Efficient: Designed for deployment on devices with 8GB+ VRAM; quantization support available for smaller footprints
Fast Inference: Minimal parameter count enables rapid token generation and batch processing

Strong Core Capabilities

Reasoning: Excels at multi-step logical inference and problem-solving
Mathematics: Robust performance on mathematical reasoning and calculation tasks
Code Generation: Proficient in generating and understanding code across multiple programming languages
Instruction Following: Refined through comprehensive instruction-tuning for reliable task execution

Multilingual Support

Native support for 10 languages including English, German, Italian, Portuguese, French, Hindi, Spanish, Thai, Chinese, and Japanese.

Evaluation

Benchmark Performance

Metric	aquif-3.5-Nano-1B	Ministral 3 3B Instruct	aquif-3.5-3B	Granite 4.0 H 1B	Qwen3-1.7B
MMLU	72.9	70.7	70.2	59.7	59.1
GPQA Diamond	42.0	35.8	35.8	29.7	27.7
AIME 2025	28.7	22.0	13.4	6.3	7.3
LiveCodeBench	24.8	24.7	23.1	11.5	12.6
Average	42.1	38.3	35.6	26.8	26.7

Performance Analysis

Despite being a lightweight nano model, aquif-3.5-Nano-1B demonstrates exceptional capability relative to parameter count:

MMLU: 72.9% accuracy, significantly outperforming base Qwen3-1.7B and Granite 4.0 H 1B
GPQA Diamond: 42.0% on expert-level questions, indicating robust reasoning capability
AIME 2025: 28.7% on advanced mathematics, demonstrating substantial improvement over comparable models
LiveCodeBench: 24.8% on real-world programming tasks, competitive with larger instruction-tuned variants

The model shows particular strength in reasoning and technical tasks, making it suitable for applications requiring intelligence without significant computational overhead.

Installation

pip install transformers torch

For faster inference with quantization support:

pip install transformers torch bitsandbytes

Technical Specifications

Architecture: Qwen3 Causal Language Model
Attention Mechanism: Full attention across all 28 layers
Position Encoding: RoPE (Rotary Position Embeddings) with theta=1,000,000
Normalization: RMSNorm with epsilon=1e-6
Vocabulary Size: 151,936 tokens
Head Dimension: 128
Intermediate Size: 6,144
Training Data Format: Instructions and reasoning tasks in multilingual contexts
Attention Dropout: 0.0
KV Caching: Enabled for efficient multi-turn inference

Use Cases

aquif-3.5-Nano-1B excels at:

Edge Deployment: Real-time inference on resource-limited devices
API Services: Cost-effective inference at scale with minimal latency
Research Prototyping: Fast experimentation with instruction-following models
Educational Applications: Learning model behavior without computational barriers
Local Processing: Privacy-preserving on-device inference
Embedded Systems: Integration into IoT and edge computing environments
Mathematical Reasoning: Problem-solving and technical explanation tasks
Code Assistance: Programming help and code generation at constrained budgets

Limitations and Considerations

Parameter Scale: While efficient, smaller capacity compared to 3B+ models may limit performance on extremely complex tasks
Context Length: 40K tokens supports extended reasoning but less than frontier models
Hardware Optimization: Best performance with recent hardware supporting BF16; FP32 inference available but slower
Specialized Domains: May require domain-specific fine-tuning for niche applications
Real-Time Requirements: Suitable for most applications; extremely latency-critical scenarios may benefit from optimization or quantization

Performance Optimization

Quantization: Use INT8 quantization to reduce memory footprint from ~4GB to 2-2.5GB
Flash Attention: Compatible with flash-attention implementations for faster inference
KV Caching: Leverages caching for efficient multi-turn conversations
Batch Inference: Process multiple prompts simultaneously for throughput optimization

Acknowledgements

Qwen Team: Base architecture and foundational model
HuggingFace: Model infrastructure and community ecosystem
aquif AI Research Team: Instruction-tuning optimization and performance refinement