Model Card for FontDiffuser
Model Details
Model Type
- Architecture: Diffusion-based Font Generation Model
- Framework: PyTorch + Hugging Face Diffusers
- Scheduler: DPM-Solver++ (configurable: dpmsolver++ / dpmsolver)
- Guidance: Classifier-free guidance
- Base Model: FontDiffuser with Content and Style Encoders
Model Components
- UNet: Main diffusion model for image generation
- Content Encoder: Extracts character structure information
- Style Encoder: Extracts font style features
- DDPM/DPM Scheduler: Noise scheduling for diffusion process
Training Configuration
- Resolution: 96Γ96 pixels
- Batch Size: configurable
- Inference Steps: 20 (default, configurable)
- Guidance Scale: 7.5 (default, configurable)
- Precision: FP32/FP16 (optional)
- Device: CUDA/GPU recommended
Key repo files & entrypoints
- configs/fontdiffuser.py β centralized CLI/config parser (all scripts import this).
- src/ β model implementations and builders (model.py, build_optimized.py).
- inference/sample_optimized.py β recommended single-GPU inference.
- inference/sample_batch.py β batch generation with checkpointing + evaluation hooks.
- inference/sample_distributed.py / run_inference.py β multi-GPU (Accelerate).
- train.py, train_fst.py β training entry points.
- tools/ β filename_utils.py, create_hf_dataset_streaming.py, generate_metadata.py, diagnose_dataset.py, upload_models_hybrid.py.
- ckpt/ β expected checkpoint directory (unet, content_encoder, style_encoder files).
- results_checkpoint.json β single source of truth for generated datasets.
Installation
The installation utilize uv package manager for its high speed due to implementation in Rust
uv pip install diffusers torch torchvision safetensors
uv pip install lpips scikit-image pytorch-fid # Optional: for evaluation
Model usage
- Load pipeline (single process)
from argparse import Namespace
from inference.sample_optimized import load_fontdiffuser_pipeline
args = Namespace(ckpt_dir="ckpt", device="cuda", guidance_scale=7.5, num_inference_steps=15)
pipe = load_fontdiffuser_pipeline(args=args)
- Single-image inference (recommended)
python inference/sample_optimized.py \
--ckpt_dir ckpt \
--content_character "A" \
--style_image_path style_images/foo.png \
--save_image --save_image_dir results/
- Large-scale batch with checkpoint/resume
python inference/sample_batch.py \
--characters "chars.txt" \
--style_images "style_images/*.png" \
--ttf_path "fonts/myfont.ttf" \
--ckpt_dir ckpt \
--output_dir my_dataset/train_original \
--batch_size 8 \
--num_inference_steps 15 \
--guidance_scale 7.5 \
--save_interval 10
- Multi-GPU inference via Accelerate
accelerate launch inference/sample_distributed.py --config_path configs/fontdiffuser.py --ckpt_dir ckpt ...
Outputs & metadata
Repo uses hash-based filenames (tools/filename_utils.py) and a central metadata file:
- ContentImage/char.png β character content images
- TargetImage/style+char.png β generated images per style
- results_checkpoint.json β canonical metadata used by dataset tools and HF exporters
Example metadata generation:
python tools/generate_metadata.py --data_root my_dataset/handwritten_original --output my_dataset/handwritten_original/results_checkpoint.json
Model Performance
Supported Tasks
- Single-character font generation
- Multi-character batch generation
- Multi-font support
- Multi-style transfer
- Index-based tracking for large-scale generation
- Checkpoint and resume support
Output Format
output_dir/
βββ ContentImage/ # Single set of content (character) images
β βββ char0.png
β βββ char1.png
β βββ ...
βββ TargetImage/ # Generated font images organized by style
β βββ style0/
β β βββ style0+char0.png
β β βββ style0+char1.png
β β βββ ...
β βββ style1/
β β βββ ...
β βββ ...
βββ results_checkpoint.json # Checkpoint act as generation metadata
Results Metadata Structure
{
"generations": [
{
"character": "A",
"char_index": 0,
"style": "style0",
"style_index": 0,
"font": "Arial",
"style_path": "path/to/style0.png",
"output_path": "TargetImage/style0/style0+char0.png"
}
],
"metrics": {
"lpips": {"mean": 0.25, "std": 0.08, "min": 0.1, "max": 0.5},
"ssim": {"mean": 0.82, "std": 0.05, "min": 0.7, "max": 0.95},
"fid": {"mean": 15.3, "std": 2.1},
"inference_times": [
{
"style": "style0",
"style_index": 0,
"font": "Arial",
"total_time": 2.45,
"num_images": 100,
"time_per_image": 0.0245
}
]
},
"fonts": ["Arial", "Times New Roman"],
"characters": ["A", "B", "C"],
"styles": ["style0", "style1"],
"total_chars": 3,
"total_styles": 2,
"total_possible_pairs": 6
}
Evaluation Metrics
Supported Metrics
- LPIPS: Learned perceptual image patch similarity (lower is better)
- SSIM: Structural similarity index (higher is better)
- FID: FrΓ©chet Inception Distance (lower is better)
- Inference Time: Per-image generation time
Generate with Evaluation
python sample_batch.py \
--characters "characters.txt" \
--style_images "styles/" \
--ttf_path "fonts/myfont.ttf" \
--ckpt_dir "checkpoints/" \
--output_dir "my_dataset/train_original" \
--evaluate \
--ground_truth_dir "ground_truth/" \
--compute_fid
Dataset
Dataset Source
- Name: font-diffusion-generated-data
- Link: https://huggingface.co/datasets/dzungpham/font-diffusion-generated-data
- Format: ContentImage + TargetImage per style
- Supports: Multi-font, multi-character, multi-style generation
Dataset Structure
FontDiffusion Dataset/
βββ train_original/
β βββ ContentImage/ # Character structure images
β βββ TargetImage/ # Style-specific font renderings
β βββ results.json
βββ val_original/
βββ test_original/
Training & Fine-tuning
Fine-tuning from Checkpoint
python my_train.py \
--ckpt_dir "checkpoints/" \
--data_dir "my_dataset/train_original" \
--output_dir "finetuned_ckpt/" \
--num_epochs 5 \
--learning_rate 1e-4 \
--batch_size 4
Convert & Upload Fine-tuned Models
python upload_model.py \
--ckpt_dir "finetuned_ckpt/" \
--hf_token "hf_xxxxx" \
--hf_repo_id "username/font-diffusion-finetuned" \
--num_epochs 5
Currently, there is also another upload version with better speed and performance.
python upload_model_hybrid.py \
--ckpt_dir "finetuned_ckpt/" \
--hf_token "hf_xxxxx" \
--hf_repo_id "username/font-diffusion-finetuned" \
--num_epochs 5
Technical Features
Optimizations
- Batch Processing: Process multiple characters per style
- Memory Efficiency: Attention slicing (optional)
- FP16 Support: Reduced precision for faster inference
- Torch Compile: Optional model compilation
- Channels Last Format: Memory-optimized tensor layout
- XFormers Support: Fast attention implementation
Robustness
- Checkpoint & Resume: Resume from interruptions
- Index-based Tracking: Handle large character sets (100K+)
- Multi-font Support: Process characters across multiple fonts
- Error Recovery: Graceful handling of missing fonts
- Automatic Indexing: Consistent char_index and style_index
Monitoring
- Weights & Biases Integration: Real-time tracking
- Progress Bars: Detailed generation progress
- Checkpoint Saving: Periodic intermediate saves
- Quality Metrics: LPIPS, SSIM, FID computation
Known Limitations
- Requires CUDA-capable GPU for practical generation speeds
- Characters must exist in at least one loaded font
- Style images should be normalized (96Γ96 or resizable)
- Very large character sets (>100K) may require memory optimization
- FID computation requires representative ground truth dataset
Citation
@article{fontdiffuser2023,
title={FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning},
author={Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, Lianwen Jin},
year={2023}
}
License
This model is licensed under the Apache License 2.0. See LICENSE file for details.
- Downloads last month
- 336