Model Card for FontDiffuser

Model Details

Model Type

Architecture: Diffusion-based Font Generation Model
Framework: PyTorch + Hugging Face Diffusers
Scheduler: DPM-Solver++ (configurable: dpmsolver++ / dpmsolver)
Guidance: Classifier-free guidance
Base Model: FontDiffuser with Content and Style Encoders

Model Components

UNet: Main diffusion model for image generation
Content Encoder: Extracts character structure information
Style Encoder: Extracts font style features
DDPM/DPM Scheduler: Noise scheduling for diffusion process

Training Configuration

Resolution: 96×96 pixels
Batch Size: configurable
Inference Steps: 20 (default, configurable)
Guidance Scale: 7.5 (default, configurable)
Precision: FP32/FP16 (optional)
Device: CUDA/GPU recommended

Key repo files & entrypoints

configs/fontdiffuser.py — centralized CLI/config parser (all scripts import this).
src/ — model implementations and builders (model.py, build_optimized.py).
inference/sample_optimized.py — recommended single-GPU inference.
inference/sample_batch.py — batch generation with checkpointing + evaluation hooks.
inference/sample_distributed.py / run_inference.py — multi-GPU (Accelerate).
train.py, train_fst.py — training entry points.
tools/ — filename_utils.py, create_hf_dataset_streaming.py, generate_metadata.py, diagnose_dataset.py, upload_models_hybrid.py.
ckpt/ — expected checkpoint directory (unet, content_encoder, style_encoder files).
results_checkpoint.json — single source of truth for generated datasets.

Installation

The installation utilize uv package manager for its high speed due to implementation in Rust

uv pip install diffusers torch torchvision safetensors
uv pip install lpips scikit-image pytorch-fid  # Optional: for evaluation

Model usage

Load pipeline (single process)

from argparse import Namespace
from inference.sample_optimized import load_fontdiffuser_pipeline

args = Namespace(ckpt_dir="ckpt", device="cuda", guidance_scale=7.5, num_inference_steps=15)
pipe = load_fontdiffuser_pipeline(args=args)

Single-image inference (recommended)

python inference/sample_optimized.py \
  --ckpt_dir ckpt \
  --content_character "A" \
  --style_image_path style_images/foo.png \
  --save_image --save_image_dir results/

Large-scale batch with checkpoint/resume

python inference/sample_batch.py \
  --characters "chars.txt" \
  --style_images "style_images/*.png" \
  --ttf_path "fonts/myfont.ttf" \
  --ckpt_dir ckpt \
  --output_dir my_dataset/train_original \
  --batch_size 8 \
  --num_inference_steps 15 \
  --guidance_scale 7.5 \
  --save_interval 10

Multi-GPU inference via Accelerate

accelerate launch inference/sample_distributed.py --config_path configs/fontdiffuser.py --ckpt_dir ckpt ...

Outputs & metadata

Repo uses hash-based filenames (tools/filename_utils.py) and a central metadata file:

ContentImage/char.png — character content images
TargetImage/style+char.png — generated images per style
results_checkpoint.json — canonical metadata used by dataset tools and HF exporters

Example metadata generation:

python tools/generate_metadata.py --data_root my_dataset/handwritten_original --output my_dataset/handwritten_original/results_checkpoint.json

Model Performance

Supported Tasks

Single-character font generation
Multi-character batch generation
Multi-font support
Multi-style transfer
Index-based tracking for large-scale generation
Checkpoint and resume support

Output Format

output_dir/
├── ContentImage/              # Single set of content (character) images
│   ├── char0.png
│   ├── char1.png
│   └── ...
├── TargetImage/               # Generated font images organized by style
│   ├── style0/
│   │   ├── style0+char0.png
│   │   ├── style0+char1.png
│   │   └── ...
│   ├── style1/
│   │   └── ...
│   └── ...
├── results_checkpoint.json    # Checkpoint act as generation metadata

Results Metadata Structure

{
  "generations": [
    {
      "character": "A",
      "char_index": 0,
      "style": "style0",
      "style_index": 0,
      "font": "Arial",
      "style_path": "path/to/style0.png",
      "output_path": "TargetImage/style0/style0+char0.png"
    }
  ],
  "metrics": {
    "lpips": {"mean": 0.25, "std": 0.08, "min": 0.1, "max": 0.5},
    "ssim": {"mean": 0.82, "std": 0.05, "min": 0.7, "max": 0.95},
    "fid": {"mean": 15.3, "std": 2.1},
    "inference_times": [
      {
        "style": "style0",
        "style_index": 0,
        "font": "Arial",
        "total_time": 2.45,
        "num_images": 100,
        "time_per_image": 0.0245
      }
    ]
  },
  "fonts": ["Arial", "Times New Roman"],
  "characters": ["A", "B", "C"],
  "styles": ["style0", "style1"],
  "total_chars": 3,
  "total_styles": 2,
  "total_possible_pairs": 6
}

Evaluation Metrics

Supported Metrics

LPIPS: Learned perceptual image patch similarity (lower is better)
SSIM: Structural similarity index (higher is better)
FID: Fréchet Inception Distance (lower is better)
Inference Time: Per-image generation time

Generate with Evaluation

python sample_batch.py \
  --characters "characters.txt" \
  --style_images "styles/" \
  --ttf_path "fonts/myfont.ttf" \
  --ckpt_dir "checkpoints/" \
  --output_dir "my_dataset/train_original" \
  --evaluate \
  --ground_truth_dir "ground_truth/" \
  --compute_fid

Dataset

Dataset Source

Name: font-diffusion-generated-data
Link: https://huggingface.co/datasets/dzungpham/font-diffusion-generated-data
Format: ContentImage + TargetImage per style
Supports: Multi-font, multi-character, multi-style generation

Dataset Structure

FontDiffusion Dataset/
├── train_original/
│   ├── ContentImage/          # Character structure images
│   ├── TargetImage/           # Style-specific font renderings
│   └── results.json
├── val_original/
└── test_original/

Training & Fine-tuning

Fine-tuning from Checkpoint

python my_train.py \
  --ckpt_dir "checkpoints/" \
  --data_dir "my_dataset/train_original" \
  --output_dir "finetuned_ckpt/" \
  --num_epochs 5 \
  --learning_rate 1e-4 \
  --batch_size 4

Convert & Upload Fine-tuned Models

python upload_model.py \
  --ckpt_dir "finetuned_ckpt/" \
  --hf_token "hf_xxxxx" \
  --hf_repo_id "username/font-diffusion-finetuned" \
  --num_epochs 5

Currently, there is also another upload version with better speed and performance.

python upload_model_hybrid.py \
  --ckpt_dir "finetuned_ckpt/" \
  --hf_token "hf_xxxxx" \
  --hf_repo_id "username/font-diffusion-finetuned" \
  --num_epochs 5

Technical Features

Optimizations

Batch Processing: Process multiple characters per style
Memory Efficiency: Attention slicing (optional)
FP16 Support: Reduced precision for faster inference
Torch Compile: Optional model compilation
Channels Last Format: Memory-optimized tensor layout
XFormers Support: Fast attention implementation

Robustness

Checkpoint & Resume: Resume from interruptions
Index-based Tracking: Handle large character sets (100K+)
Multi-font Support: Process characters across multiple fonts
Error Recovery: Graceful handling of missing fonts
Automatic Indexing: Consistent char_index and style_index

Monitoring

Weights & Biases Integration: Real-time tracking
Progress Bars: Detailed generation progress
Checkpoint Saving: Periodic intermediate saves
Quality Metrics: LPIPS, SSIM, FID computation

Known Limitations

Requires CUDA-capable GPU for practical generation speeds
Characters must exist in at least one loaded font
Style images should be normalized (96×96 or resizable)
Very large character sets (>100K) may require memory optimization
FID computation requires representative ground truth dataset

Citation

@article{fontdiffuser2023,
  title={FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning},
  author={Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, Lianwen Jin},
  year={2023}
}

License

This model is licensed under the Apache License 2.0. See LICENSE file for details.

Downloads last month: 336

dzungpham
/

font-diffusion-weights