process to add Persian (Farsi)

by phamed - opened Oct 20, 2025

Oct 20, 2025

Hi ,

First — great work on fine-tuning Chatterbox for norwegian . I found your project inspiring and would love to learn from what you did so I can replicate the process to add Persian (Farsi).

Could you please share a step-by-step guide or checklist of the exact process you followed? The more practical details the better — for example:

repository or scripts you used (training, preprocessing, tokenization)

dataset format and how you prepared/cleaned the data (examples of input/output pairs)

tokenizer details (did you reuse the original tokenizer, train a new one, or use SentencePiece/BPE?)

exact training command(s), framework versions, Dockerfile or environment specs

hardware used (GPUs, batch size limitations, mixed precision, distributed setup)

hyperparameters (learning rate schedule, batch size, warmup steps, weight decay, epochs)

any prompt/prefix engineering or control tokens you added for emotion/style

how you validated and evaluated quality (metrics, human eval, test prompts)

safety, filtering, and licensing considerations for the data you used

common pitfalls you encountered and how you fixed them

any scripts/tools for inference and deployment (API, quantization, ONNX/Triton tips)

I can provide compute (GPU clusters), data curation help, and a small team to implement and test — and, of course, full credit and attribution for your method wherever appropriate. If you prefer, I’d be happy to jump on a short call or chat and compensate for your time.

Thank you — I really appreciate any guidance you can share. I’ll follow your steps closely and report back results so we can iterate together.

Best regards,
Hamed Jam

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment