--- language: - fr - en - ary license: apache-2.0 library_name: transformers tags: - nllb - translation - lora - darija pipeline_tag: translation datasets: - MBZUAI-Paris/Darija-SFT-Mixture - custom-scraped-data --- # NLLB-Darija-FR/ENG - Fine-Tuned Translation Model This repository contains a specialized translation model for **Darija (Moroccan Arabic)**, **French**, and **English**, based on the [`facebook/nllb-200-distilled-600M`](https://huggingface.co/facebook/nllb-200-distilled-600M) model. The model has been fine-tuned using the **LoRA (Low-Rank Adaptation)** technique to enhance its translation capabilities for these specific language pairs, which are often underrepresented in generalist models. The model can translate in both directions (e.g., French to Darija and Darija to French). This project was developed following a full MLOps approach, including an automated training and deployment pipeline. ## 🚀 Usage with `transformers` You can use this model directly with a pipeline from the `transformers` library. ### Installation Make sure you have the necessary libraries installed: ```bash pip install torch transformers sentencepiece ``` ### Example Python Code ```python from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM # 1. Load the model and tokenizer from the Hub model_id = "Farid59/nllb-darija-fr_eng" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSeq2SeqLM.from_pretrained(model_id) # 2. Create the translation pipeline translator = pipeline("translation", model=model, tokenizer=tokenizer) # --- Example 1: French to Darija --- texte_fr = "Bonjour, je voudrais réserver une table pour deux personnes ce soir." traduction_darija = translator( texte_fr, src_lang="fra_Latn", tgt_lang="ary_Arab" ) print(f"French -> Darija:") print(f" Input: {texte_fr}") print(f" Output: {traduction_darija[0]['translation_text']}") # Expected output: أهلا, بغيت نحجز طاولا لشخصين هاد الليلا # --- Example 2: Darija to English --- texte_darija = "شحال كايكلف هادشي" traduction_anglais = translator( texte_darija, src_lang="ary_Arab", tgt_lang="eng_Latn" ) print(f"\nDarija -> English:") print(f" Input: {texte_darija}") print(f" Output: {traduction_anglais[0]['translation_text']}") # Expected output: How much does that cost? ``` ## 📜 Model Details - **Base model:** [`facebook/nllb-200-distilled-600M`](https://huggingface.co/facebook/nllb-200-distilled-600M) - **Fine-tuning technique:** LoRA (Low-Rank Adaptation) - **Supported languages:** - `fra_Latn` (French) - `eng_Latn` (English) - `ary_Arab` (Darija, Arabic script) ## 📊 Training Data The model was trained on a composite corpus assembled from several sources: - The [Darija-SFT-Mixture](https://huggingface.co/datasets/MBZUAI-Paris/Darija-SFT-Mixture) dataset - Data collected by scraping specialized websites - Synthetic data generated to cover tourist and conversational scenarios All data was cleaned, deduplicated, and formatted for bidirectional training. ## ⚙️ Training Process Training was orchestrated by an automated MLOps pipeline using GitHub Actions. The process includes: 1. **Data collection and preparation** 2. **Fine-tuning with LoRA** on a GPU runner 3. **Merging** LoRA adapter weights into the base model to create a standalone model 4. **Evaluation** of the merged model on a dedicated test set using the **SacreBLEU** metric 5. **Validation Gate:** The new model is deployed only if its BLEU score exceeds that of the production version ### Performance Fine-tuning resulted in a very significant performance improvement on the dedicated test set. | Model | BLEU Score (on test set) | | :----------------------------------------------------------- | :----------------------: | | `facebook/nllb-200-distilled-600M` (Base model) | 8.19 | | **`Farid59/nllb-darija-fr_eng` (This fine-tuned model)** | **18.9** | This increase of over 10 BLEU points demonstrates the effectiveness of fine-tuning to specialize the model for the nuances of Darija. ## Author **Farid Igouti** > This project is part of a portfolio showcasing skills in MLOps, CI/CD, and AI model deployment.