MarianMT Indonesian-English Translation (Fine-Tuned)

This model is a fine-tuned version of Helsinki-NLP/opus-mt-id-en specialized for translating Indonesian to English, particularly within contexts found in TED Talks.

🎯 Model Highlights

Specialized Context: Fine-tuned on the TED Talks parallel corpus for better performance on formal and presentation-style language.
Optimized Training: Utilizes modern training techniques like layer freezing and a cosine annealing scheduler for stable and effective fine-tuning.
Production Ready: Can be easily integrated into applications using the transformers library.

🚀 Model Details

Base Model: Helsinki-NLP/opus-mt-id-en
Fine-tuned Dataset: Cleaned and aligned TED Talks parallel corpus (Indonesian-English).
Training Date: 2025-06-16
Languages: Indonesian (id) → English (en)

⚙️ Training Configuration

Hyperparameters

Learning Rate: 5e-6
Weight Decay: 0.001
Gradient Clipping: 0.5
Max Sequence Length: 96-128 tokens
Scheduler: Cosine Annealing with Warmup

Architecture Optimizations

Layer Freezing: Early encoder layers were frozen to preserve foundational language knowledge from the base model.
Memory Optimization: Utilized gradient accumulation to simulate a larger batch size.
Early Stopping: Implemented with a patience of 5 epochs to prevent overfitting.

🛠️ Usage Example

from transformers import MarianMTModel, MarianTokenizer

model_name = "dhintech/marian-tedtalks_clean-id-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Pindahkan model ke GPU jika tersedia
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def translate(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
    with torch.no_grad():
        outputs = model.generate(**inputs, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Contoh penggunaan
indonesian_text = "Selamat pagi, mari kita mulai rapat hari ini."
english_translation = translate(indonesian_text)
print(f"ID: {indonesian_text}")
print(f"EN: {english_translation}")

🎯 Intended Use Cases

Presentation Translation: Translating presentation scripts and materials.
Formal Content: Translating articles, reports, and other formal documents.
Educational Content: Assisting with the translation of academic and educational materials.

⚡ Performance Metrics

Performance metrics such as BLEU score, inference time, and human evaluation will be added here after the model has been fully trained and evaluated.

🚨 Limitations and Considerations

Domain Specificity: While trained on a broad corpus, performance is best on formal language similar to TED Talks. It may not perform as well on very casual slang or regional dialects.
Long Sequences: Performance might degrade for sentences significantly longer than the max length used in training (128 tokens).

🤝 Contributing

Feedback and contributions are welcome! Please use the Community tab or open an issue on the repository if you encounter any problems or have suggestions for improvement.

Downloads last month: 3

Safetensors

Model size

72.2M params

Tensor type

F32

Model tree for dhintech/marian-tedtalks_clean-id-en

Base model

Helsinki-NLP/opus-mt-id-en

Finetuned

(18)

this model

dhintech
/

marian-tedtalks_clean-id-en