Marian MT fine-tuned on the Multilingual Corpus of World’s Constitutions (MCWC)

This model is a fine-tuned version of Helsinki-NLP/opus-mt-en-es, adapted using high-quality sentence-aligned constitutional text from the Multilingual Corpus of World’s Constitutions (MCWC)
📄 MCWC paper (OSACT 2024): https://aclanthology.org/2024.osact-1.7/

This variant handles: English → Spanish translation.


Overview

The MCWC provides a curated multilingual collection of constitutional texts from countries across the world. The corpus emphasises data cleanliness, high-quality sentence alignment, and detailed metadata (including country and continent mappings). It supports research in:

  • legal and constitutional NLP
  • comparative constitutional studies
  • multilingual machine translation
  • cross-lingual semantic analysis

This model was fine-tuned on the English–Spanish segment of the MCWC, enabling translation that is more attuned to legal and constitutional language than general-purpose MT systems.


Intended use

This model is suitable for tasks such as:

  • translating constitutional or legal documents
  • cross-lingual legal text comparison
  • multilingual information extraction
  • downstream legal NLP tasks requiring domain-specific MT

It is not intended for casual or conversational translation, as it is optimised for formal and legal text.


Training data

The model was trained on the MCWC’s English–Spanish aligned sentence pairs.
The MCWC dataset includes:

  • cleaned constitutional text
  • high-quality sentence segmentation
  • pairwise alignments
  • country and regional metadata

More details may be found in the accompanying paper:

El-Haj, M. & Ezzini, S. (2024). “The Multilingual Corpus of World’s Constitutions (MCWC).”
OSACT @ LREC-COLING 2024.
https://aclanthology.org/2024.osact-1.7/


Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • optimizer: {'name': 'AdamWeightDecay', 'learning_rate': {'module': 'keras.optimizers.schedules', 'class_name': 'PolynomialDecay', 'config': {'initial_learning_rate': 5e-05, 'decay_steps': 3666, 'end_learning_rate': 0.0, 'power': 1.0, 'cycle': False, 'name': None}, 'registered_name': None}, 'decay': 0.0, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-08, 'amsgrad': False, 'weight_decay_rate': 0.01}
  • training_precision: mixed_float16

Training results

Train Loss Validation Loss Epoch
1.1192 1.0032 0
0.9233 0.9729 1
0.8292 0.9642 2

Framework versions

  • Transformers 4.33.3
  • TensorFlow 2.13.0
  • Datasets 2.14.5
  • Tokenizers 0.13.3
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for drelhaj/marian-finetuned-mcwc-en-to-es

Finetuned
(31)
this model