--- library_name: transformers tags: - berturk - ottomanturkish - DAPT - boun datasets: - BUCOLIN/OTC-Corpus language: - tr base_model: - dbmdz/bert-base-turkish-128k-cased pipeline_tag: token-classification --- # Model Card for Model ID --- # BerTurk Ottoman Full DAPT A domain‐adaptive continuation of **dbmdz/bert-base-turkish-128k-cased**, pre‐trained on 800 K modern‐Latin Ottoman-Turkish sentences (≈ 14 M tokens) from the OTC Corpus (Özateş et al., 2025). This checkpoint is intended as a drop-in encoder for NER task. --- ## Model Details | Property | Value | |--------------------------|---------------------------------------------------| | **Base** | `dbmdz/bert-base-turkish-128k-cased` | | **Domain data** | `BUCOLIN/OTC-Corpus` | | **Pre‐training task** | Masked Language Modeling (MLM) | | **Epochs** | 4 | | **Sequence length** | 128 tokens (chunked) | | **Batch size** | 16 (per device) | | **Learning rate** | 3 × 10⁻⁵ | | **Warmup steps** | 500 | | **Weight decay** | 0.01 | | **Mixed precision** | fp16 | | **Checkpoint size** | ≈ full weights, fp16 | | **Vocabulary** | same as base | --- ## Training Data - **Corpus:** `BUCOLIN/OTC-Corpus` - 800 K modern‐Latin transliterations of Ottoman-Turkish text - Pre‐split into train/validation (90 %/10 %) during fine‐tuning --- ## Training ```python # Args args = TrainingArguments( output_dir="BerTurk_Ottoman_Full_DAPT", per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=4, learning_rate=3e-5, eval_strategy="epoch", save_strategy="epoch", warmup_steps=500, weight_decay=0.01, fp16=True, logging_steps=100, save_steps=500, eval_steps=500, save_total_limit=2, load_best_model_at_end=True, ) ``` ## Hardware & Training - **Hardware**: Google Colab Pro (T4 GPU, high VRAM). - **Batch size**: 128 - **Final Validation Loss** | 2.2306 - **Total DAPT time**: ~ 3 hours for 4 epochs ## Test Use ```python from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline # Load model & tokenizer tokenizer = AutoTokenizer.from_pretrained("cihanunlu/BerTurk_Ottoman_Full_DAPT") model = AutoModelForMaskedLM.from_pretrained("cihanunlu/BerTurk_Ottoman_Full_DAPT") nlp = pipeline("fill-mask", model=model, tokenizer=tokenizer) res = nlp("Devlet-i Aliyye-i Osmaniyye’nin [MASK] için tedâbîr-i mühimme ittikhāz olunmalıdır.") print(res) ```