---
library_name: transformers
tags:
- berturk
- ottomanturkish
- DAPT
- boun
datasets:
- BUCOLIN/OTC-Corpus
language:
- tr
base_model:
- dbmdz/bert-base-turkish-128k-cased
pipeline_tag: token-classification
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->


---


# BerTurk Ottoman Full DAPT

A domain‐adaptive continuation of **dbmdz/bert-base-turkish-128k-cased**, pre‐trained on 800 K modern‐Latin Ottoman-Turkish sentences (≈ 14 M tokens) from the OTC Corpus (Özateş et al., 2025). This checkpoint is intended as a drop-in encoder for NER task.

---

## Model Details

| Property                 | Value                                             |
|--------------------------|---------------------------------------------------|
| **Base**                 | `dbmdz/bert-base-turkish-128k-cased`  |
| **Domain data**          | `BUCOLIN/OTC-Corpus`  |
| **Pre‐training task**    | Masked Language Modeling (MLM)                    |
| **Epochs**               | 4                                                |
| **Sequence length**      | 128 tokens (chunked)                              |
| **Batch size**           | 16 (per device)                                   |
| **Learning rate**        | 3 × 10⁻⁵                                           |
| **Warmup steps**         | 500                                              |
| **Weight decay**         | 0.01                                             |
| **Mixed precision**      | fp16                                             |
| **Checkpoint size**      | ≈ full weights, fp16                     |
| **Vocabulary**           | same as base      |

---

## Training Data

- **Corpus:** `BUCOLIN/OTC-Corpus`  
  - 800 K modern‐Latin transliterations of Ottoman-Turkish text  
  - Pre‐split into train/validation (90 %/10 %) during fine‐tuning  


---

## Training 

```python

# Args
args = TrainingArguments(
    output_dir="BerTurk_Ottoman_Full_DAPT",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    learning_rate=3e-5,
    eval_strategy="epoch",
    save_strategy="epoch",
    warmup_steps=500,
    weight_decay=0.01,
    fp16=True,
    logging_steps=100,
    save_steps=500,
    eval_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
)
```


## Hardware & Training

- **Hardware**: Google Colab Pro (T4 GPU, high VRAM).
- **Batch size**: 128
- **Final Validation Loss**  | 2.2306
- **Total DAPT time**: ~ 3 hours for 4 epochs


## Test Use 

```python

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

# Load model & tokenizer
tokenizer = AutoTokenizer.from_pretrained("cihanunlu/BerTurk_Ottoman_Full_DAPT")
model     = AutoModelForMaskedLM.from_pretrained("cihanunlu/BerTurk_Ottoman_Full_DAPT")


nlp = pipeline("fill-mask", model=model, tokenizer=tokenizer)
res = nlp("Devlet-i Aliyye-i Osmaniyye’nin [MASK] için tedâbîr-i mühimme ittikhāz olunmalıdır.")
print(res)

```