Commit ·
1ca4d18
1
Parent(s): cf7c99d
Update README with evaluation results (#2)
Browse files- Update README with evaluation results (01ad791920f071f0c544ea2fb2a4410aeb7ea562)
- Updated evaluation results formatting (261cefca17e80d2a82ecedb9f3d1d3170d7b2281)
Co-authored-by: Gianluca Barmina <giannor@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -43,6 +43,43 @@ The characteristics of the three pre-training stages are detailed in the followi
|
|
| 43 |
| stage3 | 524,288 tok | 18,926 | [subfolder="stage3"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage3) | 2/3 [Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c) | Excludes depbank, jvj, nordjyllandnews, synne for Dynaword; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR schedule with 500 steps warmup, square root decay from 1e-5 |
|
| 44 |
|
| 45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
## Limitations
|
| 47 |
|
| 48 |
Munin-7B-Open-pt was trained only on Danish and English-language data and code from the 15 programming languages covered by the [stack-edu classifiers](https://huggingface.co/collections/HuggingFaceTB/the-ultimate-collection-of-code-classifiers-67b5aa3eb8994a4b71453005).
|
|
|
|
| 43 |
| stage3 | 524,288 tok | 18,926 | [subfolder="stage3"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage3) | 2/3 [Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c) | Excludes depbank, jvj, nordjyllandnews, synne for Dynaword; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR schedule with 500 steps warmup, square root decay from 1e-5 |
|
| 44 |
|
| 45 |
|
| 46 |
+
## Evaluation
|
| 47 |
+
|
| 48 |
+
Munin-7B-Open-pt was evaluated using the [EuroEval](https://euroeval.com/) framework, which includes benchmarks across seven task types covering more than 15 European languages.
|
| 49 |
+
|
| 50 |
+
We report results in both Danish and English for all EuroEval-supported tasks: sentiment classification, named entity recognition, linguistic acceptability, reading comprehension, summarization, and knowledge and common-sense reasoning. In addition, we evaluate the model on DaLA, a Danish linguistic acceptability dataset focusing on real-world common errors.
|
| 51 |
+
|
| 52 |
+
We compare Munin-7B-Open-pt at various training stages with its base model [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) and two models from the Pleias family ([Pleias-350M-Preview](https://huggingface.co/PleIAs/Pleias-350m-Preview) and [Pleias-1.2B-Preview](https://huggingface.co/PleIAs/Pleias-1.2b-Preview)). All comparison models were trained exclusively on open data, either in the public domain or under a permissive license.
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
The following tables show, for Danish and English respectively, the performance on each dataset. For each, we report the respective main metric from EuroEval and the confidence interval.
|
| 56 |
+
|
| 57 |
+
| Model | scala-da (MCC) | dala (MCC) | angry-tweets (MCC) | no_misc_dansk (Micro F1) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) |
|
| 58 |
+
| ------------------------ | -------------- | ------------ | ------------------ | ------------------------ | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- |
|
| 59 |
+
| **comma-v0.1-2t** | 0.94 ± 0.76 | 0.15 ± 0.55 | 39.77 ± 1.36 | 31.97 ± 2.77 | 3.63 ± 2.31 | 10.72 ± 4.05 | 66.37 ± 0.81 | 3.84 ± 0.96 | 60.20 ± 1.69 |
|
| 60 |
+
| **munin-7b-open-stage1** | 13.27 ± 2.92 | 12.70 ± 2.16 | 47.65 ± 1.70 | 40.01 ± 2.39 | 18.06 ± 0.92 | 32.84 ± 1.43 | 76.57 ± 0.55 | 12.85 ± 1.02 | 65.91 ± 0.85 |
|
| 61 |
+
| **munin-7b-open-stage2** | 15.78 ± 3.05 | 14.43 ± 2.92 | 47.35 ± 2.30 | 40.42 ± 2.38 | 24.12 ± 1.79 | 36.07 ± 1.80 | 75.18 ± 0.71 | 13.09 ± 1.13 | 66.50 ± 0.69 |
|
| 62 |
+
| **munin-7b-open-stage3** | 16.45 ± 1.36 | 15.68 ± 1.74 | 46.33 ± 2.09 | 41.08 ± 2.81 | 24.61 ± 1.98 | 36.22 ± 1.69 | 76.02 ± 0.68 | 13.15 ± 1.21 | 66.55 ± 0.63 |
|
| 63 |
+
| **Pleias-350m-Preview** | -0.95 ± 1.46 | -1.84 ± 1.75 | 10.61 ± 2.87 | 12.86 ± 1.78 | 0.66 ± 2.63 | 4.59 ± 2.31 | 11.63 ± 0.88 | -0.26 ± 0.73 | 56.28 ± 1.47 |
|
| 64 |
+
| **Pleias-1.2b-Preview** | 0.17 ± 1.13 | 0.66 ± 1.01 | 27.70 ± 2.89 | 27.30 ± 2.18 | -0.61 ± 1.89 | 8.60 ± 3.24 | 35.20 ± 1.25 | -0.04 ± 1.48 | 60.34 ± 0.86 |
|
| 65 |
+
|
| 66 |
+
| Model | scala-en (MCC) | sst5 (MCC) | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1) | hellaswag (MCC) | cnn-dailymail (BERTScore) |
|
| 67 |
+
| ------------------------ | -------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- |
|
| 68 |
+
| **comma-v0.1-2t** | 29.74 ± 1.94 | 61.75 ± 2.08 | 57.54 ± 2.76 | 41.60 ± 2.41 | 90.38 ± 0.35 | 16.83 ± 0.63 | 63.33 ± 0.94 |
|
| 69 |
+
| **munin-7b-open-stage1** | 27.46 ± 2.13 | 60.01 ± 1.69 | 56.63 ± 2.14 | 40.45 ± 1.74 | 22.10 ± 0.67 | 13.66 ± 0.70 | 59.16 ± 1.40 |
|
| 70 |
+
| **munin-7b-open-stage2** | 27.65 ± 2.04 | 59.49 ± 1.59 | 56.61 ± 2.31 | 41.16 ± 1.73 | 22.29 ± 1.49 | 15.95 ± 0.90 | 60.22 ± 1.58 |
|
| 71 |
+
| **munin-7b-open-stage3** | 29.00 ± 2.41 | 60.30 ± 1.40 | 56.96 ± 2.49 | 41.71 ± 1.78 | 24.62 ± 2.29 | 13.76 ± 0.87 | 58.98 ± 1.70 |
|
| 72 |
+
| **Pleias-350m-Preview** | 0.71 ± 1.75 | 15.41 ± 7.34 | 31.76 �� 3.48 | -0.70 ± 2.11 | 31.07 ± 2.31 | 0.22 ± 1.35 | 53.80 ± 1.04 |
|
| 73 |
+
| **Pleias-1.2b-Preview** | 0.99 ± 2.37 | 48.23 ± 2.58 | 40.86 ± 3.28 | 2.55 ± 2.75 | 52.90 ± 2.48 | -0.06 ± 1.50 | 60.15 ± 1.59 |
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
The following plots show, for Danish and English respectively, model size on the x-axis and an aggregate performance score on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
|
| 77 |
+
|
| 78 |
+
| Danish | English |
|
| 79 |
+
|:--------------------------:|:--------------------------:|
|
| 80 |
+
| <img src="./images/performance_plot_da.png" width="600"/> | <img src="./images/performance_plot_en.png" width="600"/> |
|
| 81 |
+
|
| 82 |
+
|
| 83 |
## Limitations
|
| 84 |
|
| 85 |
Munin-7B-Open-pt was trained only on Danish and English-language data and code from the 15 programming languages covered by the [stack-edu classifiers](https://huggingface.co/collections/HuggingFaceTB/the-ultimate-collection-of-code-classifiers-67b5aa3eb8994a4b71453005).
|