Update README with evaluation results (#2)

- Update README with evaluation results (01ad791920f071f0c544ea2fb2a4410aeb7ea562)
- Updated evaluation results formatting (261cefca17e80d2a82ecedb9f3d1d3170d7b2281)

Co-authored-by: Gianluca Barmina <giannor@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +37 -0

README.md CHANGED Viewed

@@ -43,6 +43,43 @@ The characteristics of the three pre-training stages are detailed in the followi
 | stage3 | 524,288 tok | 18,926 | [subfolder="stage3"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage3)  | 2/3 [Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c) | Excludes depbank, jvj, nordjyllandnews, synne for Dynaword; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR schedule with 500 steps warmup, square root decay from 1e-5 |
 ## Limitations
 Munin-7B-Open-pt was trained only on Danish and English-language data and code from the 15 programming languages covered by the [stack-edu classifiers](https://huggingface.co/collections/HuggingFaceTB/the-ultimate-collection-of-code-classifiers-67b5aa3eb8994a4b71453005).

 | stage3 | 524,288 tok | 18,926 | [subfolder="stage3"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage3)  | 2/3 [Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c) | Excludes depbank, jvj, nordjyllandnews, synne for Dynaword; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR schedule with 500 steps warmup, square root decay from 1e-5 |
+## Evaluation
+Munin-7B-Open-pt was evaluated using the [EuroEval](https://euroeval.com/) framework, which includes benchmarks across seven task types covering more than 15 European languages.
+We report results in both Danish and English for all EuroEval-supported tasks: sentiment classification, named entity recognition, linguistic acceptability, reading comprehension, summarization, and knowledge and common-sense reasoning. In addition, we evaluate the model on DaLA, a Danish linguistic acceptability dataset focusing on real-world common errors.
+We compare Munin-7B-Open-pt at various training stages with its base model [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) and two models from the Pleias family ([Pleias-350M-Preview](https://huggingface.co/PleIAs/Pleias-350m-Preview) and [Pleias-1.2B-Preview](https://huggingface.co/PleIAs/Pleias-1.2b-Preview)). All comparison models were trained exclusively on open data, either in the public domain or under a permissive license.
+The following tables show, for Danish and English respectively, the performance on each dataset. For each, we report the respective main metric from EuroEval and the confidence interval.
+| Model                    | scala-da (MCC) | dala (MCC)   | angry-tweets (MCC) | no_misc_dansk (Micro F1) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) |
+| ------------------------ | -------------- | ------------ | ------------------ | ------------------------ | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- |
+| **comma-v0.1-2t**        | 0.94 ± 0.76    | 0.15 ± 0.55  | 39.77 ± 1.36       | 31.97 ± 2.77             | 3.63 ± 2.31             | 10.72 ± 4.05               | 66.37 ± 0.81          | 3.84 ± 0.96        | 60.20 ± 1.69                 |
+| **munin-7b-open-stage1** | 13.27 ± 2.92   | 12.70 ± 2.16 | 47.65 ± 1.70       | 40.01 ± 2.39             | 18.06 ± 0.92            | 32.84 ± 1.43               | 76.57 ± 0.55          | 12.85 ± 1.02       | 65.91 ± 0.85                 |
+| **munin-7b-open-stage2** | 15.78 ± 3.05   | 14.43 ± 2.92 | 47.35 ± 2.30       | 40.42 ± 2.38             | 24.12 ± 1.79            | 36.07 ± 1.80               | 75.18 ± 0.71          | 13.09 ± 1.13       | 66.50 ± 0.69                 |
+| **munin-7b-open-stage3** | 16.45 ± 1.36   | 15.68 ± 1.74 | 46.33 ± 2.09       | 41.08 ± 2.81             | 24.61 ± 1.98            | 36.22 ± 1.69               | 76.02 ± 0.68          | 13.15 ± 1.21       | 66.55 ± 0.63                 |
+| **Pleias-350m-Preview**  | -0.95 ± 1.46   | -1.84 ± 1.75 | 10.61 ± 2.87       | 12.86 ± 1.78             | 0.66 ± 2.63             | 4.59 ± 2.31                | 11.63 ± 0.88          | -0.26 ± 0.73       | 56.28 ± 1.47                 |
+| **Pleias-1.2b-Preview**  | 0.17 ± 1.13    | 0.66 ± 1.01  | 27.70 ± 2.89       | 27.30 ± 2.18             | -0.61 ± 1.89            | 8.60 ± 3.24                | 35.20 ± 1.25          | -0.04 ± 1.48       | 60.34 ± 0.86                 |
+| Model                    | scala-en (MCC) | sst5 (MCC)   | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1)   | hellaswag (MCC) | cnn-dailymail (BERTScore) |
+| ------------------------ | -------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- |
+| **comma-v0.1-2t**        | 29.74 ± 1.94   | 61.75 ± 2.08 | 57.54 ± 2.76                | 41.60 ± 2.41         | 90.38 ± 0.35 | 16.83 ± 0.63    | 63.33 ± 0.94              |
+| **munin-7b-open-stage1** | 27.46 ± 2.13   | 60.01 ± 1.69 | 56.63 ± 2.14                | 40.45 ± 1.74         | 22.10 ± 0.67 | 13.66 ± 0.70    | 59.16 ± 1.40              |
+| **munin-7b-open-stage2** | 27.65 ± 2.04   | 59.49 ± 1.59 | 56.61 ± 2.31                | 41.16 ± 1.73         | 22.29 ± 1.49 | 15.95 ± 0.90    | 60.22 ± 1.58              |
+| **munin-7b-open-stage3** | 29.00 ± 2.41   | 60.30 ± 1.40 | 56.96 ± 2.49                | 41.71 ± 1.78         | 24.62 ± 2.29 | 13.76 ± 0.87    | 58.98 ± 1.70              |
+| **Pleias-350m-Preview**  | 0.71 ± 1.75    | 15.41 ± 7.34 | 31.76 �� 3.48                | -0.70 ± 2.11         | 31.07 ± 2.31 | 0.22 ± 1.35     | 53.80 ± 1.04              |
+| **Pleias-1.2b-Preview**  | 0.99 ± 2.37    | 48.23 ± 2.58 | 40.86 ± 3.28                | 2.55 ± 2.75          | 52.90 ± 2.48 | -0.06 ± 1.50    | 60.15 ± 1.59              |
+The following plots show, for Danish and English respectively, model size on the x-axis and an aggregate performance score on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
+| Danish | English |
+|:--------------------------:|:--------------------------:|
+| <img src="./images/performance_plot_da.png" width="600"/> | <img src="./images/performance_plot_en.png" width="600"/> |
 ## Limitations
 Munin-7B-Open-pt was trained only on Danish and English-language data and code from the 15 programming languages covered by the [stack-edu classifiers](https://huggingface.co/collections/HuggingFaceTB/the-ultimate-collection-of-code-classifiers-67b5aa3eb8994a4b71453005).