Text Generation
Safetensors
Danish
English
llama
KennethEnevoldsen giannor commited on
Commit
1ca4d18
·
1 Parent(s): cf7c99d

Update README with evaluation results (#2)

Browse files

- Update README with evaluation results (01ad791920f071f0c544ea2fb2a4410aeb7ea562)
- Updated evaluation results formatting (261cefca17e80d2a82ecedb9f3d1d3170d7b2281)


Co-authored-by: Gianluca Barmina <giannor@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +37 -0
README.md CHANGED
@@ -43,6 +43,43 @@ The characteristics of the three pre-training stages are detailed in the followi
43
  | stage3 | 524,288 tok | 18,926 | [subfolder="stage3"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage3) | 2/3 [Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c) | Excludes depbank, jvj, nordjyllandnews, synne for Dynaword; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR schedule with 500 steps warmup, square root decay from 1e-5 |
44
 
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  ## Limitations
47
 
48
  Munin-7B-Open-pt was trained only on Danish and English-language data and code from the 15 programming languages covered by the [stack-edu classifiers](https://huggingface.co/collections/HuggingFaceTB/the-ultimate-collection-of-code-classifiers-67b5aa3eb8994a4b71453005).
 
43
  | stage3 | 524,288 tok | 18,926 | [subfolder="stage3"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage3) | 2/3 [Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c) | Excludes depbank, jvj, nordjyllandnews, synne for Dynaword; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR schedule with 500 steps warmup, square root decay from 1e-5 |
44
 
45
 
46
+ ## Evaluation
47
+
48
+ Munin-7B-Open-pt was evaluated using the [EuroEval](https://euroeval.com/) framework, which includes benchmarks across seven task types covering more than 15 European languages.
49
+
50
+ We report results in both Danish and English for all EuroEval-supported tasks: sentiment classification, named entity recognition, linguistic acceptability, reading comprehension, summarization, and knowledge and common-sense reasoning. In addition, we evaluate the model on DaLA, a Danish linguistic acceptability dataset focusing on real-world common errors.
51
+
52
+ We compare Munin-7B-Open-pt at various training stages with its base model [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) and two models from the Pleias family ([Pleias-350M-Preview](https://huggingface.co/PleIAs/Pleias-350m-Preview) and [Pleias-1.2B-Preview](https://huggingface.co/PleIAs/Pleias-1.2b-Preview)). All comparison models were trained exclusively on open data, either in the public domain or under a permissive license.
53
+
54
+
55
+ The following tables show, for Danish and English respectively, the performance on each dataset. For each, we report the respective main metric from EuroEval and the confidence interval.
56
+
57
+ | Model | scala-da (MCC) | dala (MCC) | angry-tweets (MCC) | no_misc_dansk (Micro F1) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) |
58
+ | ------------------------ | -------------- | ------------ | ------------------ | ------------------------ | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- |
59
+ | **comma-v0.1-2t** | 0.94 ± 0.76 | 0.15 ± 0.55 | 39.77 ± 1.36 | 31.97 ± 2.77 | 3.63 ± 2.31 | 10.72 ± 4.05 | 66.37 ± 0.81 | 3.84 ± 0.96 | 60.20 ± 1.69 |
60
+ | **munin-7b-open-stage1** | 13.27 ± 2.92 | 12.70 ± 2.16 | 47.65 ± 1.70 | 40.01 ± 2.39 | 18.06 ± 0.92 | 32.84 ± 1.43 | 76.57 ± 0.55 | 12.85 ± 1.02 | 65.91 ± 0.85 |
61
+ | **munin-7b-open-stage2** | 15.78 ± 3.05 | 14.43 ± 2.92 | 47.35 ± 2.30 | 40.42 ± 2.38 | 24.12 ± 1.79 | 36.07 ± 1.80 | 75.18 ± 0.71 | 13.09 ± 1.13 | 66.50 ± 0.69 |
62
+ | **munin-7b-open-stage3** | 16.45 ± 1.36 | 15.68 ± 1.74 | 46.33 ± 2.09 | 41.08 ± 2.81 | 24.61 ± 1.98 | 36.22 ± 1.69 | 76.02 ± 0.68 | 13.15 ± 1.21 | 66.55 ± 0.63 |
63
+ | **Pleias-350m-Preview** | -0.95 ± 1.46 | -1.84 ± 1.75 | 10.61 ± 2.87 | 12.86 ± 1.78 | 0.66 ± 2.63 | 4.59 ± 2.31 | 11.63 ± 0.88 | -0.26 ± 0.73 | 56.28 ± 1.47 |
64
+ | **Pleias-1.2b-Preview** | 0.17 ± 1.13 | 0.66 ± 1.01 | 27.70 ± 2.89 | 27.30 ± 2.18 | -0.61 ± 1.89 | 8.60 ± 3.24 | 35.20 ± 1.25 | -0.04 ± 1.48 | 60.34 ± 0.86 |
65
+
66
+ | Model | scala-en (MCC) | sst5 (MCC) | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1) | hellaswag (MCC) | cnn-dailymail (BERTScore) |
67
+ | ------------------------ | -------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- |
68
+ | **comma-v0.1-2t** | 29.74 ± 1.94 | 61.75 ± 2.08 | 57.54 ± 2.76 | 41.60 ± 2.41 | 90.38 ± 0.35 | 16.83 ± 0.63 | 63.33 ± 0.94 |
69
+ | **munin-7b-open-stage1** | 27.46 ± 2.13 | 60.01 ± 1.69 | 56.63 ± 2.14 | 40.45 ± 1.74 | 22.10 ± 0.67 | 13.66 ± 0.70 | 59.16 ± 1.40 |
70
+ | **munin-7b-open-stage2** | 27.65 ± 2.04 | 59.49 ± 1.59 | 56.61 ± 2.31 | 41.16 ± 1.73 | 22.29 ± 1.49 | 15.95 ± 0.90 | 60.22 ± 1.58 |
71
+ | **munin-7b-open-stage3** | 29.00 ± 2.41 | 60.30 ± 1.40 | 56.96 ± 2.49 | 41.71 ± 1.78 | 24.62 ± 2.29 | 13.76 ± 0.87 | 58.98 ± 1.70 |
72
+ | **Pleias-350m-Preview** | 0.71 ± 1.75 | 15.41 ± 7.34 | 31.76 �� 3.48 | -0.70 ± 2.11 | 31.07 ± 2.31 | 0.22 ± 1.35 | 53.80 ± 1.04 |
73
+ | **Pleias-1.2b-Preview** | 0.99 ± 2.37 | 48.23 ± 2.58 | 40.86 ± 3.28 | 2.55 ± 2.75 | 52.90 ± 2.48 | -0.06 ± 1.50 | 60.15 ± 1.59 |
74
+
75
+
76
+ The following plots show, for Danish and English respectively, model size on the x-axis and an aggregate performance score on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
77
+
78
+ | Danish | English |
79
+ |:--------------------------:|:--------------------------:|
80
+ | <img src="./images/performance_plot_da.png" width="600"/> | <img src="./images/performance_plot_en.png" width="600"/> |
81
+
82
+
83
  ## Limitations
84
 
85
  Munin-7B-Open-pt was trained only on Danish and English-language data and code from the 15 programming languages covered by the [stack-edu classifiers](https://huggingface.co/collections/HuggingFaceTB/the-ultimate-collection-of-code-classifiers-67b5aa3eb8994a4b71453005).