hamishivi commited on
Commit
52b403f
·
verified ·
1 Parent(s): b19615b

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ assets filter=lfs diff=lfs merge=lfs -text
38
+ assets/deepseek_vs_nvidia102.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
4
+ language:
5
+ - en
6
+ license: cc-by-nc-4.0
7
+ pipeline_tag: text-generation
8
+ library_name: transformers
9
+ ---
10
+
11
+ <div align="center">
12
+ <span style="font-family: default; font-size: 1.5em;">Nemotron-Research-Reasoning-Qwen-1.5B</span>
13
+ <div>
14
+ 🚀 The leading generalist reasoning model for cutting-edge research and development 🌟
15
+ </div>
16
+ </div>
17
+
18
+ ![Comparison between DeepSeek-R1-1.5B and Nemotron-Research-Reasoning-Qwen-1.5B](./assets/deepseek_vs_nvidia102.png)
19
+
20
+ ## News
21
+ - [2025-08-11] ProRL V2 blog post is released: [ProRL V2 - Prolonged Training Validates RL Scaling Laws](https://research.nvidia.com/labs/lpr/prorlv2/).
22
+ - [2025-07-23] Nemotron-Research-Reasoning-Qwen-1.5B-v2 is released.
23
+ - [2025-05-29] Nemotron-Research-Reasoning-Qwen-1.5B is released.
24
+
25
+ ## Introduction
26
+ Nemotron-Research-Reasoning-Qwen-1.5B is the world’s leading 1.5B open-weight model for complex reasoning tasks such as mathematical problems, coding challenges, scientific questions, and logic puzzles.
27
+ It is trained using the ProRL algorithm on a diverse and comprehensive set of datasets.
28
+ Our model has achieved impressive results, outperforming Deepseek’s 1.5B model by a large margin on a broad range of tasks, including math, coding, and GPQA.
29
+
30
+ This model is for research and development only.
31
+
32
+ ## ProRL: Prolonged Reinforcement Learning
33
+ ProRL is designed to enable extended RL training periods that facilitate deeper exploration of reasoning strategies.
34
+ It enables more than 2k training steps and scale the training data across diverse tasks—from traditional math and code tasks to STEM problems, logical puzzles, and instruction following, which, we hypothesize, are crucial for generalization.
35
+ Based on Group Relative Policy Optimization (GRPO), ProRL introduces three key techniques:
36
+ 1. Mitigating Entropy Collapse
37
+ 2. Decoupled clip and dynamic sampling policy optimization (DAPO)
38
+ 3. KL regularization and reference policy reset
39
+
40
+ Using ProRL, we developed the world's best 1.5B reasoning model that significantly outperforms its base model, DeepSeek-R1-1.5B, and matches or even surpasses the performance of DeepSeek-R1-7B across a diverse range of benchmarks.
41
+ Notably, compared to DeepSeek-R1-1.5B, we achieve average pass@1 improvements of 14.7\% on math benchmarks, 13.9\% on coding, 54.8\% on logic puzzles, 25.1\% on STEM reasoning, and 18.1\% on instruction-following tasks.
42
+
43
+ ## Training Datasets
44
+ | Dataset | Link |
45
+ |---------------------------|-------------------------------------------------------------------------------------------|
46
+ | DeepScaleR-Preview-Dataset | [Link](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) |
47
+ | Eurus-2-RL-Data | [Link](https://huggingface.co/datasets/PRIME-RL/Eurus-2-RL-Data) |
48
+ | Reasoning-gym | [Link](https://github.com/open-thought/reasoning-gym) |
49
+ | IFEval | [Link](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) |
50
+ | SCP-116K | [Link](https://huggingface.co/datasets/EricLu/SCP-116K) |
51
+
52
+
53
+ ## Evaluation Results
54
+
55
+ Table 1: Performance (pass@1) comparison for benchmarks across Math domain.
56
+ | Model | AIME24 | AIME25 | AMC | Math | Minerva | Olympiad | Avg |
57
+ |-------------------------------|--------|--------|-------|-------|----------|----------|--------|
58
+ | DeepSeek-R1-Distill-Qwen-1.5B | 28.54 | 22.71 | 62.58 | 82.90 | 26.38 | 43.58 | 44.45 |
59
+ | DeepScaleR-1.5B | 40.21 | 31.46 | 73.04 | 89.36 | 41.57 | 51.63 | 54.54 |
60
+ | *DeepSeek-R1-Distill-Qwen-7B* | 53.54 | 40.83 | 82.83 | 93.68 | 50.60 | 57.66 | 63.19 |
61
+ | **Nemotron-Research-Reasoning-Qwen-1.5B** | 48.13 | 33.33 | 79.29 | 91.89 | 47.98 | 60.22 | 60.14 |
62
+ | **Nemotron-Research-Reasoning-Qwen-1.5B-v2** | **49.58** | **36.04** | **82.53** | **92.49** | **49.03** | **60.44** | **61.69** |
63
+
64
+ Table 2: Performance (pass@1) comparison across benchmarks for Code. We abbreviate benchmarks names for codecontests (cc), codeforces (cf), humanevalplus (human), and livecodebench (LCB).
65
+ | Model | apps | cc | cf | taco | human | LCB | Avg |
66
+ |-------------------------------|--------|--------|--------|--------|--------|--------|--------|
67
+ | DeepSeek-R1-Distill-Qwen-1.5B | 20.95 | 16.79 | 14.13 | 8.03 | 61.77 | 16.80 | 23.08 |
68
+ | DeepCoder-1.5B | 30.37 | 23.76 | 21.70 | 13.76 | 73.40 | 22.76 | 30.96 |
69
+ | *DeepSeek-R1-Distill-Qwen-7B* | 42.08 | 32.76 | 33.08 | 19.08 | 83.32 | 38.04 | 41.39 |
70
+ | **Nemotron-Research-Reasoning-Qwen-1.5B** | 41.99 | 31.80 | 34.50 | 20.81 | 72.05 | 23.81 | 37.49 |
71
+ | **Nemotron-Research-Reasoning-Qwen-1.5B-v2** | **46.39** | **35.59** | **40.75** | **22.89** | 72.89 | **27.69** | **41.03** |
72
+
73
+ Table 3: Performance comparison on STEM reasoning (GPQA Diamond), instruction following (IFEval), and logic puzzles (Reasoning Gym) tasks. We also present results on OOD tasks: acre, boxnet, and game_of_life_halting (game).
74
+ | Model | GPQA | IFEval | Reasoning | acre | boxnet | game |
75
+ |-------------------------------|--------|--------|-----------|--------|--------|--------|
76
+ | DeepSeek-R1-Distill-Qwen-1.5B | 15.86 | 44.05 | 4.24 | 5.99 | 0.00 | 3.49 |
77
+ | *DeepSeek-R1-Distill-Qwen-7B* | 35.44 | 58.01 | 28.55 | 20.21 | 1.71 | 12.94 |
78
+ | **Nemotron-Research-Reasoning-Qwen-1.5B** | **41.78** | 66.02 | 59.06 | **58.57** | **7.91** | **52.29** |
79
+ | **Nemotron-Research-Reasoning-Qwen-1.5B-v2** | 41.32 | **70.85** | **62.49** | - | - | - |
80
+
81
+
82
+ ## Nemotron-Research-Reasoning-Qwen-1.5B-v2
83
+
84
+ In the wake of the release of Nemotron-Research-Reasoning-Qwen-1.5B, we scaling the training steps from 2000 to 3000, resulting in Nemotron-Research-Reasoning-Qwen-1.5B-v2.
85
+ Nemotron-Research-Reasoning-Qwen-1.5B-v2 builds on top of REINFORCE++-baseline with dynamic sampling and clip-higher, and proposes several critical enhancements such as periodically refreshing the reference model with the current best checkpoint and imposing the length penalty only in scheduled cycles.
86
+ Together, these techniques allow model performance to continually improve with more RL training steps and expand LLMs' reasoning boundaries.
87
+ Our latest checkpoint, Nemotron-Research-Reasoning-Qwen-1.5B-v2, trained for 3000 steps, sets a new state-of-the-art (SOTA) among 1.5B reasoning models.
88
+
89
+ For the Nemotron-Research-Reasoning-Qwen-1.5B-v2, you can use the following code to load the model:
90
+ ```
91
+ from transformers import AutoTokenizer, AutoModelForCausalLM
92
+
93
+ tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B")
94
+ model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B")
95
+ ```
96
+
97
+ For the original Nemotron-Research-Reasoning-Qwen-1.5B, you can use the following code to load the model:
98
+ ```
99
+ from transformers import AutoTokenizer, AutoModelForCausalLM
100
+
101
+ tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B", revision="v1")
102
+ model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B", revision="v1")
103
+ ```
104
+
105
+
106
+ ## License/Terms of Use
107
+ cc-by-nc-4.0
108
+
109
+ ## Ethical Considerations
110
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
111
+
112
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
113
+
114
+ ## Citation
115
+ If you find our dataset helpful, please cite the following [paper](https://arxiv.org/abs/2505.24864):
116
+
117
+ ```
118
+ @article{liu2025prorl,
119
+ author = {Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong},
120
+ title={ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models},
121
+ journal = {arXiv preprint},
122
+ year = {2025},
123
+ archivePrefix = {arXiv},
124
+ primaryClass = {cs.CL},
125
+ url={https://arxiv.org/abs/2505.24864},
126
+ }
127
+ ```
added_tokens.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652
24
+ }
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 7.0,
3
+ "total_flos": 1.1955108328583987e+17,
4
+ "train_loss": 0.9723419804781719,
5
+ "train_runtime": 594145.3794,
6
+ "train_samples_per_second": 14.138,
7
+ "train_steps_per_second": 0.055
8
+ }
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2ForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 151646,
7
+ "eos_token_id": 151643,
8
+ "hidden_act": "silu",
9
+ "hidden_size": 1536,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 8960,
12
+ "max_position_embeddings": 131072,
13
+ "max_window_layers": 21,
14
+ "model_type": "qwen2",
15
+ "num_attention_heads": 12,
16
+ "num_hidden_layers": 28,
17
+ "num_key_value_heads": 2,
18
+ "pad_token_id": 151643,
19
+ "rms_norm_eps": 1e-06,
20
+ "rope_scaling": null,
21
+ "rope_theta": 10000,
22
+ "sliding_window": null,
23
+ "tie_word_embeddings": false,
24
+ "torch_dtype": "bfloat16",
25
+ "transformers_version": "4.51.3",
26
+ "use_cache": true,
27
+ "use_mrope": false,
28
+ "use_sliding_window": false,
29
+ "vocab_size": 151936
30
+ }
configs.yaml ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ assistant_tag: gpt
2
+ bf16: 'True'
3
+ content_tag: value
4
+ cutoff_len: '16384'
5
+ dataloader_num_workers: '4'
6
+ dataloader_persistent_workers: 'True'
7
+ dataloader_pin_memory: 'True'
8
+ dataset: mlfoundations-dev/openthoughts3
9
+ dataset_dir: ONLINE
10
+ ddp_timeout: '180000000'
11
+ deepspeed: /opt/ml/code/zero3.json
12
+ do_train: 'True'
13
+ enable_liger_kernel: 'True'
14
+ finetuning_type: full
15
+ formatting: sharegpt
16
+ global_batch_size: '256'
17
+ gradient_accumulation_steps: '1'
18
+ hub_model_id: mlfoundations-dev/openthoughts3_full_qwen25_1b
19
+ learning_rate: '0.00016'
20
+ logging_steps: '1'
21
+ lr_scheduler_type: cosine
22
+ messages: conversations
23
+ model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
24
+ num_train_epochs: '7.0'
25
+ output_dir: /opt/ml/model
26
+ overwrite_cache: 'True'
27
+ per_device_train_batch_size: '4'
28
+ plot_loss: 'True'
29
+ preprocessing_num_workers: '16'
30
+ push_to_db: 'True'
31
+ push_to_hub: 'True'
32
+ report_to: wandb
33
+ role_tag: from
34
+ run_name: openthoughts3_full_qwen25_1b
35
+ save_strategy: epoch
36
+ stage: sft
37
+ template: qwen25
38
+ user_tag: human
39
+ warmup_ratio: '0.1'
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 151646,
4
+ "eos_token_id": 151643,
5
+ "pad_token_id": 151643,
6
+ "transformers_version": "4.51.3"
7
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00000-of-00001.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2491da6f902364394f7936abe91445ed9f6e24975093e96c98c84cbbb6ca7038
3
+ size 3554214720
model.safetensors.index.json ADDED
@@ -0,0 +1,346 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 3554176000
4
+ },
5
+ "weight_map": {
6
+ "model.embed_tokens.weight": "model-00000-of-00001.safetensors",
7
+ "lm_head.weight": "model-00000-of-00001.safetensors",
8
+ "model.layers.0.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
9
+ "model.layers.0.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
10
+ "model.layers.0.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
11
+ "model.layers.1.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
12
+ "model.layers.1.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
13
+ "model.layers.1.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
14
+ "model.layers.2.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
15
+ "model.layers.2.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
16
+ "model.layers.2.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
17
+ "model.layers.3.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
18
+ "model.layers.3.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
19
+ "model.layers.3.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
20
+ "model.layers.4.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
21
+ "model.layers.4.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
22
+ "model.layers.4.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
23
+ "model.layers.5.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
24
+ "model.layers.5.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
25
+ "model.layers.5.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
26
+ "model.layers.6.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
27
+ "model.layers.6.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
28
+ "model.layers.6.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
29
+ "model.layers.7.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
30
+ "model.layers.7.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
31
+ "model.layers.7.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
32
+ "model.layers.8.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
33
+ "model.layers.8.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
34
+ "model.layers.8.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
35
+ "model.layers.9.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
36
+ "model.layers.9.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
37
+ "model.layers.9.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
38
+ "model.layers.10.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
39
+ "model.layers.10.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
40
+ "model.layers.10.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
41
+ "model.layers.11.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
42
+ "model.layers.11.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
43
+ "model.layers.11.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
44
+ "model.layers.12.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
45
+ "model.layers.12.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
46
+ "model.layers.12.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
47
+ "model.layers.13.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
48
+ "model.layers.13.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
49
+ "model.layers.13.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
50
+ "model.layers.14.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
51
+ "model.layers.14.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
52
+ "model.layers.14.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
53
+ "model.layers.15.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
54
+ "model.layers.15.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
55
+ "model.layers.15.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
56
+ "model.layers.16.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
57
+ "model.layers.16.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
58
+ "model.layers.16.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
59
+ "model.layers.17.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
60
+ "model.layers.17.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
61
+ "model.layers.17.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
62
+ "model.layers.18.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
63
+ "model.layers.18.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
64
+ "model.layers.18.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
65
+ "model.layers.19.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
66
+ "model.layers.19.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
67
+ "model.layers.19.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
68
+ "model.layers.20.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
69
+ "model.layers.20.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
70
+ "model.layers.20.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
71
+ "model.layers.21.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
72
+ "model.layers.21.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
73
+ "model.layers.21.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
74
+ "model.layers.22.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
75
+ "model.layers.22.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
76
+ "model.layers.22.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
77
+ "model.layers.23.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
78
+ "model.layers.23.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
79
+ "model.layers.23.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
80
+ "model.layers.24.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
81
+ "model.layers.24.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
82
+ "model.layers.24.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
83
+ "model.layers.25.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
84
+ "model.layers.25.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
85
+ "model.layers.25.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
86
+ "model.layers.26.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
87
+ "model.layers.26.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
88
+ "model.layers.26.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
89
+ "model.layers.27.self_attn.q_proj.weight": "model-00000-of-00001.safetensors",
90
+ "model.layers.27.self_attn.k_proj.weight": "model-00000-of-00001.safetensors",
91
+ "model.layers.27.self_attn.v_proj.weight": "model-00000-of-00001.safetensors",
92
+ "model.layers.0.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
93
+ "model.layers.0.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
94
+ "model.layers.0.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
95
+ "model.layers.1.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
96
+ "model.layers.1.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
97
+ "model.layers.1.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
98
+ "model.layers.2.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
99
+ "model.layers.2.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
100
+ "model.layers.2.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
101
+ "model.layers.3.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
102
+ "model.layers.3.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
103
+ "model.layers.3.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
104
+ "model.layers.4.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
105
+ "model.layers.4.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
106
+ "model.layers.4.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
107
+ "model.layers.5.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
108
+ "model.layers.5.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
109
+ "model.layers.5.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
110
+ "model.layers.6.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
111
+ "model.layers.6.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
112
+ "model.layers.6.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
113
+ "model.layers.7.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
114
+ "model.layers.7.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
115
+ "model.layers.7.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
116
+ "model.layers.8.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
117
+ "model.layers.8.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
118
+ "model.layers.8.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
119
+ "model.layers.9.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
120
+ "model.layers.9.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
121
+ "model.layers.9.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
122
+ "model.layers.10.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
123
+ "model.layers.10.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
124
+ "model.layers.10.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
125
+ "model.layers.11.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
126
+ "model.layers.11.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
127
+ "model.layers.11.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
128
+ "model.layers.12.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
129
+ "model.layers.12.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
130
+ "model.layers.12.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
131
+ "model.layers.13.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
132
+ "model.layers.13.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
133
+ "model.layers.13.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
134
+ "model.layers.14.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
135
+ "model.layers.14.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
136
+ "model.layers.14.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
137
+ "model.layers.15.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
138
+ "model.layers.15.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
139
+ "model.layers.15.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
140
+ "model.layers.16.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
141
+ "model.layers.16.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
142
+ "model.layers.16.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
143
+ "model.layers.17.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
144
+ "model.layers.17.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
145
+ "model.layers.17.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
146
+ "model.layers.18.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
147
+ "model.layers.18.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
148
+ "model.layers.18.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
149
+ "model.layers.19.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
150
+ "model.layers.19.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
151
+ "model.layers.19.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
152
+ "model.layers.20.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
153
+ "model.layers.20.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
154
+ "model.layers.20.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
155
+ "model.layers.21.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
156
+ "model.layers.21.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
157
+ "model.layers.21.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
158
+ "model.layers.22.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
159
+ "model.layers.22.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
160
+ "model.layers.22.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
161
+ "model.layers.23.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
162
+ "model.layers.23.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
163
+ "model.layers.23.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
164
+ "model.layers.24.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
165
+ "model.layers.24.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
166
+ "model.layers.24.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
167
+ "model.layers.25.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
168
+ "model.layers.25.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
169
+ "model.layers.25.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
170
+ "model.layers.26.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
171
+ "model.layers.26.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
172
+ "model.layers.26.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
173
+ "model.layers.27.self_attn.q_proj.bias": "model-00000-of-00001.safetensors",
174
+ "model.layers.27.self_attn.k_proj.bias": "model-00000-of-00001.safetensors",
175
+ "model.layers.27.self_attn.v_proj.bias": "model-00000-of-00001.safetensors",
176
+ "model.layers.0.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
177
+ "model.layers.0.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
178
+ "model.layers.1.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
179
+ "model.layers.1.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
180
+ "model.layers.2.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
181
+ "model.layers.2.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
182
+ "model.layers.3.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
183
+ "model.layers.3.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
184
+ "model.layers.4.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
185
+ "model.layers.4.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
186
+ "model.layers.5.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
187
+ "model.layers.5.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
188
+ "model.layers.6.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
189
+ "model.layers.6.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
190
+ "model.layers.7.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
191
+ "model.layers.7.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
192
+ "model.layers.8.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
193
+ "model.layers.8.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
194
+ "model.layers.9.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
195
+ "model.layers.9.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
196
+ "model.layers.10.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
197
+ "model.layers.10.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
198
+ "model.layers.11.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
199
+ "model.layers.11.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
200
+ "model.layers.12.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
201
+ "model.layers.12.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
202
+ "model.layers.13.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
203
+ "model.layers.13.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
204
+ "model.layers.14.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
205
+ "model.layers.14.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
206
+ "model.layers.15.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
207
+ "model.layers.15.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
208
+ "model.layers.16.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
209
+ "model.layers.16.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
210
+ "model.layers.17.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
211
+ "model.layers.17.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
212
+ "model.layers.18.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
213
+ "model.layers.18.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
214
+ "model.layers.19.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
215
+ "model.layers.19.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
216
+ "model.layers.20.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
217
+ "model.layers.20.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
218
+ "model.layers.21.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
219
+ "model.layers.21.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
220
+ "model.layers.22.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
221
+ "model.layers.22.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
222
+ "model.layers.23.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
223
+ "model.layers.23.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
224
+ "model.layers.24.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
225
+ "model.layers.24.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
226
+ "model.layers.25.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
227
+ "model.layers.25.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
228
+ "model.layers.26.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
229
+ "model.layers.26.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
230
+ "model.layers.27.mlp.gate_proj.weight": "model-00000-of-00001.safetensors",
231
+ "model.layers.27.mlp.up_proj.weight": "model-00000-of-00001.safetensors",
232
+ "model.layers.0.input_layernorm.weight": "model-00000-of-00001.safetensors",
233
+ "model.layers.1.input_layernorm.weight": "model-00000-of-00001.safetensors",
234
+ "model.layers.2.input_layernorm.weight": "model-00000-of-00001.safetensors",
235
+ "model.layers.3.input_layernorm.weight": "model-00000-of-00001.safetensors",
236
+ "model.layers.4.input_layernorm.weight": "model-00000-of-00001.safetensors",
237
+ "model.layers.5.input_layernorm.weight": "model-00000-of-00001.safetensors",
238
+ "model.layers.6.input_layernorm.weight": "model-00000-of-00001.safetensors",
239
+ "model.layers.7.input_layernorm.weight": "model-00000-of-00001.safetensors",
240
+ "model.layers.8.input_layernorm.weight": "model-00000-of-00001.safetensors",
241
+ "model.layers.9.input_layernorm.weight": "model-00000-of-00001.safetensors",
242
+ "model.layers.10.input_layernorm.weight": "model-00000-of-00001.safetensors",
243
+ "model.layers.11.input_layernorm.weight": "model-00000-of-00001.safetensors",
244
+ "model.layers.12.input_layernorm.weight": "model-00000-of-00001.safetensors",
245
+ "model.layers.13.input_layernorm.weight": "model-00000-of-00001.safetensors",
246
+ "model.layers.14.input_layernorm.weight": "model-00000-of-00001.safetensors",
247
+ "model.layers.15.input_layernorm.weight": "model-00000-of-00001.safetensors",
248
+ "model.layers.16.input_layernorm.weight": "model-00000-of-00001.safetensors",
249
+ "model.layers.17.input_layernorm.weight": "model-00000-of-00001.safetensors",
250
+ "model.layers.18.input_layernorm.weight": "model-00000-of-00001.safetensors",
251
+ "model.layers.19.input_layernorm.weight": "model-00000-of-00001.safetensors",
252
+ "model.layers.20.input_layernorm.weight": "model-00000-of-00001.safetensors",
253
+ "model.layers.21.input_layernorm.weight": "model-00000-of-00001.safetensors",
254
+ "model.layers.22.input_layernorm.weight": "model-00000-of-00001.safetensors",
255
+ "model.layers.23.input_layernorm.weight": "model-00000-of-00001.safetensors",
256
+ "model.layers.24.input_layernorm.weight": "model-00000-of-00001.safetensors",
257
+ "model.layers.25.input_layernorm.weight": "model-00000-of-00001.safetensors",
258
+ "model.layers.26.input_layernorm.weight": "model-00000-of-00001.safetensors",
259
+ "model.layers.27.input_layernorm.weight": "model-00000-of-00001.safetensors",
260
+ "model.layers.0.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
261
+ "model.layers.1.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
262
+ "model.layers.2.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
263
+ "model.layers.3.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
264
+ "model.layers.4.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
265
+ "model.layers.5.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
266
+ "model.layers.6.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
267
+ "model.layers.7.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
268
+ "model.layers.8.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
269
+ "model.layers.9.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
270
+ "model.layers.10.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
271
+ "model.layers.11.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
272
+ "model.layers.12.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
273
+ "model.layers.13.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
274
+ "model.layers.14.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
275
+ "model.layers.15.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
276
+ "model.layers.16.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
277
+ "model.layers.17.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
278
+ "model.layers.18.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
279
+ "model.layers.19.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
280
+ "model.layers.20.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
281
+ "model.layers.21.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
282
+ "model.layers.22.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
283
+ "model.layers.23.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
284
+ "model.layers.24.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
285
+ "model.layers.25.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
286
+ "model.layers.26.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
287
+ "model.layers.27.post_attention_layernorm.weight": "model-00000-of-00001.safetensors",
288
+ "model.layers.0.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
289
+ "model.layers.1.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
290
+ "model.layers.2.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
291
+ "model.layers.3.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
292
+ "model.layers.4.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
293
+ "model.layers.5.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
294
+ "model.layers.6.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
295
+ "model.layers.7.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
296
+ "model.layers.8.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
297
+ "model.layers.9.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
298
+ "model.layers.10.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
299
+ "model.layers.11.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
300
+ "model.layers.12.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
301
+ "model.layers.13.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
302
+ "model.layers.14.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
303
+ "model.layers.15.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
304
+ "model.layers.16.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
305
+ "model.layers.17.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
306
+ "model.layers.18.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
307
+ "model.layers.19.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
308
+ "model.layers.20.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
309
+ "model.layers.21.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
310
+ "model.layers.22.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
311
+ "model.layers.23.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
312
+ "model.layers.24.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
313
+ "model.layers.25.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
314
+ "model.layers.26.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
315
+ "model.layers.27.self_attn.o_proj.weight": "model-00000-of-00001.safetensors",
316
+ "model.layers.0.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
317
+ "model.layers.1.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
318
+ "model.layers.2.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
319
+ "model.layers.3.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
320
+ "model.layers.4.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
321
+ "model.layers.5.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
322
+ "model.layers.6.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
323
+ "model.layers.7.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
324
+ "model.layers.8.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
325
+ "model.layers.9.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
326
+ "model.layers.10.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
327
+ "model.layers.11.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
328
+ "model.layers.12.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
329
+ "model.layers.13.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
330
+ "model.layers.14.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
331
+ "model.layers.15.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
332
+ "model.layers.16.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
333
+ "model.layers.17.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
334
+ "model.layers.18.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
335
+ "model.layers.19.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
336
+ "model.layers.20.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
337
+ "model.layers.21.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
338
+ "model.layers.22.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
339
+ "model.layers.23.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
340
+ "model.layers.24.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
341
+ "model.layers.25.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
342
+ "model.layers.26.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
343
+ "model.layers.27.mlp.down_proj.weight": "model-00000-of-00001.safetensors",
344
+ "model.norm.weight": "model-00000-of-00001.safetensors"
345
+ }
346
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|begin▁of▁sentence|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|end▁of▁sentence|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|end▁of▁sentence|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e20ddafc659ba90242154b55275402edeca0715e5dbb30f56815a4ce081f4893
3
+ size 11422778
tokenizer_config.json ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": null,
5
+ "added_tokens_decoder": {
6
+ "151643": {
7
+ "content": "<|end▁of▁sentence|>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "151644": {
15
+ "content": "<|User|>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": false
21
+ },
22
+ "151645": {
23
+ "content": "<|Assistant|>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": false
29
+ },
30
+ "151646": {
31
+ "content": "<|begin▁of▁sentence|>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "151647": {
39
+ "content": "<|EOT|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": false
45
+ },
46
+ "151648": {
47
+ "content": "<think>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": false
53
+ },
54
+ "151649": {
55
+ "content": "</think>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": false
61
+ },
62
+ "151650": {
63
+ "content": "<|quad_start|>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "151651": {
71
+ "content": "<|quad_end|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": true
77
+ },
78
+ "151652": {
79
+ "content": "<|vision_start|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "151653": {
87
+ "content": "<|vision_end|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": true
93
+ },
94
+ "151654": {
95
+ "content": "<|vision_pad|>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": false,
99
+ "single_word": false,
100
+ "special": true
101
+ },
102
+ "151655": {
103
+ "content": "<|image_pad|>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": false,
107
+ "single_word": false,
108
+ "special": true
109
+ },
110
+ "151656": {
111
+ "content": "<|video_pad|>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": false,
115
+ "single_word": false,
116
+ "special": true
117
+ },
118
+ "151657": {
119
+ "content": "<tool_call>",
120
+ "lstrip": false,
121
+ "normalized": false,
122
+ "rstrip": false,
123
+ "single_word": false,
124
+ "special": false
125
+ },
126
+ "151658": {
127
+ "content": "</tool_call>",
128
+ "lstrip": false,
129
+ "normalized": false,
130
+ "rstrip": false,
131
+ "single_word": false,
132
+ "special": false
133
+ },
134
+ "151659": {
135
+ "content": "<|fim_prefix|>",
136
+ "lstrip": false,
137
+ "normalized": false,
138
+ "rstrip": false,
139
+ "single_word": false,
140
+ "special": false
141
+ },
142
+ "151660": {
143
+ "content": "<|fim_middle|>",
144
+ "lstrip": false,
145
+ "normalized": false,
146
+ "rstrip": false,
147
+ "single_word": false,
148
+ "special": false
149
+ },
150
+ "151661": {
151
+ "content": "<|fim_suffix|>",
152
+ "lstrip": false,
153
+ "normalized": false,
154
+ "rstrip": false,
155
+ "single_word": false,
156
+ "special": false
157
+ },
158
+ "151662": {
159
+ "content": "<|fim_pad|>",
160
+ "lstrip": false,
161
+ "normalized": false,
162
+ "rstrip": false,
163
+ "single_word": false,
164
+ "special": false
165
+ },
166
+ "151663": {
167
+ "content": "<|repo_name|>",
168
+ "lstrip": false,
169
+ "normalized": false,
170
+ "rstrip": false,
171
+ "single_word": false,
172
+ "special": false
173
+ },
174
+ "151664": {
175
+ "content": "<|file_sep|>",
176
+ "lstrip": false,
177
+ "normalized": false,
178
+ "rstrip": false,
179
+ "single_word": false,
180
+ "special": false
181
+ }
182
+ },
183
+ "bos_token": "<|begin▁of▁sentence|>",
184
+ "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin��>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|><think>\\n'}}{% endif %}",
185
+ "clean_up_tokenization_spaces": false,
186
+ "eos_token": "<|end▁of▁sentence|>",
187
+ "extra_special_tokens": {},
188
+ "legacy": true,
189
+ "model_max_length": 16384,
190
+ "pad_token": "<|end▁of▁sentence|>",
191
+ "sp_model_kwargs": {},
192
+ "tokenizer_class": "LlamaTokenizerFast",
193
+ "unk_token": null,
194
+ "use_default_system_prompt": false
195
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 7.0,
3
+ "total_flos": 1.1955108328583987e+17,
4
+ "train_loss": 0.9723419804781719,
5
+ "train_runtime": 594145.3794,
6
+ "train_samples_per_second": 14.138,
7
+ "train_steps_per_second": 0.055
8
+ }
trainer_log.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
trainer_state.json ADDED
The diff for this file is too large to render. See raw diff
 
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f6f5a1df95b05fb903b55190e76d56c83e0c6b1fa2fad102ecf8eb98be7e686
3
+ size 7224
training_loss.png ADDED
vocab.json ADDED
The diff for this file is too large to render. See raw diff