mjbommar commited on
Commit
ce335b9
·
verified ·
1 Parent(s): c441d03

Upload binary-tokenizer-001-4k tokenizer

Browse files
Files changed (4) hide show
  1. .gitattributes +2 -35
  2. README.md +189 -0
  3. analysis_results.json +131 -0
  4. tokenizer.json +0 -0
.gitattributes CHANGED
@@ -1,35 +1,2 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
1
+ *.json filter=lfs diff=lfs merge=lfs -text
2
+ *.txt filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - code
4
+ tags:
5
+ - tokenizer
6
+ - binary-analysis
7
+ - binary-tokenization
8
+ - bpe
9
+ - byte-pair-encoding
10
+ - reverse-engineering
11
+ - malware-analysis
12
+ - cybersecurity
13
+ - executable-analysis
14
+ license: mit
15
+ pipeline_tag: feature-extraction
16
+ library_name: tokenizers
17
+ ---
18
+
19
+ # binary-tokenizer-001-4k
20
+
21
+ **Model Name**: `binary-tokenizer-001-4k`
22
+ **HuggingFace**: [`mjbommar/binary-tokenizer-001-4k`](https://huggingface.co/mjbommar/binary-tokenizer-001-4k)
23
+ **Vocabulary Size**: 4,096 tokens (2^12)
24
+
25
+ ---
26
+
27
+ ## Training Configuration
28
+
29
+ **Training Corpus**:
30
+ - Source: `/nas4/data/glaurung-data/binaries-small/`
31
+ - Size: ~13 GB
32
+ - Files: 30,738 binary files
33
+ - Platforms: Linux (ELF), Windows (PE), macOS (Mach-O), Android (APK)
34
+ - Architectures: x86-64, x86, ARM64, ARM, MIPS, RISC-V
35
+
36
+ **Training Parameters**:
37
+ - Vocabulary size: 4,096 (including 7 special tokens)
38
+ - Min frequency: 10
39
+ - Chunk size: 8,192 bytes
40
+ - Allowed lengths: DEFAULT (1-16 bytes)
41
+ - Training duration: ~1h 46min
42
+
43
+ ---
44
+
45
+ ## Vocabulary Statistics
46
+
47
+ **Composition**:
48
+ - Base bytes (0-255): 256 tokens
49
+ - Learned merges: 3,833 tokens
50
+ - Special tokens: 7 tokens (`<|start|>`, `<|end|>`, `<|pad|>`, `<|unk|>`, `<|cls|>`, `<|sep|>`, `<|mask|>`)
51
+ - **Total**: 4,096 tokens
52
+
53
+ **Quality Metrics**:
54
+ - All tokens reachable: ✓ Yes
55
+ - Valid merges: 3,833 / 3,833
56
+ - Power-of-2 size: ✓ Yes (2^12)
57
+
58
+ ---
59
+
60
+ ## Token Length Distribution
61
+
62
+ | Length | Count | Percentage | Description |
63
+ |--------|-------|------------|-------------|
64
+ | 1 byte | 256 | 6.3% | Base bytes |
65
+ | 2 bytes | 1,974 | 48.3% | Byte pairs |
66
+ | 3 bytes | 841 | 20.6% | Complete x86-64 instructions |
67
+ | 4 bytes | 649 | 15.9% | Instructions with operands |
68
+ | 5 bytes | 95 | 2.3% | Complex patterns |
69
+ | 6 bytes | 86 | 2.1% | Complex patterns |
70
+ | 7 bytes | 40 | 1.0% | Complex patterns |
71
+ | 8 bytes | 59 | 1.4% | Complex patterns |
72
+ | 9+ bytes | 89 | 2.2% | Long patterns |
73
+
74
+ **Average Token Length**: 3.000 bytes
75
+
76
+ ---
77
+
78
+ ## Byte Content Analysis
79
+
80
+ **Content Categories**:
81
+ - Contains NULL byte (0x00): 1,094 tokens (26.7%)
82
+ - ASCII printable (0x20-0x7E): 896 tokens (21.9%)
83
+ - All ASCII (<0x80): 1,879 tokens (45.9%)
84
+ - High bytes (≥0x80): 2,210 tokens (54.0%)
85
+
86
+ **Most Common Bytes in Tokens**:
87
+ - `0x00` (NULL): 2,468 occurrences - Padding and alignment
88
+ - `0xFF`: 404 occurrences - Sentinel values
89
+ - `0x48` (REX.W): 340 occurrences - x86-64 REX prefix
90
+ - `0x8B` (MOV): 233 occurrences - x86-64 MOV opcode
91
+ - `0xCC` (INT3): 170 occurrences - Debug breakpoint padding
92
+
93
+ ---
94
+
95
+ ## Sequence Coverage
96
+
97
+ **N-byte Sequence Diversity**:
98
+ | Length | Learned Tokens | Possible Sequences | Coverage |
99
+ |--------|----------------|-------------------|----------|
100
+ | 1-byte | 256 | 256 | 100.00% |
101
+ | 2-byte | 1,974 | 65,536 | 3.01% |
102
+ | 3-byte | 841 | 16,777,216 | 0.005% |
103
+ | 4-byte | 649 | 4,294,967,296 | 0.000015% |
104
+
105
+ ---
106
+
107
+ ## Files
108
+
109
+ - `tokenizer-4096.json` - Trained tokenizer model (286 KB)
110
+ - `analysis_results.json` - Detailed analysis statistics
111
+ - `training.log` - Training output log
112
+ - `training_stats.txt` - Training summary
113
+
114
+ ---
115
+
116
+ ## Usage
117
+
118
+ **Load from HuggingFace Hub**:
119
+ ```python
120
+ from tokenizers import Tokenizer
121
+
122
+ # Load directly from HuggingFace
123
+ tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-4k")
124
+ ```
125
+
126
+ **Load from local file**:
127
+ ```bash
128
+ # With bbpe CLI
129
+ bbpe encode --tokenizer tokenizer-4096.json /path/to/binary
130
+ bbpe info tokenizer-4096.json
131
+ ```
132
+
133
+ **Complete Python Example**:
134
+ ```python
135
+ from tokenizers import Tokenizer
136
+
137
+ # Load from HuggingFace or local file
138
+ tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-4k")
139
+ # OR: tokenizer = Tokenizer.from_file("tokenizer-4096.json")
140
+
141
+ # Read binary file and decode as latin-1 (preserves all byte values 0-255)
142
+ with open("/usr/bin/ls", "rb") as f:
143
+ data = f.read()
144
+ data_str = data.decode("latin-1")
145
+
146
+ # Encode the binary data
147
+ encoding = tokenizer.encode(data_str)
148
+ print(f"File size: {len(data)} bytes")
149
+ print(f"Total tokens: {len(encoding.ids)}")
150
+ print(f"Compression: {len(data) / len(encoding.ids):.3f} bytes/token")
151
+
152
+ # First 10 tokens
153
+ for i, (token_id, token) in enumerate(zip(encoding.ids[:10], encoding.tokens[:10])):
154
+ token_bytes = token.encode("latin-1")
155
+ print(f" Token {i}: ID={token_id:5d} hex={token_bytes.hex():20s} ({len(token_bytes)} bytes)")
156
+
157
+ # Decode tokens back to bytes
158
+ decoded_str = tokenizer.decode(encoding.ids)
159
+ decoded_bytes = decoded_str.encode("latin-1")
160
+ assert decoded_bytes == data # Perfect reconstruction
161
+ ```
162
+
163
+ **Example output for `/usr/bin/ls` (142,312 bytes)**:
164
+ ```
165
+ File size: 142312 bytes
166
+ Total tokens: 71272
167
+ Compression: 1.997 bytes/token
168
+
169
+ First 10 tokens:
170
+ Token 0: ID= 127 hex=7f (1 bytes)
171
+ Token 1: ID= 3732 hex=454c (2 bytes)
172
+ Token 2: ID= 70 hex=46 (1 bytes)
173
+ Token 3: ID= 2 hex=02 (1 bytes)
174
+ Token 4: ID= 392 hex=0101 (2 bytes)
175
+ Token 5: ID= 662 hex=000000000000000000 (9 bytes)
176
+ Token 6: ID= 265 hex=0300 (2 bytes)
177
+ Token 7: ID= 1369 hex=3e00 (2 bytes)
178
+ Token 8: ID= 279 hex=01000000 (4 bytes)
179
+ Token 9: ID= 48 hex=30 (1 bytes)
180
+
181
+ Decoded: 7f454c4602010100000000000000000003003e000100000030...
182
+ (ELF header: 7f 45 4c 46 = ELF magic bytes)
183
+ ```
184
+
185
+ ---
186
+
187
+ **Generated**: November 12, 2025
188
+ **Training Script**: `train_tokenizers.sh`
189
+ **Analysis Script**: `analyze_tokenizer.py`
analysis_results.json ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": {
3
+ "total": 4089,
4
+ "total_with_special": 4096,
5
+ "base": 256,
6
+ "merges": 3833,
7
+ "special": 7,
8
+ "is_power_of_2": true,
9
+ "power": 12,
10
+ "matches_expected": true
11
+ },
12
+ "reachability": {
13
+ "valid_merges": 3833,
14
+ "invalid_merges": 0,
15
+ "reachable": 4089,
16
+ "unreachable": 0,
17
+ "all_reachable": true
18
+ },
19
+ "length_dist": {
20
+ "distribution": {
21
+ "1": 256,
22
+ "2": 1974,
23
+ "3": 841,
24
+ "4": 649,
25
+ "5": 95,
26
+ "6": 86,
27
+ "7": 40,
28
+ "8": 59,
29
+ "9": 19,
30
+ "10": 11,
31
+ "11": 7,
32
+ "12": 15,
33
+ "13": 3,
34
+ "14": 7,
35
+ "15": 5,
36
+ "16": 11,
37
+ "17": 2,
38
+ "19": 1,
39
+ "21": 1,
40
+ "23": 1,
41
+ "32": 5,
42
+ "20": 1
43
+ },
44
+ "avg_length": 3.0004891171435557,
45
+ "min_length": 1,
46
+ "max_length": 32,
47
+ "length_3_count": 841,
48
+ "length_3_percent": 20.56737588652482
49
+ },
50
+ "byte_content": {
51
+ "null_tokens": 1094,
52
+ "ascii_printable": 896,
53
+ "ascii_only": 1879,
54
+ "high_byte": 2210,
55
+ "mixed": 965,
56
+ "byte_distribution": {
57
+ "0": 2468,
58
+ "255": 404,
59
+ "72": 340,
60
+ "1": 287,
61
+ "32": 251,
62
+ "3": 235,
63
+ "139": 233,
64
+ "204": 170,
65
+ "36": 160,
66
+ "64": 159,
67
+ "2": 155,
68
+ "116": 155,
69
+ "65": 148,
70
+ "249": 144,
71
+ "128": 123,
72
+ "4": 122,
73
+ "101": 122,
74
+ "137": 121,
75
+ "15": 118,
76
+ "145": 103,
77
+ "97": 93,
78
+ "8": 92,
79
+ "68": 91,
80
+ "131": 88,
81
+ "232": 87,
82
+ "114": 87,
83
+ "16": 83,
84
+ "170": 80,
85
+ "110": 79,
86
+ "111": 78,
87
+ "105": 77,
88
+ "84": 75,
89
+ "115": 75,
90
+ "169": 72,
91
+ "192": 71,
92
+ "99": 70,
93
+ "117": 68,
94
+ "141": 68,
95
+ "6": 67,
96
+ "76": 66,
97
+ "69": 66,
98
+ "108": 66,
99
+ "31": 65,
100
+ "5": 61,
101
+ "33": 60,
102
+ "112": 59,
103
+ "100": 58,
104
+ "48": 57,
105
+ "224": 57,
106
+ "95": 57
107
+ }
108
+ },
109
+ "diversity": {
110
+ "1": {
111
+ "learned": 256,
112
+ "possible": 256,
113
+ "coverage": 100.0
114
+ },
115
+ "2": {
116
+ "learned": 1974,
117
+ "possible": 65536,
118
+ "coverage": 3.0120849609375
119
+ },
120
+ "3": {
121
+ "learned": 841,
122
+ "possible": 16777216,
123
+ "coverage": 0.0050127506256103516
124
+ },
125
+ "4": {
126
+ "learned": 649,
127
+ "possible": 4294967296,
128
+ "coverage": 1.5110708773136139e-05
129
+ }
130
+ }
131
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff