schirrmacher commited on
Commit
1d66f22
·
verified ·
1 Parent(s): 5b1fb23

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -248
README.md CHANGED
@@ -1,23 +1,26 @@
1
  ---
2
  license: mit
3
  ---
4
- # malwi - AI Python Malware Scanner
5
 
 
6
  <img src="malwi-logo.png" alt="Logo">
7
 
8
- ## malwi specializes in finding malware
 
 
9
 
10
- ### Key Features
11
 
12
- - 🛡️ **AI-Powered Python Malware Detection**: Leverages advanced AI to identify malicious code in Python projects with high accuracy.
13
 
14
- - ⚡ **Lightning-Fast Codebase Scanning**: Scans entire repositories in seconds, so you can focus on development—not security worries.
15
 
16
- - 🔒 **100% Offline & Private**: Your code never leaves your machine. Full control, zero data exposure.
17
 
18
- - 💰 **Free & Open-Source**: No hidden costs. Built on transparent research and openly available data.
19
 
20
- - 🇪🇺 **Developed in the EU**: Committed to open-source principles and European data standards.
21
 
22
  ### 1) Install
23
  ```
@@ -64,243 +67,3 @@ malwi scan examples/malicious
64
 
65
  => 👹 malicious 0.98
66
  ```
67
-
68
- ## PyPI Package Scanning
69
-
70
- malwi can directly scan PyPI packages without executing malicious logic, typically placed in `setup.py` or `__init__.py` files:
71
-
72
- ```bash
73
- malwi pypi requests
74
- ````
75
-
76
- ```
77
- __ __
78
- .--------.---.-| .--.--.--|__|
79
- | | _ | | | | | |
80
- |__|__|__|___._|__|________|__|
81
- AI Python Malware Scanner
82
-
83
-
84
- - target: downloads/requests-2.32.4.tar
85
- - seconds: 3.10
86
- - files: 84
87
- ├── scanned: 34
88
- └── skipped: 50
89
-
90
- => 🟢 good
91
- ```
92
-
93
- ## Python API
94
-
95
- malwi provides a comprehensive Python API for integrating malware detection into your applications.
96
-
97
- ### Quick Start
98
-
99
- ```python
100
- import malwi
101
-
102
- report = malwi.MalwiReport.create(input_path="suspicious_file.py")
103
-
104
- for obj in report.malicious_objects:
105
- print(f"File: {obj.file_path}")
106
- ```
107
-
108
- ### `MalwiReport`
109
-
110
- ```python
111
- MalwiReport.create(
112
- input_path, # str or Path - file/directory to scan
113
- accepted_extensions=None, # List[str] - file extensions to scan (e.g., ['py', 'js'])
114
- silent=False, # bool - suppress progress messages
115
- malicious_threshold=0.7, # float - threshold for malicious classification (0.0-1.0)
116
- on_finding=None # callable - callback when malicious objects found
117
- ) -> MalwiReport # Returns: MalwiReport instance with scan results
118
- ```
119
-
120
- ```python
121
- import malwi
122
-
123
- report = malwi.MalwiReport.create("suspicious_directory/")
124
-
125
- # Properties
126
- report.malicious # bool: True if malicious objects detected
127
- report.confidence # float: Overall confidence score (0.0-1.0)
128
- report.duration # float: Scan duration in seconds
129
- report.all_objects # List[MalwiObject]: All analyzed code objects
130
- report.malicious_objects # List[MalwiObject]: Objects exceeding threshold
131
- report.threshold # float: Maliciousness threshold used (0.0-1.0)
132
- report.all_files # List[Path]: All files found in input path
133
- report.skipped_files # List[Path]: Files skipped (wrong extension)
134
- report.processed_files # int: Number of files successfully processed
135
- report.activities # List[str]: Suspicious activities detected
136
- report.input_path # str: Original input path scanned
137
- report.start_time # str: ISO 8601 timestamp when scan started
138
- report.all_file_types # List[str]: All file extensions found
139
- report.version # str: Malwi version with model hash
140
-
141
- # Methods
142
- report.to_demo_text() # str: Human-readable tree summary
143
- report.to_json() # str: JSON formatted report
144
- report.to_yaml() # str: YAML formatted report
145
- report.to_markdown() # str: Markdown formatted report
146
-
147
- # Pre-load models to avoid delay on first prediction
148
- malwi.MalwiReport.load_models_into_memory()
149
- ```
150
-
151
- ### `MalwiObject`
152
- ```python
153
- obj = report.all_objects[0]
154
-
155
- # Core properties
156
- obj.name # str: Function/class/module name
157
- obj.file_path # str: Path to source file
158
- obj.language # str: Programming language ('python'/'javascript')
159
- obj.maliciousness # float|None: ML confidence score (0.0-1.0)
160
- obj.warnings # List[str]: Compilation warnings/errors
161
-
162
- # Source code and AST compilation
163
- obj.file_source_code # str: Complete content of source file
164
- obj.source_code # str|None: Extracted source for this specific object
165
- obj.byte_code # List[Instruction]|None: Compiled AST bytecode
166
- obj.location # Tuple[int,int]|None: Start and end line numbers
167
- obj.embedding_count # int: Number of DistilBERT tokens (cached)
168
-
169
- # Analysis methods
170
- obj.predict() # dict: Run ML prediction and update maliciousness
171
- obj.to_tokens() # List[str]: Extract tokens for analysis
172
- obj.to_token_string() # str: Space-separated token string
173
- obj.to_string() # str: Bytecode as readable string
174
- obj.to_hash() # str: SHA256 hash of bytecode
175
- obj.to_dict() # dict: Serializable representation
176
- obj.to_yaml() # str: YAML formatted output
177
- obj.to_json() # str: JSON formatted output
178
-
179
- # Class methods
180
- MalwiObject.all_tokens(language="python") # List[str]: All possible tokens
181
- ```
182
-
183
- ## Why malwi?
184
-
185
- Malicious actors are increasingly [targeting open-source projects](https://arxiv.org/pdf/2404.04991), introducing packages designed to compromise security.
186
-
187
- Common malicious behaviors include:
188
-
189
- - **Data exfiltration**: Theft of sensitive information such as credentials, API keys, or user data.
190
- - **Backdoors**: Unauthorized remote access to systems, enabling attackers to exploit vulnerabilities.
191
- - **Destructive actions**: Deliberate sabotage, including file deletion, database corruption, or application disruption.
192
-
193
- ## How does it work?
194
-
195
- malwi is based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1).
196
-
197
- Imagine there is a function like:
198
-
199
- ```python
200
- def runcommand(value):
201
- output = subprocess.run(value, shell=True, capture_output=True)
202
- return [output.stdout, output.stderr]
203
- ```
204
-
205
- ### 1. Files are compiled to create an Abstract Syntax Tree with [Tree-sitter](https://tree-sitter.github.io/tree-sitter/index.html)
206
-
207
- ```
208
- module [0, 0] - [3, 0]
209
- function_definition [0, 0] - [2, 41]
210
- name: identifier [0, 4] - [0, 14]
211
- parameters: parameters [0, 14] - [0, 21]
212
- identifier [0, 15] - [0, 20]
213
- ...
214
- ```
215
-
216
- ### 2. The AST is transpiled to dummy bytecode
217
-
218
- The bytecode is enhanced with security related instructions.
219
-
220
- ```
221
- TARGETED_FILE PUSH_NULL LOAD_GLOBAL PROCESS_MANAGEMENT LOAD_ATTR run LOAD_PARAM value LOAD_CONST BOOLEAN LOAD_CONST BOOLEAN KW_NAMES shell capture_output CALL STRING_VERSION STORE_GLOBAL output LOAD_GLOBAL output LOAD_ATTR stdout LOAD_GLOBAL output LOAD_ATTR stderr BUILD_LIST STRING_VERSION RETURN_VALUE
222
- ```
223
-
224
- ### 3. The bytecode is fed into a pre-trained [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)
225
-
226
- A DistilBERT model trained on [malware-samples](https://github.com/schirrmacher/malwi-samples) is used to identify suspicious code patterns.
227
-
228
- ```
229
- => Maliciousness: 0.98
230
- ```
231
-
232
- ## Benchmarks?
233
-
234
- ```
235
- training_loss: 0.0110
236
- epochs_completed: 3.0000
237
- original_train_samples: 598540.0000
238
- windowed_train_features: 831865.0000
239
- original_validation_samples: 149636.0000
240
- windowed_validation_features: 204781.0000
241
- benign_samples_used: 734930.0000
242
- malicious_samples_used: 13246.0000
243
- benign_to_malicious_ratio: 60.0000
244
- vocab_size: 30522.0000
245
- max_length: 512.0000
246
- window_stride: 128.0000
247
- batch_size: 16.0000
248
- eval_loss: 0.0107
249
- eval_accuracy: 0.9980
250
- eval_f1: 0.9521
251
- eval_precision: 0.9832
252
- eval_recall: 0.9229
253
- eval_runtime: 115.5982
254
- eval_samples_per_second: 1771.4900
255
- eval_steps_per_second: 110.7200
256
- epoch: 3.0000
257
- ```
258
-
259
- ## Contributing & Support
260
-
261
- - Found a bug or have a feature request? [Open an issue](https://github.com/schirrmacher/malwi/issues).
262
- - Do you have access to malicious packages in Rust, Go, or other languages? [Contact via GitHub profile](https://github.com/schirrmacher).
263
- - Struggling with false-positive findings? [Create a Pull-Request](https://github.com/schirrmacher/malwi-samples/pulls).
264
-
265
- ## Research
266
-
267
- ### Prerequisites
268
-
269
- 1. **Package Manager**: Install [uv](https://docs.astral.sh/uv/) for fast Python dependency management
270
- 2. **Training Data**: The research CLI will automatically clone [malwi-samples](https://github.com/schirrmacher/malwi-samples) when needed
271
-
272
- ### Quick Start
273
-
274
- ```bash
275
- # Install dependencies
276
- uv sync
277
-
278
- # Run tests
279
- uv run pytest tests
280
-
281
- # Train a model from scratch (full pipeline with automatic data download)
282
- ./research download preprocess train
283
- ```
284
-
285
- #### Individual Pipeline Steps
286
- ```bash
287
- # 1. Download training data (clones malwi-samples + downloads repositories)
288
- ./research download
289
-
290
- # 2. Data preprocessing only (parallel processing, ~4 min on 32 cores)
291
- ./research preprocess --language python
292
-
293
- # 3. Model training only (tokenizer + DistilBERT, ~40 minutes on NVIDIA RTX 4090)
294
- ./research train
295
- ```
296
-
297
- ## Limitations
298
-
299
- The malicious dataset includes some boilerplate functions, such as setup functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
300
-
301
- ## What's next?
302
-
303
- The first iteration focuses on **maliciousness of Python source code**.
304
-
305
- Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
306
-
 
1
  ---
2
  license: mit
3
  ---
4
+ <div align="center">
5
 
6
+ # malwi - AI Python Malware Scanner
7
  <img src="malwi-logo.png" alt="Logo">
8
 
9
+ <a href='https://huggingface.co/schirrmacher/malwi'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HF-Model-blue'></a>
10
+
11
+ </div>
12
 
13
+ ## Key Features
14
 
15
+ - 🛡️ **AI-Powered Python Malware Detection**
16
 
17
+ - ⚡ **Lightning-Fast Codebase Scanning**
18
 
19
+ - 🔒 **100% Offline & Private**
20
 
21
+ - 💰 **Free & Open-Source**
22
 
23
+ - 🇪🇺 **Developed in the EU**
24
 
25
  ### 1) Install
26
  ```
 
67
 
68
  => 👹 malicious 0.98
69
  ```