Update UsageNotes_Potency.md

f294217 verified 20 days ago

1.43 kB

	```bash
	python3 potency_inference.py
	<prompted for options>
	```

	## Required Inputs

	### 1. Test Dataset (CSV File)

	Required columns:
	- ligand_smiles (or SMILES, smiles, canonical_smiles) - Chemical structure in SMILES format
	- protein_sequence (or PROTEIN_SEQ, protein_seq, sequence) - Amino acid sequence

	Optional:
	- pIC50 (or pic50, PIC50) - Ground truth binding affinity values (enables metric calculation)

	### 2. Neural Network Model Files

	- Model checkpoint (.pt) - Trained GNN or GPFT model weights
	- Vocabulary (.pkl) - Amino acid to index mapping
	- Tokenizer (.pkl) - Protein sequence tokenizer

	### 3. XGBoost Model Files

	- XGBoost model (.json or .pkl) - Trained gradient boosting model
	- Feature scaler (.pkl) - StandardScaler for descriptor normalization
	- Descriptor list (.txt) - Names of RDKit molecular descriptors
	- Docking scores CSV (optional) - Pre-computed docking scores
	- Columns: ligand_smiles, protein_sequence, docking_score

	### 4. Stacking Model File

	- Ridge regression model (.pth) - Meta-learner that combines predictions

	### 5. User Selections (Interactive)

	- Model type: GNN or GPFT
	- Split strategy: Random or Scaffold (must match training)
	- If XGBoost model uses docking scores

	## Generated Outputs

	Output Directory Structure:
	```
	predictions/ (or custom name)
	├── test_predictions.csv
	├── metrics.json
	├── config.json
	└── predictions_plot.png
	```