```bash python3 potency_inference.py ``` ## Required Inputs ### 1. Test Dataset (CSV File) **Required columns:** - ligand_smiles (or SMILES, smiles, canonical_smiles) - Chemical structure in SMILES format - protein_sequence (or PROTEIN_SEQ, protein_seq, sequence) - Amino acid sequence **Optional:** - pIC50 (or pic50, PIC50) - Ground truth binding affinity values (enables metric calculation) ### 2. Neural Network Model Files - Model checkpoint (.pt) - Trained GNN or GPFT model weights - Vocabulary (.pkl) - Amino acid to index mapping - Tokenizer (.pkl) - Protein sequence tokenizer ### 3. XGBoost Model Files - XGBoost model (.json or .pkl) - Trained gradient boosting model - Feature scaler (.pkl) - StandardScaler for descriptor normalization - Descriptor list (.txt) - Names of RDKit molecular descriptors - Docking scores CSV (optional) - Pre-computed docking scores - Columns: ligand_smiles, protein_sequence, docking_score ### 4. Stacking Model File - Ridge regression model (.pth) - Meta-learner that combines predictions ### 5. User Selections (Interactive) - Model type: GNN or GPFT - Split strategy: Random or Scaffold (must match training) - If XGBoost model uses docking scores ## Generated Outputs **Output Directory Structure:** ``` predictions/ (or custom name) ├── test_predictions.csv ├── metrics.json ├── config.json └── predictions_plot.png ```