arxiv:2606.10747
Federico Torrielli
EvilScript
AI & ML interests
AI Safety & Mechanistic interpretability
Recent Activity
upvoted a paper about 5 hours ago
The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment authored a paper about 6 hours ago
The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment authored a paper 4 days ago
PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models