OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation
Abstract
OmniSafeBench-MM is a comprehensive tool for evaluating multi-modal jailbreak attacks and defenses, covering various attack methods, defense strategies, and risk domains.
Recent advances in multi-modal large language models (MLLMs) have enabled unified perception-reasoning capabilities, yet these systems remain highly vulnerable to jailbreak attacks that bypass safety alignment and induce harmful behaviors. Existing benchmarks such as JailBreakV-28K, MM-SafetyBench, and HADES provide valuable insights into multi-modal vulnerabilities, but they typically focus on limited attack scenarios, lack standardized defense evaluation, and offer no unified, reproducible toolbox. To address these gaps, we introduce OmniSafeBench-MM, which is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation. OmniSafeBench-MM integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories, structured across consultative, imperative, and declarative inquiry types to reflect realistic user intentions. Beyond data coverage, it establishes a three-dimensional evaluation protocol measuring (1) harmfulness, distinguished by a granular, multi-level scale ranging from low-impact individual harm to catastrophic societal threats, (2) intent alignment between responses and queries, and (3) response detail level, enabling nuanced safety-utility analysis. We conduct extensive experiments on 10 open-source and 8 closed-source MLLMs to reveal their vulnerability to multi-modal jailbreak. By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research. The code is released at https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM.
Community
This work presents OmniSafeBench-MM, a unified, open-source benchmark and toolbox designed for comprehensive evaluation of multimodal jailbreak attack and defense methods. It integrates 13 representative attack techniques, 15 defense strategies, and a diverse dataset spanning 9 risk domains and 50 fine-grained categories, covering real-world–relevant query types (consultative, imperative, declarative). We also propose a three-dimensional evaluation protocol measuring harmfulness (from low-impact individual harm to societal-level threats), intent alignment, and response detail — allowing nuanced safety-utility tradeoff analysis. Our extensive experiments across 10 open-source and 8 closed-source multimodal LLMs reveal widespread vulnerabilities to multimodal jailbreak attacks. By unifying data, methods, and evaluation, OmniSafeBench-MM offers a standardized, reproducible platform — which we hope will become a foundational resource for future research on safe, robust multimodal LLMs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks (2025)
- SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models (2025)
- TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations (2025)
- SAID: Empowering Large Language Models with Self-Activating Internal Defense (2025)
- JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework (2025)
- Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling (2025)
- Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper