VHELM: A Holistic Evaluation of Vision Language Models Paper • 2410.07112 • Published Oct 9, 2024 • 3
AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies Paper • 2407.17436 • Published Jul 11, 2024
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models Paper • 2410.22456 • Published Oct 29, 2024
SEA-HELM: Southeast Asian Holistic Evaluation of Language Models Paper • 2502.14301 • Published Feb 20, 2025 • 3
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons Paper • 2503.05731 • Published Feb 19, 2025 • 3
AHELM: A Holistic Evaluation of Audio-Language Models Paper • 2508.21376 • Published Aug 29, 2025 • 9
Structured Prompting Enables More Robust Evaluation of Language Models Paper • 2511.20836 • Published Nov 25, 2025
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation Paper • 2510.11977 • Published Oct 13, 2025
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw Paper • 2604.04759 • Published 29 days ago • 24
Problems with Chinchilla Approach 2: Systematic Biases in IsoFLOP Parabola Fits Paper • 2603.22339 • Published Mar 21 • 5
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning Paper • 2505.04601 • Published May 7, 2025 • 29
AHELM: A Holistic Evaluation of Audio-Language Models Paper • 2508.21376 • Published Aug 29, 2025 • 9
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge Paper • 2411.19799 • Published Nov 29, 2024 • 17