DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models Paper • 2512.02556 • Published 6 days ago • 175 • 4
Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights Paper • 2512.01816 • Published 7 days ago • 87 • 4
AInstein: Assessing the Feasibility of AI-Generated Approaches to Research Problems Paper • 2510.05432 • Published Oct 6 • 6 • 4
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper • 2504.10479 • Published Apr 14 • 303 • 9
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning Paper • 2503.07459 • Published Mar 10 • 16 • 3
ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks Paper • 2503.06885 • Published Mar 10 • 4 • 3
MinorBench: A hand-built benchmark for content-based risks for children Paper • 2503.10242 • Published Mar 13 • 5 • 3
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning Paper • 2503.07459 • Published Mar 10 • 16 • 3
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation Paper • 2503.06680 • Published Mar 9 • 20 • 7
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning Paper • 2503.10291 • Published Mar 13 • 36 • 3
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models Paper • 2503.09573 • Published Mar 12 • 74 • 3
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering Paper • 2503.06492 • Published Mar 9 • 11 • 5
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering Paper • 2503.06492 • Published Mar 9 • 11 • 5