PaperBench: Evaluating AI's Ability to Replicate AI Research Paper • 2504.01848 • Published Apr 2, 2025 • 36
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? Paper • 2502.12115 • Published Feb 17, 2025 • 46
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering Paper • 2410.07095 • Published Oct 9, 2024 • 8