SWE-chat: Coding Agent Interactions From Real Users in the Wild Paper • 2604.20779 • Published 13 days ago • 14 • 5
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors Paper • 2510.17516 • Published Oct 20, 2025 • 2
SWE-chat: Coding Agent Interactions From Real Users in the Wild Paper • 2604.20779 • Published 13 days ago • 14
SWE-chat: Coding Agent Interactions From Real Users in the Wild Paper • 2604.20779 • Published 13 days ago • 14
SWE-chat: Coding Agent Interactions From Real Users in the Wild Paper • 2604.20779 • Published 13 days ago • 14 • 5
Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation Paper • 2509.08825 • Published Sep 10, 2025 • 3
Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation Paper • 2509.08825 • Published Sep 10, 2025 • 3 • 3
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons Paper • 2503.05731 • Published Feb 19, 2025 • 3
Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation Paper • 2509.08825 • Published Sep 10, 2025 • 3
Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation Paper • 2509.08825 • Published Sep 10, 2025 • 3 • 3