Mirage of Synthesis: DREAM's Agentic Framework Catches What Static Benchmarks MissFebruary 2026Safety
Even GPT-5 Fails at Discovery: OdysseyArena Exposes the Inductive Bottleneck in LLM AgentsFebruary 2026AI Agents
DeepResearchEval: Benchmark Shows Gemini Leads Quality, Manus Wins Factual AccuracyJanuary 2026Safety
VLM Hallucinations Exposed: VIB-Probe Pinpoints and Suppresses Faulty Attention HeadsJanuary 2026Safety