AI อะไรเนี่ย
Hugging Face

VAKRA Benchmark Reveals AI Agents Still Fail Multi-Step Enterprise Tasks

Model

VAKRA Benchmark Reveals AI Agents Still Fail Multi-Step Enterprise Tasks

TL;DR

  • IBM Research and Hugging Face have launched VAKRA, a new, executable benchmark designed to test AI agents.
  • It evaluates an agent's compositional reasoning and tool use in complex enterprise scenarios with real APIs and documents.
  • The benchmark features over 8,000 locally hosted APIs across 62 domains, requiring agents to perform 3-7 step reasoning chains.
  • Early results indicate that current AI models significantly struggle with these intricate, multi-step workflows, highlighting a critical gap in capabilities.

The core value of VAKRA is realism. Instead of isolated tasks, the benchmark simulates enterprise-style environments where agents must combine API calls, retrieval, and planning over multiple steps. That gives teams a clearer signal about whether an agent can actually handle production workflows rather than single-turn demos.

A key design point is executable evaluation. Agents interact with a large tool surface and must keep state across steps to reach correct outcomes, especially in scenarios like business-intelligence API chaining. This exposes a common failure mode: models can perform well on local steps but still fail when the workflow requires consistent decisions across a longer chain.

Early results from the VAKRA benchmark analysis suggest current agents still struggle with multi-step workflow reliability. For builders, this turns VAKRA into a practical baseline for measuring progress in reasoning quality, tool selection, and end-to-end task completion over time.

Summary

  • VAKRA, a new benchmark from IBM Research and Hugging Face, evaluates AI agents' compositional reasoning and multi-step tool use in enterprise contexts.
  • It features an executable environment with over 8,000 locally hosted APIs and tasks requiring 3-7 steps of complex reasoning.
  • The benchmark reveals that current AI models struggle significantly with these intricate, real-world multi-step workflows.
  • VAKRA offers a vital resource for developers to test, improve, and build more robust and reliable AI agents for complex business applications.

Source: Official source