Hugging FaceModel

VAKRA Benchmark Reveals AI Agents Still Fail Multi-Step Enterprise Tasks

Written by

Drafted with AI; edited and reviewed by a human.

1 min read

VAKRA Benchmark Reveals AI Agents Still Fail Multi-Step Enterprise Tasks

TL;DR

  • IBM Research and Hugging Face have launched VAKRA, a new, executable benchmark designed to test AI agents.
  • It evaluates an agent's compositional reasoning and tool use in complex enterprise scenarios with real APIs and documents.
  • The benchmark features over 8,000 locally hosted APIs across 62 domains, requiring agents to perform 3-7 step reasoning chains.
  • Early results indicate that current AI models significantly struggle with these intricate, multi-step workflows, highlighting a critical gap in capabilities.

The core value of VAKRA is realism. Instead of isolated tasks, the benchmark simulates enterprise-style environments where agents must combine API calls, retrieval, and planning over multiple steps. That gives teams a clearer signal about whether an agent can actually handle production workflows rather than single-turn demos.

A key design point is executable evaluation. Agents interact with a large tool surface and must keep state across steps to reach correct outcomes, especially in scenarios like business-intelligence API chaining. This exposes a common failure mode: models can perform well on local steps but still fail when the workflow requires consistent decisions across a longer chain.

Early results from the VAKRA benchmark analysis suggest current agents still struggle with multi-step workflow reliability. For builders, this turns VAKRA into a practical baseline for measuring progress in reasoning quality, tool selection, and end-to-end task completion over time.

Summary

  • VAKRA, a new benchmark from IBM Research and Hugging Face, evaluates AI agents' compositional reasoning and multi-step tool use in enterprise contexts.
  • It features an executable environment with over 8,000 locally hosted APIs and tasks requiring 3-7 steps of complex reasoning.
  • The benchmark reveals that current AI models struggle significantly with these intricate, real-world multi-step workflows.
  • VAKRA offers a vital resource for developers to test, improve, and build more robust and reliable AI agents for complex business applications.

Source: Official source

Anthropic Upgrades Claude Opus to 4.8, Boosting Benchmarks and Collaboration

Anthropic Upgrades Claude Opus to 4.8, Boosting Benchmarks and Collaboration

Anthropic announces Claude Opus 4.8, a new version with improved benchmark performance, enhanced agentic task reliability, and faster execution, available at the same price.

Continue reading

Get notified when our newsletter launches

We're testing demand before launching a weekly AI digest. Drop your email and you'll be the first to know when it ships — one launch announcement, no spam.

We only use your email to announce the newsletter launch — never for spam. See our Privacy