Hugging FaceModel

VAKRA Benchmark Reveals AI Agents Still Fail Multi-Step Enterprise Tasks

Written by

Drafted with AI; edited and reviewed by a human.

1 min read

VAKRA Benchmark Reveals AI Agents Still Fail Multi-Step Enterprise Tasks

TL;DR

  • IBM Research and Hugging Face have launched VAKRA, a new, executable benchmark designed to test AI agents.
  • It evaluates an agent's compositional reasoning and tool use in complex enterprise scenarios with real APIs and documents.
  • The benchmark features over 8,000 locally hosted APIs across 62 domains, requiring agents to perform 3-7 step reasoning chains.
  • Early results indicate that current AI models significantly struggle with these intricate, multi-step workflows, highlighting a critical gap in capabilities.

The core value of VAKRA is realism. Instead of isolated tasks, the benchmark simulates enterprise-style environments where agents must combine API calls, retrieval, and planning over multiple steps. That gives teams a clearer signal about whether an agent can actually handle production workflows rather than single-turn demos.

A key design point is executable evaluation. Agents interact with a large tool surface and must keep state across steps to reach correct outcomes, especially in scenarios like business-intelligence API chaining. This exposes a common failure mode: models can perform well on local steps but still fail when the workflow requires consistent decisions across a longer chain.

Early results from the VAKRA benchmark analysis suggest current agents still struggle with multi-step workflow reliability. For builders, this turns VAKRA into a practical baseline for measuring progress in reasoning quality, tool selection, and end-to-end task completion over time.

Summary

  • VAKRA, a new benchmark from IBM Research and Hugging Face, evaluates AI agents' compositional reasoning and multi-step tool use in enterprise contexts.
  • It features an executable environment with over 8,000 locally hosted APIs and tasks requiring 3-7 steps of complex reasoning.
  • The benchmark reveals that current AI models struggle significantly with these intricate, real-world multi-step workflows.
  • VAKRA offers a vital resource for developers to test, improve, and build more robust and reliable AI agents for complex business applications.

Source: Official source

OncoAgent: 2-Tier Oncology AI Achieves 56x Fine-Tuning Speed on AMD

OncoAgent: 2-Tier Oncology AI Achieves 56x Fine-Tuning Speed on AMD

OncoAgent is an open-source, privacy-preserving oncology decision support system featuring a dual-tier multi-agent LLM architecture and optimized QLoRA fine-tuning on AMD MI300X hardware.

Continue reading

Get notified when our newsletter launches

We're testing demand before launching a weekly AI digest. Drop your email and you'll be the first to know when it ships — one launch announcement, no spam.

We only use your email to announce the newsletter launch — never for spam. See our Privacy