Tao Squared Bench
Tao Squared Bench (τ²-Bench) is an open-source AI benchmark developed by Sierra AI designed to evaluate conversational agents in dual-control environments. It focuses on multi-turn customer service scenarios where agents must both reason and act collaboratively with simulated users to achieve shared objectives. The benchmark extends the original τ-bench by incorporating additional domains such as telecom troubleshooting and simulating collaborative tasks that reflect real-world AI agent roles, including coordinating with users to modify a shared environment state. Implemented in Python, it provides domain-specific policies, task data, and API documentation to facilitate reproducible evaluations. The framework supports running specific tasks by ID and includes leaderboards that track the performance of various models, including GPT-4o and Claude 3.5 Sonnet, across multiple domains like retail, airline, and telecom. It is compatible with Python 3.10+ and offers local API documentation accessible via a built-in server. As an open-source project, Tao Squared Bench is actively maintained with recent releases and commits, making it a resource for AI researchers and developers focused on conversational agent evaluation in customer service contexts.
Tao Squared Bench is an open-source benchmark for evaluating conversational AI agents in multi-turn, dual-control customer service scenarios across multiple domains.
Evaluating Conversational AI Agents
Researchers and developers can use Tao Squared Bench to test how well AI agents perform in multi-turn customer service interactions requiring coordination with users.
git clone https://github.com/sierra-research/tau2-bench && cd tau2-bench to download the source code.tau2 env and visit http://127.0.0.1:8004/redoc to access API documentation.