veAgentBench is a professional evaluation framework for AI Agent ecosystems, featuring built-in evaluation tools and datasets. It provides core capabilities including LLM judge scoring, multi-dimensional metric analysis, and tool call matching, along with a complete end-to-end analytical reporting system to help build trustworthy agent evaluation systems.
📖 中文版本 | 查看中文文档
[2025/11/12] 🔥 veAgentBench Officially Open Sources Tools + Evaluation Datasets - Enterprise-grade AI Agent Evaluation Solution
- Multi-dimensional Evaluation System: Integrates comprehensive metrics including LLM judge scoring, tool matching accuracy, and response quality
- Deep Metric Analysis: Provides fine-grained performance breakdown and intermediate metric visibility
- Visualized Reporting: Automatically generates professional analysis reports with multi-format output support
- High-performance Architecture: Supports concurrent evaluation with optimized assessment efficiency
- Flexible Extension: Modular design supporting custom evaluation metrics and dimensions
- Multiple Evaluation Target Support: Local development objects, HTTP+SSE, A2A
Comes with evaluation data for legal, education, financial analysis, and personal assistant domains, supporting one-click reference evaluation. For detailed dataset information, see: veAgentBench-data
- Python: 3.10+
- Environment Management: Virtual environments recommended
- Dependency Management: Supports mainstream tools like uv/pip
pip install git+https://github.com/volcengine/veAgentBench.gitveagentbench --help# View available metrics
veagentbench info --metrics
# View available agents
veagentbench info --agents
# View configuration template types
veagentbench info --templatesveagentbench config generate --task-name my_test --output my_config.yamlveagentbench run --config my_config.yaml --parallelveagentbench run --config my_config.yaml --sequentialtasks:
- name: legal_assistant # Evaluation task name
datasets:
- name: bytedance-research/veAgentBench # Dataset name
description: Legal Aid Assistant # Dataset description
property: # Dataset-related properties
type: huggingface # Dataset type, supports csv, huggingface
huggingface_repo: bytedance-research/veAgentBench #huggingface Dataset名称
config_name: legal_aid
split: #split,列表
- test[:1]
columns_name:
input_column: "input" #输入prompt
expected_column: "expected_column" #预期输出结果
expected_tool_call_column: "expected_tool_call_column" #预期工具调用结果
available_tools_column: "available_tools_column" #可用工具列表
id_column: "id_column" #用例ID
multi_turn_input_column: "multi_turn_input_column" #多轮对话输入column
case_name_column: "case_name_column" #用例名称
extra_fields: ["extra_column_1", "extra_column_2"] #额外需要导入的column
extra_column_1: ""
extra_column_2: ""
metrics: # Evaluation metrics
- AnswerCorrectnessMetric
judge_model: # Judge model configuration
model_name: "gpt-4" # Model name
base_url: "https://api.openai.com/v1" # OpenAPI base_url
api_key: "your_api_key" # API key (needs replacement)
agent: # Agent under test configuration
type: AdkAgent # Agent type under test: AdkAgent/LocalAdkAgent/A2AAgent
property:
agent_name: "financial_analysis_agent" # Agent name
end_point: "http://127.0.0.1:8000/invoke" # Call endpoint
api_key: "your_api_key" # Agent API key (needs replacement)
generation_kwargs:
max_tokens: 20480
extra_body:
thinking:
type: "disabled"
max_concurrent: 5 # Concurrent calls to agent under test
measure_concurrent: 100 # Evaluation concurrency: 100 samples
cache_dir: "./cache" # Cache directory path datasets:
- name: financial_analysis # datasetname
description: Financial Analysis Test Set
property:
type: huggingface # Dataset type
config_name: financial_analysis # Subset name
huggingface_repo: bytedance-research/veAgentBench
split:
- test[:1] # Split, can be left blank. Specify if running few cases
columns_name:
input_column: "input" # Input column name
expected_column: "expect_output" # Expected response column name
expected_tool_call_column: "expected_tool_calls" # Expected tool call column name datasets:
- name: legal # Dataset name
description: Legal Consultation Customer Service Evaluation Set # Dataset description
property:
type: csv # Dataset type
csv_file: "dataset/test1.csv" # Local dataset file
columns_name:
input_column: "input" # Input column name
expected_column: "expect_output" # Expected response column name
expected_tool_call_column: "expected_tool_calls" # Expected tool call column name agent: # Agent under test configuration
type: AdkAgent # Agent type under test: AdkAgent/LocalAdkAgent/A2AAgent
property:
agent_name: "financial_analysis_agent" # Agent name
end_point: "http://127.0.0.1:8000/invoke" # Call endpoint
api_key: "your_api_key" # Agent API key (needs replacement) agent:
type: LocalAdkAgent
property:
agent_name: local_finantial_agent
agent_dir_path: "agents/legal" # Local agent object directoryOffline evaluation is suitable for scenarios with existing evaluation data, ideal for pre-launch effectiveness validation.
veAgentBench provides built-in evaluation datasets covering multiple professional domains:
1. Prepare Evaluation Configuration
Prepare evaluation configuration test_config.yaml, reference examples below:
Financial Analysis Evaluation Configuration:
tasks:
- name: financial_analysis_test
datasets:
- name: bytedance-research/veAgentBench # HuggingFace dataset name
description: Financial Analysis Test Set
property:
type: huggingface
config_name: financial_analysis # Subset name
split: "test[:1]" # Split, can be left blank. Specify if running few cases
input_column: "input"
expected_output_column: "expect_output"
expected_tool_call_column: "expected_tool_calls"
metrics: ["MCPToolMetric"]
judge_model: # Judge model configuration
model_name: "gpt-4" # Model name
base_url: "https://api.openai.com/v1" # OpenAPI base_url
api_key: "your_api_key" # API key (needs replacement)
agent: # Agent under test configuration
type: AdkAgent # Agent type under test: AdkAgent/LocalAdkAgent/A2AAgent
property:
agent_name: "financial_analysis_agent" # Agent name
end_point: "http://127.0.0.1:8000/invoke" # Call endpoint
api_key: "your_api_key" # Agent API key (needs replacement)
max_concurrent: 5 # Concurrent calls to agent under test
measure_concurrent: 100 # Evaluation concurrency: 100 samples
cache_dir: "./cache" # Cache directory pathLegal Aid Evaluation Configuration:
tasks:
- name: legal_assistant
datasets:
- name: bytedance-research/veAgentBench # HuggingFace dataset name
description: Legal Aid Assistant
property:
type: huggingface
config_name: legal_aid # Subset name
split: "test[:1]" # Split, can be left blank. Specify if running few cases
input_column: "input"
expected_output_column: "expect_output"
metrics:
- AnswerCorrectnessMetric
- AnswerRelevancyMetric
- ContextualPrecisionMetric
- ContextualRecallMetric
- FaithfulnessMetric
- ContextualRelevancyMetric
judge_model: # Judge model configuration
model_name: "gpt-4" # Model name
base_url: "https://api.openai.com/v1" # OpenAPI base_url
api_key: "your_api_key" # API key (needs replacement)
agent: # Agent under test configuration
type: AdkAgent # Agent type under test: AdkAgent/LocalAdkAgent/A2AAgent
property:
agent_name: "financial_analysis_agent" # Agent name
end_point: "http://127.0.0.1:8000/invoke" # Call endpoint
api_key: "your_api_key" # Agent API key (needs replacement)
max_concurrent: 5 # Concurrent calls to agent under test
measure_concurrent: 100 # Evaluation concurrency: 100 samples
cache_dir: "./cache" # Cache directory path2. Prepare Agent Under Test
Refer to the corresponding agent files at veAgentBench-agent, develop locally or deploy to Volcano agentkit platform for evaluation.
3. Execute Test Command
veagentbench run --config test_config.yaml --parallelSupports users to evaluate with their own datasets, flexibly adapting to various business scenarios:
1. Data Format Requirements
- CSV Format: Supports local CSV files containing input, expected output, expected tool calls and other columns
- HuggingFace Format: Supports loading datasets from HuggingFace Hub
2. Configure Custom Dataset
# CSV dataset configuration example, generally requires input_column, expected_output_column
datasets:
- name: custom_testset
property:
type: csv # Dataset type: csv/huggingface/trace
csv_file_path: "path/to/your/dataset.csv" # Data file path
input_column: "question" # Input column name
expected_output_column: "expected_answer" # Expected output column name
expected_tool_call_column: "expected_tools" # Expected tool call column name3. Execute Test Command
veagentbench run --config test_config.yaml --parallelOnline evaluation functionality is under development and will support real-time Agent calls for dynamic assessment, suitable for online agent performance monitoring scenarios.
Upcoming Features:
- 🔌 Real-time Agent calling and evaluation
- 📊 Dynamic performance monitoring
- ⚡ Development debugging support
- 🔄 Continuous integration integration
- 📈 Real-time metric display
- Expand Agent framework support (LangChain, AutoGPT, etc.)
- Add domain-specific evaluation metrics (finance, medical, legal, etc.)
- Optimize evaluation performance and concurrent processing capabilities
- Improve visualization analysis features
- Support distributed evaluation architecture
- Establish industry standard evaluation system
We welcome community developers to participate in veAgentBench development:
- 📋 Submit Issues for feedback and suggestions
- 🔧 Contribute code and documentation improvements
- 📊 Share use cases and best practices
- 💡 Propose new feature requirements
Open source under Apache 2.0 license - see LICENSE
veAgentBench - Professional, Trustworthy, and Efficient AI Agent Evaluation Framework
