mcpmark
MCP Servers are shaping the future of software. MCPMark is a comprehensive, stress-testing benchmark designed to evaluate model and agent capabilities in real-world MCP use.
GitHubスター
142
ユーザー評価
未評価
お気に入り
0
閲覧数
4
フォーク
5
イシュー
3
An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).
MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.
What you can do with MCPMark
- Evaluate real tool usage across multiple MCP services:
Notion
,GitHub
,Filesystem
,Postgres
,Playwright
. - Use ready-to-run tasks covering practical workflows, each with strict automated verification.
- Reliable and reproducible: isolated environments that do not pollute your accounts/data; failed tasks auto-retry and resume.
- Unified metrics and aggregation: single/multi-run (pass@k, avg@k, etc.) with automated results aggregation.
- Flexible deployment: local or Docker; fully validated on macOS and Linux.
Quickstart (5 minutes)
1) Clone the repository
git clone https://github.com/eval-sys/mcpmark.git
cd mcpmark
2) Set environment variables (create .mcp_env
at repo root)
Only set what you need. Add service credentials when running tasks for that service.
# Example: OpenAI
OPENAI_BASE_URL="https://api.openai.com/v1"
OPENAI_API_KEY="sk-..."
# Optional: Notion (only for Notion tasks)
SOURCE_NOTION_API_KEY="your-source-notion-api-key"
EVAL_NOTION_API_KEY="your-eval-notion-api-key"
EVAL_PARENT_PAGE_TITLE="MCPMark Eval Hub"
PLAYWRIGHT_BROWSER="chromium" # chromium | firefox
PLAYWRIGHT_HEADLESS="True"
# Optional: GitHub (only for GitHub tasks)
GITHUB_TOKENS="token1,token2" # token pooling for rate limits
GITHUB_EVAL_ORG="your-eval-org"
# Optional: Postgres (only for Postgres tasks)
POSTGRES_HOST="localhost"
POSTGRES_PORT="5432"
POSTGRES_USERNAME="postgres"
POSTGRES_PASSWORD="password"
See docs/introduction.md
and the service guides below for more details.
3) Install and run a minimal example
Local (Recommended)
pip install -e .
# If you'll use browser-based tasks, install Playwright browsers first
playwright install
Docker
./build-docker.sh
Run a filesystem task (no external accounts required):
python -m pipeline \
--mcp filesystem \
--k 1 \ # run once to quick start
--models gpt-5 \ # or any model you configured
--tasks file_property/size_classification
Results are saved to ./results/{exp_name}/{model}__{mcp}/run-*/...
(e.g., ./results/test-run/gpt-5__filesystem/run-1/...
).
Run your evaluations
Single run (k=1)
# Run ALL tasks for a service
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL --k 1
# Run a task group
python -m pipeline --exp-name exp --mcp notion --tasks online_resume --models MODEL --k 1
# Run a specific task
python -m pipeline --exp-name exp --mcp notion --tasks online_resume/daily_itinerary_overview --models MODEL --k 1
# Evaluate multiple models
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL1,MODEL2,MODEL3 --k 1
Multiple runs (k>1) for pass@k
# Run k=4 to compute stability metrics (requires --exp-name to aggregate final results)
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL
# Aggregate results (pass@1 / pass@k / pass^k / avg@k)
python -m src.aggregators.aggregate_results --exp-name exp
Run with Docker
# Run all tasks for a service
./run-task.sh --mcp notion --models MODEL --exp-name exp --tasks all
# Cross-service benchmark
./run-benchmark.sh --models MODEL --exp-name exp --docker
Please visit docs/introduction.md
for choices of MODEL.
Tip: MCPMark supports auto-resume. When re-running, only unfinished tasks will execute. Failures matching our retryable patterns (see RETRYABLE_PATTERNS) are retried automatically. Models may emit different error strings—if you encounter a new resumable error, please open a PR or issue.
Service setup and authentication
Service | Setup summary | Docs |
---|---|---|
Notion | Environment isolation (Source Hub / Eval Hub), integration creation and grants, browser login verification. | Guide |
GitHub | Multi-account token pooling recommended; import pre-exported repo state if needed. | Guide |
Postgres | Start via Docker and import sample databases. | Setup |
Playwright | Install browsers before first run; defaults to chromium . |
Setup |
Filesystem | Zero-configuration, run directly. | Config |
You can also follow Quickstart for the shortest end-to-end path.
Results and metrics
- Results are organized under
./results/{exp_name}/{model}__{mcp}/run-*/
(JSON + CSV per task). - Generate a summary with:
# Basic usage
python -m src.aggregators.aggregate_results --exp-name exp
# For k-run experiments with single-run models
python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-models claude-opus-4-1
- Only models with complete results across all tasks and runs are included in the final summary.
- Includes multi-run metrics (pass@k, pass^k) for stability comparisons when k > 1.
Model and Tasks
- Model support: MCPMark calls models via LiteLLM — see the LiteLLM docs:
LiteLLM Doc
. For Anthropic (Claude) extended thinking mode (enabled via--reasoning-effort
), we use Anthropic’s native API. - See
docs/introduction.md
for details and configuration of supported models in MCPMark. - To add a new model, edit
src/model_config.py
. Before adding, check LiteLLM supported models/providers. SeeLiteLLM Doc
. - Task design principles in
docs/datasets/task.md
. Each task ships with an automatedverify.py
for objective, reproducible evaluation, seedocs/task.md
for details.
Contributing
Contributions are welcome:
- Add a new task under
tasks/<category_id>/<task_id>/
withmeta.json
,description.md
andverify.py
. - Ensure local checks pass and open a PR.
- See
docs/contributing/make-contribution.md
.
Citation
If you find our works useful for your research, please consider citing:
@misc{mcpmark_2025,
title = {MCPMark: Stress-Testing Comprehensive MCP Use},
author = {The MCPMark Team},
howpublished = {\url{https://github.com/eval-sys/mcpmark}},
year = {2025}
}
License
This project is licensed under the Apache License 2.0 — see LICENSE
.