mcpmark

Name: mcpmark
Availability: InStock
Author: eval-sys

MCP Servers are shaping the future of software. MCPMark is a comprehensive, stress-testing benchmark designed to evaluate model and agent capabilities in real-world MCP use.

GitHub Website

GitHub Stars

142

User Rating

Not Rated

Favorites

Views

Forks

Issues

README

MCPMark: Stress-Testing Comprehensive MCP Use

An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).

MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.

What you can do with MCPMark

Evaluate real tool usage across multiple MCP services: Notion, GitHub, Filesystem, Postgres, Playwright.
Use ready-to-run tasks covering practical workflows, each with strict automated verification.
Reliable and reproducible: isolated environments that do not pollute your accounts/data; failed tasks auto-retry and resume.
Unified metrics and aggregation: single/multi-run (pass@k, avg@k, etc.) with automated results aggregation.
Flexible deployment: local or Docker; fully validated on macOS and Linux.

Quickstart (5 minutes)

1) Clone the repository

git clone https://github.com/eval-sys/mcpmark.git
cd mcpmark

2) Set environment variables (create `.mcp_env` at repo root)

Only set what you need. Add service credentials when running tasks for that service.

# Example: OpenAI
OPENAI_BASE_URL="https://api.openai.com/v1"
OPENAI_API_KEY="sk-..."

# Optional: Notion (only for Notion tasks)
SOURCE_NOTION_API_KEY="your-source-notion-api-key"
EVAL_NOTION_API_KEY="your-eval-notion-api-key"
EVAL_PARENT_PAGE_TITLE="MCPMark Eval Hub"
PLAYWRIGHT_BROWSER="chromium"   # chromium | firefox
PLAYWRIGHT_HEADLESS="True"

# Optional: GitHub (only for GitHub tasks)
GITHUB_TOKENS="token1,token2"   # token pooling for rate limits
GITHUB_EVAL_ORG="your-eval-org"

# Optional: Postgres (only for Postgres tasks)
POSTGRES_HOST="localhost"
POSTGRES_PORT="5432"
POSTGRES_USERNAME="postgres"
POSTGRES_PASSWORD="password"

See docs/introduction.md and the service guides below for more details.

3) Install and run a minimal example

Local (Recommended)

pip install -e .
# If you'll use browser-based tasks, install Playwright browsers first
playwright install

Docker

./build-docker.sh

Run a filesystem task (no external accounts required):

python -m pipeline \
  --mcp filesystem \
  --k 1 \ # run once to quick start
  --models gpt-5  \ # or any model you configured
  --tasks file_property/size_classification

Results are saved to ./results/{exp_name}/{model}__{mcp}/run-*/... (e.g., ./results/test-run/gpt-5__filesystem/run-1/...).

Run your evaluations

Single run (k=1)

# Run ALL tasks for a service
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL --k 1

# Run a task group
python -m pipeline --exp-name exp --mcp notion --tasks online_resume --models MODEL --k 1

# Run a specific task
python -m pipeline --exp-name exp --mcp notion --tasks online_resume/daily_itinerary_overview --models MODEL --k 1

# Evaluate multiple models
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL1,MODEL2,MODEL3 --k 1

Multiple runs (k>1) for pass@k

# Run k=4 to compute stability metrics (requires --exp-name to aggregate final results)
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL

# Aggregate results (pass@1 / pass@k / pass^k / avg@k)
python -m src.aggregators.aggregate_results --exp-name exp

Run with Docker

# Run all tasks for a service
./run-task.sh --mcp notion --models MODEL --exp-name exp --tasks all

# Cross-service benchmark
./run-benchmark.sh --models MODEL --exp-name exp --docker

Please visit docs/introduction.md for choices of MODEL.

Tip: MCPMark supports auto-resume. When re-running, only unfinished tasks will execute. Failures matching our retryable patterns (see RETRYABLE_PATTERNS) are retried automatically. Models may emit different error strings—if you encounter a new resumable error, please open a PR or issue.

Service setup and authentication

Service	Setup summary	Docs
Notion	Environment isolation (Source Hub / Eval Hub), integration creation and grants, browser login verification.	Guide
GitHub	Multi-account token pooling recommended; import pre-exported repo state if needed.	Guide
Postgres	Start via Docker and import sample databases.	Setup
Playwright	Install browsers before first run; defaults to `chromium`.	Setup
Filesystem	Zero-configuration, run directly.	Config

You can also follow Quickstart for the shortest end-to-end path.

Results and metrics

Results are organized under ./results/{exp_name}/{model}__{mcp}/run-*/ (JSON + CSV per task).
Generate a summary with:

# Basic usage
python -m src.aggregators.aggregate_results --exp-name exp

# For k-run experiments with single-run models
python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-models claude-opus-4-1

Only models with complete results across all tasks and runs are included in the final summary.
Includes multi-run metrics (pass@k, pass^k) for stability comparisons when k > 1.

Model and Tasks

Model support: MCPMark calls models via LiteLLM — see the LiteLLM docs: LiteLLM Doc. For Anthropic (Claude) extended thinking mode (enabled via --reasoning-effort), we use Anthropic’s native API.
See docs/introduction.md for details and configuration of supported models in MCPMark.
To add a new model, edit src/model_config.py. Before adding, check LiteLLM supported models/providers. See LiteLLM Doc.
Task design principles in docs/datasets/task.md. Each task ships with an automated verify.py for objective, reproducible evaluation, see docs/task.md for details.

Contributing

Contributions are welcome:

Add a new task under tasks/<category_id>/<task_id>/ with meta.json, description.md and verify.py.
Ensure local checks pass and open a PR.
See docs/contributing/make-contribution.md.

Citation

If you find our works useful for your research, please consider citing:

@misc{mcpmark_2025,
  title        = {MCPMark: Stress-Testing Comprehensive MCP Use},
  author       = {The MCPMark Team},
  howpublished = {\url{https://github.com/eval-sys/mcpmark}},
  year         = {2025}
}

License

This project is licensed under the Apache License 2.0 — see LICENSE.

Author Information

eval-sys

GitHub

Followers

Repositories

Gists

Total Contributions