web-crawler-mcp-server

A MCP server that provides a web crawling and content extraction tool for AI assistants

GitHubスター

0

ユーザー評価

未評価

お気に入り

0

閲覧数

6

フォーク

0

イシュー

0

README
Web Crawler MCP Server

A Model Context Protocol (MCP) server that provides a web crawling and content extraction tool for AI assistants such as Claude Desktop, Cursor, and other MCP-compatible clients.

Features
  • Extracts and cleans main text content from any public web page.
  • Uses Puppeteer with stealth plugin to bypass anti-bot protections.
  • Returns readable, whitespace-normalized text for LLM consumption.
  • Easy integration with Claude Desktop and other MCP clients.
Prerequisites
  • Node.js (v16 or higher)
  • MCP-compatible client (e.g., Claude Desktop, Cursor)
  • (Optional) Puppeteer dependencies for some Linux environments
Installation
  1. Install dependencies:
    npm install
    
  2. Build the server:
    npm run build
    
Usage

You can run the server directly:

node build/index.js

Or configure it as an MCP server in your client (e.g., Claude Desktop):

{
  "mcpServers": {
    "web-crawler-mcp": {
      "command": "node",
      "args": ["<absolute-path-to>/server/web_crawler/build/index.js"]
    }
  }
}
Available Tool

web-crawler

  • Description: Extracts and returns the cleaned text content from a specified URL.
  • Input:
    • url (string, required): The URL to extract content from.
Example
{
  "tool_name": "web-crawler",
  "arguments": {
    "url": "https://openai.com/news"
  }
}
Development
  • npm run build — Compile TypeScript to JavaScript.
  • npm run watch — Watch and rebuild on changes.
  • npm run inspector — Launch MCP Inspector for debugging.
Notes
  • The server launches a real browser instance (headless: false) for best compatibility.
  • Output is plain text, suitable for LLM input.
  • For advanced parsing, modify the Cheerio logic in src/index.ts.
License

MIT