web-crawler-mcp-server

Name: web-crawler-mcp-server
Availability: InStock
Author: JonathanHsuHH

A MCP server that provides a web crawling and content extraction tool for AI assistants

GitHub

GitHub Stars

User Rating

Not Rated

Favorites

Views

Forks

Issues

README

Web Crawler MCP Server

A Model Context Protocol (MCP) server that provides a web crawling and content extraction tool for AI assistants such as Claude Desktop, Cursor, and other MCP-compatible clients.

Features

Extracts and cleans main text content from any public web page.
Uses Puppeteer with stealth plugin to bypass anti-bot protections.
Returns readable, whitespace-normalized text for LLM consumption.
Easy integration with Claude Desktop and other MCP clients.

Prerequisites

Node.js (v16 or higher)
MCP-compatible client (e.g., Claude Desktop, Cursor)
(Optional) Puppeteer dependencies for some Linux environments

Installation

Install dependencies:
```
npm install
```
Build the server:
```
npm run build
```

Usage

You can run the server directly:

node build/index.js

Or configure it as an MCP server in your client (e.g., Claude Desktop):

{
  "mcpServers": {
    "web-crawler-mcp": {
      "command": "node",
      "args": ["<absolute-path-to>/server/web_crawler/build/index.js"]
    }
  }
}

Available Tool

web-crawler

Description: Extracts and returns the cleaned text content from a specified URL.
Input:
- url (string, required): The URL to extract content from.

Example

{
  "tool_name": "web-crawler",
  "arguments": {
    "url": "https://openai.com/news"
  }
}

Development

npm run build — Compile TypeScript to JavaScript.
npm run watch — Watch and rebuild on changes.
npm run inspector — Launch MCP Inspector for debugging.

Notes

The server launches a real browser instance (headless: false) for best compatibility.
Output is plain text, suitable for LLM input.
For advanced parsing, modify the Cheerio logic in src/index.ts.