NCBI-Database-MCP

The NCBI Database is a Python library designed for accessing the NCBI database and retrieving bioinformatics-related data. It offers an easy-to-use API that simplifies the process of searching and obtaining data. This tool is widely used by researchers for data analysis and research purposes.

GitHub Stars

1

User Rating

Not Rated

Favorites

0

Views

40

Forks

0

Issues

0

README
NCBI Database MCP

๐Ÿ” MCP server for NCBI bioinformatics tools and disease-focused gene expression research

Enable AI assistants to discover gene expression datasets by disease/condition and access comprehensive NCBI databases through natural language. Perfect for researchers studying disease mechanisms and therapeutic targets.

๐Ÿงฌ Features
  • ๐Ÿ”ฌ Disease-Focused GEO Search - Discover gene expression datasets by disease/condition and organism
  • ๐Ÿ“Š Comprehensive Study Metadata - Get detailed methodology, platform, and sample information
  • ๐Ÿงฌ Gene-to-Genomic Conversion - Convert gene names to genomic DNA sequences
  • ๐Ÿญ Multi-Species Support - Human, mouse, and rat datasets
  • ๐Ÿ“ˆ Research Methodology Details - RNA-Seq, microarray, ChIP-Seq, and other techniques
  • ๐Ÿ”— Direct Database Links - Easy access to full datasets and original studies
๐Ÿš€ Quick Start
Installation
# Clone repository
git clone https://github.com/hpend2373/NCBI-Database-MCP.git
cd NCBI-Database-MCP

# Install dependencies
pip install -r requirements.txt
Basic Usage

๐Ÿš€ RECOMMENDED: Use FastMCP Server for Best Performance

# Start the FastMCP server (RECOMMENDED)
./run_fastmcp_gene_server.sh

# Alternative: Standard MCP server (slower startup)
python src/gene_to_genomic_server.py

Why FastMCP?

  • โšก Faster startup - Instant server initialization
  • ๐Ÿ”ง Easier debugging - Better error messages and logging
  • ๐Ÿ“Š Built-in monitoring - Performance metrics included
  • ๐ŸŽฏ Optimized for research - Designed specifically for bioinformatics workflows
Configuration

Add to your MCP client config:

{
  "mcpServers": {
    "ncbi-database": {
      "command": "python",
      "args": ["src/gene_to_genomic_server.py"],
      "cwd": "/path/to/NCBI-Database-MCP",
      "env": {
        "NCBI_API_KEY": "your_api_key_here"
      }
    }
  }
}

Alternative: Set global environment variable

export NCBI_API_KEY="your_api_key_here"

Then use simpler config:

{
  "mcpServers": {
    "ncbi-database": {
      "command": "python",
      "args": ["src/gene_to_genomic_server.py"],
      "cwd": "/path/to/NCBI-Database-MCP"
    }
  }
}
๐Ÿ’ก Usage Examples
๐Ÿ”ฌ Disease Expression Research (Primary Use Case)
User: "Find gene expression datasets for Alzheimer's disease in humans"
AI: [calls search_geo_datasets] โ†’ 
๐Ÿ“Š Returns 10 datasets with:
- Study methodology (RNA-Seq, Microarray)
- Sample sizes and experimental design
- Platform information (Illumina, Affymetrix)
- Research summaries and direct GEO links
User: "Show me cancer expression studies in mice using RNA sequencing"
AI: [calls search_geo_datasets] โ†’ 
๐Ÿงช Filtered results showing:
- RNA-Seq datasets only
- Mouse-specific cancer studies
- Detailed experimental protocols
๐Ÿงฌ Gene-to-Genomic Analysis
User: "Get the genomic sequence for BRCA1"
AI: [calls gene_to_genomic_sequence] โ†’ Returns genomic DNA sequence in FASTA format
๐Ÿ“ Gene Information & Location
User: "Find information about TP53 gene"
AI: [calls search_gene_info] โ†’ Returns gene location, function, and coordinates
๐ŸŽฏ Coordinate-Based Sequence Retrieval
User: "Get sequence from chr17:43044295-43125483"
AI: [calls get_genomic_sequence] โ†’ Returns DNA sequence for specified coordinates
๐Ÿ› ๏ธ Available Tools
๐Ÿ”ฌ search_geo_datasets (Primary Tool)

Discover gene expression datasets by disease/condition and organism

Parameters:

  • disease (required) - Disease or condition name
    • Examples: "cancer", "diabetes", "Alzheimer", "heart disease", "depression"
  • organism - Target organism (default: "Homo sapiens")
    • Options: "Homo sapiens", "Mus musculus", "Rattus norvegicus"
  • study_type - Expression study methodology (optional, default: "Expression profiling by high throughput sequencing")
    • Options: "Expression profiling by array", "Expression profiling by high throughput sequencing"
    • Default: RNA-Seq - Most comprehensive and current sequencing technology
  • max_results - Maximum results to return (1-50, default: 10)

Detailed Output:

  • ๐Ÿ“Š Dataset Information: GDS accession numbers and titles
  • ๐Ÿ”ฌ Study Methodology:
    • RNA-Seq (High-throughput transcriptome sequencing) - DEFAULT
    • Microarray (Hybridization-based gene expression)
    • ChIP-Seq (Chromatin immunoprecipitation sequencing)
    • SAGE (Serial analysis of gene expression)
  • ๐Ÿงฌ Data Type Classification:
    • Single-Cell RNA-Seq ๐Ÿงฉ - Individual cell-level gene expression
    • Bulk RNA-Seq ๐Ÿ“ฆ - Tissue/population-level gene expression
    • Spatial Transcriptomics ๐Ÿ—บ๏ธ - Location-aware gene expression
  • ๐Ÿงช Platform Details: Illumina, Affymetrix, Agilent technologies
  • ๐Ÿ“ˆ Experimental Design: Sample counts, tissue types, treatment conditions
  • ๐Ÿ“ Research Context: Study summaries and disease relevance
  • ๐Ÿ”— Direct Access: Links to full datasets on NCBI GEO
๐Ÿงฌ gene_to_genomic_sequence

Convert gene name to genomic DNA sequence

Parameters:

  • gene_name (required) - Gene symbol (e.g., "BRCA1", "TP53")
  • organism - Target organism (default: "human")
  • sequence_type - "genomic", "cds", "mrna", "protein"
  • output_format - "fasta", "genbank", "json"
๐Ÿ“ search_gene_info

Search for gene information and genomic location

Parameters:

  • gene_name (required) - Gene symbol or name
  • organism - Target organism (default: "human")
๐ŸŽฏ get_genomic_sequence

Get genomic sequence from chromosome coordinates

Parameters:

  • chromosome (required) - Chromosome accession (e.g., "NC_000017.11")
  • start (required) - Start position
  • end (required) - End position
  • output_format - "fasta", "json"
โš™๏ธ Configuration
Environment Variables

You can configure the server using environment variables:

# Copy example file and edit
cp .env.example .env

# Or set directly
export NCBI_API_KEY="your_api_key_here"

# Get your free API key from: https://www.ncbi.nlm.nih.gov/account/
# Without API key: 3 requests/second
# With API key: 10 requests/second
๐Ÿ“ Project Structure
NCBI-Database-MCP/
โ”œโ”€โ”€ README.md                    # Documentation
โ”œโ”€โ”€ requirements.txt             # Python dependencies
โ”œโ”€โ”€ pyproject.toml              # Project configuration
โ”œโ”€โ”€ .env.example                # Environment variables template
โ”œโ”€โ”€ run_fastmcp_gene_server.sh  # Launch script
โ””โ”€โ”€ src/
    โ”œโ”€โ”€ gene_to_genomic_server.py  # Standard MCP server
    โ””โ”€โ”€ fastmcp_gene_server.py     # FastMCP server (recommended)
๐Ÿ“ˆ Performance Tips
๐Ÿ”ฌ GEO Dataset Search Optimization
  • Use specific disease terms: "lung cancer" > "cancer", "type 2 diabetes" > "diabetes"
  • Combine with study types: Filter by methodology for targeted results
  • Start with small result sets: Use max_results=5-10 for initial exploration
  • Organism specificity: Use exact names ("Homo sapiens" not "human")
๐Ÿ› Troubleshooting
Common Issues

Gene not found

# Check gene name spelling
# Try alternative gene symbols
# Verify organism specification

No GEO datasets found

# Try broader disease terms (e.g., "cancer" instead of "lung adenocarcinoma")
# Check organism name (use "Homo sapiens" not "human")
# Try without study_type filter
# Verify disease spelling and terminology

API rate limiting

# Get free NCBI API key: https://www.ncbi.nlm.nih.gov/account/
# Set NCBI_API_KEY environment variable
# Without key: 3 requests/second limit
# With key: 10 requests/second limit

Network timeouts

# Check internet connection
# Increase timeout values
# Retry failed requests
๐Ÿ“š Resources
๐Ÿ†˜ Support

Happy genomics research! ๐Ÿงฌ๐Ÿ”