kreuzberg

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

GitHubスター

2,340

ユーザー評価

未評価

お気に入り

0

閲覧数

2

フォーク

95

イシュー

5

README
Kreuzberg

Discord
PyPI version
Documentation
Benchmarks
License: MIT
DeepSource

A document intelligence framework for Python. Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.

📖 Complete Documentation

Framework Overview
Document Intelligence Capabilities
  • Text Extraction: High-fidelity text extraction preserving document structure and formatting
  • Metadata Extraction: Comprehensive metadata including author, creation date, language, and document properties
  • Format Support: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
  • OCR Integration: Tesseract OCR with markdown output (default) and table extraction from scanned documents
  • Document Classification: Automatic document type detection (contracts, forms, invoices, receipts, reports)
Technical Architecture
  • Performance: Highest throughput among Python document processing frameworks (30+ docs/second)
  • Resource Efficiency: 71MB installation, ~360MB runtime memory footprint
  • Extensibility: Plugin architecture for custom extractors via the Extractor base class
  • API Design: Synchronous and asynchronous APIs with consistent interfaces
  • Type Safety: Complete type annotations throughout the codebase
Open Source Foundation

Kreuzberg leverages established open source technologies:

  • Pandoc: Universal document converter for robust format support
  • PDFium: Google's PDF rendering engine for accurate PDF processing
  • Tesseract: Google's OCR engine for text recognition
  • Python-docx/pptx: Native Microsoft Office format support
Quick Start
Extract Text with CLI
# Extract text from any file to text format
uvx kreuzberg extract document.pdf > output.txt

# With all features (chunking, language detection, etc.)
uvx kreuzberg extract invoice.pdf --ocr-backend tesseract --output-format text

# Extract with rich metadata
uvx kreuzberg extract report.pdf --show-metadata --output-format json
Python Usage

Async (recommended for web apps):

from kreuzberg import extract_file

# In your async function
result = await extract_file("presentation.pptx")
print(result.content)

# Rich metadata extraction
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Page count: {result.metadata.page_count}")
print(f"Created: {result.metadata.created_at}")

Sync (for scripts and CLI tools):

from kreuzberg import extract_file_sync

result = extract_file_sync("report.docx")
print(result.content)

# Access rich metadata
print(f"Language: {result.metadata.language}")
print(f"Word count: {result.metadata.word_count}")
print(f"Keywords: {result.metadata.keywords}")
Docker

Two optimized images available:

# Base image (API + CLI + multilingual OCR)
docker run -p 8000:8000 goldziher/kreuzberg

# Core image (+ chunking + crypto + document classification + language detection)
docker run -p 8000:8000 goldziher/kreuzberg-core:latest

# Extract via API
curl -X POST -F "file=@document.pdf" http://localhost:8000/extract

📖 Installation GuideCLI DocumentationAPI Reference

Deployment Options
🤖 MCP Server (AI Integration)

Add to Claude Desktop with one command:

claude mcp add kreuzberg uvx kreuzberg-mcp

Or configure manually in claude_desktop_config.json:

{
  "mcpServers": {
    "kreuzberg": {
      "command": "uvx",
      "args": ["kreuzberg-mcp"]
    }
  }
}

MCP capabilities:

  • Extract text from PDFs, images, Office docs, and more
  • Multilingual OCR support with Tesseract
  • Metadata parsing and language detection

📖 MCP Documentation

Supported Formats
Category Formats
Documents PDF, DOCX, DOC, RTF, TXT, EPUB
Images JPG, PNG, TIFF, BMP, GIF, WEBP
Spreadsheets XLSX, XLS, CSV, ODS
Presentations PPTX, PPT, ODP
Web HTML, XML, MHTML
Archives Support via extraction
📊 Performance Characteristics

View comprehensive benchmarksBenchmark methodologyDetailed Analysis

Technical Specifications
Metric Kreuzberg Sync Kreuzberg Async Benchmarked
Throughput (tiny files) 31.78 files/s 23.94 files/s Highest throughput
Throughput (small files) 8.91 files/s 9.31 files/s Highest throughput
Memory footprint 359.8 MB 395.2 MB Lowest usage
Installation size 71 MB 71 MB Smallest size
Success rate 100% 100% Perfect
Supported formats 18 18 Comprehensive
Architecture Advantages
  • Native C extensions: Built on PDFium and Tesseract for maximum performance
  • Async/await support: True asynchronous processing with intelligent task scheduling
  • Memory efficiency: Streaming architecture minimizes memory allocation
  • Process pooling: Automatic multiprocessing for CPU-intensive operations
  • Optimized data flow: Efficient data handling with minimal transformations

Benchmark details: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.

Documentation
Quick Links
License

MIT License - see LICENSE for details.