kreuzberg

Name: kreuzberg
Availability: InStock
Author: Goldziher

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

GitHub Website Docs

GitHub Stars

2,340

User Rating

Not Rated

Favorites

Views

353

Forks

Issues

README

Kreuzberg

A document intelligence framework for Python. Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.

📖 Complete Documentation

Framework Overview

Document Intelligence Capabilities

Text Extraction: High-fidelity text extraction preserving document structure and formatting
Metadata Extraction: Comprehensive metadata including author, creation date, language, and document properties
Format Support: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
OCR Integration: Tesseract OCR with markdown output (default) and table extraction from scanned documents
Document Classification: Automatic document type detection (contracts, forms, invoices, receipts, reports)

Technical Architecture

Performance: Highest throughput among Python document processing frameworks (30+ docs/second)
Resource Efficiency: 71MB installation, ~360MB runtime memory footprint
Extensibility: Plugin architecture for custom extractors via the Extractor base class
API Design: Synchronous and asynchronous APIs with consistent interfaces
Type Safety: Complete type annotations throughout the codebase

Open Source Foundation

Kreuzberg leverages established open source technologies:

Pandoc: Universal document converter for robust format support
PDFium: Google's PDF rendering engine for accurate PDF processing
Tesseract: Google's OCR engine for text recognition
Python-docx/pptx: Native Microsoft Office format support

Quick Start

Extract Text with CLI

# Extract text from any file to text format
uvx kreuzberg extract document.pdf > output.txt

# With all features (chunking, language detection, etc.)
uvx kreuzberg extract invoice.pdf --ocr-backend tesseract --output-format text

# Extract with rich metadata
uvx kreuzberg extract report.pdf --show-metadata --output-format json

Python Usage

Async (recommended for web apps):

from kreuzberg import extract_file

# In your async function
result = await extract_file("presentation.pptx")
print(result.content)

# Rich metadata extraction
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Page count: {result.metadata.page_count}")
print(f"Created: {result.metadata.created_at}")

Sync (for scripts and CLI tools):

from kreuzberg import extract_file_sync

result = extract_file_sync("report.docx")
print(result.content)

# Access rich metadata
print(f"Language: {result.metadata.language}")
print(f"Word count: {result.metadata.word_count}")
print(f"Keywords: {result.metadata.keywords}")

Docker

Two optimized images available:

# Base image (API + CLI + multilingual OCR)
docker run -p 8000:8000 goldziher/kreuzberg

# Core image (+ chunking + crypto + document classification + language detection)
docker run -p 8000:8000 goldziher/kreuzberg-core:latest

# Extract via API
curl -X POST -F "file=@document.pdf" http://localhost:8000/extract

📖 Installation Guide • CLI Documentation • API Reference

Deployment Options

🤖 MCP Server (AI Integration)

Add to Claude Desktop with one command:

claude mcp add kreuzberg uvx kreuzberg-mcp

Or configure manually in claude_desktop_config.json:

{
  "mcpServers": {
    "kreuzberg": {
      "command": "uvx",
      "args": ["kreuzberg-mcp"]
    }
  }
}

MCP capabilities:

Extract text from PDFs, images, Office docs, and more
Multilingual OCR support with Tesseract
Metadata parsing and language detection

📖 MCP Documentation

Supported Formats

Category	Formats
Documents	PDF, DOCX, DOC, RTF, TXT, EPUB
Images	JPG, PNG, TIFF, BMP, GIF, WEBP
Spreadsheets	XLSX, XLS, CSV, ODS
Presentations	PPTX, PPT, ODP
Web	HTML, XML, MHTML
Archives	Support via extraction

📊 Performance Characteristics

View comprehensive benchmarks • Benchmark methodology • Detailed Analysis

Technical Specifications

Metric	Kreuzberg Sync	Kreuzberg Async	Benchmarked
Throughput (tiny files)	31.78 files/s	23.94 files/s	Highest throughput
Throughput (small files)	8.91 files/s	9.31 files/s	Highest throughput
Memory footprint	359.8 MB	395.2 MB	Lowest usage
Installation size	71 MB	71 MB	Smallest size
Success rate	100%	100%	Perfect
Supported formats	18	18	Comprehensive

Architecture Advantages

Native C extensions: Built on PDFium and Tesseract for maximum performance
Async/await support: True asynchronous processing with intelligent task scheduling
Memory efficiency: Streaming architecture minimizes memory allocation
Process pooling: Automatic multiprocessing for CPU-intensive operations
Optimized data flow: Efficient data handling with minimal transformations

Benchmark details: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.

Documentation

License

MIT License - see LICENSE for details.

Author Information

Goldziher

GitHub

Followers

Repositories

Gists

Total Contributions