kreuzberg
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
GitHub Stars
2,340
User Rating
Not Rated
Favorites
0
Views
76
Forks
95
Issues
5
Kreuzberg
A document intelligence framework for Python. Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.
Framework Overview
Document Intelligence Capabilities
- Text Extraction: High-fidelity text extraction preserving document structure and formatting
- Metadata Extraction: Comprehensive metadata including author, creation date, language, and document properties
- Format Support: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
- OCR Integration: Tesseract OCR with markdown output (default) and table extraction from scanned documents
- Document Classification: Automatic document type detection (contracts, forms, invoices, receipts, reports)
Technical Architecture
- Performance: Highest throughput among Python document processing frameworks (30+ docs/second)
- Resource Efficiency: 71MB installation, ~360MB runtime memory footprint
- Extensibility: Plugin architecture for custom extractors via the Extractor base class
- API Design: Synchronous and asynchronous APIs with consistent interfaces
- Type Safety: Complete type annotations throughout the codebase
Open Source Foundation
Kreuzberg leverages established open source technologies:
- Pandoc: Universal document converter for robust format support
- PDFium: Google's PDF rendering engine for accurate PDF processing
- Tesseract: Google's OCR engine for text recognition
- Python-docx/pptx: Native Microsoft Office format support
Quick Start
Extract Text with CLI
# Extract text from any file to text format
uvx kreuzberg extract document.pdf > output.txt
# With all features (chunking, language detection, etc.)
uvx kreuzberg extract invoice.pdf --ocr-backend tesseract --output-format text
# Extract with rich metadata
uvx kreuzberg extract report.pdf --show-metadata --output-format json
Python Usage
Async (recommended for web apps):
from kreuzberg import extract_file
# In your async function
result = await extract_file("presentation.pptx")
print(result.content)
# Rich metadata extraction
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Page count: {result.metadata.page_count}")
print(f"Created: {result.metadata.created_at}")
Sync (for scripts and CLI tools):
from kreuzberg import extract_file_sync
result = extract_file_sync("report.docx")
print(result.content)
# Access rich metadata
print(f"Language: {result.metadata.language}")
print(f"Word count: {result.metadata.word_count}")
print(f"Keywords: {result.metadata.keywords}")
Docker
Two optimized images available:
# Base image (API + CLI + multilingual OCR)
docker run -p 8000:8000 goldziher/kreuzberg
# Core image (+ chunking + crypto + document classification + language detection)
docker run -p 8000:8000 goldziher/kreuzberg-core:latest
# Extract via API
curl -X POST -F "file=@document.pdf" http://localhost:8000/extract
📖 Installation Guide • CLI Documentation • API Reference
Deployment Options
🤖 MCP Server (AI Integration)
Add to Claude Desktop with one command:
claude mcp add kreuzberg uvx kreuzberg-mcp
Or configure manually in claude_desktop_config.json:
{
"mcpServers": {
"kreuzberg": {
"command": "uvx",
"args": ["kreuzberg-mcp"]
}
}
}
MCP capabilities:
- Extract text from PDFs, images, Office docs, and more
- Multilingual OCR support with Tesseract
- Metadata parsing and language detection
Supported Formats
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, RTF, TXT, EPUB |
| Images | JPG, PNG, TIFF, BMP, GIF, WEBP |
| Spreadsheets | XLSX, XLS, CSV, ODS |
| Presentations | PPTX, PPT, ODP |
| Web | HTML, XML, MHTML |
| Archives | Support via extraction |
📊 Performance Characteristics
View comprehensive benchmarks • Benchmark methodology • Detailed Analysis
Technical Specifications
| Metric | Kreuzberg Sync | Kreuzberg Async | Benchmarked |
|---|---|---|---|
| Throughput (tiny files) | 31.78 files/s | 23.94 files/s | Highest throughput |
| Throughput (small files) | 8.91 files/s | 9.31 files/s | Highest throughput |
| Memory footprint | 359.8 MB | 395.2 MB | Lowest usage |
| Installation size | 71 MB | 71 MB | Smallest size |
| Success rate | 100% | 100% | Perfect |
| Supported formats | 18 | 18 | Comprehensive |
Architecture Advantages
- Native C extensions: Built on PDFium and Tesseract for maximum performance
- Async/await support: True asynchronous processing with intelligent task scheduling
- Memory efficiency: Streaming architecture minimizes memory allocation
- Process pooling: Automatic multiprocessing for CPU-intensive operations
- Optimized data flow: Efficient data handling with minimal transformations
Benchmark details: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.
Documentation
Quick Links
- Installation Guide - Setup and dependencies
- User Guide - Comprehensive usage guide
- Performance Analysis - Detailed benchmark results
- API Reference - Complete API documentation
- Docker Guide - Container deployment
- REST API - HTTP endpoints
- CLI Guide - Command-line usage
- OCR Configuration - OCR engine setup
License
MIT License - see LICENSE for details.
HoloViz MCP is a comprehensive Model Context Protocol server that provides intelligent access to the HoloViz ecosystem. It enables AI assistants to help you build interactive dashboards and data visualizations using libraries like Panel, hvPlot, and datashader, enhancing the efficiency of data analysis.
ostruct is a tool designed to simplify the maintenance of data extraction pipelines. It provides a way to convert messy data into structured JSON without relying on complex regex, allowing for flexibility in handling format changes. This enhances code readability for developers and enables quicker data processing.