LLM-Aided OCR - Get More Accurate OCR Outputs with this Open-source App

Hazem Abbas

Nov 19, 2024 — 7 min read

Table of Content

Sometimes, traditional OCR just doesn’t cut it. I’ve tried several tools in the past to get accurate results, but they often fell short. With the power of LLMs and Retrieval-Augmented Generation (RAG), though, you can achieve much more precise and well-designed outputs—just like the project I’m working on today.

The LLM-Aided OCR Project is an open-source project that uses advanced natural language processing and large language models (LLMs) to dramatically improve OCR results, turning raw text into accurate, well-formatted, and readable documents.

Features

PDF to image conversion
OCR using Tesseract
Advanced error correction using LLMs (local or API-based)
Smart text chunking for efficient processing
Markdown formatting option
Header and page number suppression (optional)
Quality assessment of the final output
Support for both local LLMs and cloud-based API providers (OpenAI, Anthropic)
Asynchronous processing for improved performance
Detailed logging for process tracking and debugging
GPU acceleration for local LLM inference

Requirements

Python 3.12+
Tesseract OCR engine
PDF2Image library
PyTesseract
OpenAI API (optional)
Anthropic API (optional)
Local LLM support (optional, requires compatible GGUF model)

How does it work?

The LLM-Aided OCR project employs a multi-step process to transform raw OCR output into high-quality, readable text:

PDF Conversion: Converts input PDF into images using pdf2image.
OCR: Applies Tesseract OCR to extract text from images.
Text Chunking: Splits the raw OCR output into manageable chunks for processing.
Error Correction: Each chunk undergoes LLM-based processing to correct OCR errors and improve readability.
Markdown Formatting (Optional): Reformats the corrected text into clean, consistent Markdown.
Quality Assessment: An LLM-based evaluation compares the final output quality to the original OCR text.