What is OCR (Optical Character Recognition)?
OCR or Optical Character Recognition is a process that converts images that contains text into readable editable text formats which you can edit, copy, paste and save.
It is not a new technology as it was created decades ago to aid enterprise transform their paperwork into digital documents.
OCR works by recognizing the text characters within image or PDF files, scanned papers or directly from with a camera's live stream.
It does not only work with printed text, but many OCR libraries and frameworks can extract text of handwritten documents to a certain degree as well.
Open-source OCR Libraries for developers
Tesseract is a free open-source OCR engine for building OCR apps. It supports Unicode (UTF-8) by default, many image formats as PNG, JPEG, and TIFF. It also supports many output formats as PDF, TEXT files, TSV, and read-only text.
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google.
Tesseract is released as an open-source project under the Apache 2.0 License, however, it uses the Leptonica library which uses the BSD 2-clause License.
EasyOCR is a free OCR solution for end-users that supports 80+ languages, that include Arabic, Hebrew, Chinese, Cyrillic, Latin, and Farsi.
The project is under an active development from many contributors. It is written primary in Python.
EasyOCR supports several image formats, PDF files, text position identification with a bounding box, confident level and more.
The project uses Pytorch for AI training. The detection execution uses the CRAFT algorithm.
Kraken is an open-source free OCR for historical documents such as non-Latin scripts.
- Fully trainable layout analysis and character recognition
- Right-to-Left, BiDi, and Top-to-Bottom script support
- ALTO, PageXML, abbyyXML, and hOCR output
- Word bounding boxes and character cuts
- Multi-script recognition support
- Public repository of model files
- Lightweight model files
- Variable recognition network architectures
GNU Ocrad is an OCR (Optical Character Recognition) program and library based on a feature extraction method. It reads images in png or pnm formats and produces text in byte (8-bit) or UTF-8 formats. The formats pbm (bitmap), pgm (greyscale), and ppm (color) are collectively known as pnm.
Ocrad includes a layout analyser able to separate the columns and blocks of text normally found on printed pages.
Ocrad can be used as a stand-alone console application, or as a backend to other programs.
GOCR is a free open-source OCR that is released under the GNU Public License.
It converts scanned images of text back to text files. Joerg Schulenburg started the program, and was leading the team of developers on SF, and after 2010 still manages the package at a (very) low time base.
GOCR can be used with different front-ends, which makes it very easy to port to different OSes and architectures. It can open many different image formats, and its quality have been improving in a daily basis until 2010.
Ocular is a free FLOSS (Free Libre Open Source Software) OCR system for historical and printed documents.
Ocular is written in Java and works seamlessly on Windows, Linux and macOS. It comes with a rich CLI (Command-Line Interface) and supports all popular image formats.
It is features include:
- Unsupervised learning of unknown fonts: requires only document images and a corpus of text.
- Ability to handle noisy documents: inconsistent inking, spacing, vertical alignment, etc.
- Support for multilingual documents, including those that have considerable word-level code-switching.
- Unsupervised learning of orthographic variation patterns including archaic spellings and printer shorthand.
- Simultaneous, joint transcription into both diplomatic (literal) and normalized forms.
7- Attention-based OCR
The Attention-based OCR comes with a state-of-art text recognition which uses TensorFlow models and a Python package that is fully compatible with Google Cloud ML Engine.
This project is based on a model by Qi Guo and Yuntian Deng. You can find the original model in the da03/Attention-OCR repository.
8- Calamari OCR
Calamari OCR Engine is based on OCRopy and Kraken using python3. It is designed to both be easy to use from the command line but also be modular to be integrated and customized from other python scripts.
9- Simple Python OCR
Simple OCR is an open-source OCR app that uses OpenCV and Numpy python libraries.
docTR (Document Text Recognition) is a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
The docTR is powered by TensorFlow 2 and PyTorch.
The SwiftOCR (which is Currently, deprecated and no longer maintained) is an open-source OCR library written in the Swift language.
It uses a neural network for image recognition. As of now, SwiftOCR is optimized for recognizing short, one line long alphanumeric codes (e.g. DI4C9CM). We currently support iOS and OS X.
To Sum up
OCR technologies and apps are essential for all type of users who wishes to convert their paperwork into digital format.
In this list we listed the best open-source OCR libraries and framework for developers to build OCR-oriented applications for end-users.
If you know of any other open-source library or framework that we did not mention here, let us know.