frank devilbiss

Modeling Enthusiast & Data Scientist

A Command Line Optical Character Recognition Tool

Project Link

Recently, I needed to translate a set of image files into one long text file. Optical character recognition or OCR is an old technology that converts images into text and there are a large number of GUI tools that will combine extract text from images. That being said, I had difficulty finding software that would combine multiple images into one text file.

While looking for this capability, a thought struck me. Could I build a tool in Python to do this for me?

The answer, indicated in advance by the very existence of this post, is yes.

Using OCR in Python

Tesseract, an OCR engine, has been in development since 1985 and was made into open source software in 2005. HP created it now google now supports it. Python has a wonderful interface to the Tesseract engine (https://github.com/tesseract-ocr/tesseract) called pytesseract. It is ridiculously easy to use:

from PIL import Image
from pytesseract import image_to_string
# Simple image to string conversion
print(pytesseract.image_to_string(Image.open('test.png')))

That’s all there is to it.

The Project

The project takes the simple command above, adds a few image processing steps and wraps it around a loop that can be called from the command line. If you have a directory full of sequentially named text files (e.g. page01.png, page02.png, etc.), you can run the command:

python ocr.py testfiles testfiles/sampleoutput.txt

And voila! You have now transcribed text from all of the pages into a text file called sampleoutput.txt.

Check it out: Project Link