Table of contents

OCR with tesseract, python and pytesseract

OCR with tesseract, python and pytesseract

Python is super versatile, it has a giant community that has libraries that allow to achieve great things with few lines of code, Optical Character Recognition (OCR) is one of them, for that you just need to install tesseract and the python bindings, called pytesseract.

Applications of OCR

OCR is quite useful for social networks, where you can scan the text that appears in the images to read its content and then process it or give it statistical treatment.

Here’s another case, imagine a program that scans image boards or social networks, extracts a couple of images from the posted videos and links them to a tik tok account using the watermark that appears on each video.

Or maybe a page that uploads images of your products with your prices written on each of them. With OCR it is possible to get all their prices, and upload them to your database, downloading and processing their images.

Facebook must use some kind of similar technology to censor images that include offensive text, according to its policies, that are uploaded to its social network.

Another of the most common applications is the transformation of a pdf book into images to text, ideal for transforming old book scans to epub or text files.

Installation of tesseract-ocr

To perform OCR with Python we will need tesseract, which is the library that handles all the heavy lifting and image processing.

Make sure you install the newest tesseract-ocr, there is a huge difference between version 3 and versions after 4, as neural networks were implemented to improve character recognition. I am using version 5 alpha.

sudo apt install tesseract-ocr
tesseract -v
tesseract 5.0.0-alpha-20201224-3-ge1a3

Differences in OCR engine efficiency between tesseract 3 and tesseract 5 alpha.

Version 5 shows better performance
Comparison between OCR performance of tesseract 3 and tesseract 5

Installing languages in tesseract

We can see which languages are installed with –list-langs.

tesseract --list-langs

It is obvious, but it is necessary to mention that the extent to which it recognizes the text will depend on whether we use it in the correct language. Let’s install the Spanish language.

sudo apt install tesseract-ocr-spa
tesseract --list-langs
List of available languages (3):
eng
osd
spa

You will see that Spanish is now installed and we can use it to detect the text in our images by adding the -l spa option to the end of our command

OCR with tesseract

Now let’s put it to the test to recognize text in images, straight from the terminal. I am going to use the following image:

Image with text to be processed
File: image_with_text.jpg

tesseract imagen_con_texto.jpg -
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 139
Do you have the time to listen to me whine
...

The “-” at the end of the command tells tesseract to send the results of the analysis to the standard output, so that we can view them in the terminal.

It is possible to tell tesseract which OCR engine to use:

  • 0: for the original tesseract
  • 1: for neural networks
  • 2: tesseract and neural networks
  • 3: Default, whichever is available
tesseract imagen_con_texto.jpg - --oem 1

Consider that not all language files work with the original tesseract (0 and 3). Although generally the neural networks one is the one that gives the best result. You can find the models compatible with the original tesseract and neural networks in the tesseract repository.

You can install them manually by downloading them and moving them to the appropriate folder, in my case it is /usr/local/share/tessdata/, but it may be different on your system.

wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
sudo mv eng.traineddata /usr/local/share/tessdata/

Installing pytesseract

After installation we add pytesseract (the python bindings) and pillow (for image management) to our virtual environment.

pipenv install pytesseract pillow

Read text from images with python

First let’s check the languages we have installed.

import pytesseract
from PIL import Image
import pytesseract

print(pytesseract.get_languages())
# ['eng', 'osd', 'spa']

Now that we have the languages, we can read the text of our images.

The code is quite short and self-explanatory. Basically we pass the image as an argument to pytesseract’s image_to_string() method.

import pytesseract

from PIL import Image
import pytesseract

img = Image.open("nuestra_imagen.jpg") # Open the image with pillow
img.load()
text = pytesseract.image_to_string(img, lang='eng') # Extract image's text
print(text)

# Do you have the time to listen to me whine...

image_to_string() can receive as argument the language in which we want it to detect the text.

Tesseract when with a method with which we can obtain much more information from the image, image_to_data(), available for versions higher than 3.05.

data = pytesseract.image_to_data(img)
print(data)

Return from image_to_data method in tesseract

If you want to learn more visit the complete tesseract documentation.

Eduardo Zepeda
Web developer and GNU/Linux enthusiast always learning something new. I believe in choosing the right tool for the job and that simplicity is the ultimate sophistication. I'm under the impression that being perfect is the enemy of getting things done. I also believe in the goodnesses of cryptocurrencies outside of monetary speculation.
Read more