Sometimes you need to take a written piece of information from the real world (e. g. a letter, a document) and enter information from that document into the computer. Now reading information from a document off-screen and typing it in manually is error-prone, time-consuming and boring and therefore I wrote a simple Python script that takes pages of a PDF, transforms the single pages into images and extracts all the written text using an OCR engine, making it possible to copy and paste the contents of the document easily.
I am assuming you have the latest version of Python installed (as of this date, it should be v. 3.7.3). We are going to need a few libraries for this: Poppler, pdf2image, tesseract and pytesseract.
pdf2image is a Python library that wraps Poppler, which is a PDF rendering library. Tesseract is an open-source OCR (optical character recognition) engine developed by Google (https://opensource.google/projects/tesseract). And pytesseract wraps this open-source library for Python.
Tesseract is an OCR engine with support for unicode and the ability to recognize more than 100 languages out of the box. It can be trained to recognize other languages.https://opensource.google/projects/tesseract
Let’s start with installing all the required libraries. For installing Tesseract and Poppler, I am relying on homebrew this time (I usually prefer to build from source manually). For installing the Python libraries, I am going to use the package installer PIP3 which is suitable for all Python 3 versions.
brew install tesseract brew install poppler pip3 install pdf2image pip3 install pytesseract
Next we are going to write our simple script that will:
- Take a PDF with images (e. g. a letter)
- Convert the PDF into a series of pages
- Iterate over the pages and save them as images to the disk
- Read the images and read the text into a string
import PIL import pytesseract import pdf2image # Convert PDF contents to pages pages = pdf2image.convert_from_path('letter.pdf', 500) # Just using this to give the pages a number counter = 0 for page in pages: file_name = 'page' + str(counter) + '.jpg' # Save images to the same folder page.save(file_name, 'JPEG') # Open the file as an image image_file = PIL.Image.open(file_name) # Use tesseract to extract the text from the image string_contents = pytesseract.image_to_string(image_file) # Print the contents to the console print(string_contents) counter = counter + 1
Processing the information can take a few seconds, so be patient. You will receive the output of the document in the console and the image files in the same folder you are running the script in.
Now here comes the catch. Depending on how good the quality of the picture is (including the angle, blurriness etc.) the output can vary and it’s very possible that there are some wrongly recognized characters in the output. Therefore you should double check the output and do some error correction if necessary.