Saturday, April 23, 2011

Optical Character recognition with Tesseract

Since I was very busy with my final year stuff I could not write a blog post for months. But today I got some time to introduce an amazing tool for you, which will help you when dealing with the Optical Character Recognition(OCR) domain. Lets get a basic knowledge on the OCR domain, before moving to Tesseract.


What is OCR?
"Translating the handwritten, typewritten or printed texts in a scanned image to a machine encoded text."


This OCR concept is very important and widely used in converting important documents which are in image formats to text format. The first major use of OCR was in processing petroleum credit card sales drafts. This application provides recognition of the purchaser from
the imprinted credit card account number and the introduction of a transaction. Most of the modern scanners provides the feature of OCR enabling for the images scanned using it which will directly output the text content of the scanned image. This concept helps to save a huge time which will waste on rewriting the image contents. There are lots of pattern recognition and image processing activities involved with this field and this is an very interesting and research area in the current computing word.

What is Tesseract?
"The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test and it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images. Tesseract is licensed under the apace license and any one can use it for free. The core developer on the project is Ray Smith "


Sometimes back I involved with a project to modifying an OCR module for Sahana Disaster Management System. So I searched for an OCR engine for integrate it with the SahanaOCR system. Tesseract was the best solution I found. It has an immense power on recognizing printed text and also it is very easy to localize for your own mother languages.


How to install Tesseract on your PC ?.
Tessearct provides a set of binary executable for Windows and installers for Linux systems. The detailed installation guild is available here.


The Windows users can directly download the latest exe file here. The installation guide for the Linux users also provided at the general installation guide . Then you have to download the trained data folder which is available at the download page. Currently there are trained data sets for many languages such as English, Chines, Spanish etc. You can download any of those tessdata files and extract it into your working directory.
Then there you can go into the directory where the tesseract.exe is located and use the following command to recognize the images and finally to output the recognized characters in an text file.

tesseract image outputbasename [-l lang] [configs]

In this the image file is name of your image using for OCR and the output_base_name is the name of the text file which will contains the recognized output. The images should be in TIFF format and you can use the Libtiff library for handling compressed images with Tesseract. Other configurations are optional and they facilitate for more advance features.


Training Tesseract.
There is another amazing feature provided by Tesseract which is customizing it for your local languages. For that initially we have to train Tessearct for the new language. Here is the training guide for the Tessearct.

For training it, you have to have a proper training set of characters. Then after you select a proper training set you can follow the training procedure which is described at the guide. After following all the steps it will provide you a lang.trainddata file which contains all the newly trained data.


Tessbase API.
Now you have a good idea on how to use Tesseract for OCR. But thats all for using Tesseract on its own. What if we need to integrate it with our own application. So for that Tesseract provides a highly user friendly API which is named as Tessbase API. So you can use tessbaseAPI class to call the internal functions provided by Tesseract.


Here is the general forum where the common discussions carried out on Tesseract project and you can find more details on the project by subscribing to that.

1 comment:

  1. If you need a fast and easy to use program for optical character recognition, you should get Smart OCR. It is very accurate and it does not mess with the document layout. http://smartocr.com

    ReplyDelete