Today, the world is driven by data. Businesses and organizations that effectively understand its data and derive informed decisions from it can excel in what they do. For a business, the data it generates is so diverse and vast. Handwritten forms, the video stream from CCTV cameras, data produced from the website and third-party tools, customer feedback, images, audio, and there could be 100 other types of data sources.
One of the major challenges a business owner faces is in digitizing the data in non-digital formats, such as handwritten texts, text in images and printed documents, etc. Businesses have been using OCR or optical character recognition since the 1990s to automate the processing of physical documents, images, etc. OCR-based data extraction is faster, cheaper, and a lot more efficient as compared to manual data extraction.
Working with documents for OCR could be challenging in many ways. The source of input data and its attributes are the key challenges. Images of documents captured on mobile phones or handheld devices could be plagued with skewness. This is often caused by the method of capture, lighting issues, documents may have cluttered background like watermarks or design patterns, low print quality, flash glare when captured in lowlight conditions, if the document was printed on glossy material, variation in resolutions, etc. If the input document is skewed, the OCR engine gets confused and produces errors.
Recently, in one of the projects I’ve been working on, the requirement was to extract specific information from images of identity cards of users. Seemed simple enough. I had worked on projects like this many times before. That is when I realized that all the files they had were photos taken from a smartphone, not scanned copies. This meant that the OCR engine would not be able to discern the text properly.
All the images were quite different from each other as they were taken from various smartphones. We had to find a way to account for variations like angle, lighting, camera resolution, etc. There was also a need to consider the different levels of compression that each photo would experience when transferring from camera to cloud. Older smartphone models took photos that had a much lower quality when compared to the newer models. This created a major difference in the resolution of the images as well.
Another challenge was text extraction due to the numerous lines and varying qualities of printing. The template fonts were lighter than the font with identifying information. The text printed on the identity cards was of varying sizes, fonts, and clarity. Added to this the grid lines also appeared very close to the text making the segmentation and OCR very difficult.
As there were no standard printing practices, we could see a lot of differences in the way characters were printed. Sometimes the characters were put too close together, which could cause errors while reading. For example, if the vehicle type is a HONDA, the algorithm might mistake the ‘N’ for an ‘M’ and identify the vehicle as a ‘HOMDA’ instead.
What I built, and what anyone who is going to extract data from photographed documents, would need to build is a pre-processing pipeline to transform the input images more ‘understandable’ for the OCR engines. The pipeline’s objective is to correct the issues with rotation, misalignments, and other factors that affect data extraction. I designed some novel ways to address such issues with existing image processing techniques. I’m writing this article to share my insights, which I hope, it could help developers in understanding different variables that govern the text extraction process in OCR.
If you feel that the OCR output from your scanned document is missing some alphabets or the algorithm tends to favor a certain part of the document, the culprit could be misalignment. OCR engines like tesseract expect the input document to be properly aligned because the extraction takes into account line formatting. Slight translations and rotations may affect the output. In an unconstrained environment, if such an OCR engine is made available in a consumer product, one should expect documents with all kinds of translational errors. A good rule of thumb is to preconfigure a template in the backend if you are clear on the types of documents the application would encounter. The template is an ideal ground truth of the expected input, which is free of errors and could be used to align or make the future inputs rotation/translation invariant. A good way to do this is with Oriented FAST and Rotated BRIEF( ORB ). ORB lets you identify key points in an input document by matching it to a ground truth template and using these key points to align your input. The parameters in ORB lets you tweak it as per the input document you have. Through trial and error, one can find the parameter set which gives the best result for their use case. Most of the time, the default parameters will give you good enough results, but if your application needs to encounter very challenging documents, then the best parameters need to be identified through some research.
Once the rotation invariance is solved, you can move on to removing unwanted background clutters like patterns, logos, watermarks, holograms, etc from the image. For this make sure the target for extraction, ie, the printed text is visible and has good print quality. Now, computer vision algorithms like adaptive thresholding, erosion, dilation could be used to remove unwanted elements or artifacts from the image. Be mindful of the parameters being used. Again with the above-preprocessing methods, one can start with the default values and work their way to the best set of values that suit their requirement.
To fix the problem of varying resolutions, we scaled the random resolutions to pre-configured resolution settings. An aspect ratio of 16:10 was maintained for all the images. As soon as the image was sent to the cloud, it was downscaled or upscaled to the closest matching resolution to ensure that all the aspect ratios stayed the same. Because the images often had different fonts and gridlines, we used keypoint extraction to improve the accuracy of OCR. This method required every document uploaded to the system to be of the same size and any attribute associated with the key points (eg: Vehicle type) in the document to be at the exact same location. For example, if the base template for OCR has “vehicle type” at the 80th pixel, then every document uploaded to OCR must have the “vehicle type” at the 80th pixel.
We also identified key points or regions of interest that contained the required information. This was done by pre-defining co-ordinate sets of pixels that would hold relevant information. By identifying regions of interest, the process of text extraction was made much easier for the algorithm. For close lettering, we used a fuzzy logic match that helped eliminate these kinds of errors. So, if the algorithm identified a vehicle incorrectly, it would look through the database of vehicles and choose the right match. Finally, the Tesseract OCR was used to extract text from the ROIs and pooled in a JSON response.
Text extraction can be done next, where you can choose what your strategy is. Popular choices are tesseract-OCR, easy-ocr, etc. If you plan to use tesseract, please read the documentation to understand the different parameters included in the latest releases. Tesseract 4 with LSTM tends to work only on Ubuntu 18.