Intelligent Data Extraction & OCR
Author: Ernest Semerda
“Optical character recognition or optical character reader, often abbreviated as OCR, is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example from a television broadcast).”Wikipedia: https://en.wikipedia.org/wiki/Optical_character_recognition
OCR (Optical character recognition) is a solved problem. What isn’t is the intelligence around data extraction.
OCR is about 10% of our IP.
You can push a receipt, bill or invoice into OCR and it will return an average result of unstructured text. How do you know what is the total, tax, vendor and so forth? You don’t.
You need to build some contextual intelligence around it. And that’s what we did at Veryfi. But as soon as you start doing this you run into so many issues it’s mind gobbling. All solvable through painful software iterations.
Take pre-processing of a document. Let’s say you want to use your phone to capture a picture of a receipt. You cannot just push a photo against OCR because OCR will return below average results.
You have to think this through and either auto crop receipt out (like Veryfi app) or manually crop it. Since we are in the business of automating work, manual crop is not a solution. Auto it is.
This means more noise indoors even during the day. So the app has to detect ambience and turn on the light to drop ISO back down making the photo clear again. Otherwise you might end up with image blur and a failed auto crop if the camera cannot see the receipt clearly.
There are a host of other things to consider also like Blur, Pin and Barrel Distortions and correct them, end users phone antenna ie. 3G uploads will take forever so using a custom compression based on those parameters to push the image over a slow connection is important and so forth.
At this point we still haven’t OCR’ed the document. Just prepped it. The server’s API will prep it even more and generate multiple slices of it before OCR’ing the document and understanding where all the data points are so it can label it properly.
Dmitry (my cofounder) once brought me a receipt and asked me to tell him which country I thought the receipt was from. I saw the GST component on the receipt and…
I said “Australian”!
“Nope” said Dmitry
Dmitry continued… “Veryfi AI said it’s Canadian. Look closer.”
And when I looked closer I finally saw what he was referring to. There were other data points like vendor address, the language used on the receipt and even a general look at feel of the receipt was common to Canada.
If a product cannot achieve upfront ACCURATE REAL-TIME DATA EXTRACTION, then forget any sort of AI/ML work like Tax Categorization.
If you want to do anything with Machine Learning (a branch of AI) you need DATA first. Then you can run algorithms against all that data set to do wonderful things.
Solve for DATA and the rest is easy.Ernest Semerda, Veryfi Cofounder
If you like this story then I recommend you listen to the whole interview organized by Jeff Cain, Director of the Envestnet | Yodlee Incubator with Veryfi Cofounder, Ernest Semerda.