Capture Data from a Receipt or Invoice in 5 Lines of Python Code
Author: Dmitry Birulia
If you’re looking for real-time data extraction from receipts and invoices including line items, look no further, you’re in the right place 🙂
Before we dive into the details let’s agree on the terminology first.
There are a lot of applications on the market that promise real-time solutions to capture data from receipts and invoices, but what is real-time?
Let’s imagine you are uploading a file to your Dropbox or Google Drive folder. You select a file and you wait for it to finish uploading, once it’s there you get a confirmation message that it has been uploaded successfully. If the file is an image or a PDF with a few pages, it happens within a few seconds. Is this real-time? Yes!
That’s exactly what you should be expecting from any application that claims fast or real-time data extraction from documents. You upload a file and all the data is captured in a few seconds. Not in minutes, hours, or days.
OCR (optical character recognition). There are plenty of open source solutions like Tesseract and lots of cheap options like Google Vision or Microsoft’s Read API. Or you can also build your own OCR with Tensorflow, there are plenty of tutorials on how to do that. When you run a document through OCR you get back raw, unstructured text. OCR is pretty much a solved problem for printed text.
ICR (intelligent character recognition). ICR is advanced OCR that has a self-learning component and includes an understanding of not only printed text but also reading handwriting in different languages, different fonts, and styles. In many cases, results from ICR can be more accurate than traditional human data entry.
Reading text (handwritten or printed) from a document is the least complex problem when it comes to extraction of the specific fields from receipts and invoices. The true place for AI is in understanding what is what on the document. Once you have unstructured text from ICR you have to push it through another model to get structured data from the unstructured text. The result will be a set of extracted fields and their positions on the document.
Template-based Data Extraction is a great solution when you get the same document that has the same format and layout every time, for example, if you need to capture data from a scan of a US Passport then you can just create one template that tells your algorithm where on the document to look for a specific field. Fast and easy, right?
However, when it comes to receipts and invoices there are thousands, if not millions of ways receipts and invoices are printed. In this case, you will need to create a new template for every single format and if the vendor changes the layout of their invoice, it will break your data extraction. Eventually, the use of templates will lead you to a dead-end. There’s just no realistic way to keep up with all the changes and new requests.
Veryfi's API offers true real-time data extraction from receipts and invoices. Veryfi extracts over 50 different fields (including line item data) and has embedded ICR for the understanding of different languages and handwritten text. Below you will see a Python code example on how to extract data from documents in just 5 lines of code.
First, you will need to install the veryfi package in your working environment:
Once installed, go ahead and register on the Veryfi website to get the necessary tokens. You will need your
api_key to access the API.
With those keys you will first instantiate the veryfi client:
And the 5 lines of code below are all you need to process an invoice:
This is a synchronous API call, which returns all the data extracted in ~3-5 seconds for a single-page document. Each additional page for PDF documents may take another 1–2 seconds per page.
Note 1: Veryfi can capture data from receipts and invoices in different formats: jpg, png, pdf, tiff, html, docx.
Note 2: Veryfi can categorize your document, as shown in the code example above you can send a list of categories and Veryfi will pick the category that best suits your particular document. For example, if you send an Uber receipt and your list of “categories” is
[“Travel”, “Meals”, “Utilities”, “Maintenance”], Veryfi will choose “Travel” as the category for your Uber receipt.
Note 3: Your application is in a different programming language rather than Python? No problem. Log in to your Veryfi API account and you will see other code examples in Java, PHP, and cURL. You will also find a PostMan collection.
Below is a JSON example of the data extracted from an invoice:
If you would like to play with the Veryfi API and see the results without registration, you can drag and drop your receipts and invoices in the live demo on this page: https://veryfi.com/api/
If you have questions, feedback or a feature request please send us an email to firstname.lastname@example.org or schedule a call with me https://calendly.com/dmitry-veryfi