The STP Challenge
- OCR Quality: OCR engines convert the text on a scanned document to machine readable text. The quality of the incoming document determines the quality of this conversion. Distortions within scanning, stains and blotches on the paper, handwriting on printed text, low quality scans all affect how an OCR engine “reads” the text. Just like a human, OCR engines can make mistakes and identify the letter “o” as a zero or a letter “B” as the number 8.
- Document Classification: For a document to be processed the system needs to understand what type of document is being processed so that the appropriate data extraction model can be used. This is done by either evaluating the text or the images on the document to determine the type of document. This can be done either through traditional programmatic methods or by using a Machine Learning model that is “trained” to classify/identify a document. Multi page documents are always tougher to detect than single page documents. It is almost always advisable to segregate the documents at the intake stage to classify them. For example, all invoices should be received on a separate mailbox than purchase orders.
- Document Data Extraction: Almost all intelligent document processing platforms depend upon a machine learning model that is “trained” on documents to extract data from documents. Training involves a human to identify data elements on a document. The model then uses this data to extract data from a document. The greater the variety of documents the larger the training data set needs to be. Once the model is trained when a document is presented to the model, it returns the extracted data along with an additional parameter referred to as a confidence score. This confidence score is really a “familiarity” score for the extracted data. This is the model’s prediction of its own confidence in the extraction of the data. A confidence threshold is set and any values falling below the threshold require the data extraction to be verified by a human. Note: For document to be processed straight through without human intervention ALL fields on the document must be extracted with a confidence score above the threshold. This can be very challenging if a lot of fields are being extracted. Even a single field with a low confidence score will result in a human needing to look at the document. One way to handle low confidence score extraction is to compare extracted data to an independent source of data if available. E.g., compare the PO number and the vendor’s name on an invoice to a PO number in the accounting system and if they match, the low confidence score can be ignored. As humans validate more documents, the data captured from this validation is then used to “re-train” the model and accuracy of extraction increases over time. The time is dictated by variations in documents and the number of documents being processed through the solution.
- Post Processing Errors: Generally, these errors are related to data consistency between extracted data and the system of record, e.g., Customer Name not found, or product description not matching. These are usually resolved through data cleansing and mapping tables.