Today there are many technology options available to assist in the automation of mortgage loan processing. Some solutions are well marketed and low cost with great claims of vast libraries of rules and an ability to provide tremendous results. There are even approaches which claim that OCR is an antiquated technology, but go on to either apply very low cost labor or other older technology approaches or a combination of these.
There are three typical methodologies applied to document classification. Below, we provide a high-level overview of each, along with some discussion of data extraction with each approach:
Full Page OCR for Document Classification
This approach to document classification is distinct from most other classification technologies in that it uses a full-page OCR pass for every page of every document presented to it. Ideally, an entire page is read in less than half a second and then a set of rules are applied to determine which document type each page belongs to. While this would seem to be an obvious way to approach the task of identifying the very diverse documents found in the mortgage industry, most technology providers are unable to deliver the speed necessary to successfully scale with this approach.
Advantages of this approach include:
>>Ability to index document versions which may have never been seen before by the system assuming they are lexically similar (same words and phrases found throughout)
>>Ability to accurately distinguish between leading pages and following pages, thus eliminating need to include separator sheets in the scanning process
>>Ability to “discover” data for capture in a similar way to how a human being does it using words and phrases across the entire document to find key data.
>>High speed OCR allows for almost infinite scalability with a relatively small hardware footprint
Visual Classification also known as Fingerprinting
This is an old approach which has is been remarketed and renamed today by some vendors as AI for use in the mortgage industry. While it does recognize and have the advantage of sub-second speed it is NOT an OCR solution. Therefore instead, an image analysis (non-text based) approach is used to identify documents and page types.
This solution attempts to differentiate between document type A and document type B largely by examining the distribution of ink on samples of each document type. This is like a thumbprint analysis i.e. a graphical signature of each document type is learned and remembered.
The Advantage of this methodology include:
>>Performance (for the images successfully processed by the image signature method)
Disadvantages of this methodology include:
>>The layout-specific configurations needed for each document variation can take a long time to set up if the number of document variations/types is high.
>>These layout-specific configurations need to change if the layout of a document ever changes.
>>The graphical signature approach tends to be less reliable with more than one hundred document variations/types to compare. This can affect accuracy in some cases.
>>The time to process images tends to be linearly related to the number of document variations/types.
>>This approach presents challenges when attempting to detect document boundaries for multiple page documents and does not provide an ability to extract data from the documents once identified.
This approach does NOT have the advantage of a sub-second OCR solution but it does use OCR as part of its document classification and data extraction methodology to enhance its results. In general, the system is a mix of preconfigured rules, a learned knowledgebase and layout-specific configurations. The rules are configured through a GUI but more complex operations require scripting. The technology is typically configured for mailroom and Accounts Payable environments.
Learning is achieved by running real production data through the system to a human verification step. The system attempts to learn from the document classification and data extraction decisions made by the verification operator.
An advantage of this technology is: In-production learning allows rapid use of layout specific information. Unfortunately, this advantage is also a disadvantage. Many higher-volume sites require regression testing prior to promotion of any configuration change into production. This methodology is based on a belief that this is not necessary.
Other disadvantages include:
>>As the system adds layout-specific templates, the system gets proportionately slower
>>Separator sheets between multi-page documents are required
>>Production errors occur if layouts change
About The Author
Mark Tinkham is Director of Business Alliances at Paradatec, Inc. Over the past twenty-five plus years, Mark has worked for technology companies that deliver innovative solutions to the financial services industry. For the past ten years, his primary focus has been bringing efficiencies to the mortgage market through industry leading Optical Character Recognition (OCR).