AI Based Textual Analysis vs. Other Approaches

Today there are many technology options available to assist in the automation of mortgage loan processing.  Some solutions are well marketed and low cost with great claims of vast libraries of rules and an ability to provide tremendous results.  There are even approaches which claim that OCR is an antiquated technology, but go on to either apply very low cost labor or other older technology approaches or a combination of these.  

Alternative Approaches

There are three typical methodologies applied to document classification. Below, we provide a high-level overview of each, along with some discussion of data extraction with each approach:

Featured Sponsors:


Full Page OCR for Document Classification

This approach to document classification is distinct from most other classification technologies in that it uses a full-page OCR pass for every page of every document presented to it. Ideally, an entire page is read in less than half a second and then a set of rules are applied to determine which document type each page belongs to. While this would seem to be an obvious way to approach the task of identifying the very diverse documents found in the mortgage industry, most technology providers are unable to deliver the speed necessary to successfully scale with this approach.

Featured Sponsors:


Advantages of this approach include:

>>Ability to index document versions which may have never been seen before by the system assuming they are lexically similar (same words and phrases found throughout)

>>Ability to accurately distinguish between leading pages and following pages, thus eliminating need to include separator sheets in the scanning process

Featured Sponsors:


>>Ability to “discover” data for capture in a similar way to how a human being does it using words and phrases across the entire document to find key data.

>>High speed OCR allows for almost infinite scalability with a relatively small hardware footprint

Visual Classification also known as Fingerprinting

This is an old approach which has is been remarketed and renamed today by some vendors as AI for use in the mortgage industry.  While it does recognize and have the advantage of sub-second speed it is NOT an OCR solution.  Therefore instead, an image analysis (non-text based) approach is used to identify documents and page types. 

Featured Sponsors:


This solution attempts to differentiate between document type A and document type B largely by examining the distribution of ink on samples of each document type. This is like a thumbprint analysis i.e. a graphical signature of each document type is learned and remembered.

The Advantage of this methodology include:

>>Performance (for the images successfully processed by the image signature method) 

Disadvantages of this methodology include:

>>The layout-specific configurations needed for each document variation can take a long time to set up if the number of document variations/types is high.

>>These layout-specific configurations need to change if the layout of a document ever changes.

>>The graphical signature approach tends to be less reliable with more than one hundred document variations/types to compare. This can affect accuracy in some cases.

>>The time to process images tends to be linearly related to the number of document variations/types.

>>This approach presents challenges when attempting to detect document boundaries for multiple page documents and does not provide an ability to extract data from the documents once identified.

Dynamic Learning

This approach does NOT have the advantage of a sub-second OCR solution but it does use OCR as part of its document classification and data extraction methodology to enhance its results.  In general, the system is a mix of preconfigured rules, a learned knowledgebase and layout-specific configurations. The rules are configured through a GUI but more complex operations require scripting. The technology is typically configured for mailroom and Accounts Payable environments.

Learning is achieved by running real production data through the system to a human verification step. The system attempts to learn from the document classification and data extraction decisions made by the verification operator.

An advantage of this technology is: In-production learning allows rapid use of layout specific information. Unfortunately, this advantage is also a disadvantage. Many higher-volume sites require regression testing prior to promotion of any configuration change into production. This methodology is based on a belief that this is not necessary. 

Other disadvantages include:

>>As the system adds layout-specific templates, the system gets proportionately slower

>>Separator sheets between multi-page documents are required

>>Production errors occur if layouts change

About The Author

Evaluating An OCR Solution for Mortgage Documents

Today there are many OCR technology options available to assist in the automation of mortgage loan processing.  Some solutions are well marketed and low cost with great claims of vast libraries of rules and an ability to provide tremendous results. Unfortunately, the reality is that OCR technology users and prospects are often disappointed in the results of current and past OCR evaluations and initiatives.  So, they’re understandably cautious and untrusting. To overcome much of the confusion and disappointment, a better evaluation process in many cases may go a long way towards greatly minimizing the risks involved in choosing an OCR technology vendor.

Featured Sponsors:


In order to quickly understand an OCR technology and its capabilities, a blind test with several sample files should be considered the gold standard for an evaluation.  This is especially true when it comes to the challenges presented with the many and varying document types and quality levels of document images found in the mortgage industry.  Asking vendors if they are willing to perform a test on a never before seen sample set of typical loan files on siteand in sightof your evaluation team, is a good first step in shortening your list of viable vendors for your project.

Featured Sponsors:


The test is often conducted on-site, rather than at a vendor’s facilities, due in part to the typically confidential nature of the content, but to also minimize concerns about the possible skewing of any results behind the scenes.  Look for a pre-built mortgage OCR library which can offer clients a short evaluation and implementation timeline rather than a requirement to develop processes and rules from the ground up.  Ideally, an evaluation should be setup as a One-Day Blind Test.  This kind of test is intended to demonstrate the validity of vendor claims so that prospects can be assured that they are considering a proven, robust and scalable solution ready to deliver productivity improvements in weeks rather than months or years.

Featured Sponsors:


In the course of a One-Day Blind Test, provided loan files should be indexed by document type and 50-100 data fields should be extracted from various key documents like the Note, Deed of Trust, Closing Disclosure, and Appraisal.  Output results should be provided along with statistical reporting describing automation and processing times.

Unfortunately, many companies base their buying decision primarily on price, only to be disappointed with the lack of true out-of-the box mortgage-specific functionality offered by the product.  In other cases, great claims are made regarding OCR automation, with the reality being something less impressive.  

For qualified opportunities, Paradatec has been performing this process which enables prospective clients to quickly understand the overall levels of automation and speed improvements they will be able to achieve with their technology.  Paradatec calls their evaluation the One-Day Blind Test Challenge.  

Paradatec’s Advanced OCR solutions offer significant efficiencies for classifying large quantities of differing document types and extracting key data elements from those documents.  In the mortgage market, these capabilities allow for the quick and accurate identification of over 500 unique documents in the typical mortgage file, along with capturing nearly any data element from those documents that an organization requires.  For more information, please visit

About The Author

Looking At OCR Use Cases In Mortgage Lending

With the costs to process each mortgage continuing to rise, lenders must leverage automation to improve profitability and consistency in their business processes.  With the right Advanced Mortgage OCR solution, mortgage companies have been able to reduce their level of manual document indexing and data entry activity, enabling them to process more loans per day at a lower cost per loan – yielding a leaner process and increased profit margins. 

Advanced OCR, More Than Just Reading Characters

An Advanced Mortgage OCR solution needs to do more than just convert document images to text.  Once converted, an advanced OCR solution should then be able to interpret that text using Semantic Analysis and artificial intelligence (“AI”) rules engines in a similar way a human being would process the content. Based on these results, documents can be automatically indexed and relevant datapoints extracted.  This information is then passed to downstream applications for appropriate routing, and archival. 

Featured Sponsors:


A Technology Vendor with a Unique Approach

For today’s most advanced OCR solution, the OCR process begins with a full-page OCR scan of each image.  This step is unique and typically completed in less than one second per page.  An extremely high-speed OCR process is critical and yet difficult for many vendors to achieve.  It is this performance, which allows every word on the page to be included in the scope of the AI rules engine analysis, just as a human being would interpret the content.  This content evaluation process is unique in terms of the combination of speed and ability to include allpage content in the evaluation scope, thereby making it extremely flexible with documents of varying layout (for example, bank statements).  

OCR in Action Use Cases from Leading Lenders

>>TRID Capture and Audit

The ideal OCR solution provides a rigorous tool for a comprehensive review of each TRID transaction. Typically, during the origination process there are several iterations of both a Loan Estimate and a Closing Disclosure. The most efficient TRID Audit solution is able to extract every data element from all initial and re-disclosed Loan Estimates and Closing Disclosures. The system can be configured to either output all of the data from each document iteration, or output just the differences found from the prior document. Output formats should include MISMO v3.3 or custom XML schemas. 

Featured Sponsors:


In the case where a loan origination system is generating the TRID disclosures, this differential reporting may be something produced by the LOS itself. However, in the correspondent lending channel, or in the case of a split, “borrower-only” and “seller-only” Closing Disclosure transaction, this Advanced OCR solution closes a gap that the LOS is unable to address. 

In these cases where the lender’s LOS does not generate all iterations of the Closing Disclosure and Loan Estimate, a solution is needed that can natively read PDF or scanned TIFF versions of these documents. This type of TRID Audit solution has been developed and tested to support any layout of these documents from any source.

>>UCD Creation and Audit

The Uniform Closing Dataset (UCD) provides a common industry dataset to support the Consumer Financial Protection Bureau’s (CFPB) Closing Disclosure and its ability to be communicated electronically. 

Loans closed on or after September 25, 2017 which are acquired by the  GSEs are required to have both a UCD XML file and after June 25, 2018 an embedded PDF of the associated Borrower Closing Disclosure. 

Over time the UCD is intended to provide the following benefits:

Featured Sponsors:


A. Greater data consistency by promoting better and more efficient data integration and exchange between business partners.

B. A common understanding, as all parties use a consistent approach and language to describe the information on the Closing Disclosure.

C. Improved data accuracy by eliminating the need for proprietary formats that can be costly to maintain and can lead to misinterpretation of the data.

The GSEs are collecting UCD data because it:

A. Helps enhance credit risk management with more data and better quality data.

B. Provides important information to help increase their ability to detect fraud and misrepresentation at loan delivery.

C. Provides additional transparency into the mortgage loan transaction file to help assess whether the loan, as closed, meets the GSE’s eligibility requirements.

Featured Sponsors:


According to the GSEs a PDF of the Closing Disclosure needs to be embedded in the UCD because, “The Borrower Closing Disclosure is the definitive record of the fees, charges, and adjustments that occurred in the loan transaction. As such, it is used to validate that the information provided in the UCD submission is complete and accurate.”

July 2018 UPDATE: As the new requirement for embedding a PDF of the Borrower’s Closing Disclosure was beginning to rollout, leading solution providers engineered a solution to perform an audit to statistically measure the accuracy between the data found on the embedded PDF and the MISMO XML data found in the UCD.  

The right solution provides the tools to determine if the data on the embedded PDF Closing Disclosure source document actually matches the same data within the UCD XML file. While this capability is certainly valuable to GSE entities, it is also possible to use this audit for other loan transfers.  As part of a due diligence process, investors may use this capability to verify that a set of loans to be purchased is as advertised and all critical metadata provided is accurate.

>>HMDA Audit

In order to promote compliance with federal consumer protection laws, lenders are required to submit certain borrower demographic data to the federal government. HMDA (Home Mortgage Disclosure Act) disclosures provide the public with information on the home mortgage lending activities of most lenders.

One of the challenges for a lender in reporting HMDA data is to ensure that the documents from which data is pulled are, in fact, the final versions. Many times errors in HMDA reporting are due to reporting data based on a non-final source document.

The most advanced OCR solution for HMDA Audits searches through an image archive for every version of every document relevant to the HMDA reporting process and automatically determines the final versions. Data is then automatically captured from these final documents via their AI data extraction rules and coalesced into an XML file or spreadsheet to be used for reporting. 

This process provides lenders with a highly automated method for assuring accuracy of required Loan Application Register (LAR) reporting data and to ensure database of record quality for future reporting needs.

What’s New in 2018?

>>OnDemand OCR capture (W2s, Paystubs, and Tax forms)

As the industry continues to look for faster and more efficient ways to capture key data from prospective borrowers, a leading OCR provider has been listening.  Their sub-second speed OCR is the ideal technology platform from which to allow borrowers, loan officers and others to submit supporting loan documentation for quick automated document identification and data field capture.

A user may drag a PDF of their Federal IRS 1040 Income Tax form to a browser-based app, the form will be identified and all data fields captured in a short time frame and immediately available to loan officers and loan origination systems.

>>Necessary but Unique Capabilities

The key capabilities and features of the Paradatec Advanced OCR solution that make these use cases possible are:

A. Sub-second per image full OCR processing

Paradatec advanced indexing and data capture technology is at least 10 times faster than others, which allows them to take an approach others would like to, but just can’t because of their system performance. This capability is unique, and enables Paradatec to evaluate all text on every page, just as a human can but much faster. 

B. Extreme scalability with a small hardware footprint 

Paradatec’s Advanced OCR solution scales from the ability to process over 1,000,000 images daily on a single eight core server to tens of millions of images daily by simply enlisting additional cores into the configuration.

C. Pre-built mortgage OCR library

Over 500 mortgage document types ready to be indexed, and more than 6,000 mortgage loan data fields able to be captured right “out of the box”.  

D. Web services API

Paradatec’s OnDemand OCR feature extends their Advanced OCR capabilities to other applications through seamless integration with a web services API.

E. Document versioning

Documents can be stacked, with like documents consolidated together, to streamline the document versioning process.

F. Bookmarked PDF output

Paradatec’s WritePDF module provides a bookmarked and annotated PDF of the submitted loan package, including a table of contents with links to key data elements within the package.  Clients find this feature invaluable and a significant documentation addition to their inventories of mortgage loans.

Paradatec’s Advanced Mortgage OCR solutionis designed to make mortgage lending faster and more accurate.   In 2017, Paradatec’s Mortgage OCR solution processed over 1,500,000,000 images (representing over 2,500,000 loans), helping lenders and servicers streamline their origination, onboarding and compliance obligations by automating document indexing, automating data extraction, meeting tighter service level agreements, and delivering more accurate data much faster than manual data entry alone. In 2018, Paradatec is on track to again exceed the volumes processed and the automation provided to their lender, servicer, and other technology provider clients in the mortgage lending industry.

About The Author

Tried & True Innovation

As we all know, mortgage lenders are looking for an edge. How do they get that edge? They can start by replacing a paper-driven mortgage process with an automated process. This is where industry specialists like Paradatec can help. For over two decades, Paradatec has focused its skills towards delivering the most efficient, accurate, and flexible freeform document classification and data extraction solution available anywhere. Specifically, Paradatec’s advanced OCR solutions offer significant efficiencies for classifying large quantities of differing document types and extracting key data elements from those documents. In the mortgage market, these out-of-the-box capabilities allow for the quick and accurate identification of nearly 500 unique documents in the typical mortgage file, along with capturing over 6,000 data elements from those documents. Our editor talked to (left to right) Mark Tinkham, the company’s Director of Business Alliances; Paul Fischer, the company’s Director of Professional Services; and Neil Fraser, the company’s Director of US Operations; about how lenders can use technology to improve the mortgage process. Here’s what they said:

Q: So, what does Paradatec specifically do that would be compelling to a mortgage servicing, or lending operation?

MARK TINKHAM: Paradatec streamlines and monitors processes which otherwise require significant human labor. We minimize the need for managing large costly staffs of trained loan file indexers and data key entry operators. We do this while at the same time providing statistical feedback and measurement of accuracy and automation. We provide these efficiencies so that our clients are able to better focus on their customers, manage workload peaks and valleys more easily, and measure results over time.

Featured Sponsors:


A good basic example is our ability to automatically identify all the documents in a 500 to 1,000 page loan package, and capture every one of the hundreds of fields on every version of every TRID document (Loan Estimate and Closing Disclosure), every one of the dozens of fields on a Loan Application, Appraisal, Transmittal Summary, Note, Deed of Trust, 4506-T, Income Tax statement and whatever else a client may require.

Q: How does OCR (Optical Character Recognition) technology provide value in today’s Mortgage Industry?

PAUL FISCHER: There are vast differences between some of the lower cost OCR technologies, and the advanced OCR offered by Paradatec. The advantages to using our technology are a dramatically faster, more accurate and less costly process for indexing and capturing data from mortgage loan documents.

The short answer to your question is: we provide our clients with an ability to do in seconds what many operations, using 100% human labor, take hours to do. And, at the same time, we provide results which are more accurate.

Our unique approach to OCR allows us to extend these broad benefits to originators’ and correspondent lenders’ indexing and data ingestion validation, and servicers’ loan onboarding processes.

Featured Sponsors:

Our capabilities, out of the box, today include rules to identify approximately 500 mortgage loan document types and extract more than 6,000 fields from those documents.

In addition, we have helped our clients with automating their compliance processes with HMDA loan audits, UCD creation and TRID capture solutions.

Q: Has the industry fully embraced your automation technology?

NEIL FRASER: We think lenders do understand the need for automation, but many may not be aware of the significant and unique competitive advantages our clients continue to realize.

Lately we have been spending more time sharing our many success stories and getting the word out that we can provide powerful efficiencies related to loan automation.   These advantages range from compressing the time it takes to process borrower-provided documents to expediting the loan onboarding process and making compliance audits significantly more automated.

We offer an ability to dramatically reduce the manual efforts related to indexing loan documents, and capturing key data from those document images. Our sub-second per image processing speed is unique and it allows us to take an approach which others are unable to match due to their OCR performance. This speed and ability to scale our processes to tens of millions of images per day on a small hardware footprint are waking up the industry to the possibilities of how their operations will benefit.

So, we are seeing more and more lenders embracing our technology. And, because we continue to add enhancements and find new ways to provide value with our technology we believe our current and future clients will continue to find new and exciting ways to further embrace our solutions.

Q: How is Paradatec’s OCR technology different than others?

MARK TINKHAM: Our extreme focus on OCR technology began more than twenty-five years ago, and since 2007 we have been applying our unique, sub-second per page, small hardware footprint OCR technology to the mortgage industry. With every implementation, we have continued to build more and more out-of-the-box capabilities specific to processing mortgage loan documents. Over the years we have seen various fads and splashy marketing campaigns touting various OCR technologies and approaches, which in reality were not effective.

Recently we’ve seen an increase in the hype with alternative automation strategies. One approach, which isn’t new, and we have seen in years past, is something called visual classification, in which the image ‘fingerprint’ of a page is used for identification rather than the text itself. This approach is fast and used in an attempt at matching our sub-second per page processing speed.

Featured Sponsors:

For documents that are graphically focused with minimal text, this may work fine, but mortgage files are loaded with text, and in many cases that text will be key to correctly identifying the document type. For example, many Deed and Rider signature pages can look similar, in that the content many times pushes the signature block to its own page. Our clients want the delineation between these docs, and even between the various Riders, but at a ‘fingerprint’ level these pages can look quite similar, leading to many indexing errors. It’s only when the footer text is discovered and read as “PUD Rider,” “MERS Rider,” or “Deed of Trust” that the correct automated decision can be made, which our solution completes with sub-second speed.

Q: How do you ensure quality control and data accuracy?

PAUL FISCHER: We implement database validation of captured data, and reasonability rules for indexing and data capture. In addition, we provide a process for statistically random reviews and measurements of loan indexing and captured data along with an ability to track user efficiency over time. With the Paradatec Statistics database, our clients are able to generate an unlimited number of useful reports which track processing time, by loan, by user, by time period, even down to the document type and extracted data field level.

In addition, we provide an ability to create a quality review and learning process from production output with our analytical tools. This process is performed as part of the testing and implementation stages, and provides deep insights into the accuracy and automation levels which have been achieved.

As part of an ongoing quality measurement and learning adjustment stage, our clients can be confident that their processes are continuing to perform at the highest levels of quality.

Q: Your Company has released an Application Programming Interface (API). In layman’s terms, what does this do?

PARADATEC: We provide a Web Services API which allows end users to submit loan documents and data for validation to our workflow processes from virtually anywhere.

A use case example would be our OnDemandOCR process, which utilizes our API to allow lenders to submit final Closing Disclosures remotely and receive a MISMO formatted GSE compliant Uniform Closing Dataset (UCD) back as output for review and ultimately submission to the GSEs when loans are presented.

Another use case for our API will allow borrowers to submit documents as part of a loan origination. Our OnDemandOCR process will then identify the document or documents submitted, and automatically extract the key data fields from them.

Q: What are some other manual processes that you have automated within your clients’ operations?

NEIL FRASER: Since our focus on the mortgage industry began, we have continued to find more and more new, and many times dramatic ways to enhance our clients’ processes.

A little over a year ago, we were asked to re-index approximately two million loans due to some compliance pressure our client was getting to make sure their loan portfolio accurately accounted for the necessary source documents. We were able to assist by processing over 1.2 billion document images in a matter of weeks. In other cases we have been asked to help meet new compliance obligations by significantly streamlining what would otherwise have been extremely costly, labor-intensive efforts.

Our new HMDA Audit capability enables our clients to quickly validate the data on their Loan Application Register (LAR) against the data found on the associated loan source documents. Each loan is processed at less than one second per page and each of the final source documents’ data is compared to the values on the LAR. This process allows our clients to ensure compliance with the Federal Reserve Board’s Regulation C before submission to the Federal Financial Institutions Examination Council (FFIEC).

Our UCD Audit capability enables our clients and the GSEs to automatically compare the MISMO 3.3 data found in a Uniform Closing Dataset against the corresponding values found on the final Closing Disclosure which is embedded in that UCD. This process is performed at an average of one second per page and each of approximately 300 fields extracted are then compared. Differences found between the MISMO data and the extracted data are reported in a MISMO compliant “differences” file. Along with this, we also produce a corrected UCD based on the embedded Closing Disclosure.

Our CCAR FRY_14M offering helps our largest clients comply with the latest CFO attestation requirements related to the Dodd-Frank Stress Test rules for large financial institutions. This process uses our high speed OCR capability and pre-built rules to classify documents, find the final version of key document types, and validate source document data against attestation data. This process can be performed in seconds per loan, and allows our clients to find and correct much of the inaccuracies typically found. In fact, because the original attestation data is typically key entered with human labor, and final document versions are often confused with non-final versions, prior attestation data is often incorrect. Without automation, this compliance risk mitigation step would be cost prohibitive.

Q: Paradatec has more than a decade of experience within the mortgage industry. What new initiatives and innovations have you recently brought to market or have coming up in the near future?

MARK TINKHAM: Some examples of new initiatives, new capabilities, and product features, some of which were mentioned earlier, include:

The Paradatec WriteUCD module for automated creation of GSE compliant UCDs from final Closing Disclosures.

Web Services API to enable our clients to seamlessly integrate our technology using our OnDemandOCR feature.

An ability to capture every field on every version of both the Closing Disclosure, and the Loan Estimate in an average of one second per page.

Our Paradatec WritePDF module for creating fully indexed loans with data fields highlighted in a PDF which includes a table of contents which virtually maps a loan’s documents and key source data.

An ability to automatically identify and capture all the fields on the new HMDA compliant URLA and the new HMDA addendum to the old URLA.

Our new HMDA audit process which can greatly streamline this process for our clients.

Our UCD Audit capability has attracted some significant interest from the GSEs and some of our larger clients.

We’re developing a new handprint discovery feature that will provide large leaps in automation for our post-close clients, which need to validate the required initials and signatures on key loan documents.

Q: How do you see the mortgage industry and the mortgage process of the future evolving?

MARK TINKHAM: Like many other industries, the mortgage industry is experiencing an evolution through the aid of technology. Staying competitive and reducing per-loan processing costs require the use of technology like ours. Industry leaders such as Amazon and Orbitz have made the self-service model, albeit in other market segments, much less daunting, and the speed at which transactions can be completed has decreased significantly through this evolution. While the magnitude of the buying decision for a home is obviously much greater than that of buying an airplane ticket or a pair of shoes, the consumer has become comfortable with online transactions to the point that a paper-bound process is viewed as slow and stodgy.


Mark Tinkham is Director of Business Alliances at Paradatec, Inc. Over the past twenty-five plus years, Mark has worked for technology companies that deliver innovative solutions to the financial services industry. For the past ten years, his primary focus has been bringing efficiencies to the mortgage market through industry leading Optical Character Recognition (OCR).


Mark Tinkham thinks:

1.) The digital mortgage won’t eliminate the need for manual data entry.

2.) Our UCD Audit process will be found to be an invaluable tool for those lenders selling loans to the GSEs.

3.) The 20 largest lenders and servicers will all embrace advanced OCR by 2020 out of necessity.


Paul Fischer is Director of Professional Services at Paradatec, Inc.  For nearly 15 years he has focused on the design and installation of document capture, content management, and workflow automation systems for clients in a variety of industries.  Since joining Paradatec in early 2013, his primary focus has been on helping mortgage clients improve their operational efficiencies with Paradatec’s advanced mortgage OCR solution.


Paul Fischer thinks:

1.) Cycle times and cost pressures will continue to drive automation initiatives in the mortgage origination and servicing space.

2.) Document ingestion for mortgage servicing rights (MSR) transfers will become an entirely automated process.

3.) Robotic process automation (RPA) will reduce manual labor by 20% and much more in many cases.


Neil Fraser is Director of US Operations at Paradatec, a mortgage OCR technology organization that automates the data entry operations of large lenders through intelligent document analysis. Neil was Paradatec’s first US employee and has grown the organization every year since the company incorporated here in 2002.


Neil Fraser thinks:

1.) Redaction of personally identifiable information (PII) will become ubiquitous for any mortgage documents leaving a lender.

2.) Audits involving regulation such as TRID, RESPA, HMDA etc will become automated.

3.) As more investors move back into the secondary markets, the need for an audit trail from documents to elements in a loan servicing system database will become a requirement.

Coping With Doc Management Complexities

The lending industry faces the challenge of managing very large volumes of unstructured documents that contain immense amounts of critical data. The process of classifying and keying data from these documents is labor intensive, time consuming and costly due to the sheer volume and complexity of the documents. In an industry where standardizing forms is not possible due to their varying sources and wide variety, an acceptable solution must be able to cope with this complexity.

Founded in 1993, Franklin American Mortgage Company (FAMC), a privately held mortgage-banking firm located in Franklin, Tennessee, is a full-service professional mortgage banker licensed to provide residential mortgages across the nation. FAMC, which offers a host of diverse, flexible mortgage packages for customers with a variety of backgrounds and needs, is committed to helping families and individuals achieve the dream of home ownership through its three divisions: retail, wholesale and correspondent.

FAMC offers borrowers, brokers and lenders the strength and security of a forward-thinking national mortgage company, dedicated to remaining an industry trendsetter. FAMC truly values its relationship with each customer and mortgage professional they work with, maintaining a company tradition of responsiveness and personalized service characteristic of a much smaller organization. This philosophy has enabled FAMC to become one of the fastest growing mortgage bankers in the nation.

Featured Sponsors:


The Challenge

The Mortgage lending industry presents a number of unique challenges for manually classifying and managing very large volumes of disparate documents, which are ubiquitous within this industry.

>>It is common for a single mortgage loan to be comprised of over 250-500 pages of various size documents.

>>A mortgage loan may include over 275 different possible document types.

>>Manually sorting each set of loan documents can be a very labor intensive and error fraught effort.

>>When scanning loan documents, significant labor is required to simply establish the first and last pages of the multiple page documents. This is most often done using the costly process of inserting “document separator” sheets prior to scanning.

>>To compete in this extremely competitive business, organizations need to look at cutting costs and streamlining their processes.

Manually preparing a batch for scanning by inserting document separator sheets and manually classifying loan documents is a labor-intensive process. Not only is it critical that this process be done accurately, but also that it be done efficiently in order to allow downstream underwriting and servicing decisions to be performed in a timely way.

Featured Sponsors:

Project Description

Franklin American Mortgage Company (FAMC) had been looking for an OCR technology vendor to streamline their ADR process and had spent a significant amount of time performing a due diligence process, which compared vendors of these technologies.

“We had attempted to use OCR in the past for Automated Document Recognition (ADR). Due to our prior experience and a variety of technical issues we were very skeptical about OCR.”

Because of the extremely large number of, and variations of forms FAMC encounters, they required the flexibility offered by a non-template-based solution. In addition, the ideal solution needed to offer pre-built mortgage logic that would “understand” the vast majority of the document types and variations FAMC was required to recognize. This logic would allow FAMC to rapidly develop a customized ADR solution to their specific needs using the ideal solutions copyrighted mortgage rules as its foundation.

Today, FAMC scans millions of pages of mortgage documents per month. They no longer require their employees to insert document separator sheets to prepare a loan for the scanning process.   Once scanned, the loans are processed using the industry leading OCR solution for automated document recognition.

Documents’ boundaries (first and last pages) are defined and their types are automatically identified. These processes are now done faster and with a fraction of the labor formerly required. To ensure extreme accuracy, sophisticated mortgage-lending business rules have been implemented as part of the solutions exception process.

Additional capabilities leveraged successfully at FAMC:

>>Verification provides list of very likely document types to further increase speed of verifying exceptions.

>>Ability to customize how documents are handled based on the division of business the documents come from.

>>Ability to quickly add new document types using the Paradatec exclusive automated learning objects.

>>Database lookups and business rule logic checks to ensure the highest degree of accuracy.

>>No scripting interface, easily configurable rules to manage FAMCs highly sophisticated ADR processing application.


The project was completed and is currently in production. The system is able to achieve 80% document recognition while keeping error rates low. This has allowed FAMC to position itself for an anticipated future increase in incoming document volume and provides them with a powerful competitive advantage. Today FAMC is processing Millions of images per month and doing all this more accurately and with less production time than was formerly required.

Featured Sponsors:

“We asked a number of vendors including the Paradatec team to help us perform an extensive due diligence process which included a proof of concept test with our own documents. Paradatec was the clear winner based on our comprehensive vetting process”.

Paradatec’s PROSAR-AIDA is an advanced and unique OCR recognition technology. It is unique in that it utilizes neural networks technology and Artificial Intelligence (AI). PROSAR-AIDA is able to read structured, semi-structured, and unstructured documents. It makes ‘decisions’ about document characteristics in much the same way as a human being does, only many times faster and without human intervention.

PROSAR-AIDA takes a very different approach than other technologies. Because the recognition engine (a Paradatec exclusive) incorporated in PROSAR-AIDA is faster than any full page OCR product on the market, it is able to process each image, in less than two seconds on average. It does this without making any assumptions about content location on the page or attempts at matching zonal OCR templates. PROSAR-AIDA is capable of processing thousands of documents per hour with a single processor core, and provides even further almost unlimited scalability by offering seamless utilization of the latest in multi-core processor technologies, and multi-server environments.

Because of Paradatec’s unique approach, and their ability to leverage a vast quantity of intellectual property, which they have built over the years specifically for mortgage loans, implementations can be completed in a fraction of the time normally required by others.

About The Author

Loan Document Automation

The mortgage lending industry presents a number of unique challenges for classifying and extracting data from key documents, due in part to the large volumes of disparate documents in most loan files.

Featured Sponsors:


New documents and the regulations related to them put a new emphasis on the need for quick and very accurate data. Lenders in particular face significant penalties for inaccurate data and missed delivery deadlines. Sorting and capturing critical data from thousands of diverse documents has historically been labor intensive, slow, and expensive. To stay competitive, and meet these new and constantly changing challenges, automation through technology is no longer optional.

The key is finding a provider that specializes in automated document classification and data capture specifically for mortgage lending and the financial services industries, which scales to process millions of pages per day.

Featured Sponsors:

Leading edge OCR solutions offer significant efficiencies for classifying large quantities of differing document types and extracting key data elements from those documents.  In the mortgage market, these capabilities allow for quick and accurate identification of over 500 unique documents in the typical mortgage file, along with the ability to capture nearly any data element from those documents that an organization requires.

Here are some examples of applying this advanced technology to specific mortgage documents:

Application Processing

Extract relevant content from borrower-provided pay stubs, W-2s, bank statements, and tax documents to expedite underwriting and reduce origination costs.

Post-Close Processing

Identification of each document in the loan file, bringing structure to what was a 300+ page blob of content.

Featured Sponsors:

Verification that relevant documents have been signed

Compare key data elements from loan file with your systems of record to verify changes haven’t been made without your knowledge.

UCD File Generation

Create the Uniform Closing Dataset (“UCD”) file required (as of Sept 25, 2017) when selling loans to Fannie Mae and Freddie Mac

Reporting And Audit Automation

Extract key loan file data elements to support the following reporting/audit activities:

HMDA reporting – our system is ready to capture the additional demographic data on the new Uniform Residential Loan Application (effective Jan 1, 2018)

RESPA audit

TRID audit

Lenders can longer afford to manually classify and manage large volumes of disparate documents. Manually preparing a batch for scanning by inserting document separator sheets and manually classifying loan documents is a labor-intensive, inefficient and error prone process. Not only is it critical that this process be done accurately, but also that it be done efficiently in order to allow downstream underwriting and servicing decisions to be performed in a timely way.

At the end of the day it is about finding a provider that focuses its skills towards delivering the most efficient, accurate, and flexible freeform document classification and data extraction solution available. The time is now for lenders to reduces manual labor costs and increases accuracy levels associated with classifying and capturing data from loan documents.

About The Author

Make Sure That Your OCR Tech Works

Paradatec, Inc., a provider of advanced Optical Character Recognition (OCR) solutions for mortgage file processing, announced the availability of their One-Day Blind Test Challenge. Paradatec’s OCR library identifies nearly 500 unique document types in the typical mortgage file, along with extracting over 6,000 data fields from those documents. Combining this library with Paradatec’s sub-second OCR processing engine creates a high level of performance and scalability. Through this new One-Day Blind Test Challenge, qualified organizations can see the power of Paradatec’s out-of-the-box OCR solution for themselves, providing a final buying decision validation point using samples of their own mortgage files.

Featured Sponsors:


“Many of our prospects have been disappointed in the results of past OCR initiatives, so they’re understandably cautious. Our One-Day Blind Test Challenge lets them run samples of their loans through our solution to validate our out-of-the-box performance claims. The Challenge will be conducted on-site rather than at our facilities, due in part to the confidential nature of the content, but to also minimize concerns about our skewing any results behind the scenes,” said Neil Fraser, Paradatec’s Director of US Operations.

Featured Sponsors:

Fraser continues, “Our mortgage OCR library offers clients a short implementation timeline while other solutions require development from the ground up. This One-Day Blind Test Challenge demonstrates the validity of our claim so prospects can be assured that Paradatec offers a robust and scalable solution ready to deliver productivity improvements in weeks rather than months or years.”

Featured Sponsors:

In the course of the Blind Test Challenge, the provided loan files will be indexed by document type; 100 data fields will be extracted from various key documents like the Note, Deed of Trust, Closing Disclosure, Appraisal, and W-2; and a bookmarked PDF of the loan will be produced, with the data extraction fields highlighted using Paradatec’s new WritePDF module. Fraser concludes his statements by saying, “Unfortunately, many companies base their buying decision primarily on price, only to be disappointed with the lack of true out-of-the-box mortgage-specific functionality offered by the product. In other cases, great claims are made regarding OCR automation rates, while the typical experience found with other products is something less impressive. We believe ours is the most expansive OCR offering available, such that we’ll gladly test it on a blind set of loans to show a prospect what makes Paradatec different.”

Vendor Releases Innovative Web Services API

Paradatec, Inc., developer of an Optical Character Recognition (OCR) solution for mortgage file processing, has released their web services API for real-time integration to their clients’ line-of-business applications. This new functionality can seamlessly transfer documents from the loan origination system (LOS) to the Paradatec solution for page classification and data extraction, with the Paradatec-produced results transferred back to the LOS in place of manual data entry.

Featured Sponsors:


“As application integration becomes tighter in response to the ongoing compression of service level timeframes, Paradatec’s new web services API stands ready to serve as the OCR extension to our clients’ line-of-business applications. Our first solution to leverage this capability is our new WriteUCD module, in which the final Closing Disclosure (CD) is submitted to us through the web services API, our OCR functionality extracts the relevant data from the CD, and WriteUCD then produces the corresponding Uniform Closing Dataset (UCD) file required by Fannie Mae and Freddie Mac” said Neil Fraser, Paradatec, Inc.’s Director of US Operations.

Featured Sponsors:

“This new functionality allows for seamless OCR processing scaling from small document sets like borrower-provided paystubs and W-2s up to full loan files. With the ability to integrate tightly with any other web service-enabled application, we’re helping our clients create a very rich and efficient application ecosystem.”

Featured Sponsors:

Paradatec’s OCR solutions offer significant efficiencies for classifying large quantities of differing document types and extracting key data elements from those documents.  In the mortgage market, these capabilities allow for the quick and accurate identification of over 500 unique documents in the typical mortgage file, along with capturing nearly any data element from those documents that an organization requires.

About The Author

OCR For Mortgage In Action

The financial services industry is challenged with managing large volumes of documents with varying layouts containing immense amounts of data – part of which is highly critical with regards to compliance. The traditional manual process for classifying and keying data from these documents is time consuming, error prone, and costly due to the sheer volume and complexity of the mortgage documents. In an industry where standardizing forms is not always possible due to their varying systems and points of origination, an acceptable automation solution must be able to properly and compliantly handle this variability.


Top-Five Originator. This bank is one of the largest in the United States. It is a leading lender offering a range of quality home loans, including government and conventional. These loans are provided through multiple channels.

Featured Sponsors:



The mortgage lending industry presents a number of unique challenges for classifying and extracting data from key documents. This is due in part to the large volumes of disparate document variations found in most loan files.

>>A typical incoming mortgage loan file may contain 250 to 600+ pages of various size documents, comprising more than 250 potential document types. Older loans files may grow to well over 1000 pages.

Featured Sponsors:

>>Manually sorting each set of loan documents is a labor intensive and error prone effort, typically requiring the addition of document separator pages if the file is to be scanned.

>>Due to the sheer labor effort required, the typical level of detailed document sorting possible with a manual approach is very “coarse”. In other words, only the most critical documents and document groups are classified rather than attempting to identify all specific document types. An example of this limitation might be a manual grouping of a series of specific documents into a “Credit Documents Group” rather than breaking these out specifically by document types such as bank statements, credit reports, and brokerage statements.

>>To compete in this extremely competitive market segment, organizations are looking for ways to reduce costs and streamline their processes.

In addition to the challenges described above, this top five originator was looking for a solution to help automate the laborious task of providing data for a number of audit-centric applications. These ad-hoc projects commonly had tight timelines and included wide ranges of loans, and millions of pages to be audited.

Featured Sponsors:

Project Description

At the start of the Project, this top five originator had a sophisticated document capture infrastructure feeding a well-known enterprise content management system in place. What was missing from this infrastructure was an advanced recognition module that could deal with the document variations expected in an organization serving borrowers across the nation.

The ideal solution needed to provide a seamless interface to this current capture infrastructure. This would greatly simplify the implementation by allowing the existing interfaces to both front-end scanning and back-end image storage to be largely unaffected by the addition of the recognition technology.

Prior to the installation of the new recognition components, a large team would manually classify incoming documents into a moderately broad range of categories or Document Groups. Once these documents had reached the enterprise content management system, a team of underwriters would review, manually enter data, and process the loan.

Limitations of this approach included:

>>Heavy reliance on the skills of the people manually classifying documents and extracting data. Error rates varied from operator to operator. Thus, a loss of a skilled operator for any reason had a negative impact.

>>Time is of the essence in any mortgage-processing environment. Using a human-centric approach meant that processing times were proportional to staff availability at any given moment.

>>People tend to be more expensive than computers and software.

>>Regulatory bodies as well as this originator would have preferred a greater granularity in the way documents were classified. However, this need was outweighed by the complexity and difficulty presented when attempting to teach and maintain a group of individuals in how to classify documents among over 250 possible choices.

The new extraction system was selected after an exhaustive evaluation process. A competing solution was initially tried. However, after months of tests, it was determined that a more advanced solution was available which had a number of capabilities that surpassed other solutions previously tested or reviewed:

>>This new solution was by far the fastest technology available to read OCR mortgage documents. Pre-production technical due diligence empirically showed a system that was capable of processing approximately 1 million images per day on a single twelve-core server.

>>This solution was able to use one set of rules to process and recognize all document variations. Because of the extremely large number of documents (and variations of each), which this top five originator encounters, they required the flexibility offered by a non-template-based ADR (Automated Document Classification) and data extraction solution.

>>This solution offered pre-built mortgage logic, which “understands” the vast majority of the document types and variations that were required to be recognized. This solution allowed this originator to rapidly implement an ADR and data extraction solution for their specific needs.

The initial focus was to implement an ADR solution that supported more than 250 different document types and potentially hundreds of variations of each document type. The vast majority of the pages in a loan are now identified automatically with no human intervention. The remaining exceptions are presented to operators who either accept the first choice page type or choose an alternative.

This system is able to narrow down the page types that are lexically possible based on the text on the page. Because of this, in most cases, the operator can choose from a list of no more than five alternate page types. This reduces errors and review time in the verification process.

Upon production implementation of the ADR solution, the focus shifted to automatic data extraction. A list of more than 1500 fields was identified for the first implementation phase of data extraction. Both this project and the ADR work that preceded it were initially implemented in one of the originator’s major channels in order to ensure a wide variety of document sources and variations.

Today both of the projects described above are in full production. The amount of manual labor previously required for these tasks has been reduced significantly. Error rates are lower than the human processes that preceded implementation. The end to end processing time has been vastly reduced due to the fact that much of the human labor has now been replaced by lightning fast computer CPU cycles. Additionally, this top five originator has implemented sophisticated downstream mortgage lending business rules to take advantage of the valuable data generated by the new system.

This top five originator, like any other mortgage lender, is subject to a variety of time-sensitive requests such as internal audits. These audits require that specific data be tabulated from each loan file and reported to the appropriate entity. In some cases, the volume of loans included in these audits can reach into the tens of thousands, with a very limited response timeframe. With the system now in production, it is possible for this organization to be more agile than in the past. New data fields can be configured and tested in a few hours and a million images can now be interrogated for salient data overnight.

Additional capabilities leveraged successfully at this customer include:

>>Verification provides a list of likely document types to further increase speed of verifying exceptions.

>>Ability to customize how documents are handled based on the type of process to be conducted (e.g. origination, servicing, audit, etc.).

>>Ability to quickly recognize additional document types using the automated learning facility.

>>Database lookups and business rule logic checks to ensure the highest degree of data accuracy.

>>No scripting interface, with easily configurable rules to modify customers’ highly sophisticated ADR and data extraction processes.

>>Ability to add processor cores (including new servers) to the environment in a matter of minutes to quickly scale and meet tight deadlines or increased staffing demands.


The project was successfully implemented and released to production on time. As a result of this experience with both the Paradatec staff and the Paradatec solution, this customer is prepared to act as a reference on behalf of Paradatec. Prospective clients are encouraged to take advantage of this opportunity.

Paradatec is rapidly approaching the significant milestone of processing 300,000,000 pages annually for this client alone. As a company, Paradatec processes several billion pages per year.

Paradatec’s solution is an advanced and unique OCR recognition technology. It utilizes neural networks technology and artificial intelligence and is able to read structured, semi-structured, and unstructured documents. It then makes ‘decisions’ about document characteristics in much the same way as a human being does— only many times faster and without human intervention.

Paradatec takes a very different approach from other OCR forms processing technologies in that it is a truly template-free design, allowing the system to easily cope with the varying layouts of each document. In performance terms, Paradatec is capable of processing thousands of documents per hour with a single processor. It provides even further scalability by offering seamless support for the latest in multi-core processor technologies and multi-server configurations.

Per Neil Fraser, Director, of US Operations, “To be chosen by such a high-profile client for a project of this size was a vote of confidence for Paradatec and our leading edge technology. I would encourage other similarly placed clients to reach out to Paradatec to setup a ‘One-Day Blind Test Challenge’. In just a day it is possible to see what this technology can do, right out of the box.”

About The Author

Paradatec Named Verified UCD Producer By Freddie Mac

Paradatec, Inc., a provider of Optical Character Recognition (OCR) solutions for mortgage file processing, announced that it is a verified technology integration vendor for Freddie Mac’s Loan Closing Advisor platform. Paradatec’s WriteUCD module was developed in accordance with Freddie Mac’s requirements for producing valid Uniform Closing Dataset (UCD) files.

Featured Sponsors:


The UCD is a common collection of data that mortgage lenders will be required to deliver digitally to Freddie Mac and Fannie Mae starting on Sept. 25, 2017. This requirement is part of the Uniform Mortgage Data Program (UMDP), an industry-wide drive to build a better housing finance system in the United States.

Featured Sponsors:

The WriteUCD module leverages Paradatec’s advanced OCR solution for the mortgage market to extract data from closing disclosure (CD) documents in mere seconds per page and then format that data in the required format.

According to Neil Fraser, Paradatec’s Director of US Operations, “We’re pleased to have obtained Freddie Mac validation as our clients need the assurance that they can meet the GSEs’ requirements well in advance of the September deadline. If a lender’s current loan origination system partner or document provider is struggling to produce a valid UCD file, they can sleep soundly knowing that Paradatec has them covered with our new WriteUCD module.”

Featured Sponsors:

Paradatec also announces the release of their new AuditUCD module for auditing UCD file content against the closing disclosure contained in the UCD file. Fraser continues, “Since we’re building the UCD file from extracted closing disclosure data, it’s just as easy for us to unpack a UCD file’s content to compare the individual data elements against the values extracted from the submitted CD to verify the integrity of both components in the UCD file. Any elements that don’t match will be flagged in our XML output for further review and resolution. Given the volume of content that will be produced and need verification with this UCD initiative, our solution is uniquely positioned to offer a high degree of automation and operator efficiency.”

About The Author