Document data extraction: How to extract data from unstructured document

What is document data extraction?

Document data extraction is a method of extracting relevant data from unstructured documents such as scanned copies, PDFs and even handwritten documents. Using OCR and Intelligent document processing (IDP), also known as document AI, relevant information can be extracted from unstructured documents with high accuracy.

Document AI: Use of OCR and AI

Document AI uses machine learning techniques along with the OCR to extract data from scanned documents. OCR is a technology that converts scanned images of text into machine-readable text. It recognizes individual characters and returns the extracted text either line by line or word by word. However, OCR by itself does not understand the layout, structure, or context of the document, and it cannot interpret the content beyond recognizing individual characters. As OCR lacks the ability to comprehend the context of a document, it cannot determine the relationships between different elements such as words, sentences, tables, images within the document. For instance, it cannot recognize which pieces of information are related to each other, such as identifying the sender and recipient of a letter or associating products with their respective prices in an invoice.

To overcome the limitations of OCR, AI techniques are applied on top of OCR. Document AI, also known as Intelligent Document Processing, combines OCR with machine learning and NLP (natural language processing) techniques to understand the context, structure, and content of the document.

By leveraging AI, Document AI can comprehend the layout and structure of the document, making it possible to identify and extract relevant information regardless of its location within the document. It can recognize relationships between different elements, extract key data points, and categorize information based on its meaning.

By integrating AI with OCR, businesses and organizations can automate document processing tasks, such as data entry, information extraction, and content analysis. This significantly reduces manual effort, enhances accuracy, and speeds up document-based workflows.

What AI models are used to extract from documents?

There are some open source AI models that can be used for data extraction from unstructured documents.

Donut: The OCR-free Document Understanding Transformer introduced the “Donut” model, which comprises an image Transformer encoder and an autoregressive text Transformer decoder. This combination enables the Donut model to excel in document understanding tasks, including document image classification, form comprehension, and visual question answering. The authors of this model are Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park.

LayoutXLM: The “LayoutXLM” model was introduced in the research paper titled “LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding.” The authors of this model are Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu Wei. LayoutXLM is an extension of the LayoutLMv2 model and is specifically designed for multilingual document understanding tasks. It has been trained on a diverse dataset of 53 different languages, making it capable of handling visually-rich documents in various languages.

ayoutLMv3: The research paper “LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking” introduces the LayoutLMv3 model. The authors of this model are Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. LayoutLMv3 is a refined version of LayoutLMv2, which simplifies the architecture by using patch embeddings (similar to ViT) instead of relying on a CNN backbone. The model undergoes pre-training with three main objectives: masked language modeling (MLM), masked image modeling (MIM), and word-patch alignment (WPA). This approach enables LayoutLMv3 to effectively handle Document AI tasks that involve both text and image understanding.

Pretarined and ready to use AI models

The above models are trained with general datasets. Developers need to take these models and train with the specific type of documents such as invoices, receipts etc. to capture data from those documents accurately. There are some pre-trained AI models that can be used for receipts and invoices. Developers can use readily available API to process invoices and receipts and get the extracted data through webhook in JSON format. The captured data then can easily be integrated into any existing systems through APIs.

AI has changed the way people used to process documents. It not only saves massive amounts of time, it also increases accuracy and overall efficiency in the workflow. The IDP market is growing very fast due to its advantages and more companies are automating their workflow using Intelligent Document Processing.

Document data extraction: How to extract data from unstructured document

Soumen Das