Corporate documents such as contracts, reports, invoices, and receipts have complex layouts. These documents can be automatically interpreted and analyzed, which can be useful and lead to the creation of AI-driven solutions. However, these documents pose many challenges as they can have rich semantics that intersect textual and spatial modalities. The complex layout of a document provides important visual cues needed to effectively interpret the document.
Document AI (DocAI) has made significant advances in areas such as question answering, classification, and extraction, but real-world applications are related to accuracy, reliability, understanding of context, and generalization to new domains. continues to face persistent hurdles.
To address these issues, a team of researchers at JPMorgan AI Research introduced DocLLM. DocLLM is a lightweight version of traditional large-scale language models (LLMs) that considers both the semantics and spatial layout of text and was created specifically for visual document inference.
DocLLM is multimodal in nature because it represents both the semantics and spatial layout of the text. In contrast to traditional methods, the requirements for advanced visual encoders are developed in a way that adds spatial layout information using bounding box coordinates obtained using optical character recognition (OCR). becomes unnecessary. This design decision reduces processing time, slightly increases model size, and preserves the causal decoder architecture.
The team shared that spatial layout structure alone is sufficient for some document intelligence tasks, such as understanding forms, arranging tables, and answering visual questions. This method extends the typical transformer self-attention mechanism to capture cross-modal interactions by separating spatial information from textual information.
Visual documents often contain fragmented text sections, irregular layouts, and disparate information. To address this, research suggests changing pre-training goals during a self-supervised pre-training phase. Recommended padding is in place to accommodate various text alignments and cohesive blocks of text. This adjustment allows the model to more effectively handle mixed data types, complex layouts, contextual completion, and misaligned text.
DocLLM's pre-trained knowledge is fine-tuned for a variety of document intelligence jobs based on instructional data from many datasets. These tasks include document classification, visual question answering, natural language inference, and key information extraction.
Both single-page and multi-page documents are covered with instructional tuning data and can include layout cues such as field delimiters, titles, and captions to help readers understand the paper's logical structure. Masu. For the Llama2-7B model, the changes made by DocLLM resulted in significant performance improvements ranging from 15% to 61% on four of the five previously unpublished datasets.
The team summarizes their main contributions as follows:
- A typical LLM with lightweight extensions specifically designed for visual document interpretation was introduced.
- This study aims to provide a unique attention mechanism that can distinguish between textual and spatial information, allowing efficient capture of cross-modal alignment between layout and text.
- Pre-training objectives are outlined to address the problems caused by asymmetric layouts of visual documents.
- Specialized instruction tuning datasets are designed for visual document intelligence tasks where you need to be selective in order to effectively fine-tune your model.
- Detailed trials were performed and important insights were gained into how the proposed model behaves and functions during visual document management.
Please check paper. All credit for this study goes to the researchers of this project.Also, don't forget to join us 35,000+ ML SubReddits, 41,000+ Facebook communities, Discord channel, linkedin groupsHmm, twitterand email newsletterWe share the latest AI research news, cool AI projects, and more.
If you like what we do, you'll love our newsletter.
Tanya Malhotra is a final year student at University of Petroleum and Energy Research, Dehradun, pursuing a Bachelor's degree in Computer Science Engineering with specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, and a keen interest in learning new skills, leading groups, and managing work in an organized manner.