VisFocus:
Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

ECCV 2024

1Ofir Abramovich, 2Niv Nayman, 2Sharon Fogel, 2Inbal Lavi, 2Ron Litman,
2Shahar Tsiper, 2Royee Tichauer, 2Srikar Appalaraju, 2Shai Mazor, 2R. Manmatha

1Reichman University  2Amazon AWS AI Labs

Abstract

In recent years, notable advancements have been made in the domain of visual document understanding, with the prevailing architecture comprising a cascade of vision and language models. The text component can either be extracted explicitly with the use of external OCR models in OCR-based approaches, or alternatively, the vision model can be endowed with reading capabilities in OCR-free approaches. Typically, the queries to the model are input exclusively to the language component, necessitating the visual features to encompass the entire document. In this paper, we present VisFocus, an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt. To do so, we replace the down-sampling layers with layers that receive the input prompt and allow highlighting relevant parts of the document, while disregarding others. We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder in place of the prompt, to empower the model with focusing capabilities. Consequently, VisFocus learns to allocate its attention to text patches pertinent to the provided prompt. Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance, achieving state-of-the-art results on various benchmarks.

Video

Overview

VisFocus is an OCR-free method for dense document understanding that aims to better utilize the vision encoder's capacity by directly coupling it with language prompts. This is achieved using a combination of architectural enhancements and a novel pre-training task.

Overview figure
Method Overview.

Method

VisFocus proposes two main contribustions:

Localized Masked Language Modeling (LMPM)

LMPM aims to guide the model's attention towards relevant sections of a document by:

Smapling a text snippet from the pre-extracted text of the document

Randomly masking portions of the text snippet

Feeding the masked text snippet to the vision encoder

Training the model to predict the masked tokens

LMPM

Vision-Language Patch-Merging (ViLMA)

ViLMA layers enable direct interaction between the visual features and the language prompt.
They replace traditional down-sampling layers in the vision encoder, allowing the model to highlight relevant parts of the document based on the input prompt while disregarding less important areas.

ViLMA

Results

Comparison with previous OCR-Free methods on VQA benchmarks.

main_results
VisFocus outperforms previous methods of comparable scale, even when trained on substantially less pre-training data. We report ANLS on DocVQA and InfoVQA, Relaxed Accuracy (RA) on ChartQA and Exact Match (EM) on OCR-VQA and AI2D. In fully-trained methods, we only state total number of parameters.

Empirical Analysis

We perform an extesive ablation study on each of VisFocus' components. More ablation can be fround in the paper.

ablation_table
Breaking down the contributions of VisFocus’ main components.

ablation_lmpm
Attention maps of the last ViLMA layer, with and without LMPM.

Cite Us

If you find our word useful for your research, please cite us!


@misc{abramovich2024visfocus,
    title={VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding},
    author={Ofir Abramovich and Niv Nayman and Sharon Fogel and Inbal Lavi and Ron Litman and Shahar Tsiper
            and Royee Tichauer and Srikar Appalaraju and Shai Mazor and R. Manmatha},
    year={2024},
    eprint={2407.12594},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}