Li Sun
2025
DocAgent: An Agentic Framework for Multi-Modal Long-Context Document Understanding
Li Sun
|
Liu He
|
Shuyue Jia
|
Yangfan He
|
Chenyu You
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Recent advances in large language models (LLMs) have demonstrated significant promise in document understanding and question-answering. Despite the progress, existing approaches can only process short documents due to limited context length or fail to fully leverage multi-modal information. In this work, we introduce DocAgent, a multi-agent framework for long-context document understanding that imitates human reading practice. Specifically, we first extract a structured, tree-formatted outline from documents to help agents identify relevant sections efficiently. Further, we develop an interactive reading interface that enables agents to query and retrieve various types of content dynamically. To ensure answer reliability, we introduce a reviewer agent that cross-checks responses using complementary sources and maintains a task-agnostic memory bank to facilitate knowledge sharing across tasks. We evaluate our method on two long-context document understanding benchmarks, where it bridges the gap to human-level performance by surpassing competitive baselines, while maintaining a short context length. Our code is available at https://github.com/lisun-ai/DocAgent.
2023
From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding
Li Sun
|
Florian Luisier
|
Kayhan Batmanghelich
|
Dinei Florencio
|
Cha Zhang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens. This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes. This fixed vocabulary limits the model’s robustness to spelling errors and its capacity to adapt to new domains. In this work, we introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach: one at the word level and another at the sequence level. Concretely, we design an intra-word module that uses a shallow Transformer architecture to learn word representations from their characters, and a deep inter-word Transformer module that contextualizes each word representation by attending to the entire word sequence. Our model thus directly operates on character sequences with explicit awareness of word boundaries, but without biased sub-word or word-level vocabulary. Experiments on various downstream tasks show that our method outperforms strong baselines. We also demonstrate that our hierarchical model is robust to textual corruption and domain shift.
Search
Fix author
Co-authors
- Kayhan Batmanghelich 1
- Dinei Florencio 1
- Liu He 1
- Yangfan He 1
- Shuyue Jia 1
- show all...