Xiangyu Wong
2025
Hierarchical Memory Organization for Wikipedia Generation
Eugene J. Yu
|
Dawei Zhu
|
Yifan Song
|
Xiangyu Wong
|
Jiebin Zhang
|
Wenxuan Shi
|
Xiaoguang Li
|
Qun Liu
|
Sujian Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Generating Wikipedia articles autonomously is a challenging task requiring the integration of accurate, comprehensive, and well-structured information from diverse sources. This paper introduces the Memory Organization-based Generation (MOG) framework, a novel approach to address these challenges by leveraging a hierarchical memory architecture. MOG extracts fine-grained memory units from web documents, recursively organizes them into a Wikipedia-style hierarchical structure, and uses this structure to guide the generation process. This ensures alignment between memory and the article outline, improving both informativeness and verifiability while minimizing hallucinations. Additionally, a citation module is implemented to enhance traceability by linking every generated sentence to specific memory units. Evaluations on our newly created WikiStart dataset demonstrate that MOG outperforms baseline methods in producing informative and reliable articles, making it particularly robust in real-world scenarios.
LongAttn: Selecting Long-context Training Data via Token-level Attention
Longyun Wu
|
Dawei Zhu
|
Guangxiang Zhao
|
Zhuocheng Yu
|
Junfeng Ran
|
Xiangyu Wong
|
Lin Sun
|
Sujian Li
Findings of the Association for Computational Linguistics: ACL 2025
With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with **long-range dependencies** is crucial. Existing methods to select long-context data often rely on sentence-level analysis,which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, **LongAttn**, which leverages the self-attention mechanism of LLMs to measure the **long-range dependencies** for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies **long-range dependencies**, enabling more accurate and efficient data selection. We filter **LongABC-32K** from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent **effectiveness**, **scalability**, and **efficiency**. We will release our code and the high-quality long-context dataset **LongABC-32K** in the future.
Search
Fix author
Co-authors
- Sujian Li (李素建) 2
- Dawei Zhu 2
- Xiaoguang Li 1
- Qun Liu 1
- Junfeng Ran 1
- show all...