Zhipeng Xu
2026
ThinkNote: Enhancing Knowledge Integration and Utilization of Large Language Models via Constructivist Cognition Modeling
Zhipeng Xu | Zhenghao Liu | Yukun Yan | Shuo Wang | Shi Yu | Zheni Zeng | Chaojun Xiao | Zhiyuan Liu | Ge Yu | Chenyan Xiong
Findings of the Association for Computational Linguistics: EACL 2026
Zhipeng Xu | Zhenghao Liu | Yukun Yan | Shuo Wang | Shi Yu | Zheni Zeng | Chaojun Xiao | Zhiyuan Liu | Ge Yu | Chenyan Xiong
Findings of the Association for Computational Linguistics: EACL 2026
Large Language Models (LLMs) have demonstrated strong performance across a wide range of NLP tasks. However, they often exhibit suboptimal behaviors and inconsistencies when exposed to unfamiliar external information, underscoring their limitations in effectively leveraging such knowledge. Inspired by constructivist learning theory, we propose ThinkNote, a novel framework that enhances the external knowledge utilization of LLMs through a two-stage constructivist cognitive modeling process. Specifically, ThinkNote performs knowledge assimilation to align new information with the model’s parametric memory, forming a coherent internal representation. It then applies thought accommodation to adapt internal reasoning, thereby promoting more consistent and reliable outputs. Extensive experimental results demonstrate that ThinkNote achieves a 10% improvement over strong baseline methods on various question-answering benchmarks. Further analysis indicates that ThinkNote effectively integrates and utilizes external knowledge to help LLMs generate accurate responses and improves their self-consistency. All data and code will be publicly available at https://github.com/OpenMatch/ThinkNote.
2024
Cleaner Pretraining Corpus Curation with Neural Web Scraping
Zhipeng Xu | Zhenghao Liu | Yukun Yan | Zhiyuan Liu | Ge Yu | Chenyan Xiong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Zhipeng Xu | Zhenghao Liu | Yukun Yan | Zhiyuan Liu | Ge Yu | Chenyan Xiong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
The web contains large-scale, diverse, and abundant information to satisfy the information-seeking needs of humans. Through meticulous data collection, preprocessing, and curation, webpages can be used as a fundamental data resource for language model pretraining. However, when confronted with the progressively revolutionized and intricate nature of webpages, rule-based/feature-based web scrapers are becoming increasingly inadequate. This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages. Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement, demonstrating its potential in extracting higher-quality data to facilitate the language model pretraining. All of the code is available at https://github.com/OpenMatch/NeuScraper.