An LLM-based Framework for Domain-Specific Information Extraction: A Case Study in Computer Science and Chemistry

Xungang Gu, Yangjie Tian, Ning Li, Meng Liu, Ruohua Xu, He Zhang, Hanqiu Liu, Yongpan Sheng, Ming Liu


Abstract
Information extraction (IE) in specialized domains like computer science and chemistry is challenged by the poor generalization of traditional models and the knowledge deficits of general-purpose Large Language Models (LLMs). We introduce a robust, LLM-based framework featuring two core contributions: an end-to-end training and inference paradigm that combines continual pre-training (CPT) for knowledge injection, supervised fine-tuning (SFT) for task alignment, and retrieval-augmented generation (RAG) for inference-time enhancement; and a novel LLM-assisted data annotation pipeline for the efficient creation of high-quality training data. Comprehensive experiments demonstrate that while fine-tuning alone yields strong in-domain performance, our complete framework exhibits superior robustness and generalization. It consistently achieves state-of-the-art results in challenging domain-shift and novel-schema scenarios, validating our integrated approach for building adaptable and high-performance domain-specific IE systems.
Anthology ID:
2025.alta-main.8
Volume:
Proceedings of The 23rd Annual Workshop of the Australasian Language Technology Association
Month:
November
Year:
2025
Address:
Sydney, Australia
Editors:
Jonathan K. Kummerfeld, Aditya Joshi, Mark Dras
Venue:
ALTA
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
101–111
Language:
URL:
https://preview.aclanthology.org/ingest-alta/2025.alta-main.8/
DOI:
Bibkey:
Cite (ACL):
Xungang Gu, Yangjie Tian, Ning Li, Meng Liu, Ruohua Xu, He Zhang, Hanqiu Liu, Yongpan Sheng, and Ming Liu. 2025. An LLM-based Framework for Domain-Specific Information Extraction: A Case Study in Computer Science and Chemistry. In Proceedings of The 23rd Annual Workshop of the Australasian Language Technology Association, pages 101–111, Sydney, Australia. Association for Computational Linguistics.
Cite (Informal):
An LLM-based Framework for Domain-Specific Information Extraction: A Case Study in Computer Science and Chemistry (Gu et al., ALTA 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-alta/2025.alta-main.8.pdf