TaxoAlign: Scholarly Taxonomy Generation Using Language Models

Avishek Lahiri, Yufang Hou, Debarshi Kumar Sanyal


Abstract
Taxonomies play a crucial role in helping researchers structure and navigate knowledge in a hierarchical manner. They also form an important part in the creation of comprehensive literature surveys. The existing approaches to automatic survey generation do not compare the structure of the generated surveys with those written by human experts. To address this gap, we present our own method for automated taxonomy creation that can bridge the gap between human-generated and automatically-created taxonomies. For this purpose, we create the CS-TaxoBench benchmark which consists of 460 taxonomies that have been extracted from human-written survey papers. We also include an additional test set of 80 taxonomies curated from conference survey papers. We propose TaxoAlign, a three-phase topic-based instruction-guided method for scholarly taxonomy generation. Additionally, we propose a stringent automated evaluation framework that measures the structural alignment and semantic coherence of automatically generated taxonomies in comparison to those created by human experts. We evaluate our method and various baselines on CS-TaxoBench, using both automated evaluation metrics and human evaluation studies. The results show that TaxoAlign consistently surpasses the baselines on nearly all metrics. The code and data can be found at https://github.com/AvishekLahiri/TaxoAlign.
Anthology ID:
2025.emnlp-main.1536
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30191–30211
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1536/
DOI:
Bibkey:
Cite (ACL):
Avishek Lahiri, Yufang Hou, and Debarshi Kumar Sanyal. 2025. TaxoAlign: Scholarly Taxonomy Generation Using Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30191–30211, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
TaxoAlign: Scholarly Taxonomy Generation Using Language Models (Lahiri et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1536.pdf
Checklist:
 2025.emnlp-main.1536.checklist.pdf