Benchmarking Hindi Term Extraction in Education: A Dataset and Analysis
Shubhanker Banerjee, Bharathi Raja Chakravarthi, John Philip McCrae
Abstract
18 This paper introduces the HTEC HindiTerm Extraction Dataset 2.0, a resourcedesigned to support terminology extractionand classification tasks within the education domain. HTEC 2.0 has been developed with the objective of providing a high-quality benchmark dataset for the evaluation of term recognition and classification methodologies in Hindi educationaldiscourse. The dataset consists of 97 documents sourced from Hindi Wikipedia, covering a diverse range of topics relevant tothe education sector. Within these documents, 1,702 terms have been manuallyannotated where each term is defined as asingle-word or multi-word expression thatconveys a domain-specific meaning. Theannotated terms in HTEC 2.0 are systematically categorized into seven distinct classes.Furthermore, this paper outlines the development of annotation guidelines, detailingthe criteria used to determine term boundaries and category assignments. By offeringa structured dataset with clearly definedterm classifications, HTEC 2.0 serves as avaluable resource for researchers workingon terminology extraction, domain-specificnamed entity recognition, and text classification in Hindi.- Anthology ID:
- 2025.ldk-1.3
- Volume:
- Proceedings of the 5th Conference on Language, Data and Knowledge
- Month:
- September
- Year:
- 2025
- Address:
- Naples, Italy
- Editors:
- Mehwish Alam, Andon Tchechmedjiev, Jorge Gracia, Dagmar Gromann, Maria Pia di Buono, Johanna Monti, Maxim Ionov
- Venues:
- LDK | WS
- SIG:
- Publisher:
- Unior Press
- Note:
- Pages:
- 19–30
- Language:
- URL:
- https://preview.aclanthology.org/ldl-25-ingestion/2025.ldk-1.3/
- DOI:
- Cite (ACL):
- Shubhanker Banerjee, Bharathi Raja Chakravarthi, and John Philip McCrae. 2025. Benchmarking Hindi Term Extraction in Education: A Dataset and Analysis. In Proceedings of the 5th Conference on Language, Data and Knowledge, pages 19–30, Naples, Italy. Unior Press.
- Cite (Informal):
- Benchmarking Hindi Term Extraction in Education: A Dataset and Analysis (Banerjee et al., LDK 2025)
- PDF:
- https://preview.aclanthology.org/ldl-25-ingestion/2025.ldk-1.3.pdf