A Dataset for Term Extraction in Hindi
Shubhanker Banerjee, Bharathi Raja Chakravarthi, John Philip McCrae
Abstract
Automatic Term Extraction (ATE) is one of the core problems in natural language processing and forms a key component of text mining pipelines of domain specific corpora. Complex low-level tasks such as machine translation and summarization for domain specific texts necessitate the use of term extraction systems. However, the development of these systems requires the use of large annotated datasets and thus there has been little progress made on this front for under-resourced languages. As a part of ongoing research, we present a dataset for term extraction from Hindi texts in this paper. To the best of our knowledge, this is the first dataset that provides term annotated documents for Hindi. Furthermore, we have evaluated this dataset on statistical term extraction methods and the results obtained indicate the problems associated with development of term extractors for under-resourced languages.- Anthology ID:
- 2022.term-1.4
- Volume:
- Proceedings of the Workshop on Terminology in the 21st century: many faces, many places
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Editors:
- Rute Costa, Sara Carvalho, Ana Ostroški Anić, Anas Fahad Khan
- Venue:
- TERM
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 19–25
- Language:
- URL:
- https://aclanthology.org/2022.term-1.4
- DOI:
- Cite (ACL):
- Shubhanker Banerjee, Bharathi Raja Chakravarthi, and John Philip McCrae. 2022. A Dataset for Term Extraction in Hindi. In Proceedings of the Workshop on Terminology in the 21st century: many faces, many places, pages 19–25, Marseille, France. European Language Resources Association.
- Cite (Informal):
- A Dataset for Term Extraction in Hindi (Banerjee et al., TERM 2022)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2022.term-1.4.pdf