Biomedical Concept Normalization over Nested Entities with Partial UMLS Terminology in Russian

Natalia Loukachevitch, Andrey Sakhovskiy, Elena Tutubalina


Abstract
We present a new manually annotated dataset of PubMed abstracts for concept normalization in Russian. It contains over 23,641 entity mentions in 756 documents linked to 4,544 unique concepts from the UMLS ontology. Compared to existing corpora, we explore two novel annotation characteristics: the nestedness of named entities and the incompleteness of the Russian medical terminology in UMLS. 4,424 entity mentions are linked to 1,535 unique English concepts absent in the Russian part of the UMLS ontology. We present several baselines for normalization over nested named entities obtained with state-of-the-art models such as SapBERT. Our experimental results show that models pre-trained on graph structural data from UMLS achieve superior performance in a zero-shot setting on bilingual terminology.
Anthology ID:
2024.lrec-main.213
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
2383–2389
Language:
URL:
https://aclanthology.org/2024.lrec-main.213
DOI:
Bibkey:
Cite (ACL):
Natalia Loukachevitch, Andrey Sakhovskiy, and Elena Tutubalina. 2024. Biomedical Concept Normalization over Nested Entities with Partial UMLS Terminology in Russian. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2383–2389, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Biomedical Concept Normalization over Nested Entities with Partial UMLS Terminology in Russian (Loukachevitch et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.lrec-main.213.pdf