Malak Rassem
2024
What Can Diachronic Contexts and Topics Tell Us about the Present-Day Compositionality of English Noun Compounds?
Samin Mahdizadeh Sani
|
Malak Rassem
|
Chris Jenkins
|
Filip Miletić
|
Sabine Schulte im Walde
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Predicting the compositionality of noun compounds such as climate change and tennis elbow is a vital component in natural language understanding. While most previous computational methods that automatically determine the semantic relatedness between compounds and their constituents have applied a synchronic perspective, the current study investigates what diachronic changes in contexts and semantic topics of compounds and constituents reveal about the compounds’ present-day degrees of compositionality. We define a binary classification task that utilizes two diachronic vector spaces based on contextual co-occurrences and semantic topics, and demonstrate that diachronic changes in cosine similarities – measured over context or topic distributions – uncover patterns that distinguish between compounds with low and high present-day compositionality. Despite fewer dimensions in the topic models, the topic space performs on par with the co-occurrence space and captures rather similar information. Temporal similarities between compounds and modifiers as well as between compounds and their prepositional paraphrases predict the compounds’ present-day compositionality with accuracy >0.7.
2023
AlphaMWE-Arabic: Arabic Edition of Multilingual Parallel Corpora with Multiword Expression Annotations
Najet Hadj Mohamed
|
Malak Rassem
|
Lifeng Han
|
Goran Nenadic
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Multiword Expressions (MWEs) have been a bottleneck for Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks due to their idiomaticity, ambiguity, and non-compositionality. Bilingual parallel corpora introducing MWE annotations are very scarce which set another challenge for current Natural Language Processing (NLP) systems, especially in a multilingual setting. This work presents AlphaMWE-Arabic, an Arabic edition of the AlphaMWE parallel corpus with MWE annotations. We introduce how we created this corpus including machine translation (MT), post-editing, and annotations for both standard and dialectal varieties, i.e. Tunisian and Egyptian Arabic. We analyse the MT errors when they meet MWEs-related content, both quantitatively using the human-in-the-loop metric HOPE and qualitatively. We report the current state-of-the-art MT systems are far from reaching human parity performances. We expect our bilingual English-Arabic corpus will be an asset for multilingual research on MWEs such as translation and localisation, as well as for monolingual settings including the study of Arabic-specific lexicography and phrasal verbs on MWEs. Our corpus and experimental data are available at https://github.com/aaronlifenghan/AlphaMWE.
Search