Asım Ersoy
2025
In-Depth Analysis of Arabic-Origin Words in the Turkish Morpholex
Mounes Zaval
|
Abdullah İhsanoğlu
|
Asım Ersoy
|
Olcay Taner Yıldız
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
MorphoLex is an investigation that focuses on analyzing the roots, prefixes, and suffixes of words. Turkish Morpholex, for example, analyzes 48,472 Turkish words. Unfortunately, it lacks in-depth analysis of the Arabic-origin words, and does not include their accurate and correct roots. This study analyzes Arabic-origin words in the Turkish Morpholex, annotating their roots, morphological patterns, and semantic categories. The methodology developed for this work is adaptable to other languages influenced by Arabic, such as Urdu and Persian, offering broader implications for studying loanword integration across linguistic contexts.
Tool Calling for Arabic LLMs: Data Strategies and Instruction Tuning
Asım Ersoy
|
Enes Altinisik
|
Kareem Mohamed Darwish
|
Husrev Taha Sencar
Proceedings of The Third Arabic Natural Language Processing Conference
Tool calling is a critical capability that allows Large Language Models (LLMs) to interact with external systems, significantly expanding their utility. However, research and resources for tool calling are predominantly English-centric, leaving a gap in our understanding of how to enable this functionality for other languages, such as Arabic. This paper investigates three key research questions: (1) the necessity of in-language (Arabic) tool-calling data versus relying on cross-lingual transfer, (2) the effect of general-purpose instruction tuning on tool-calling performance, and (3) the value of fine-tuning on specific, high-priority tools. To address these questions, we conduct extensive experiments using base and post-trained variants of an open-weight Arabic LLM. To enable this study, we bridge the resource gap by translating and adapting two open-source tool-calling datasets into Arabic. Our findings provide crucial insights into the optimal strategies for developing robust tool-augmented agents for Arabic.
2023
In What Languages are Generative Language Models the Most Formal? Analyzing Formality Distribution across Languages
Asım Ersoy
|
Gerson Vizcarra
|
Tahsin Mayeesha
|
Benjamin Muller
Findings of the Association for Computational Linguistics: EMNLP 2023
Multilingual generative language models (LMs) are increasingly fluent in a large variety of languages. Trained on the concatenation of corpora in multiple languages, they enable powerful transfer from high-resource languages to low-resource ones. However, it is still unknown what cultural biases are induced in the predictions of these models. In this work, we focus on one language property highly influenced by culture: formality. We analyze the formality distributions of XGLM and BLOOM’s predictions, two popular generative multilingual language models, in 5 languages. We classify 1,200 generations per language as formal, informal, or incohesive and measure the impact of the prompt formality on the predictions. Overall, we observe a diversity of behaviors across the models and languages. For instance, XGLM generates informal text in Arabic and Bengali when conditioned with informal prompts, much more than BLOOM. In addition, even though both models are highly biased toward the formal style when prompted neutrally, we find that the models generate a significant amount of informal predictions even when prompted with formal text. We release with this work 6,000 annotated samples, paving the way for future work on the formality of generative multilingual LMs.
Search
Fix author
Co-authors
- Enes Altinisik 1
- Kareem Mohamed Darwish 1
- Tahsin Mayeesha 1
- Benjamin Muller 1
- Husrev Taha Sencar 1
- show all...