2024
pdf
abs
Is Spoken Hungarian Low-resource?: A Quantitative Survey of Hungarian Speech Data Sets
Peter Mihajlik
|
Katalin Mády
|
Anna Kohári
|
Fruzsina Sára Fruzsina
|
Gábor Kiss
|
Tekla Etelka Gráczi
|
A. Seza Doğruöz
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Even though various speech data sets are available in Hungarian, there is a lack of a general overview about their types and sizes. To fill in this gap, we provide a survey of available data sets in spoken Hungarian in five categories (e.g., monolingual, Hungarian part of multilingual, pathological, child-related and dialectal collections). In total, the estimated size of available data is about 2800 hours (across 7500 speakers) and it represents a rich spoken language diversity. However, the distribution of the data and its alignment to real-life (e.g. speech recognition) tasks is far from optimal indicating the need for additional larger-scale natural language speech data sets. Our survey presents an overview of available data sets for Hungarian explaining their strengths and weaknesses which is useful for researchers working on Hungarian across disciplines. In addition, our survey serves as a starting point towards a unified foundational speech model specific to Hungarian.
pdf
abs
Who Is Bragging More Online? A Large Scale Analysis of Bragging in Social Media
Mali Jin
|
Daniel Preotiuc-Pietro
|
A. Seza Doğruöz
|
Nikolaos Aletras
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Bragging is the act of uttering statements that are likely to be positively viewed by others and it is extensively employed in human communication with the aim to build a positive self-image of oneself. Social media is a natural platform for users to employ bragging in order to gain admiration, respect, attention and followers from their audiences. Yet, little is known about the scale of bragging online and its characteristics. This paper employs computational sociolinguistics methods to conduct the first large scale study of bragging behavior on Twitter (U.S.) by focusing on its overall prevalence, temporal dynamics and impact of demographic factors. Our study shows that the prevalence of bragging decreases over time within the same population of users. In addition, younger, more educated and popular users in the U.S. are more likely to brag. Finally, we conduct an extensive linguistics analysis to unveil specific bragging themes associated with different user traits.
pdf
bib
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
Archna Bhatia
|
Gosse Bouma
|
A. Seza Doğruöz
|
Kilian Evang
|
Marcos Garcia
|
Voula Giouli
|
Lifeng Han
|
Joakim Nivre
|
Alexandre Rademaker
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
pdf
abs
Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity
Eric Khiu
|
Hasti Toossi
|
Jinyu Liu
|
Jiaxu Li
|
David Anugraha
|
Juan Flores
|
Leandro Roman
|
A. Seza Doğruöz
|
En-Shiun Lee
Findings of the Association for Computational Linguistics: EACL 2024
Fine-tuning and testing a multilingual large language model is a challenge for low-resource languages (LRLs) since it is an expensive process. While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors (the size of the fine-tuning corpus, domain similarity between fine-tuning and testing corpora, and language similarity between source and target languages), which can potentially impact the model performance by using classical regression models. Our results indicate that domain similarity has the most important impact on predicting the performance of Machine Translation models.
pdf
abs
A Reproducibility Study on Quantifying Language Similarity: The Impact of Missing Values in the URIEL Knowledge Base
Hasti Toossi
|
Guo Huai
|
Jinyu Liu
|
Eric Khiu
|
A. Seza Doğruöz
|
En-Shiun Lee
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
In the pursuit of supporting more languages around the world, tools that characterize properties of languages play a key role in expanding the existing multilingual NLP research. In this study, we focus on a widely used typological knowledge base, URIEL, which aggregates linguistic information into numeric vectors. Specifically, we delve into the soundness and reproducibility of the approach taken by URIEL in quantifying language similarity. Our analysis reveals URIEL’s ambiguity in calculating language distances and in handling missing values. Moreover, we find that URIEL does not provide any information about typological features for 31% of the languages it represents, undermining the reliabilility of the database, particularly on low-resource languages. Our literature review suggests URIEL and lang2vec are used in papers on diverse NLP tasks, which motivates us to rigorously verify the database as the effectiveness of these works depends on the reliability of the information the tool provides.
pdf
abs
Comparing LLM prompting with Cross-lingual transfer performance on Indigenous and Low-resource Brazilian Languages
David Ifeoluwa Adelani
|
A. Seza Doğruöz
|
André Coneglian
|
Atul Kr. Ojha
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
Large Language Models are transforming NLP for a lot of tasks. However, how LLMs perform NLP tasks for LRLs is less explored. In alliance with the theme track of the NAACL’24, we focus on 12 low-resource languages (LRLs) from Brazil, 2 LRLs from Africa and 2 high-resource languages (HRLs) (e.g., English and Brazilian Portuguese). Our results indicate that the LLMs perform worse for the labeling of LRLs in comparison to HRLs in general. We explain the reasons behind this failure and provide an error analyses through examples from 2 Brazilian LRLs.
pdf
bib
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Atul Kr. Ojha
|
A. Seza Doğruöz
|
Harish Tayyar Madabushi
|
Giovanni Da San Martino
|
Sara Rosenthal
|
Aiala Rosá
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
2023
pdf
abs
Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation
A. Seza Doğruöz
|
Sunayana Sitaram
|
Zheng Xin Yong
Findings of the Association for Computational Linguistics: EMNLP 2023
Multilingualism is widespread around the world and code-switching (CSW) is a common practice among different language pairs/tuples across locations and regions. However, there is still not much progress in building successful CSW systems, despite the recent advances in Massive Multilingual Language Models (MMLMs). We investigate the reasons behind this setback through a critical study about the existing CSW data sets (68) across language pairs in terms of the collection and preparation (e.g. transcription and annotation) stages. This in-depth analysis reveals that a) most CSW data involves English ignoring other language pairs/tuples b) there are flaws in terms of representativeness in data collection and preparation stages due to ignoring the location based, socio-demographic and register variation in CSW. In addition, lack of clarity on the data selection and filtering stages shadow the representativeness of CSW data sets. We conclude by providing a short check-list to improve the representativeness for forthcoming studies involving CSW data collection and preparation.
pdf
bib
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Atul Kr. Ojha
|
A. Seza Doğruöz
|
Giovanni Da San Martino
|
Harish Tayyar Madabushi
|
Ritesh Kumar
|
Elisa Sartori
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
pdf
bib
Current Status of NLP in South East Asia with Insights from Multilingualism and Language Diversity
Alham Fikri Aji
|
Jessica Zosa Forde
|
Alyssa Marie Loo
|
Lintang Sutawika
|
Skyler Wang
|
Genta Indra Winata
|
Zheng-Xin Yong
|
Ruochen Zhang
|
A. Seza Doğruöz
|
Yin Lin Tan
|
Jan Christian Blaise Cruz
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract
pdf
abs
Learning from Partially Annotated Data: Example-aware Creation of Gap-filling Exercises for Language Learning
Semere Kiros Bitew
|
Johannes Deleu
|
A. Seza Doğruöz
|
Chris Develder
|
Thomas Demeester
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
Since performing exercises (including, e.g.,practice tests) forms a crucial component oflearning, and creating such exercises requiresnon-trivial effort from the teacher. There is agreat value in automatic exercise generationin digital tools in education. In this paper, weparticularly focus on automatic creation of gap-filling exercises for language learning, specifi-cally grammar exercises. Since providing anyannotation in this domain requires human ex-pert effort, we aim to avoid it entirely and ex-plore the task of converting existing texts intonew gap-filling exercises, purely based on anexample exercise, without explicit instructionor detailed annotation of the intended gram-mar topics. We contribute (i) a novel neuralnetwork architecture specifically designed foraforementioned gap-filling exercise generationtask, and (ii) a real-world benchmark datasetfor French grammar. We show that our modelfor this French grammar gap-filling exercisegeneration outperforms a competitive baselineclassifier by 8% in F1 percentage points, achiev-ing an average F1 score of 82%. Our model im-plementation and the dataset are made publiclyavailable to foster future research, thus offeringa standardized evaluation and baseline solutionof the proposed partially annotated data predic-tion task in grammar exercise creation.
pdf
abs
The Open-domain Paradox for Chatbots: Common Ground as the Basis for Human-like Dialogue
Gabriel Skantze
|
A. Seza Doğruöz
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue
There is a surge in interest in the development of open-domain chatbots, driven by the recent advancements of large language models. The “openness” of the dialogue is expected to be maximized by providing minimal information to the users about the common ground they can expect, including the presumed joint activity. However, evidence suggests that the effect is the opposite. Asking users to “just chat about anything” results in a very narrow form of dialogue, which we refer to as the “open-domain paradox”. In this position paper, we explain this paradox through the theory of common ground as the basis for human-like communication. Furthermore, we question the assumptions behind open-domain chatbots and identify paths forward for enabling common ground in human-computer dialogue.
2022
pdf
abs
Automatic Identification and Classification of Bragging in Social Media
Mali Jin
|
Daniel Preotiuc-Pietro
|
A. Seza Doğruöz
|
Nikolaos Aletras
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Bragging is a speech act employed with the goal of constructing a favorable self-image through positive statements about oneself. It is widespread in daily communication and especially popular in social media, where users aim to build a positive image of their persona directly or indirectly. In this paper, we present the first large scale study of bragging in computational linguistics, building on previous research in linguistics and pragmatics. To facilitate this, we introduce a new publicly available data set of tweets annotated for bragging and their types. We empirically evaluate different transformer-based models injected with linguistic information in (a) binary bragging classification, i.e., if tweets contain bragging statements or not; and (b) multi-class bragging type prediction including not bragging. Our results show that our models can predict bragging with macro F1 up to 72.42 and 35.95 in the binary and multi-class classification tasks respectively. Finally, we present an extensive linguistic and error analysis of bragging prediction to guide future research on this topic.
pdf
abs
Language Technologies for Low Resource Languages: Sociolinguistic and Multilingual Insights
A. Seza Doğruöz
|
Sunayana Sitaram
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages
There is a growing interest in building language technologies (LTs) for low resource languages (LRLs). However, there are flaws in the planning, data collection and development phases mostly due to the assumption that LRLs are similar to High Resource Languages (HRLs) but only smaller in size. In our paper, we first provide examples of failed LTs for LRLs and provide the reasons for these failures. Second, we discuss the problematic issues with the data for LRLs. Finally, we provide recommendations for building better LTs for LRLs through insights from sociolinguistics and multilingualism. Our goal is not to solve all problems around LTs for LRLs but to raise awareness about the existing issues, provide recommendations toward possible solutions and encourage collaboration across academic disciplines for developing LTs that actually serve the needs and preferences of the LRL communities.
2021
pdf
abs
How “open” are the conversations with open-domain chatbots? A proposal for Speech Event based evaluation
A. Seza Doğruöz
|
Gabriel Skantze
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Open-domain chatbots are supposed to converse freely with humans without being restricted to a topic, task or domain. However, the boundaries and/or contents of open-domain conversations are not clear. To clarify the boundaries of “openness”, we conduct two studies: First, we classify the types of “speech events” encountered in a chatbot evaluation data set (i.e., Meena by Google) and find that these conversations mainly cover the “small talk” category and exclude the other speech event categories encountered in real life human-human communication. Second, we conduct a small-scale pilot study to generate online conversations covering a wider range of speech event categories between two humans vs. a human and a state-of-the-art chatbot (i.e., Blender by Facebook). A human evaluation of these generated conversations indicates a preference for human-human conversations, since the human-chatbot conversations lack coherence in most speech event categories. Based on these results, we suggest (a) using the term “small talk” instead of “open-domain” for the current chatbots which are not that “open” in terms of conversational abilities yet, and (b) revising the evaluation methods to test the chatbot conversations against other speech events.
pdf
abs
Open Machine Translation for Low Resource South American Languages (AmericasNLP 2021 Shared Task Contribution)
Shantipriya Parida
|
Subhadarshi Panda
|
Amulya Dash
|
Esau Villatoro-Tello
|
A. Seza Doğruöz
|
Rosa M. Ortega-Mendoza
|
Amadeo Hernández
|
Yashvardhan Sharma
|
Petr Motlicek
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
This paper describes the team (“Tamalli”)’s submission to AmericasNLP2021 shared task on Open Machine Translation for low resource South American languages. Our goal was to evaluate different Machine Translation (MT) techniques, statistical and neural-based, under several configuration settings. We obtained the second-best results for the language pairs “Spanish-Bribri”, “Spanish-Asháninka”, and “Spanish-Rarámuri” in the category “Development set not used for training”. Our performed experiments will serve as a point of reference for researchers working on MT with low-resource languages.
pdf
abs
A Survey of Code-switching: Linguistic and Social Perspectives for Language Technologies
A. Seza Doğruöz
|
Sunayana Sitaram
|
Barbara E. Bullock
|
Almeida Jacqueline Toribio
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
The analysis of data in which multiple languages are represented has gained popularity among computational linguists in recent years. So far, much of this research focuses mainly on the improvement of computational methods and largely ignores linguistic and social aspects of C-S discussed across a wide range of languages within the long-established literature in linguistics. To fill this gap, we offer a survey of code-switching (C-S) covering the literature in linguistics with a reflection on the key issues in language technologies. From the linguistic perspective, we provide an overview of structural and functional patterns of C-S focusing on the literature from European and Indian contexts as highly multilingual areas. From the language technologies perspective, we discuss how massive language models fail to represent diverse C-S types due to lack of appropriate training data, lack of robust evaluation benchmarks for C-S (across multilingual situations and types of C-S) and lack of end-to- end systems that cover sociolinguistic aspects of C-S as well. Our survey will be a step to- wards an outcome of mutual benefit for computational scientists and linguists with a shared interest in multilingualism and C-S.
2017
pdf
bib
Proceedings of the Second Workshop on NLP and Computational Social Science
Dirk Hovy
|
Svitlana Volkova
|
David Bamman
|
David Jurgens
|
Brendan O’Connor
|
Oren Tsur
|
A. Seza Doğruöz
Proceedings of the Second Workshop on NLP and Computational Social Science
pdf
abs
Integrating Meaning into Quality Evaluation of Machine Translation
Osman Başkaya
|
Eray Yildiz
|
Doruk Tunaoğlu
|
Mustafa Tolga Eren
|
A. Seza Doğruöz
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers
Machine translation (MT) quality is evaluated through comparisons between MT outputs and the human translations (HT). Traditionally, this evaluation relies on form related features (e.g. lexicon and syntax) and ignores the transfer of meaning reflected in HT outputs. Instead, we evaluate the quality of MT outputs through meaning related features (e.g. polarity, subjectivity) with two experiments. In the first experiment, the meaning related features are compared to human rankings individually. In the second experiment, combinations of meaning related features and other quality metrics are utilized to predict the same human rankings. The results of our experiments confirm the benefit of these features in predicting human evaluation of translation quality in addition to traditional metrics which focus mainly on form.
2016
pdf
Computational Sociolinguistics: A Survey
Dong Nguyen
|
A. Seza Doğruöz
|
Carolyn P. Rosé
|
Franciska de Jong
Computational Linguistics, Volume 42, Issue 3 - September 2016
pdf
bib
Proceedings of the First Workshop on NLP and Computational Social Science
David Bamman
|
A. Seza Doğruöz
|
Jacob Eisenstein
|
Dirk Hovy
|
David Jurgens
|
Brendan O’Connor
|
Alice Oh
|
Oren Tsur
|
Svitlana Volkova
Proceedings of the First Workshop on NLP and Computational Social Science
2014
pdf
Predicting Code-switching in Multilingual Communication for Immigrant Communities
Evangelos Papalexakis
|
Dong Nguyen
|
A. Seza Doğruöz
Proceedings of the First Workshop on Computational Approaches to Code Switching
pdf
Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment
Dong Nguyen
|
Dolf Trieschnigg
|
A. Seza Doğruöz
|
Rilana Gravel
|
Mariët Theune
|
Theo Meder
|
Franciska de Jong
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
pdf
Modeling the Use of Graffiti Style Features to Signal Social Relations within a Multi-Domain Learning Paradigm
Mario Piergallini
|
A. Seza Doğruöz
|
Phani Gadde
|
David Adamson
|
Carolyn Rosé
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics
pdf
Predicting Dialect Variation in Immigrant Contexts Using Light Verb Constructions
A. Seza Doğruöz
|
Preslav Nakov
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
2013
pdf
Word Level Language Identification in Online Multilingual Communication
Dong Nguyen
|
A. Seza Doğruöz
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing