This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
DavidLanglois
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
Processing North African Arabic dialects presents significant challenges due to high lexical variability, frequent code-switching with French, and the use of both Arabic and Latin scripts. We address this with a phonemebased normalization strategy that maps Arabic and French text into a simplified representation (Arabic rendered in Latin script), reflecting native reading patterns. Using this method, we pretrain BERTbased models on normalized Modern Standard Arabic and French only and evaluate them on Named Entity Recognition (NER) and text classification. Experiments show that normalized standard-language corpora yield competitive performance on North African dialect tasks; in zero-shot NER, Ar_20k surpasses dialectpretrained baselines. Normalization improves vocabulary alignment, indicating that normalized standard corpora can suffice for developing dialect-supportive
Arabic dialects present major challenges for natural language processing (NLP) due to their diglossic nature, phonetic variability, and the scarcity of resources. To address this, we introduce a phoneme-like transcription approach that enables the training of robust language models for North African Dialects (NADs) using only formal language data, without the need for dialect-specific corpora.Our key insight is that Arabic dialects are highly phonetic, with NADs particularly influenced by European languages. This motivated us to develop a novel approach in which we convert Arabic script into a Latin-based representation, allowing our language model, ABDUL, to benefit from existing Latin-script corpora.Our method demonstrates strong performance in multi-label emotion classification and named entity recognition (NER) across various Arabic dialects. ABDUL achieves results comparable to or better than specialized and multilingual models such as DarijaBERT, DziriBERT, and mBERT. Notably, in the NER task, ABDUL outperforms mBERT by 5% in F1-score for Modern Standard Arabic (MSA), Moroccan, and Algerian Arabic, despite using a vocabulary four times smaller than mBERT.
Pre-trained word embeddings (for example, BERT-like) have been successfully used in a variety of downstream tasks. However, do all embeddings, obtained from the models of the same architecture, encode information in the same way? Does the size of the model correlate to the quality of the information encoding? In this paper, we will attempt to dissect the dimensions of several BERT-like models that were trained on the French language to find where grammatical information (gender, plurality, part of speech) and semantic features might be encoded. In addition to this, we propose a framework for comparing the quality of encoding in different models.
More than 13 million people suffer a stroke each year. Aphasia is known as a language disorder usually caused by a stroke that damages a specific area of the brain that controls the expression and understanding of language. Aphasia is characterized by a disturbance of the linguistic code affecting encoding and/or decoding of the language. Our project aims to propose a method that helps a person suffering from aphasia to communicate better with those around him. For this, we will propose a machine translation capable of correcting aphasic errors and helping the patient to communicate more easily. To build such a system, we need a parallel corpus; to our knowledge, this corpus does not exist, especially for French. Therefore, the main challenge and the objective of this task is to build a parallel corpus composed of sentences with aphasic errors and their corresponding correction. We will show how we create a pseudo-aphasia corpus from real data, and then we will show the feasibility of our project to translate from aphasia data to natural language. The preliminary results show that the deep learning methods we used achieve correct translations corresponding to a BLEU of 38.6.
La démonstration de résumé et de traduction automatique de vidéos résulte de nos travaux dans le projet AMIS. L’objectif du projet était d’aider un voyageur à comprendre les nouvelles dans un pays étranger. Pour cela, le projet propose de résumer et traduire automatiquement une vidéo en langue étrangère (ici, l’arabe). Un autre objectif du projet était aussi de comparer les opinions et sentiments exprimés dans plusieurs vidéos comparables. La démonstration porte sur l’aspect résumé, transcription et traduction. Les exemples montrés permettront de comprendre et mesurer qualitativement les résultats du projet.
Automatic speech recognition for Arabic is a very challenging task. Despite all the classical techniques for Automatic Speech Recognition (ASR), which can be efficiently applied to Arabic speech recognition, it is essential to take into consideration the language specificities to improve the system performance. In this article, we focus on Modern Standard Arabic (MSA) speech recognition. We introduce the challenges related to Arabic language, namely the complex morphology nature of the language and the absence of the short vowels in written text, which leads to several potential vowelization for each graphemes, which is often conflicting. We develop an ASR system for MSA by using Kaldi toolkit. Several acoustic and language models are trained. We obtain a Word Error Rate (WER) of 14.42 for the baseline system and 12.2 relative improvement by rescoring the lattice and by rewriting the output with the right Z hamoza above or below Alif.
Building multilingual opinionated models requires multilingual corpora annotated with opinion labels. Unfortunately, such kind of corpora are rare. We consider opinions in this work as subjective or objective. In this paper, we introduce an annotation method that can be reliably transferred across topic domains and across languages. The method starts by building a classifier that annotates sentences into subjective/objective label using a training data from “movie reviews” domain which is in English language. The annotation can be transferred to another language by classifying English sentences in parallel corpora and transferring the same annotation to the same sentences of the other language. We also shed the light on the link between opinion mining and statistical language modelling, and how such corpora are useful for domain specific language modelling. We show the distinction between subjective and objective sentences which tends to be stable across domains and languages. Our experiments show that language models trained on objective (respectively subjective) corpus lead to better perplexities on objective (respectively subjective) test.
Dans cet article, nous présentons une nouvelle approche pour la traduction automatique fondée sur les triggers inter-langues. Dans un premier temps, nous expliquons le concept de triggers inter-langues ainsi que la façon dont ils sont déterminés. Nous présentons ensuite les différentes expérimentations qui ont été menées à partir de ces triggers afin de les intégrer au mieux dans un processus complet de traduction automatique. Pour cela, nous construisons à partir des triggers inter-langues des tables de traduction suivant différentes méthodes. Nous comparons par la suite notre système de traduction fondé sur les triggers interlangues à un système état de l’art reposant sur le modèle 3 d’IBM (Brown & al., 1993). Les tests menés ont montré que les traductions automatiques générées par notre système améliorent le score BLEU (Papineni & al., 2001) de 2, 4% comparé à celles produites par le système état de l’art.
In this paper, we propose a new phrase-based translation model based on inter-lingual triggers. The originality of our method is double. First we identify common source phrases. Then we use inter-lingual triggers in order to retrieve their translations. Furthermore, we consider the way of extracting phrase translations as an optimization issue. For that we use simulated annealing algorithm to find out the best phrase translations among all those determined by inter-lingual triggers. The best phrases are those which improve the translation quality in terms of Bleu score. Tests are achieved on movie subtitle corpora. They show that our phrase-based machine translation (PBMT) system outperforms a state-of-the-art PBMT system by almost 7 points.
Dans le cadre de la modélisation statistique du langage, nous montrons qu’il est possible d’utiliser un modèle n-grammes avec un historique qui n’est pas nécessairement celui avec lequel il a été appris. Par exemple, un adverbe présent dans l’historique peut ne pas avoir d’importance pour la prédiction, et devrait donc être ignoré en décalant l’historique utilisé pour la prédiction. Notre étude porte sur les modèles n-grammes classiques et les modèles n-grammes distants et est appliquée au cas des bigrammes. Nous présentons quatre cas d’utilisation pour deux modèles bigrammes : distants et non distants. Nous montrons que la combinaison linéaire dépendante de l’historique de ces quatre cas permet d’améliorer de 14 % la perplexité du modèle bigrammes classique. Par ailleurs, nous nous intéressons à quelques cas de combinaison qui permettent de mettre en valeur les historiques pour lesquels les modèles que nous proposons sont performants.