This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
JulienPerez
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
La relecture en double aveugle est centrale dans les conférences scientifiques, mais des biais persistent. OpenReview a introduit plus de transparence en rendant publics les articles, les évaluations et les décisions. Ce travail explore l’utilisation des grands modèles de langage (LLMs) pour assister différentes étapes du processus de relecture : production de méta-revues, détection de biais et de subjectivité dans les évaluations. L’étude s’appuie sur les données ICLR de 2017 à 2022 et inclut des analyses quantitatives et des évaluations humaines à l’aveugle. Les résultats visent à encourager une relecture scientifique plus efficace et équitable.
Dans le contexte de l’utilisation croissante des LLM, le besoin d’un retour efficace et automatique aux sources devient essentiel, en particulier pour les documents historiques. La capacité des LLM à identifier les sources pertinentes ne constitue plus seulement un maillon dans une chaîne où l’objectif final est la génération de réponses ; elle représente un enjeu fondamental de l’analyse, justifiant une évaluation à part entière. Quelles stratégies, quels modèles et quels paramètres offrent aux historiens les meilleures capacités d’exploration d’un corpus vaste et bruité ? Cet article propose une première tentative d’évaluation du retriever dans un cadre de RAG appliqué aux débats parlementaires de la Troisième République.
L’évaluation automatisée en éducation par projet pour l’apprentissage de la programmation s’appuie traditionnellement sur les tests unitaires pour juger les soumissions de code des étudiants, mettant l’accent sur la correction fonctionnelle. Cependant, ces tests négligent souvent des aspects qualitatifs du code, comme la lisibilité ou la modularité. Cette étude examine le potentiel des grands modèles de langage (LLM) pour évaluer les soumissions de programmation, en comparant leurs résultats à ceux des tests unitaires. À partir d’un grand ensemble de données de rendus d’étudiants à une collection de projets de développement logiciel, nous appliquons des analyses statistiques, modélisations prédictives, ainsi que plusieurs comparaisons pour évaluer l’efficacité des LLMs. Nos résultats mettent en évidence une corrélation significative entre les évaluations des LLMs, pour des prompts donnés, et les tests unitaires. Les modèles prédictifs montrent que les scores des LLMs peuvent être approximés à partir des résultats des tests unitaires, et les classements d’étudiants issus des deux approches sont fortement corrélés. Ces constats restent robustes même en présence de bruit injecté dans les rendus étudiants. Ces résultats suggèrent que les LLM, en capturant des dimensions supplémentaires de la performance, peuvent enrichir les cadres d’évaluation éducative, offrant une approche totale plus nuancée et complète.
Fine-tuning a large language model on downstream tasks has become a commonly adopted process in the Natural Language Processing (NLP) (CITATION). However, such a process, when associated with the current transformer-based (CITATION) architectures, shows several limitations when the target task requires to reason with long documents. In this work, we introduce a novel hierarchical propagation layer that spreads information between multiple transformer windows. We adopt a hierarchical approach where the input is divided in multiple blocks independently processed by the scaled dot-attentions and combined between the successive layers. We validate the effectiveness of our approach on three extractive summarization corpora of long scientific papers and news articles. We compare our approach to standard and pre-trained language-model-based summarizers and report state-of-the-art results for long document summarization and comparable results for smaller document summarization.
Machine reading using differentiable reasoning models has recently shown remarkable progress. In this context, End-to-End trainable Memory Networks (MemN2N) have demonstrated promising performance on simple natural language based reasoning tasks such as factual reasoning and basic deduction. However, other tasks, namely multi-fact question-answering, positional reasoning or dialog related tasks, remain challenging particularly due to the necessity of more complex interactions between the memory and controller modules composing this family of models. In this paper, we introduce a novel end-to-end memory access regulation mechanism inspired by the current progress on the connection short-cutting principle in the field of computer vision. Concretely, we develop a Gated End-to-End trainable Memory Network architecture (GMemN2N). From the machine learning perspective, this new capability is learned in an end-to-end fashion without the use of any additional supervision signal which is, as far as our knowledge goes, the first of its kind. Our experiments show significant improvements on the most challenging tasks in the 20 bAbI dataset, without the use of any domain knowledge. Then, we show improvements on the Dialog bAbI tasks including the real human-bot conversion-based Dialog State Tracking Challenge (DSTC-2) dataset. On these two datasets, our model sets the new state of the art.
In an end-to-end dialog system, the aim of dialog state tracking is to accurately estimate a compact representation of the current dialog status from a sequence of noisy observations produced by the speech recognition and the natural language understanding modules. This paper introduces a novel method of dialog state tracking based on the general paradigm of machine reading and proposes to solve it using an End-to-End Memory Network, MemN2N, a memory-enhanced neural network architecture. We evaluate the proposed approach on the second Dialog State Tracking Challenge (DSTC-2) dataset. The corpus has been converted for the occasion in order to frame the hidden state variable inference as a question-answering task based on a sequence of utterances extracted from a dialog. We show that the proposed tracker gives encouraging results. Then, we propose to extend the DSTC-2 dataset with specific reasoning capabilities requirement like counting, list maintenance, yes-no question answering and indefinite knowledge management. Finally, we present encouraging results using our proposed MemN2N based tracking model.
There have been many attempts at automatically recognising author personality traits from text, typically incorporating linguistic features with conventional machine learning models, e.g. linear regression or Support Vector Machines. In this work, we propose to use deep-learning-based models with atomic features of text – the characters – to build hierarchical, vectorial word and sentence representations for the task of trait inference. On a corpus of tweets, this method shows state-of-the-art performance across five traits and three languages (English, Spanish and Italian) compared with prior work in author profiling. The results, supported by preliminary visualisation work, are encouraging for the ability to detect complex human traits.
Many methods have been used to recognise author personality traits from text, typically combining linguistic feature engineering with shallow learning models, e.g. linear regression or Support Vector Machines. This work uses deep-learning-based models and atomic features of text, the characters, to build hierarchical, vectorial word and sentence representations for trait inference. This method, applied to a corpus of tweets, shows state-of-the-art performance across five traits compared with prior work. The results, supported by preliminary visualisation work, are encouraging for the ability to detect complex human traits.