This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
SwarnadeepBhar
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Nous présentons une taxonomie des hallucinations dans les LLM, classées en trois catégories : hallucinations infidèles, contradictions factuelles et fabrications factuelles. Ces hallucinations peuvent se produire à cause des données de pré-entraînement et d’alignement, conduisant à des informations erronées, des préjugés et des erreurs de connaissance. Les méthodes d’entraînement peuvent introduire des problèmes tels que l’ajustement excessif, les effets boule de neige ou la sycophantie. Les stratégies de décodage peuvent également rendre les modèles trop confiants et enclins à attribuer des probabilités aux résultats incorrects. Une bibliographie sur la détection et atténuation des hallucinations est présentée: des méthodes de TALN, telles que la vérification des faits et la classification, de même que des méthodes basées sur les LLM. Les solutions d’atténuation des hallucinations comprennent l’amélioration de la qualité des données de pré-entraînement, l’injection de nouvelles connaissances (par ex. avec RAG), l’optimisation, SFT et RLHF, ainsi que des méthodes de décodage.
Transformer-based language models have been shown to be highly effective for several NLP tasks. In this article, we consider three transformer models, BERT, RoBERTa, and XLNet, in both small and large versions, and investigate how faithful their representations are with respect to the semantic content of texts. We formalize a notion of semantic faithfulness, in which the semantic content of a text should causally figure in a model’s inferences in question answering. We then test this notion by observing a model’s behavior on answering questions about a story after performing two novel semantic interventions—deletion intervention and negation intervention. While transformer models achieve high performance on standard question answering tasks, we show that they fail to be semantically faithful once we perform these interventions for a significant number of cases (∼ 50% for deletion intervention, and ∼ 20% drop in accuracy for negation intervention). We then propose an intervention-based training regime that can mitigate the undesirable effects for deletion intervention by a significant margin (from ∼ 50% to ∼ 6%). We analyze the inner-workings of the models to better understand the effectiveness of intervention-based training for deletion intervention. But we show that this training does not attenuate other aspects of semantic unfaithfulness such as the models’ inability to deal with negation intervention or to capture the predicate–argument structure of texts. We also test InstructGPT, via prompting, for its ability to handle the two interventions and to capture predicate–argument structure. While InstructGPT models do achieve very high performance on predicate–argument structure task, they fail to respond adequately to our deletion and negation interventions.
Despite great performance on many tasks, language models (LMs) still struggle with reasoning, sometimes providing responses that cannot possibly be true because they stem from logical incoherence. We call such responses strong hallucinations and prove that they follow from an LM’s computation of its internal representations for logical operators and outputs from those representations. Focusing on negation, we provide a novel solution in which negation is treated not as another element of a latent representation, but as an operation over an LM’s latent representations that constrains how they may evolve. We show that our approach improves model performance in cloze prompting and natural language inference tasks with negation without requiring training on sparse negative data.
With the advent of large language models (LLMs), the trend in NLP has been to train LLMs on vast amounts of data to solve diverse language understanding and generation tasks. The list of LLM successes is long and varied. Nevertheless, several recent papers provide empirical evidence that LLMs fail to capture important aspects of linguistic meaning. Focusing on universal quantification, we provide a theoretical foundation for these empirical findings by proving that LLMs cannot learn certain fundamental semantic properties including semantic entailment and consistency as they are defined in formal semantics. More generally, we show that LLMs are unable to learn concepts beyond the first level of the Borel Hierarchy, which imposes severe limits on the ability of LMs, both large and small, to capture many aspects of linguistic meaning. This means that LLMs will operate without formal guarantees on tasks that require entailments and deep linguistic understanding.