This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
RandyGoebel
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-lingual reasoning abilities. This dual limitation makes it challenging to assess LLMs’ performance in the multilingual setting comprehensively. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark. Each language version consists of 11,829 identical questions, enabling direct cross-lingual comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language. To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance. Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs. The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, particularly for African languages. Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.
Explaining the predictions of a deep neural network (DNN) is a challenging problem. Many attempts at interpreting those predictions have focused on attribution-based methods, which assess the contributions of individual features to each model prediction. However, attribution-based explanations do not always provide faithful explanations to the target model, e.g., noisy gradients can result in unfaithful feature attribution for back-propagation methods. We present a method to learn explanations-specific representations while constructing deep network models for text classification. These representations can be used to faithfully interpret black-box predictions, i.e., highlighting the most important input features and their role in any particular prediction. We show that learning specific representations improves model interpretability across various tasks, for both qualitative and quantitative evaluations, while preserving predictive performance.
We discuss a variety of approaches to build a robust Depression level detection model from longer social media posts (i.e., Reddit Depression forum posts) using a mental health text pre-trained BERT model. Further, we report our experimental results based on a strategy to select excerpts from long text and then fine-tune the BERT model to combat the issue of memory constraints while processing such texts. We show that, with domain specific BERT, we can achieve reasonable accuracy with fixed text size (in this case 200 tokens) for this task. In addition we can use short text classifiers to extract relevant text from the long text and achieve slightly better accuracy, albeit, trading off with the processing time for extracting such excerpts.
Neural networks (NN) applied to natural language processing (NLP) are becoming deeper and more complex, making them increasingly difficult to understand and interpret. Even in applications of limited scope on fixed data, the creation of these complex “black-boxes” creates substantial challenges for debugging, understanding, and generalization. But rapid development in this field has now lead to building more straightforward and interpretable models. We propose a new technique (DISK-CSV) to distill knowledge concurrently from any neural network architecture for text classification, captured as a lightweight interpretable/explainable classifier. Across multiple datasets, our approach achieves better performance than the target black-box. In addition, our approach provides better explanations than existing techniques.
We propose a new self-explainable model for Natural Language Processing (NLP) text classification tasks. Our approach constructs explanations concurrently with the formulation of classification predictions. To do so, we extract a rationale from the text, then use it to predict a concept of interest as the final prediction. We provide three types of explanations: 1) rationale extraction, 2) a measure of feature importance, and 3) clustering of concepts. In addition, we show how our model can be compressed without applying complicated compression techniques. We experimentally demonstrate our explainability approach on a number of well-known text classification datasets.
For compounding languages, a great part of the topical semantics is transported via nominal compounds. Various applications of natural language processing can profit from explicit access to these compounds, provided by a lexicon. The best way to acquire such a resource is to harvest corpora that represent the domain in question. For Chinese, a significant difficulty lies in the fact that the text comes as a string of characters, only segmented by sentence boundaries. Extraction algorithms that solely rely on context variety do not perform precisely enough. We propose a pipeline of filters that starts from a candidate set established by accessor variety and then employs several methods to improve precision. For the experiments the Xinhua part of the Chinese Gigaword Corpus was used. We extracted a random sample of 200 story texts with 119,509 Hanzi characters. All compound words of this evaluation corpus were tagged, segmented into their morphemes, and augmented with the POS-information of their segments. A cascade of filters applied to a preliminary set of compound candidates led to a very high precision of over 90%, measured for the types. The result also holds for a small corpus where a solely contextual method introduces too much noise, even for the longer compounds. An introduction of MI into the basic candidacy algorithm led to a much higher recall with still reasonable precision for subsequent manual processing. Especially for the four-character compounds, that in our sample represent over 40% of the target data, the method has sufficient efficacy to support the rapid construction of compound dictionaries from domain corpora.