Understanding the discussion moves that teachers and students use to engage in classroom discussions is important to support pre-service teacher learning and teacher educators. This work introduces a novel conversational multi-label corpus of teaching transcripts collected from a simulated classroom environment for Conversational Argument Move AnaLysis (CAMAL). The dataset offers various argumentation moves used by pre-service teachers and students in mathematics and science classroom discussions. The dataset includes 165 transcripts from these discussions that pre-service elementary teachers facilitated in a simulated classroom environment of five student avatars. The discussion transcripts were annotated by education assessment experts for nine argumentation moves (aka. intents) used by the pre-service teachers and students during the discussions. In this paper, we describe the dataset, our annotation framework, and the models we employed to detect argumentation moves. Our experiments with state-of-the-art models demonstrate the complexity of the CAMAL task presented in the dataset. The result reveals that models that combined CNN and LSTM structures with speaker ID graphs improved the F1-score of our baseline models to detect speakers’ intents by a large margin. Given the complexity of the CAMAL task, it creates research opportunities for future studies. We share the dataset, the source code, and the annotation framework publicly at http://github.com/uonlp/camal-dataset.
Extensive training datasets represent one of the important factors for the impressive learning capabilities of large language models (LLMs). However, these training datasets for current LLMs, especially the recent state-of-the-art models, are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is released in Hugging Face facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.
Event Causality Identification (ECI) is the task of detecting causal relations between events mentioned in the text. Although this task has been extensively studied for English materials, it is under-explored for many other languages. A major reason for this issue is the lack of multilingual datasets that provide consistent annotations for event causality relations in multiple non-English languages. To address this issue, we introduce a new multilingual dataset for ECI, called MECI. The dataset employs consistent annotation guidelines for five typologically different languages, i.e., English, Danish, Spanish, Turkish, and Urdu. Our dataset thus enable a new research direction on cross-lingual transfer learning for ECI. Our extensive experiments demonstrate high quality for MECI that can provide ample research challenges and directions for future research. We will publicly release MECI to promote research on multilingual ECI.
Event extraction (EE) is one of the fundamental tasks for information extraction whose goal is to identify mentions of events and their participants in text. Due to its importance, different methods and datasets have been introduced for EE. However, existing EE datasets are limited to formally written documents such as news articles or scientific papers. As such, the challenges of EE in informal and noisy texts are not adequately studied. In particular, video transcripts constitute an important domain that can benefit tremendously from EE systems (e.g., video retrieval), but has not been studied in EE literature due to the lack of necessary datasets. To address this limitation, we propose the first large-scale EE dataset obtained for transcripts of streamed videos on the video hosting platform Behance to promote future research in this area. In addition, we extensively evaluate existing state-of-the-art EE methods on our new dataset. We demonstrate that such systems cannot achieve adequate performance on the proposed dataset, revealing challenges and opportunities for further research effort.
Existing works on information extraction (IE) have mainly solved the four main tasks separately (entity mention recognition, relation extraction, event trigger detection, and argument extraction), thus failing to benefit from inter-dependencies between tasks. This paper presents a novel deep learning model to simultaneously solve the four tasks of IE in a single model (called FourIE). Compared to few prior work on jointly performing four IE tasks, FourIE features two novel contributions to capture inter-dependencies between tasks. First, at the representation level, we introduce an interaction graph between instances of the four tasks that is used to enrich the prediction representation for one instance with those from related instances of other tasks. Second, at the label level, we propose a dependency graph for the information types in the four IE tasks that captures the connections between the types expressed in an input sentence. A new regularization mechanism is introduced to enforce the consistency between the golden and predicted type dependency graphs to improve representation learning. We show that the proposed model achieves the state-of-the-art performance for joint IE on both monolingual and multilingual learning settings with three different languages.
We introduce Trankit, a light-weight Transformer-based Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages. Built on a state-of-the-art pretrained language model, Trankit significantly outperforms prior multilingual NLP pipelines over sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing while maintaining competitive performance for tokenization, multi-word token expansion, and lemmatization over 90 Universal Dependencies treebanks. Despite the use of a large pretrained transformer, our toolkit is still efficient in memory usage and speed. This is achieved by our novel plug-and-play mechanism with Adapters where a multilingual pretrained transformer is shared across pipelines for different languages. Our toolkit along with pretrained models and code are publicly available at: https://github.com/nlp-uoregon/trankit. A demo website for our toolkit is also available at: http://nlp.uoregon.edu/trankit. Finally, we create a demo video for Trankit at: https://youtu.be/q0KGP3zGjGc.
Recent studies on event detection (ED) have shown that the syntactic dependency graph can be employed in graph convolution neural networks (GCN) to achieve state-of-the-art performance. However, the computation of the hidden vectors in such graph-based models is agnostic to the trigger candidate words, potentially leaving irrelevant information for the trigger candidate for event prediction. In addition, the current models for ED fail to exploit the overall contextual importance scores of the words, which can be obtained via the dependency tree, to boost the performance. In this study, we propose a novel gating mechanism to filter noisy information in the hidden vectors of the GCN models for ED based on the information from the trigger candidate. We also introduce novel mechanisms to achieve the contextual diversity for the gates and the importance score consistency for the graphs and models in ED. The experiments show that the proposed model achieves state-of-the-art performance on two ED datasets.
Current event detection models under supervised learning settings fail to transfer to new event types. Few-shot learning has not been explored in event detection even though it allows a model to perform well with high generalization on new event types. In this work, we formulate event detection as a few-shot learning problem to enable to extend event detection to new event types. We propose two novel loss factors that matching examples in the support set to provide more training signals to the model. Moreover, these training signals can be applied in many metric-based few-shot learning models. Our extensive experiments on the ACE-2005 dataset (under a few-shot learning setting) show that the proposed method can improve the performance of few-shot learning.
Traditional event detection classifies a word or a phrase in a given sentence for a set of prede- fined event types. The limitation of such pre- defined set is that it prevents the adaptation of the event detection models to new event types. We study a novel formulation of event detec- tion that describes types via several keywords to match the contexts in documents. This fa- cilitates the operation of the models to new types. We introduce a novel feature-based attention mechanism for convolutional neural networks for event detection in the new for- mulation. Our extensive experiments demon- strate the benefits of the new formulation for new type extension for event detection as well as the proposed attention mechanism for this problem