Viet Dac Lai


2024

pdf
CAMAL: A Novel Dataset for Multi-label Conversational Argument Move Analysis
Viet Dac Lai | Duy Ngoc Pham | Jonathan Steinberg | Jamie Mikeska | Thien Huu Nguyen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Understanding the discussion moves that teachers and students use to engage in classroom discussions is important to support pre-service teacher learning and teacher educators. This work introduces a novel conversational multi-label corpus of teaching transcripts collected from a simulated classroom environment for Conversational Argument Move AnaLysis (CAMAL). The dataset offers various argumentation moves used by pre-service teachers and students in mathematics and science classroom discussions. The dataset includes 165 transcripts from these discussions that pre-service elementary teachers facilitated in a simulated classroom environment of five student avatars. The discussion transcripts were annotated by education assessment experts for nine argumentation moves (aka. intents) used by the pre-service teachers and students during the discussions. In this paper, we describe the dataset, our annotation framework, and the models we employed to detect argumentation moves. Our experiments with state-of-the-art models demonstrate the complexity of the CAMAL task presented in the dataset. The result reveals that models that combined CNN and LSTM structures with speaker ID graphs improved the F1-score of our baseline models to detect speakers’ intents by a large margin. Given the complexity of the CAMAL task, it creates research opportunities for future studies. We share the dataset, the source code, and the annotation framework publicly at http://github.com/uonlp/camal-dataset.

pdf
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Thuat Nguyen | Chien Van Nguyen | Viet Dac Lai | Hieu Man | Nghia Trung Ngo | Franck Dernoncourt | Ryan A. Rossi | Thien Huu Nguyen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Extensive training datasets represent one of the important factors for the impressive learning capabilities of large language models (LLMs). However, these training datasets for current LLMs, especially the recent state-of-the-art models, are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is released in Hugging Face facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.

2022

pdf
MECI: A Multilingual Dataset for Event Causality Identification
Viet Dac Lai | Amir Pouran Ben Veyseh | Minh Van Nguyen | Franck Dernoncourt | Thien Huu Nguyen
Proceedings of the 29th International Conference on Computational Linguistics

Event Causality Identification (ECI) is the task of detecting causal relations between events mentioned in the text. Although this task has been extensively studied for English materials, it is under-explored for many other languages. A major reason for this issue is the lack of multilingual datasets that provide consistent annotations for event causality relations in multiple non-English languages. To address this issue, we introduce a new multilingual dataset for ECI, called MECI. The dataset employs consistent annotation guidelines for five typologically different languages, i.e., English, Danish, Spanish, Turkish, and Urdu. Our dataset thus enable a new research direction on cross-lingual transfer learning for ECI. Our extensive experiments demonstrate high quality for MECI that can provide ample research challenges and directions for future research. We will publicly release MECI to promote research on multilingual ECI.

pdf
Event Extraction in Video Transcripts
Amir Pouran Ben Veyseh | Viet Dac Lai | Franck Dernoncourt | Thien Huu Nguyen
Proceedings of the 29th International Conference on Computational Linguistics

Event extraction (EE) is one of the fundamental tasks for information extraction whose goal is to identify mentions of events and their participants in text. Due to its importance, different methods and datasets have been introduced for EE. However, existing EE datasets are limited to formally written documents such as news articles or scientific papers. As such, the challenges of EE in informal and noisy texts are not adequately studied. In particular, video transcripts constitute an important domain that can benefit tremendously from EE systems (e.g., video retrieval), but has not been studied in EE literature due to the lack of necessary datasets. To address this limitation, we propose the first large-scale EE dataset obtained for transcripts of streamed videos on the video hosting platform Behance to promote future research in this area. In addition, we extensively evaluate existing state-of-the-art EE methods on our new dataset. We demonstrate that such systems cannot achieve adequate performance on the proposed dataset, revealing challenges and opportunities for further research effort.

pdf bib
Proceedings of the First Workshop On Transcript Understanding
Franck Dernoncourt | Thien Huu Nguyen | Viet Dac Lai | Amir Pouran Ben Veyseh | Trung H. Bui | David Seunghyun Yoon
Proceedings of the First Workshop On Transcript Understanding

2021

pdf
Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks
Minh Van Nguyen | Viet Dac Lai | Thien Huu Nguyen
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Existing works on information extraction (IE) have mainly solved the four main tasks separately (entity mention recognition, relation extraction, event trigger detection, and argument extraction), thus failing to benefit from inter-dependencies between tasks. This paper presents a novel deep learning model to simultaneously solve the four tasks of IE in a single model (called FourIE). Compared to few prior work on jointly performing four IE tasks, FourIE features two novel contributions to capture inter-dependencies between tasks. First, at the representation level, we introduce an interaction graph between instances of the four tasks that is used to enrich the prediction representation for one instance with those from related instances of other tasks. Second, at the label level, we propose a dependency graph for the information types in the four IE tasks that captures the connections between the types expressed in an input sentence. A new regularization mechanism is introduced to enforce the consistency between the golden and predicted type dependency graphs to improve representation learning. We show that the proposed model achieves the state-of-the-art performance for joint IE on both monolingual and multilingual learning settings with three different languages.

pdf
Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing
Minh Van Nguyen | Viet Dac Lai | Amir Pouran Ben Veyseh | Thien Huu Nguyen
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

We introduce Trankit, a light-weight Transformer-based Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages. Built on a state-of-the-art pretrained language model, Trankit significantly outperforms prior multilingual NLP pipelines over sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing while maintaining competitive performance for tokenization, multi-word token expansion, and lemmatization over 90 Universal Dependencies treebanks. Despite the use of a large pretrained transformer, our toolkit is still efficient in memory usage and speed. This is achieved by our novel plug-and-play mechanism with Adapters where a multilingual pretrained transformer is shared across pipelines for different languages. Our toolkit along with pretrained models and code are publicly available at: https://github.com/nlp-uoregon/trankit. A demo website for our toolkit is also available at: http://nlp.uoregon.edu/trankit. Finally, we create a demo video for Trankit at: https://youtu.be/q0KGP3zGjGc.

pdf
Event Extraction from Historical Texts: A New Dataset for Black Rebellions
Viet Dac Lai | Minh Van Nguyen | Heidi Kaufman | Thien Huu Nguyen
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf
Event Detection: Gate Diversity and Syntactic Importance Scores for Graph Convolution Neural Networks
Viet Dac Lai | Tuan Ngo Nguyen | Thien Huu Nguyen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recent studies on event detection (ED) have shown that the syntactic dependency graph can be employed in graph convolution neural networks (GCN) to achieve state-of-the-art performance. However, the computation of the hidden vectors in such graph-based models is agnostic to the trigger candidate words, potentially leaving irrelevant information for the trigger candidate for event prediction. In addition, the current models for ED fail to exploit the overall contextual importance scores of the words, which can be obtained via the dependency tree, to boost the performance. In this study, we propose a novel gating mechanism to filter noisy information in the hidden vectors of the GCN models for ED based on the information from the trigger candidate. We also introduce novel mechanisms to achieve the contextual diversity for the gates and the importance score consistency for the graphs and models in ED. The experiments show that the proposed model achieves state-of-the-art performance on two ED datasets.

pdf
Extensively Matching for Few-shot Learning Event Detection
Viet Dac Lai | Thien Huu Nguyen | Franck Dernoncourt
Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events

Current event detection models under supervised learning settings fail to transfer to new event types. Few-shot learning has not been explored in event detection even though it allows a model to perform well with high generalization on new event types. In this work, we formulate event detection as a few-shot learning problem to enable to extend event detection to new event types. We propose two novel loss factors that matching examples in the support set to provide more training signals to the model. Moreover, these training signals can be applied in many metric-based few-shot learning models. Our extensive experiments on the ACE-2005 dataset (under a few-shot learning setting) show that the proposed method can improve the performance of few-shot learning.

2019

pdf
Extending Event Detection to New Types with Learning from Keywords
Viet Dac Lai | Thien Nguyen
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Traditional event detection classifies a word or a phrase in a given sentence for a set of prede- fined event types. The limitation of such pre- defined set is that it prevents the adaptation of the event detection models to new event types. We study a novel formulation of event detec- tion that describes types via several keywords to match the contexts in documents. This fa- cilitates the operation of the models to new types. We introduce a novel feature-based attention mechanism for convolutional neural networks for event detection in the new for- mulation. Our extensive experiments demon- strate the benefits of the new formulation for new type extension for event detection as well as the proposed attention mechanism for this problem