Derek Pham


2022

pdf
ErAConD: Error Annotated Conversational Dialog Dataset for Grammatical Error Correction
Xun Yuan | Derek Pham | Sam Davidson | Zhou Yu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Currently available grammatical error correction (GEC) datasets are compiled using essays or other long-form text written by language learners, limiting the applicability of these datasets to other domains such as informal writing and conversational dialog. In this paper, we present a novel GEC dataset consisting of parallel original and corrected utterances drawn from open-domain chatbot conversations; this dataset is, to our knowledge, the first GEC dataset targeted to a human-machine conversational setting. We also present a detailed annotation scheme which ranks errors by perceived impact on comprehension, making our dataset more representative of real-world language learning applications. To demonstrate the utility of the dataset, we use our annotated data to fine-tune a state-of-the-art GEC model. Experimental results show the effectiveness of our data in improving GEC model performance in a conversational scenario.

pdf
Towards Unsupervised Morphological Analysis of Polysynthetic Languages
Sujay Khandagale | Yoann Léveillé | Samuel Miller | Derek Pham | Ramy Eskander | Cass Lowry | Richard Compton | Judith Klavans | Maria Polinsky | Smaranda Muresan
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Polysynthetic languages present a challenge for morphological analysis due to the complexity of their words and the lack of high-quality annotated datasets needed to build and/or evaluate computational models. The contribution of this work is twofold. First, using linguists’ help, we generate and contribute high-quality annotated data for two low-resource polysynthetic languages for two tasks: morphological segmentation and part-of-speech (POS) tagging. Second, we present the results of state-of-the-art unsupervised approaches for these two tasks on Adyghe and Inuktitut. Our findings show that for these polysynthetic languages, using linguistic priors helps the task of morphological segmentation and that using stems rather than words as the core unit of abstraction leads to superior performance on POS tagging.