Anh-Duc Vu

Also published as: Anh-duc Vu

2025

pdf bib abs
Estimation of Text Difficulty in the Context of Language Learning
Anisia Katinskaia | Anh-Duc Vu | Jue Hou | Ulla Vanhatalo | Yiheng Wu | Roman Yangarber
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Easy language and text simplification are currently topical research questions, with important applications in many contexts, and with various approaches under active investigation, including prompt-based methods. The estimation of the level of difficulty of a text becomes a crucial challenge when the estimator is employed in a simplification workflow as a quality-control mechanism. It can act as a critic in frameworks where it can guide other models, which are responsible for generating text at a specified level of difficulty, as determined by the user’s needs.We present our work in the context of simplified Finnish. We discuss problems in collecting corpora for training models for estimation of text difficulty, and our experiments with estimation models.The results of the experiments are promising: the models appear usable both for assessment and for deployment as a component in a larger simplification framework.

pdf bib abs
A Bayesian Approach to Inferring Prerequisite Structures and Topic Difficulty in Language Learning
Anh-Duc Vu | Jue Hou | Anisia Katinskaia | Ching-Fan Sheu | Roman Yangarber
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Understanding how linguistic topics are related to each another is essential for designing effective and adaptive second-language (L2) instruction. We present a data-driven framework to model topic dependencies and their difficulty within a L2 learning curriculum. First, we estimate topic difficulty and student ability using a three-parameter Item Response Theory (IRT) model. Second, we construct topic-level knowledge graphs—as directed acyclic graphs (DAGs)—to capture the prerequisite relations among the topics, comparing a threshold-based method with the statistical Grow-Shrink Markov Blanket algorithm. Third, we evaluate the alignment between IRT-inferred topic difficulty and the structure of the graphs using edge-level and global ordering metrics. Finally, we compare the IRT-based estimates of learner ability with assessments of the learners provided by teachers to validate the model’s effectiveness in capturing learner proficiency. Our results show a promising agreement between the inferred graphs, IRT estimates, and human teachers’ assessments, highlighting the framework’s potential to support personalized learning and adaptive curriculum design in intelligent tutoring systems.

2024

pdf bib abs
What Do Transformers Know about Government?
Jue Hou | Anisia Katinskaia | Lari Kotilainen | Sathianpong Trangcasanchai | Anh-Duc Vu | Roman Yangarber
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper investigates what insights about linguistic features and what knowledge about the structure of natural language can be obtained from the encodings in transformer language models. In particular, we explore how BERT encodes the government relation between constituents in a sentence. We use several probing classifiers, and data from two morphologically rich languages. Our experiments show that information about government is encoded across all transformer layers, but predominantly in the early layers of the model. We find that, for both languages, a small number of attention heads encode enough information about the government relations to enable us to train a classifier capable of discovering new, previously unknown types of government, never seen in the training data. Currently, data is lacking for the research community working on grammatical constructions, and government in particular. We release the Government Bank—a dataset defining the government relations for thousands of lemmas in the languages in our experiments.

pdf bib abs
Intelligent Tutor to Support Teaching and Learning of Tatar
Alsu Zakirova | Jue Hou | Anisia Katinskaia | Anh-Duc Vu | Roman Yangarber
Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024)

This paper presents our work on tools to support the Tatar language, using Revita, a web-based Intelligent Tutoring System for language teaching and learning. The system allows the users — teachers and learners — to upload arbitrary authentic texts, and automatically creates exercises based on these texts that engage the learners in active production of language. It provides graduated feedback when they make mistakes, and performs continuous assessment, based on which the system selects exercises for the learners at the appropriate level. The assessment also helps the students maintain their learning pace, and helps the teachers to monitor their progress.The paper describes the functionality currently implemented for Tatar, which enables learners — who possess basic proficiency beyond the beginner level — to improve their competency, using texts of their choice as learning content. Support for Tatar is being developed to increase public interest in learning the language of this important regional minority, as well as to to provide tools for improving fluency to “heritage speakers” — those who have substantial passive competency, but lack active fluency and need support for regular practice.

2023

pdf bib abs
Linguistic Constructs Represent the Domain Model in Intelligent Language Tutoring
Anisia Katinskaia | Jue Hou | Anh-duc Vu | Roman Yangarber
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

This paper presents the development of the AI-based language-learning platform, Revita. It is an intelligent online tutor, developed to support learners of multiple languages, from lower-intermediate toward advanced levels. It has been in pilot use with hundreds of students at several universities, whose feedback and needs shape the development. One of the main emerging features of Revita is the system of linguistic constructs to represent the domain knowledge. The system of constructs is developed in collaboration with experts in language pedagogy. Constructs define the types of exercises, the content of the feedback, and enable detailed modeling and evaluation of learner progress.

pdf bib abs
Effects of sub-word segmentation on performance of transformer language models
Jue Hou | Anisia Katinskaia | Anh-Duc Vu | Roman Yangarber
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Language modeling is a fundamental task in natural language processing, which has been thoroughly explored with various architectures and hyperparameters. However, few studies focus on the effect of sub-word segmentation on the performance of language models (LMs). In this paper, we compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation — Morfessor and StateMorph. We train the models for several languages — including ones with very rich morphology — and compare their performance with different segmentation algorithms, vocabulary sizes, and model sizes. The results show that training with morphological segmentation allows the LMs to: (1) achieve lower perplexity, (2) converge more efficiently in terms of training time, and (3) achieve equivalent or better evaluation scores on downstream tasks. Lastly, we show that (4) LMs of smaller size using morphological segmentation can perform comparably to models of larger size trained with BPE — both in terms of (1) perplexity and (3) scores on downstream tasks. Points (2) and (4) impact on sustainability, since they reduce the model cost; and while 2 reduces cost only in the training phase, 4 does so also in the inference phase.