2025
pdf
bib
abs
Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs
Xiulin Yang
|
Tatsuya Aoyama
|
Yuekun Yao
|
Ethan Wilcox
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Do language models (LMs) offer insights into human language learning? A common argument against this idea is that because their architecture and training paradigm are so vastly different from humans, LMs can learn arbitrary inputs as easily as natural languages. We test this claim by training LMs to model impossible and typologically unattested languages.Unlike previous work, which has focused exclusively on English, we conduct experiments on 12 languages from 4 language families with two newly constructed parallel corpora. Our results show that while GPT-2 small can largely distinguish attested languages from their impossible counterparts, it does not achieve perfect separation between all the attested languages and all the impossible ones. We further test whether GPT-2 small distinguishes typologically attested from unattested languages with different NP orders by manipulating word order based on Greenberg’s Universal 20. We find that the model’s perplexity scores do not distinguish attested vs. unattested word orders, while its performance on the generalization test does. These findings suggest that LMs exhibit some human-like inductive biases, though these biases are weaker than those found in human learners.
2024
pdf
bib
abs
Predicting generalization performance with correctness discriminators
Yuekun Yao
|
Alexander Koller
Findings of the Association for Computational Linguistics: EMNLP 2024
The ability to predict an NLP model’s accuracy on unseen, potentially out-of-distribution data is a prerequisite for trustworthiness. We present a novel model that establishes upper and lower bounds on the accuracy, without requiring gold labels for the unseen data. We achieve this by training a *discriminator* which predicts whether the output of a given sequence-to-sequence model is correct or not. We show across a variety of tagging, parsing, and semantic parsing tasks that the gold accuracy is reliably between the predicted upper and lower bounds, and that these bounds are remarkably close together.
pdf
bib
abs
Simple and effective data augmentation for compositional generalization
Yuekun Yao
|
Alexander Koller
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Compositional generalization, the ability to predict complex meanings from training on simpler sentences, poses challenges for powerful pretrained seq2seq models. In this paper, we show that data augmentation methods that sample MRs and backtranslate them can be effective for compositional generalization, but only if we sample from the right distribution. Remarkably, sampling from a uniform distribution performs almost as well as sampling from the test distribution, and greatly outperforms earlier methods that sampled from the training distribution.We further conduct experiments to investigate the reason why this happens and where the benefit of such data augmentation methods come from.
2023
pdf
bib
abs
SLOG: A Structural Generalization Benchmark for Semantic Parsing
Bingzhi Li
|
Lucia Donatelli
|
Alexander Koller
|
Tal Linzen
|
Yuekun Yao
|
Najoung Kim
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
The goal of compositional generalization benchmarks is to evaluate how well models generalize to new complex linguistic expressions. Existing benchmarks often focus on lexical generalization, the interpretation of novel lexical items in syntactic structures familiar from training; structural generalization tasks, where a model needs to interpret syntactic structures that are themselves unfamiliar from training, are often underrepresented, resulting in overly optimistic perceptions of how well models can generalize. We introduce SLOG, a semantic parsing dataset that extends COGS (Kim and Linzen, 2020) with 17 structural generalization cases. In our experiments, the generalization accuracy of Transformer models, including pretrained ones, only reaches 40.6%, while a structure-aware parser only achieves 70.8%. These results are far from the near-perfect accuracy existing models achieve on COGS, demonstrating the role of SLOG in foregrounding the large discrepancy between models’ lexical and structural generalization capacities.
2022
pdf
bib
abs
Structural generalization is hard for sequence-to-sequence models
Yuekun Yao
|
Alexander Koller
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Sequence-to-sequence (seq2seq) models have been successful across many NLP tasks,including ones that require predicting linguistic structure. However, recent work on compositional generalization has shown that seq2seq models achieve very low accuracy in generalizing to linguistic structures that were not seen in training. We present new evidence that this is a general limitation of seq2seq models that is present not just in semantic parsing, but also in syntactic parsing and in text-to-text tasks, and that this limitation can often be overcome by neurosymbolic models that have linguistic knowledge built in. We further report on some experiments that give initial answers on the reasons for these limitations.
2020
pdf
bib
Dynamic Masking for Improved Stability in Online Spoken Language Translation
Yuekun Yao
|
Barry Haddow
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
pdf
bib
abs
ELITR Non-Native Speech Translation at IWSLT 2020
Dominik Macháček
|
Jonáš Kratochvíl
|
Sangeet Sagar
|
Matúš Žilinec
|
Ondřej Bojar
|
Thai-Son Nguyen
|
Felix Schneider
|
Philip Williams
|
Yuekun Yao
Proceedings of the 17th International Conference on Spoken Language Translation
This paper is an ELITR system submission for the non-native speech translation task at IWSLT 2020. We describe systems for offline ASR, real-time ASR, and our cascaded approach to offline SLT and real-time SLT. We select our primary candidates from a pool of pre-existing systems, develop a new end-to-end general ASR system, and a hybrid ASR trained on non-native speech. The provided small validation set prevents us from carrying out a complex validation, but we submit all the unselected candidates for contrastive evaluation on the test set.