Matthew Finlayson


2022

pdf
What Makes Instruction Learning Hard? An Investigation and a New Challenge in a Synthetic Environment
Matthew Finlayson | Kyle Richardson | Ashish Sabharwal | Peter Clark
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The instruction learning paradigm—where a model learns to perform new tasks from task descriptions alone—has become popular in research on general-purpose models. The capabilities of large transformer models as instruction learners, however, remain poorly understood. We use a controlled synthetic environment to characterize such capabilities. Specifically, we use the task of deciding whether a given string matches a regular expression (viewed as an instruction) to identify properties of tasks, instructions, and instances that make instruction learning challenging. For instance, we find that our model, a fine-tuned T5-based text2text transformer, struggles with large regular languages, suggesting that less precise instructions are challenging for models. Instruction executions that require tracking longer contexts of prior steps are also difficult. We use our findings to systematically construct a challenging instruction learning dataset, which we call Hard RegSet. Fine-tuning on Hard RegSet, our large transformer learns to correctly interpret (with at least 90% accuracy) only 65.6% of test instructions, and 11%-24% of the instructions in out-of-distribution generalization settings. We thus propose Hard RegSet as a challenging instruction learning dataset, and a controlled environment for studying instruction learning.

pdf
LILA: A Unified Benchmark for Mathematical Reasoning
Swaroop Mishra | Matthew Finlayson | Pan Lu | Leonard Tang | Sean Welleck | Chitta Baral | Tanmay Rajpurohit | Oyvind Tafjord | Ashish Sabharwal | Peter Clark | Ashwin Kalyan
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Mathematical reasoning skills are essential for general-purpose intelligentsystems to perform tasks from grocery shopping to climate modeling.Towards evaluating and improving AI systems in this domain, we proposeLILA, a unified mathematical reasoning benchmark consisting of 23 diversetasks along four dimensions:(i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs,thereby obtaining explainable solutions in addition to the correct answer.We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation.Finally, we introduce BHASKARA,a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models),while the best performing model only obtains 60.40%,indicating the room for improvement in general mathematical reasoning and understanding.

2021

pdf
Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models
Matthew Finlayson | Aaron Mueller | Sebastian Gehrmann | Stuart Shieber | Tal Linzen | Yonatan Belinkov
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Targeted syntactic evaluations have demonstrated the ability of language models to perform subject-verb agreement given difficult contexts. To elucidate the mechanisms by which the models accomplish this behavior, this study applies causal mediation analysis to pre-trained neural language models. We investigate the magnitude of models’ preferences for grammatical inflections, as well as whether neurons process subject-verb agreement similarly across sentences with different syntactic structures. We uncover similarities and differences across architectures and model sizes—notably, that larger models do not necessarily learn stronger preferences. We also observe two distinct mechanisms for producing subject-verb agreement depending on the syntactic structure of the input sentence. Finally, we find that language models rely on similar sets of neurons when given sentences with similar syntactic structure.