Aviad Sar-Shalom

Also published as: Aviad Sar-shalom


2023

pdf
Curating Datasets for Better Performance with Example Training Dynamics
Aviad Sar-Shalom | Roy Schwartz
Findings of the Association for Computational Linguistics: ACL 2023

The landscape of NLP research is dominated by large-scale models training on colossal datasets, relying on data quantity rather than quality. As an alternative to this landscape, we propose a method for weighing the relative importance of examples in a dataset based on their Example Training dynamics (swayamdipta et al., 2020) — a set of metrics computed during training. We propose a new way of computing the ETD of a dataset, and show that they can be used to improve performance in both in-distribution and out-of-distribution testing. We show that ETD can be transferable, i.e., they can be computed once and used for training different models, effectively reducing their computation cost. Finally, we suggest an active learning approach for computing ETD during training rather than as a preprocessing step — an approach that is not as effective, but dramatically reduces the extra computational costs.

2022

pdf
Dyna-bAbI: unlocking bAbI’s potential with dynamic synthetic benchmarking
Ronen Tamari | Kyle Richardson | Noam Kahlon | Aviad Sar-shalom | Nelson F. Liu | Reut Tsarfaty | Dafna Shahaf
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics

While neural language models often perform surprisingly well on natural language understanding (NLU) tasks, their strengths and limitations remain poorly understood. Controlled synthetic tasks are thus an increasingly important resource for diagnosing model behavior. In this work we focus on story understanding, a core competency for NLU systems. However, the main synthetic resource for story understanding, the bAbI benchmark, lacks such a systematic mechanism for controllable task generation. We develop Dyna-bAbI, a dynamic framework providing fine-grained control over task generation in bAbI. We demonstrate our ideas by constructing three new tasks requiring compositional generalization, an important evaluation setting absent from the original benchmark. We tested both special-purpose models developed for bAbI as well as state-of-the-art pre-trained methods, and found that while both approaches solve the original tasks (99% accuracy), neither approach succeeded in the compositional generalization setting, indicating the limitations of the original training data. We explored ways to augment the original data, and found that though diversifying training data was far more useful than simply increasing dataset size, it was still insufficient for driving robust compositional generalization (with 70% accuracy for complex compositions). Our results underscore the importance of highly controllable task generators for creating robust NLU systems through a virtuous cycle of model and data development.