Shir Ashury-Tahan
Also published as: Shir Ashury Tahan
2026
The Mighty ToRR: A Benchmark for Table Reasoning and Robustness in LLMs
Shir Ashury-Tahan | Yifan Mai | Rajmohan C | Ariel Gera | Yotam Perlitz | Asaf Yehudai | Elron Bandel | Leshem Choshen | Eyal Shnarch | Percy Liang | Michal Shmueli-Scheuer
Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026)
Shir Ashury-Tahan | Yifan Mai | Rajmohan C | Ariel Gera | Yotam Perlitz | Asaf Yehudai | Elron Bandel | Leshem Choshen | Eyal Shnarch | Percy Liang | Michal Shmueli-Scheuer
Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026)
Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. We further find that no single table format consistently yields superior performance. However, evaluating models across multiple formats is essential for a reliable assessment of their capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that reasoning over table tasks remains a significant challenge. The leaderboard, data and code are publicly available.
2024
Data-driven Coreference-based Ontology Building
Shir Ashury Tahan | Amir David Nissan Cohen | Nadav Cohen | Yoram Louzoun | Yoav Goldberg
Findings of the Association for Computational Linguistics: EMNLP 2024
Shir Ashury Tahan | Amir David Nissan Cohen | Nadav Cohen | Yoram Louzoun | Yoav Goldberg
Findings of the Association for Computational Linguistics: EMNLP 2024
While coreference resolution is traditionally used as a component in individual document understanding, in this work we take a more global view and explore what can we learn about a domain from the set of all document-level coreference relations that are present in a large corpus. We derive coreference chains from a corpus of 30 million biomedical abstracts and construct a graph based on the string phrases within these chains, establishing connections between phrases if they co-occur within the same coreference chain. We then use the graph structure and the betweeness centrality measure to distinguish between edges denoting hierarchy, identity and noise, assign directionality to edges denoting hierarchy, and split nodes (strings) that correspond to multiple distinct concepts. The result is a rich, data-driven ontology over concepts in the biomedical domain, parts of which overlaps significantly with human-authored ontologies. We release the coreference chains and resulting ontology under a creative-commons license.
Label-Efficient Model Selection for Text Generation
Shir Ashury Tahan | Ariel Gera | Benjamin Sznajder | Leshem Choshen | Liat Ein-Dor | Eyal Shnarch
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shir Ashury Tahan | Ariel Gera | Benjamin Sznajder | Leshem Choshen | Liat Ein-Dor | Eyal Shnarch
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluation.DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model for selecting between models, prompts and configurations. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations – by up to 75% – while maintaining high evaluation reliability.