Rotem Dror


Zero-Shot On-the-Fly Event Schema Induction
Rotem Dror | Haoyu Wang | Dan Roth
Findings of the Association for Computational Linguistics: EACL 2023

What are the events involved in a pandemic outbreak? What steps should be taken when planning a wedding? The answers to these questions can be found by collecting many documents on the complex event of interest, extracting relevant information, and analyzing it. We present a new approach in which large language models are utilized to generate source documents that allow predicting, given a high-level event definition, the specific events, arguments, and relations between them to construct a schema that describes the complex event in its entirety.Using our model, complete schemas on any topic can be generated on-the-fly without any manual data collection, i.e., in a zero-shot manner. Moreover, we develop efficient methods to extract pertinent information from texts and demonstrate in a series of experiments that these schemas are considered to be more complete than human-curated ones in the majority of examined scenarios. Finally, we show that this framework is comparable in performance with previous supervised schema induction methods that rely on collecting real texts and even reaching the best score in the prediction task.


Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics
Daniel Deutsch | Rotem Dror | Dan Roth
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice and propose changes to rectify this disconnect. First, we calculate the system score for an automatic metric using the full test set instead of the subset of summaries judged by humans, which is currently standard practice. We demonstrate how this small change leads to more precise estimates of system-level correlations. Second, we propose to calculate correlations only on pairs of systems that are separated by small differences in automatic scores which are commonly observed in practice. This allows us to demonstrate that our best estimate of the correlation of ROUGE to human judgments is near 0 in realistic scenarios. The results from the analyses point to the need to collect more high-quality human judgments and to improve automatic metrics when differences in system scores are small.

RESIN-11: Schema-guided Event Prediction for 11 Newsworthy Scenarios
Xinya Du | Zixuan Zhang | Sha Li | Pengfei Yu | Hongwei Wang | Tuan Lai | Xudong Lin | Ziqi Wang | Iris Liu | Ben Zhou | Haoyang Wen | Manling Li | Darryl Hannan | Jie Lei | Hyounghun Kim | Rotem Dror | Haoyu Wang | Michael Regan | Qi Zeng | Qing Lyu | Charles Yu | Carl Edwards | Xiaomeng Jin | Yizhu Jiao | Ghazaleh Kazeminejad | Zhenhailong Wang | Chris Callison-Burch | Mohit Bansal | Carl Vondrick | Jiawei Han | Dan Roth | Shih-Fu Chang | Martha Palmer | Heng Ji
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

We introduce RESIN-11, a new schema-guided event extraction&prediction framework that can be applied to a large variety of newsworthy scenarios. The framework consists of two parts: (1) an open-domain end-to-end multimedia multilingual information extraction system with weak-supervision and zero-shot learningbased techniques. (2) schema matching and schema-guided event prediction based on our curated schema library. We build a demo website based on our dockerized system and schema library publicly available for installation ( We also include a video demonstrating the system.

On the Limitations of Reference-Free Evaluations of Generated Text
Daniel Deutsch | Rotem Dror | Dan Roth
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

There is significant interest in developing evaluation metrics which accurately estimate the quality of generated text without the aid of a human-written reference text, which can be time consuming and expensive to collect or entirely unavailable in online applications. However, in this work, we demonstrate that these reference-free metrics are inherently biased and limited in their ability to evaluate generated text, and we argue that they should not be used to measure progress on tasks like machine translation or summarization. We show how reference-free metrics are equivalent to using one generation model to evaluate another, which has several limitations: (1) the metrics can be optimized at test time to find the approximate best-possible output, (2) they are inherently biased toward models which are more similar to their own, and (3) they can be biased against higher-quality outputs, including those written by humans. Therefore, we recommend that reference-free metrics should be used as diagnostic tools for analyzing and understanding model behavior instead of measures of how well models perform a task, in which the goal is to achieve as high of a score as possible.


A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods
Daniel Deutsch | Rotem Dror | Dan Roth
Transactions of the Association for Computational Linguistics, Volume 9

Abstract The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics’ correlations reflect a true difference or if it is due to mere chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations using two resampling methods, bootstrapping and permutation. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. We find that the confidence intervals are rather wide, demonstrating high uncertainty in the reliability of automatic metrics. Further, although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do so in some evaluation settings.1


pdf bib
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Shruti Rijhwani | Jiangming Liu | Yizhong Wang | Rotem Dror
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop


Deep Dominance - How to Properly Compare Deep Neural Models
Rotem Dror | Segev Shlomov | Roi Reichart
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Comparing between Deep Neural Network (DNN) models based on their performance on unseen data is crucial for the progress of the NLP field. However, these models have a large number of hyper-parameters and, being non-convex, their convergence point depends on the random values chosen at initialization and during training. Proper DNN comparison hence requires a comparison between their empirical score distributions on unseen data, rather than between single evaluation scores as is standard for more simple, convex models. In this paper, we propose to adapt to this problem a recently proposed test for the Almost Stochastic Dominance relation between two distributions. We define the criteria for a high quality comparison method between DNNs, and show, both theoretically and through analysis of extensive experimental results with leading DNN models for sequence tagging tasks, that the proposed test meets all criteria while previously proposed methods fail to do so. We hope the test we propose here will set a new working practice in the NLP community.


The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing
Rotem Dror | Gili Baumer | Segev Shlomov | Roi Reichart
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Statistical significance testing is a standard statistical tool designed to ensure that experimental results are not coincidental. In this opinion/ theoretical paper we discuss the role of statistical significance testing in Natural Language Processing (NLP) research. We establish the fundamental concepts of significance testing and discuss the specific aspects of NLP tasks, experimental setups and evaluation measures that affect the choice of significance tests in NLP research. Based on this discussion we propose a simple practical protocol for statistical significance test selection in NLP setups and accompany this protocol with a brief survey of the most relevant tests. We then survey recent empirical papers published in ACL and TACL during 2017 and show that while our community assigns great value to experimental results, statistical significance testing is often ignored or misused. We conclude with a brief discussion of open issues that should be properly addressed so that this important tool can be applied. in NLP research in a statistically sound manner.


Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets
Rotem Dror | Gili Baumer | Marina Bogomolov | Roi Reichart
Transactions of the Association for Computational Linguistics, Volume 5

With the ever growing amount of textual data from a large variety of languages, domains, and genres, it has become standard to evaluate NLP algorithms on multiple datasets in order to ensure a consistent performance across heterogeneous setups. However, such multiple comparisons pose significant challenges to traditional statistical analysis methods in NLP and can lead to erroneous conclusions. In this paper we propose a Replicability Analysis framework for a statistically sound analysis of multiple comparisons between algorithms for NLP tasks. We discuss the theoretical advantages of this framework over the current, statistically unjustified, practice in the NLP literature, and demonstrate its empirical value across four applications: multi-domain dependency parsing, multilingual POS tagging, cross-domain sentiment classification and word similarity prediction.


The Structured Weighted Violations Perceptron Algorithm
Rotem Dror | Roi Reichart
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing