Michal Shmueli-Scheuer
Also published as: Michal Shmueli Scheuer
2026
The Mighty ToRR: A Benchmark for Table Reasoning and Robustness in LLMs
Shir Ashury-Tahan | Yifan Mai | Rajmohan C | Ariel Gera | Yotam Perlitz | Asaf Yehudai | Elron Bandel | Leshem Choshen | Eyal Shnarch | Percy Liang | Michal Shmueli-Scheuer
Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026)
Shir Ashury-Tahan | Yifan Mai | Rajmohan C | Ariel Gera | Yotam Perlitz | Asaf Yehudai | Elron Bandel | Leshem Choshen | Eyal Shnarch | Percy Liang | Michal Shmueli-Scheuer
Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026)
Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. We further find that no single table format consistently yields superior performance. However, evaluating models across multiple formats is essential for a reliable assessment of their capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that reasoning over table tasks remains a significant challenge. The leaderboard, data and code are publicly available.
Stop Guessing When to Stop Testing: Efficient Model Evaluation with Just Enough Data
Ofir Arviv | Kristjan Greenewald | Yotam Perlitz | Hadar Mulian | Michal Shmueli-Scheuer | Leshem Choshen
Findings of the Association for Computational Linguistics: ACL 2026
Ofir Arviv | Kristjan Greenewald | Yotam Perlitz | Hadar Mulian | Michal Shmueli-Scheuer | Leshem Choshen
Findings of the Association for Computational Linguistics: ACL 2026
The inherent rigidity of fixed-size benchmarks makes them an inefficient tool for model evaluation. Diverse evaluation objectives, including model ranking, model selection and testing throughout development, demand varying levels of statistical power. The mismatch between fixed sample sizes and these diverse needs results in either excessive computational cost or compromised reliability – a critical concern for model evaluation. To overcome these limitations, we call for adoption of sequential testing in our field. We provide an adaptive evaluation framework, that provides a principled way to navigate the trade-off between efficiency and reliability in model evaluation. Our framework combines the established statistical paradigm of sequential testing with stopping criteria tailored to common evaluation needs such as diminishing returns detection, and minimum detectable effect size. We demonstrate its ability to adaptively manage the efficiency-reliability trade-off on the Open VLM Leaderboard, including, for example, a 80% reduction in computational cost compared to fixed-size evaluation (with a 2.5-point CI width allowance) while maintaining statistical significance.
A Survey on Evaluation of LLM-based Agents
Asaf Yehudai | Lilach Eden | Alan Li | Guy Uziel | Yilun Zhao | Roy Bar-Haim | Arman Cohan | Michal Shmueli-Scheuer
Findings of the Association for Computational Linguistics: ACL 2026
Asaf Yehudai | Lilach Eden | Alan Li | Guy Uziel | Yilun Zhao | Roy Bar-Haim | Arman Cohan | Michal Shmueli-Scheuer
Findings of the Association for Computational Linguistics: ACL 2026
LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methods for these increasingly capable agents. We analyze the field of agent evaluation across five perspectives: (1) Core LLM capabilities needed for agentic workflows, like planning, and tool use; (2) Application-specific benchmarks such as web and SWE agents; (3) Evaluation of generalist agents; (4) Analysis of agent benchmarks’ core dimensions; and (5) Evaluation frameworks and tools for agent developers. Our analysis reveals current trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address—particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, scalable evaluation methods.
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
Asaf Yehudai | Lilach Eden | Michal Shmueli-Scheuer
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Asaf Yehudai | Lilach Eden | Michal Shmueli-Scheuer
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation or creating a static taxonomy of agent errors. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.
2025
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Ofir Arviv | Miruna Clinciu | Kaustubh Dhole | Rotem Dror | Sebastian Gehrmann | Eliya Habba | Itay Itzhak | Simon Mille | Yotam Perlitz | Enrico Santus | João Sedoc | Michal Shmueli Scheuer | Gabriel Stanovsky | Oyvind Tafjord
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Ofir Arviv | Miruna Clinciu | Kaustubh Dhole | Rotem Dror | Sebastian Gehrmann | Eliya Habba | Itay Itzhak | Simon Mille | Yotam Perlitz | Enrico Santus | João Sedoc | Michal Shmueli Scheuer | Gabriel Stanovsky | Oyvind Tafjord
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation
Eliya Habba | Ofir Arviv | Itay Itzhak | Yotam Perlitz | Elron Bandel | Leshem Choshen | Michal Shmueli-Scheuer | Gabriel Stanovsky
Findings of the Association for Computational Linguistics: ACL 2025
Eliya Habba | Ofir Arviv | Itay Itzhak | Yotam Perlitz | Elron Bandel | Leshem Choshen | Michal Shmueli-Scheuer | Gabriel Stanovsky
Findings of the Association for Computational Linguistics: ACL 2025
Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more at: https://slab-nlp.github.io/DOVE
Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models
George Kour | Itay Nakash | Michal Shmueli-Scheuer | Ateret Anaby Tavor
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
George Kour | Itay Nakash | Michal Shmueli-Scheuer | Ateret Anaby Tavor
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
As Large Language Models (LLMs) become deeply integrated into human life and increasingly influence decision-making, it’s crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs. These tendencies may stem from biases within the models, which may shape their behavior, influence the advice and recommendations they offer to users, and potentially reinforce certain viewpoints. This paper presents the Preference, Opinion, and Belief survey (POBs), a benchmark developed to assess LLMs’ subjective inclinations across societal, cultural, ethical, and personal domains. We applied our benchmark to evaluate leading open- and closed-source LLMs, measuring desired properties such as reliability, neutrality, and consistency. In addition, we investigated the effect of increasing the test-time compute, through reasoning and self-reflection mechanisms, on those metrics. While effective in other tasks, our results show that these mechanisms offer only limited gains in our domain. Furthermore, we reveal that newer model versions are becoming less consistent and more biased toward specific viewpoints, highlighting a blind spot and a concerning trend.POBS: https://ibm.github.io/POBS
2024
Efficient Benchmarking (of Language Models)
Yotam Perlitz | Elron Bandel | Ariel Gera | Ofir Arviv | Liat Ein-Dor | Eyal Shnarch | Noam Slonim | Michal Shmueli-Scheuer | Leshem Choshen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Yotam Perlitz | Elron Bandel | Ariel Gera | Ofir Arviv | Liat Ein-Dor | Eyal Shnarch | Noam Slonim | Michal Shmueli-Scheuer | Leshem Choshen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The increasing versatility of language models (LMs) has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs, extending to thousands of GPU hours per model. However, the efficiency aspect of these evaluation efforts had raised little discussion in the literature.In this work, we present the problem of Efficient Benchmarking, namely, intelligently reducing the computation costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off. We propose to evaluate the reliability of such decisions, by using a new measure – Decision Impact on Reliability, DIoR for short.We find, for example, that a benchmark leader may change by merely removing a low-ranked model from the benchmark, and observe that a correct benchmark ranking can be obtained by considering only a fraction of the evaluation examples.Based on our findings, we outline a set of concrete recommendations for efficient benchmark design and utilization practices. To take a step further, we use our findings to propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability, often reducing computation by x100 or more.
Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI
Elron Bandel | Yotam Perlitz | Elad Venezian | Roni Friedman | Ofir Arviv | Matan Orbach | Shachar Don-Yehiya | Dafna Sheinwald | Ariel Gera | Leshem Choshen | Michal Shmueli-Scheuer | Yoav Katz
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)
Elron Bandel | Yotam Perlitz | Elad Venezian | Roni Friedman | Ofir Arviv | Matan Orbach | Shachar Don-Yehiya | Dafna Sheinwald | Ariel Gera | Leshem Choshen | Michal Shmueli-Scheuer | Yoav Katz
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)
In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution.Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively. Join the Unitxt community at https://github.com/IBM/unitxt
Navigating the Modern Evaluation Landscape: Considerations in Benchmarks and Frameworks for Large Language Models (LLMs)
Leshem Choshen | Ariel Gera | Yotam Perlitz | Michal Shmueli-Scheuer | Gabriel Stanovsky
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries
Leshem Choshen | Ariel Gera | Yotam Perlitz | Michal Shmueli-Scheuer | Gabriel Stanovsky
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries
General-Purpose Language Models have changed the world of Natural Language Processing, if not the world itself. The evaluation of such versatile models, while supposedly similar to evaluation of generation models before them, in fact presents a host of new evaluation challenges and opportunities. In this Tutorial, we will start from the building blocks of evaluation. The tutorial welcomes people from diverse backgrounds and assumes little familiarity with metrics, datasets, prompts and benchmarks. It will lay the foundations and explain the basics and their importance, while touching on the major points and breakthroughs of the recent era of evaluation. It will also compare traditional evaluation methods – which are still widely used – to newly developed methods. We will contrast new to old approaches, from evaluating on many-task benchmarks rather than on dedicated datasets to efficiency constraints, and from testing stability and prompts on in-context learning to using the models themselves as evaluation metrics. Finally, the tutorial will cover practical issues, ranging from reviewing widely-used benchmarks and prompt banks to efficient evaluation.
2023
Active Learning for Natural Language Generation
Yotam Perlitz | Ariel Gera | Michal Shmueli-Scheuer | Dafna Sheinwald | Noam Slonim | Liat Ein-Dor
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Yotam Perlitz | Ariel Gera | Michal Shmueli-Scheuer | Dafna Sheinwald | Noam Slonim | Liat Ein-Dor
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
The field of Natural Language Generation (NLG) suffers from a severe shortage of labeled data due to the extremely expensive and time-consuming process involved in manual annotation. A natural approach for coping with this problem is active learning (AL), a well-known machine learning technique for improving annotation efficiency by selectively choosing the most informative examples to label. However, while AL has been well-researched in the context of text classification, its application to NLG remains largely unexplored. In this paper, we present a first systematic study of active learning for NLG, considering a diverse set of tasks and multiple leading selection strategies, and harnessing a strong instruction-tuned model. Our results indicate that the performance of existing AL strategies is inconsistent, surpassing the baseline of random example selection in some cases but not in others. We highlight some notable differences between the classification and generation scenarios, and analyze the selection behaviors of existing AL strategies. Our findings motivate exploring novel approaches for applying AL to generation tasks.
2022
Overview of the First Shared Task on Multi Perspective Scientific Document Summarization (MuP)
Arman Cohan | Guy Feigenblat | Tirthankar Ghosal | Michal Shmueli-Scheuer
Proceedings of the Third Workshop on Scholarly Document Processing
Arman Cohan | Guy Feigenblat | Tirthankar Ghosal | Michal Shmueli-Scheuer
Proceedings of the Third Workshop on Scholarly Document Processing
We present the main findings of MuP 2022 shared task, the first shared task on multi-perspective scientific document summarization. The task provides a testbed representing challenges for summarization of scientific documents, and facilitates development of better models to leverage summaries generated from multiple perspectives. We received 139 total submissions from 9 teams. We evaluated submissions both by automated metrics (i.e., Rouge) and human judgments on faithfulness, coverage, and readability which provided a more nuanced view of the differences between the systems. While we observe encouraging results from the participating teams, we conclude that there is still significant room left for improving summarization leveraging multiple references. Our dataset is available at https://github.com/allenai/mup.
Overview of the Third Workshop on Scholarly Document Processing
Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Michal Shmueli-Scheuer | Anita de Waard | Lucy Lu Wang
Proceedings of the Third Workshop on Scholarly Document Processing
Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Michal Shmueli-Scheuer | Anita de Waard | Lucy Lu Wang
Proceedings of the Third Workshop on Scholarly Document Processing
With the ever-increasing pace of research and high volume of scholarly communication, scholars face a daunting task. Not only must they keep up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. These needs have motivated an increasing focus on computational methods for enhancing search, summarization, and analysis of scholarly documents. However, the various strands of research on scholarly document processing remain fragmented. To reach out to the broader NLP and AI/ML community, pool distributed efforts in this area, and enable shared access to published research, we held the 3rd Workshop on Scholarly Document Processing (SDP) at COLING as a hybrid event (https://sdproc.org/2022/). The SDP workshop consisted of a research track, three invited talks and five Shared Tasks: 1) MSLR22: Multi-Document Summarization for Literature Reviews, 2) DAGPap22: Detecting automatically generated scientific papers, 3) SV-Ident 2022: Survey Variable Identification in Social Science Publications, 4) SKGG: Scholarly Knowledge Graph Generation, 5) MuP 2022: Multi Perspective Scientific Document Summarization. The program was geared towards NLP, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges.
Proceedings of the Third Workshop on Scholarly Document Processing
Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Michal Shmueli-Scheuer | Anita de Waard | Lucy Lu Wang
Proceedings of the Third Workshop on Scholarly Document Processing
Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Michal Shmueli-Scheuer | Anita de Waard | Lucy Lu Wang
Proceedings of the Third Workshop on Scholarly Document Processing
Quality Controlled Paraphrase Generation
Elron Bandel | Ranit Aharonov | Michal Shmueli-Scheuer | Ilya Shnayderman | Noam Slonim | Liat Ein-Dor
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Elron Bandel | Ranit Aharonov | Michal Shmueli-Scheuer | Ilya Shnayderman | Noam Slonim | Liat Ein-Dor
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Paraphrase generation has been widely used in various downstream tasks. Most tasks benefit mainly from high quality paraphrases, namely those that are semantically similar to, yet linguistically diverse from, the original sentence. Generating high-quality paraphrases is challenging as it becomes increasingly hard to preserve meaning as linguistic diversity increases. Recent works achieve nice results by controlling specific aspects of the paraphrase, such as its syntactic tree. However, they do not allow to directly control the quality of the generated paraphrase, and suffer from low flexibility and scalability. Here we propose QCPG, a quality-guided controlled paraphrase generation model, that allows directly controlling the quality dimensions. Furthermore, we suggest a method that given a sentence, identifies points in the quality control space that are expected to yield optimal generated paraphrases. We show that our method is able to generate paraphrases which maintain the original meaning while achieving higher diversity than the uncontrolled baseline. The models, the code, and the data can be found in https://github.com/IBM/quality-controlled-paraphrase-generation.
2021
Overview of the Second Workshop on Scholarly Document Processing
Iz Beltagy | Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Keith Hall | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Robert Patton | Michal Shmueli-Scheuer | Anita de Waard | Kuansan Wang | Lucy Lu Wang
Proceedings of the Second Workshop on Scholarly Document Processing
Iz Beltagy | Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Keith Hall | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Robert Patton | Michal Shmueli-Scheuer | Anita de Waard | Kuansan Wang | Lucy Lu Wang
Proceedings of the Second Workshop on Scholarly Document Processing
With the ever-increasing pace of research and high volume of scholarly communication, scholars face a daunting task. Not only must they keep up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. These needs have motivated an increasing focus on computational methods for enhancing search, summarization, and analysis of scholarly documents. However, the various strands of research on scholarly document processing remain fragmented. To reach out to the broader NLP and AI/ML community, pool distributed efforts in this area, and enable shared access to published research, we held the 2nd Workshop on Scholarly Document Processing (SDP) at NAACL 2021 as a virtual event (https://sdproc.org/2021/). The SDP workshop consisted of a research track, three invited talks, and three Shared Tasks (LongSumm 2021, SCIVER, and 3C). The program was geared towards the application of NLP, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges.
Proceedings of the Second Workshop on Scholarly Document Processing
Iz Beltagy | Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Keith Hall | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Robert M. Patton | Michal Shmueli-Scheuer | Anita de Waard | Kuansan Wang | Lucy Lu Wang
Proceedings of the Second Workshop on Scholarly Document Processing
Iz Beltagy | Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Keith Hall | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Robert M. Patton | Michal Shmueli-Scheuer | Anita de Waard | Kuansan Wang | Lucy Lu Wang
Proceedings of the Second Workshop on Scholarly Document Processing
2020
Overview and Insights from the Shared Tasks at Scholarly Document Processing 2020: CL-SciSumm, LaySumm and LongSumm
Muthu Kumar Chandrasekaran | Guy Feigenblat | Eduard Hovy | Abhilasha Ravichander | Michal Shmueli-Scheuer | Anita de Waard
Proceedings of the First Workshop on Scholarly Document Processing
Muthu Kumar Chandrasekaran | Guy Feigenblat | Eduard Hovy | Abhilasha Ravichander | Michal Shmueli-Scheuer | Anita de Waard
Proceedings of the First Workshop on Scholarly Document Processing
We present the results of three Shared Tasks held at the Scholarly Document Processing Workshop at EMNLP2020: CL-SciSumm, LaySumm and LongSumm. We report on each of the tasks, which received 18 submissions in total, with some submissions addressing two or three of the tasks. In summary, the quality and quantity of the submissions show that there is ample interest in scholarly document summarization, and the state of the art in this domain is at a midway point between being an impossible task and one that is fully resolved.
Overview of the First Workshop on Scholarly Document Processing (SDP)
Muthu Kumar Chandrasekaran | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Eduard Hovy | Philipp Mayr | Michal Shmueli-Scheuer | Anita de Waard
Proceedings of the First Workshop on Scholarly Document Processing
Muthu Kumar Chandrasekaran | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Eduard Hovy | Philipp Mayr | Michal Shmueli-Scheuer | Anita de Waard
Proceedings of the First Workshop on Scholarly Document Processing
Next to keeping up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. To address these challenges, computational work on enhancing search, summarization, and analysis of scholarly documents has flourished. However, the various strands of research on scholarly document processing remain fragmented. To reach to the broader NLP and AI/ML community, pool distributed efforts and enable shared access to published research, we held the 1st Workshop on Scholarly Document Processing at EMNLP 2020 as a virtual event. The SDP workshop consisted of a research track (including a poster session), two invited talks and three Shared Tasks (CL-SciSumm, Lay-Summ and LongSumm), geared towards easier access to scientific methods and results. Website: https://ornlcda.github.io/SDProc
Proceedings of the First Workshop on Scholarly Document Processing
Muthu Kumar Chandrasekaran | Anita de Waard | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Eduard Hovy | Petr Knoth | David Konopnicki | Philipp Mayr | Robert M. Patton | Michal Shmueli-Scheuer
Proceedings of the First Workshop on Scholarly Document Processing
Muthu Kumar Chandrasekaran | Anita de Waard | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Eduard Hovy | Petr Knoth | David Konopnicki | Philipp Mayr | Robert M. Patton | Michal Shmueli-Scheuer
Proceedings of the First Workshop on Scholarly Document Processing
2019
Bot2Vec: Learning Representations of Chatbots
Jonathan Herzig | Tommy Sandbank | Michal Shmueli-Scheuer | David Konopnicki
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)
Jonathan Herzig | Tommy Sandbank | Michal Shmueli-Scheuer | David Konopnicki
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)
Chatbots (i.e., bots) are becoming widely used in multiple domains, along with supporting bot programming platforms. These platforms are equipped with novel testing tools aimed at improving the quality of individual chatbots. Doing so requires an understanding of what sort of bots are being built (captured by their underlying conversation graphs) and how well they perform (derived through analysis of conversation logs). In this paper, we propose a new model, Bot2Vec, that embeds bots to a compact representation based on their structure and usage logs. Then, we utilize Bot2Vec representations to improve the quality of two bot analysis tasks. Using conversation data and graphs of over than 90 bots, we show that Bot2Vec representations improve detection performance by more than 16% for both tasks.
TalkSumm: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks
Guy Lev | Michal Shmueli-Scheuer | Jonathan Herzig | Achiya Jerbi | David Konopnicki
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Guy Lev | Michal Shmueli-Scheuer | Jonathan Herzig | Achiya Jerbi | David Konopnicki
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Currently, no large-scale training data is available for the task of scientific paper summarization. In this paper, we propose a novel method that automatically generates summaries for scientific papers, by utilizing videos of talks at scientific conferences. We hypothesize that such talks constitute a coherent and concise description of the papers’ content, and can form the basis for good summaries. We collected 1716 papers and their corresponding videos, and created a dataset of paper summaries. A model trained on this dataset achieves similar performance as models trained on a dataset of summaries created manually. In addition, we validated the quality of our summaries by human experts.
A Summarization System for Scientific Documents
Shai Erera | Michal Shmueli-Scheuer | Guy Feigenblat | Ora Peled Nakash | Odellia Boni | Haggai Roitman | Doron Cohen | Bar Weiner | Yosi Mass | Or Rivlin | Guy Lev | Achiya Jerbi | Jonathan Herzig | Yufang Hou | Charles Jochim | Martin Gleize | Francesca Bonin | Debasis Ganguly | David Konopnicki
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations
Shai Erera | Michal Shmueli-Scheuer | Guy Feigenblat | Ora Peled Nakash | Odellia Boni | Haggai Roitman | Doron Cohen | Bar Weiner | Yosi Mass | Or Rivlin | Guy Lev | Achiya Jerbi | Jonathan Herzig | Yufang Hou | Charles Jochim | Martin Gleize | Francesca Bonin | Debasis Ganguly | David Konopnicki
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations
We present a novel system providing summaries for Computer Science publications. Through a qualitative user study, we identified the most valuable scenarios for discovery, exploration and understanding of scientific documents. Based on these findings, we built a system that retrieves and summarizes scientific documents for a given information need, either in form of a free-text query or by choosing categorized values such as scientific tasks, datasets and more. Our system ingested 270,000 papers, and its summarization module aims to generate concise yet detailed summaries. We validated our approach with human experts.
2018
Detecting Egregious Conversations between Customers and Virtual Agents
Tommy Sandbank | Michal Shmueli-Scheuer | Jonathan Herzig | David Konopnicki | John Richards | David Piorkowski
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
Tommy Sandbank | Michal Shmueli-Scheuer | Jonathan Herzig | David Konopnicki | John Richards | David Piorkowski
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
Virtual agents are becoming a prominent channel of interaction in customer service. Not all customer interactions are smooth, however, and some can become almost comically bad. In such instances, a human agent might need to step in and salvage the conversation. Detecting bad conversations is important since disappointing customer service may threaten customer loyalty and impact revenue. In this paper, we outline an approach to detecting such egregious conversations, using behavioral cues from the user, patterns in agent responses, and user-agent interaction. Using logs of two commercial systems, we show that using these features improves the detection F1-score by around 20% over using textual features alone. In addition, we show that those features are common across two quite different domains and, arguably, universal.
2017
Neural Response Generation for Customer Service based on Personality Traits
Jonathan Herzig | Michal Shmueli-Scheuer | Tommy Sandbank | David Konopnicki
Proceedings of the 10th International Conference on Natural Language Generation
Jonathan Herzig | Michal Shmueli-Scheuer | Tommy Sandbank | David Konopnicki
Proceedings of the 10th International Conference on Natural Language Generation
We present a neural response generation model that generates responses conditioned on a target personality. The model learns high level features based on the target personality, and uses them to update its hidden state. Our model achieves performance improvements in both perplexity and BLEU scores over a baseline sequence-to-sequence model, and is validated by human judges.
2016
Search
Fix author
Co-authors
- Guy Feigenblat 10
- Yotam Perlitz 8
- Anita De Waard 7
- Tirthankar Ghosal 7
- David Konopnicki 7
- Leshem Choshen 6
- Arman Cohan 6
- Dayne Freitag 6
- Jonathan Herzig 6
- Philipp Mayr 6
- Ofir Arviv 5
- Elron Bandel 5
- Ariel Gera 5
- Petr Knoth 5
- Drahomira Herrmannova 4
- Kyle Lo 4
- Lucy Lu Wang 4
- Muthu Kumar Chandrasekaran 3
- Liat Ein Dor 3
- Eduard Hovy 3
- Tommy Sandbank 3
- Noam Slonim 3
- Gabriel Stanovsky 3
- Asaf Yehudai 3
- Iz Beltagy 2
- Lilach Eden 2
- Eliya Habba 2
- Keith Hall 2
- Itay Itzhak 2
- Achiya Jerbi 2
- Guy Lev 2
- Robert M. Patton 2
- Dafna Sheinwald 2
- Eyal Shnarch 2
- Kuansan Wang 2
- Ranit Aharonov 1
- Daniel Altman 1
- Shir Ashury-Tahan 1
- Roy Bar-Haim 1
- Odellia Boni 1
- Francesca Bonin 1
- Rajmohan C 1
- Miruna Clinciu 1
- Doron Cohen 1
- Kaustubh Dhole 1
- Shachar Don-Yehiya 1
- Rotem Dror 1
- Shai Erera 1
- Roni Friedman 1
- Debasis Ganguly 1
- Sebastian Gehrmann 1
- Martin Gleize 1
- Kristjan Greenewald 1
- Yufang Hou 1
- Charles Jochim 1
- Yoav Katz 1
- George Kour 1
- Alan Li 1
- Percy Liang 1
- Yifan Mai 1
- Yosi Mass 1
- Simon Mille 1
- Hadar Mulian 1
- Itay Nakash 1
- Matan Orbach 1
- Robert Patton 1
- Ora Peled Nakash 1
- David Piorkowski 1
- Anat Rafaeli 1
- Abhilasha Ravichander 1
- John Richards 1
- Or Rivlin 1
- Haggai Roitman 1
- Enrico Santus 1
- João Sedoc 1
- Ilya Shnayderman 1
- David Spivak 1
- Oyvind Tafjord 1
- Ateret Anaby Tavor 1
- Guy Uziel 1
- Elad Venezian 1
- Bar Weiner 1
- Yilun Zhao 1