Jason Eisner
Also published as: Jason M. Eisner
2026
LLMs Know More About Numbers than They Can Say
Fengting Yuchi | Li Du | Jason Eisner
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Fengting Yuchi | Li Du | Jason Eisner
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, 5.7 × 102 or 580?”This raises a fundamental question: Do LLMs even know how big these numbers are?We probe the hidden states of several smaller open-source LLMs. A single linear projection of an appropriate hidden layer encodes the *log-magnitudes* of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers).Furthermore, the hidden state after reading a *pair* of numerals encodes their *ranking*, with a linear classifier achieving over 90% accuracy.Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50-70% accuracy, with worse performance for models whose probes are less effective.Finally, we show that incorporating the classifier probe’s log-loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models’ internal magnitude representations can enhance their numerical reasoning capabilities.
2025
MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools
Nishant Subramani | Jason Eisner | Justin Svegliato | Benjamin Van Durme | Yu Su | Sam Thomson
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Nishant Subramani | Jason Eisner | Justin Svegliato | Benjamin Van Durme | Yu Su | Sam Thomson
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Tool-using agents that act in the world need to be both useful and safe. Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions, but prior work shows that many models are poorly calibrated. Inspired by interpretability literature exploring the internals of models, we propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools. MICE first decodes from each intermediate layer of the language model using logit lens and then computes similarity scores between each layer’s generation and the final output. These features are fed into a learned probabilistic classifier to assess confidence in the decoded output. On the simulated trial and error (STE) tool-calling dataset using Llama3 models, we find that MICE beats or matches the baselines on smoothed expected calibration error. Using MICE confidences to determine whether to call a tool significantly improves over strong baselines on a new metric, expected tool-calling utility. Further experiments show that MICE is sample-efficient, can generalize zero-shot to unseen APIs, and results in higher tool-calling utility in scenarios with varying risk levels. Our code is open source, available at https://github.com/microsoft/mice_for_cats.
2024
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia
Giovanni Monea | Maxime Peyrard | Martin Josifoski | Vishrav Chaudhary | Jason Eisner | Emre Kiciman | Hamid Palangi | Barun Patra | Robert West
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Giovanni Monea | Maxime Peyrard | Martin Josifoski | Vishrav Chaudhary | Jason Eisner | Emre Kiciman | Hamid Palangi | Barun Patra | Robert West
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have an impressive ability to draw on novel information supplied in their context. Yet the mechanisms underlying this contextual grounding remain unknown, especially in situations where contextual information contradicts factual knowledge stored in the parameters, which LLMs also excel at recalling. Favoring the contextual information is critical for retrieval-augmented generation methods, which enrich the context with up-to-date information, hoping that grounding can rectify outdated or noisy stored knowledge. We present a novel method to study grounding abilities using Fakepedia, a novel dataset of counterfactual texts constructed to clash with a model’s internal parametric knowledge. In this study, we introduce Fakepedia, a counterfactual dataset designed to evaluate grounding abilities when the internal parametric knowledge clashes with the contextual information. We benchmark various LLMs with Fakepedia and conduct a causal mediation analysis of LLM components when answering Fakepedia queries, based on our Masked Grouped Causal Tracing (MGCT) method. Through this analysis, we identify distinct computational patterns between grounded and ungrounded responses. We finally demonstrate that distinguishing grounded from ungrounded responses is achievable through computational analysis alone. Our results, together with existing findings about factual recall mechanisms, provide a coherent narrative of how grounding and factual recall mechanisms interact within LLMs.
LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error
Boshi Wang | Hao Fang | Jason Eisner | Benjamin Van Durme | Yu Su
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Boshi Wang | Hao Fang | Jason Eisner | Benjamin Van Durme | Yu Su
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM’s ‘imagination’ to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.
LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts
Helia Hashemi | Jason Eisner | Corby Rosset | Benjamin Van Durme | Chris Kedzie
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Helia Hashemi | Jason Eisner | Corby Rosset | Benjamin Van Durme | Chris Kedzie
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses. The LLM predictions often fail to agree well with human judges—indeed, the humans do not fully agree with one another. However, the multiple LLM distributions can be _combined_ to _predict_ each human judge’s annotations on all questions, including a summary question that assesses overall quality or relevance. LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that LLM-Rubric with 9 questions (assessing dimensions such as naturalness, conciseness, and citation quality) predicts human judges’ assessment of overall user satisfaction, on a scale of 1–4, with RMS error < 0.5, a 2× improvement over the uncalibrated baseline.
Learning to Retrieve Iteratively for In-Context Learning
Yunmo Chen | Tongfei Chen | Harsh Jhamtani | Patrick Xia | Richard Shin | Jason Eisner | Benjamin Van Durme
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Yunmo Chen | Tongfei Chen | Harsh Jhamtani | Patrick Xia | Richard Shin | Jason Eisner | Benjamin Van Durme
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
We introduce iterative retrieval, a novel framework that empowers retrievers to make iterative decisions through policy optimization. Finding an optimal portfolio of retrieved items is a combinatorial optimization problem, generally considered NP-hard. This approach provides a learned approximation to such a solution, meeting specific task requirements under a given family of large language models (LLMs). We propose a training procedure based on reinforcement learning, incorporating feedback from LLMs. We instantiate an iterative retriever for composing in-context learning (ICL) exemplars and apply it to various semantic parsing tasks that demand synthesized programs as outputs. By adding only 4M additional parameters for state encoding, we convert an off-the-shelf dense retriever into a stateful iterative retriever, outperforming previous methods in selecting ICL exemplars on semantic parsing datasets such as CalFlow, TreeDST, and MTOP. Additionally, the trained iterative retriever generalizes across different inference LLMs beyond the one used during training.
Language-to-Code Translation with a Single Labeled Example
Kaj Bostrom | Harsh Jhamtani | Hao Fang | Sam Thomson | Richard Shin | Patrick Xia | Benjamin Van Durme | Jason Eisner | Jacob Andreas
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Kaj Bostrom | Harsh Jhamtani | Hao Fang | Sam Thomson | Richard Shin | Patrick Xia | Benjamin Van Durme | Jason Eisner | Jacob Andreas
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Tools for translating natural language into code promise natural, open-ended interaction with databases, web APIs, and other software systems. However, this promise is complicated by the diversity and continual development of these systems, each with its own interface and distinct set of features. Building a new language-to-code translator, even starting with a large language model (LM), typically requires annotating a large set of natural language commands with their associated programs. In this paper, we describe ICIP (In-Context Inverse Programming), a method for bootstrapping a language-to-code system using mostly (or entirely) unlabeled programs written using a potentially unfamiliar (but human-readable) library or API. ICIP uses a pre-trained LM to assign candidate natural language descriptions to these programs, then iteratively refines the descriptions to ensure global consistency. Across nine different application domains from the Overnight and Spider benchmarks and text-davinci-003 and CodeLlama-7b-Instruct models, ICIP outperforms a number of prompting baselines. Indeed, in a “nearly unsupervised” setting with only a single annotated program and 100 unlabeled examples, it achieves up to 85% of the performance of a fully supervised system.
Interpreting User Requests in the Context of Natural Language Standing Instructions
Nikita Moghe | Patrick Xia | Jacob Andreas | Jason Eisner | Benjamin Van Durme | Harsh Jhamtani
Findings of the Association for Computational Linguistics: NAACL 2024
Nikita Moghe | Patrick Xia | Jacob Andreas | Jason Eisner | Benjamin Van Durme | Harsh Jhamtani
Findings of the Association for Computational Linguistics: NAACL 2024
Users of natural language interfaces, frequently powered by Large Language Models (LLMs), must often repeat their full set of preferences each time they make a similar request. We describe an approach to LLM-based dialogue modeling in which persistent user constraints and preferences – collectively termed standing instructions – are provided as additional context for such interfaces. For example, when a user states “I’m hungry”, a previously expressed preference for Persian food can be automatically added to the LLM prompt, influencing the search for relevant restaurants.We develop NLSI, a language-to-program dataset consisting of over 2.4K English dialogues spanning 17 domains, in which each dialogue is paired with a user profile (a set of user-specific standing instructions) and corresponding structured representations (a sequence of API calls). A key challenge in NLSI is to identify which subset of the standing instructions is applicable to a given dialogue. NLSI contains diverse phenomena, from simple preferences to interdependent instructions such as triggering a hotel search whenever the user is booking tickets to an event. We conduct experiments on NLSI using prompting with large language models and various retrieval approaches, achieving a maximum of 46% exact match on API prediction. Our results demonstrate the challenges in identifying the relevant standing instructions and their interpretation into API calls
Do Androids Know They’re Only Dreaming of Electric Sheep?
Sky CH-Wang | Benjamin Van Durme | Jason Eisner | Chris Kedzie
Findings of the Association for Computational Linguistics: ACL 2024
Sky CH-Wang | Benjamin Van Durme | Jason Eisner | Chris Kedzie
Findings of the Association for Computational Linguistics: ACL 2024
We design probes trained on the internal representations of a transformer language model to predict its hallucinatory behavior on three grounded generation tasks. To train the probes, we annotate for span-level hallucination on both sampled (organic) and manually edited (synthetic) reference outputs. Our probes are narrowly trained and we find that they are sensitive to their training domain: they generalize poorly from one task to another or from synthetic to organic hallucinations. However, on in-domain data, they can reliably detect hallucinations at many transformer layers, achieving 95% of their peak performance as early as layer 4. Here, probing proves accurate for evaluating hallucination, outperforming several contemporary baselines and even surpassing an expert human annotator in response-level detection F1. Similarly, on span-level labeling, probes are on par or better than the expert annotator on two out of three generation tasks. Overall, we find that probing is a feasible and efficient alternative to language model hallucination evaluation when model states are available.
When is a Language Process a Language Model?
Li Du | Holden Lee | Jason Eisner | Ryan Cotterell
Findings of the Association for Computational Linguistics: ACL 2024
Li Du | Holden Lee | Jason Eisner | Ryan Cotterell
Findings of the Association for Computational Linguistics: ACL 2024
A language model may be viewed as a 𝛴-valued stochastic process for some alphabet 𝛴.However, in some pathological situations, such a stochastic process may “leak” probability mass onto the set of infinite strings and hence is not equivalent to the conventional view of a language model as a distribution over ordinary (finite) strings.Such ill-behaved language processes are referred to as *non-tight* in the literature.In this work, we study conditions of tightness through the lens of stochastic processes.In particular, by regarding the symbol as marking a stopping time and using results from martingale theory, we give characterizations of tightness that generalize our previous work [(Du et al. 2023)](https://arxiv.org/abs/2212.10502).
Decision-Oriented Dialogue for Human-AI Collaboration
Jessy Lin | Nicholas Tomlin | Jacob Andreas | Jason Eisner
Transactions of the Association for Computational Linguistics, Volume 12
Jessy Lin | Nicholas Tomlin | Jacob Andreas | Jason Eisner
Transactions of the Association for Computational Linguistics, Volume 12
We describe a class of tasks called decision-oriented dialogues, in which AI assistants such as large language models (LMs) must collaborate with one or more humans via natural language to help them make complex decisions. We formalize three domains in which users face everyday decisions: (1) choosing an assignment of reviewers to conference papers, (2) planning a multi-step itinerary in a city, and (3) negotiating travel plans for a group of friends. In each of these settings, AI assistants and users have disparate abilities that they must combine to arrive at the best decision: Assistants can access and process large amounts of information, while users have preferences and constraints external to the system. For each task, we build a dialogue environment where agents receive a reward based on the quality of the final decision they reach. We evaluate LMs in self-play and in collaboration with humans and find that they fall short compared to human assistants, achieving much lower rewards despite engaging in longer dialogues. We highlight a number of challenges models face in decision-oriented dialogues, ranging from goal-directed behavior to reasoning and optimization, and release our environments as a testbed for future work.
2023
Efficient Semiring-Weighted Earley Parsing
Andreas Opedal | Ran Zmigrod | Tim Vieira | Ryan Cotterell | Jason Eisner
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Andreas Opedal | Ran Zmigrod | Tim Vieira | Ryan Cotterell | Jason Eisner
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present Earley’s (1970) context-free parsing algorithm as a deduction system, incorporating various known and new speed-ups. In particular, our presentation supports a known worst-case runtime improvement from Earley’s (1970) O(N3|G||R|), which is unworkable for the large grammars that arise in natural language processing, to O(N3|G|), which matches the complexity of CKY on a binarized version of the grammar G. Here N is the length of the sentence, |R| is the number of productions in G, and |G| is the total length of those productions. We also provide a version that achieves runtime of O(N3|M|) with |M| ≤ |G| when the grammar is represented compactly as a single finite-state automaton M (this is partly novel). We carefully treat the generalization to semiring-weighted deduction, preprocessing the grammar like Stolcke (1995) to eliminate the possibility of deduction cycles, and further generalize Stolcke’s method to compute the weights of sentence prefixes. We also provide implementation details for efficient execution, ensuring that on a preprocessed grammar, the semiring-weighted versions of our methods have the same asymptotic runtime and space requirements as the unweighted methods, including sub-cubic runtime on some grammars.
Privacy-Preserving Domain Adaptation of Semantic Parsers
Fatemehsadat Mireshghallah | Yu Su | Tatsunori Hashimoto | Jason Eisner | Richard Shin
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fatemehsadat Mireshghallah | Yu Su | Tatsunori Hashimoto | Jason Eisner | Richard Shin
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Task-oriented dialogue systems often assist users with personal or confidential matters. For this reason, the developers of such a system are generally prohibited from observing actual usage. So how can they know where the system is failing and needs more training data or new functionality? In this work, we study ways in which realistic user utterances can be generated synthetically, to help increase the linguistic and functional coverage of the system, without compromising the privacy of actual users. To this end, we propose a two-stage Differentially Private (DP) generation method which first generates latent semantic parses, and then generates utterances based on the parses. Our proposed approach improves MAUVE by 2.5X and parse tree function-type overlap by 1.3X relative to current approaches for private synthetic data generation, improving both on fluency and semantic coverage. We further validate our approach on a realistic domain adaptation task of adding new functionality from private user data to a semantic parser, and show overall gains of 8.5% points on its accuracy with the new feature.
A Measure-Theoretic Characterization of Tight Language Models
Li Du | Lucas Torroba Hennigen | Tiago Pimentel | Clara Meister | Jason Eisner | Ryan Cotterell
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Li Du | Lucas Torroba Hennigen | Tiago Pimentel | Clara Meister | Jason Eisner | Ryan Cotterell
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language modeling, a central task in natural language processing, involves estimating a probability distribution over strings. In most cases, the estimated distribution sums to 1 over all finite strings. However, in some pathological cases, probability mass can “leak” onto the set of infinite sequences. In order to characterize the notion of leakage more precisely, this paper offers a measure-theoretic treatment of language modeling. We prove that many popular language model families are in fact tight, meaning that they will not leak in this sense. We also generalize characterizations of tightness proposed in previous works.
Contrastive Decoding: Open-ended Text Generation as Optimization
Xiang Lisa Li | Ari Holtzman | Daniel Fried | Percy Liang | Jason Eisner | Tatsunori Hashimoto | Luke Zettlemoyer | Mike Lewis
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiang Lisa Li | Ari Holtzman | Daniel Fried | Percy Liang | Jason Eisner | Tatsunori Hashimoto | Luke Zettlemoyer | Mike Lewis
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Given a language model (LM), maximum probability is a poor decoding objective for open-ended generation, because it produces short and repetitive text. On the other hand, sampling can often produce incoherent text that drifts from the original topics. We propose contrastive decoding (CD), a reliable decoding approach that optimizes a contrastive objective subject to a plausibility constraint. The contrastive objective returns the difference between the likelihood under a large LM (called the expert, e.g. OPT-13B) and a small LM (called the amateur, e.g. OPT-125M), and the constraint ensures that the outputs are plausible. CD is inspired by the fact that the failures of larger LMs (e.g., repetition, inco- herence) are even more prevalent in smaller LMs, and that this difference signals which texts should be preferred. CD requires zero additional training, and produces higher quality text than decoding from the larger LM alone. It also works across model scales (OPT-13B and GPT2-1.5B) and significantly outperforms four strong decoding algorithms (e.g., nucleus, top-k) in automatic and human evaluations across wikipedia, news and story domains.
Toward Interactive Dictation
Belinda Z. Li | Jason Eisner | Adam Pauls | Sam Thomson
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Belinda Z. Li | Jason Eisner | Adam Pauls | Sam Thomson
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Voice dictation is an increasingly important text input modality. Existing systems that allow both dictation and editing-by-voice restrict their command language to flat templates invoked by trigger words. In this work, we study the feasibility of allowing users to interrupt their dictation with spoken editing commands in open-ended natural language. We introduce a new task and dataset, TERTiUS, to experiment with such systems. To support this flexibility in real-time, a system must incrementally segment and classify spans of speech as either dictation or command, and interpret the spans that are commands. We experiment with using large pre-trained language models to predict the edited text, or alternatively, to predict a small text-editing program. Experiments show a natural trade-off between model accuracy and latency: a smaller model achieves 30% end-state accuracy with 1.3 seconds of latency, while a larger model achieves 55% end-state accuracy with 7 seconds of latency.
On the Intersection of Context-Free and Regular Languages
Clemente Pasti | Andreas Opedal | Tiago Pimentel | Tim Vieira | Jason Eisner | Ryan Cotterell
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Clemente Pasti | Andreas Opedal | Tiago Pimentel | Tim Vieira | Jason Eisner | Ryan Cotterell
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
The Bar-Hillel construction is a classic result in formal language theory. It shows, by a simple construction, that the intersection of a context-free language and a regular language is itself context-free. In the construction, the regular language is specified by a finite-state automaton. However, neither the original construction (Bar-Hillel et al., 1961) nor its weighted extension (Nederhof and Satta, 2003) can handle finite-state automata with ε-arcs. While it is possible to remove ε-arcs from a finite-state automaton efficiently without modifying the language, such an operation modifies the automaton’s set of paths. We give a construction that generalizes the Bar- Hillel in the case the desired automaton has ε-arcs, and further prove that our generalized construction leads to a grammar that encodes the structure of both the input automaton and grammar while retaining the asymptotic size of the original construction.
Non-Programmers Can Label Programs Indirectly via Active Examples: A Case Study with Text-to-SQL
Ruiqi Zhong | Charlie Snell | Dan Klein | Jason Eisner
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Ruiqi Zhong | Charlie Snell | Dan Klein | Jason Eisner
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Can non-programmers annotate natural language utterances with complex programs that represent their meaning? We introduce APEL, a framework in which non-programmers select among candidate programs generated by a seed semantic parser (e.g., Codex). Since they cannot understand the candidate programs, we ask them to select indirectly by examining the programs’ input-ouput examples. For each utterance, APEL actively searches for a simple input on which the candidate programs tend to produce different outputs. It then asks the non-programmers only to choose the appropriate output, thus allowing us to infer which program is correct and could be used to fine-tune the parser. As a first case study, we recruited human non-programmers to use APEL to re-annotate SPIDER, a text-to-SQL dataset. Our approach achieved the same annotation accuracy as the original expert annotators (75%) and exposed many subtle errors in the original annotations.
The Whole Truth and Nothing But the Truth: Faithful and Controllable Dialogue Response Generation with Dataflow Transduction and Constrained Decoding
Hao Fang | Anusha Balakrishnan | Harsh Jhamtani | John Bufe | Jean Crawford | Jayant Krishnamurthy | Adam Pauls | Jason Eisner | Jacob Andreas | Dan Klein
Findings of the Association for Computational Linguistics: ACL 2023
Hao Fang | Anusha Balakrishnan | Harsh Jhamtani | John Bufe | Jean Crawford | Jayant Krishnamurthy | Adam Pauls | Jason Eisner | Jacob Andreas | Dan Klein
Findings of the Association for Computational Linguistics: ACL 2023
In a real-world dialogue system, generated text must be truthful and informative while remaining fluent and adhering to a prescribed style. Satisfying these constraints simultaneously isdifficult for the two predominant paradigms in language generation: neural language modeling and rule-based generation. We describe a hybrid architecture for dialogue response generation that combines the strengths of both paradigms. The first component of this architecture is a rule-based content selection model defined using a new formal framework called dataflow transduction, which uses declarative rules to transduce a dialogue agent’s actions and their results (represented as dataflow graphs) into context-free grammars representing the space of contextually acceptable responses. The second component is a constrained decoding procedure that uses these grammars to constrain the output of a neural language model, which selects fluent utterances. Our experiments show that this system outperforms both rule-based and learned approaches in human evaluations of fluency, relevance, and truthfulness.
Time-and-Space-Efficient Weighted Deduction
Jason Eisner
Transactions of the Association for Computational Linguistics, Volume 11
Jason Eisner
Transactions of the Association for Computational Linguistics, Volume 11
Many NLP algorithms have been described in terms of deduction systems. Unweighted deduction allows a generic forward-chaining execution strategy. For weighted deduction, however, efficient execution should propagate the weight of each item only after it has converged. This means visiting the items in topologically sorted order (as in dynamic programming). Toposorting is fast on a materialized graph; unfortunately, materializing the graph would take extra space. Is there a generic weighted deduction strategy which, for every acyclic deduction system and every input, uses only a constant factor more time and space than generic unweighted deduction? After reviewing past strategies, we answer this question in the affirmative by combining ideas of Goodman (1999) and Kahn (1962). We also give an extension to cyclic deduction systems, based on Tarjan (1972).
2022
Online Semantic Parsing for Latency Reduction in Task-Oriented Dialogue
Jiawei Zhou | Jason Eisner | Michael Newman | Emmanouil Antonios Platanios | Sam Thomson
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiawei Zhou | Jason Eisner | Michael Newman | Emmanouil Antonios Platanios | Sam Thomson
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Standard conversational semantic parsing maps a complete user utterance into an executable program, after which the program is executed to respond to the user. This could be slow when the program contains expensive function calls. We investigate the opportunity to reduce latency by predicting and executing function calls while the user is still speaking. We introduce the task of online semantic parsing for this purpose, with a formal latency reduction metric inspired by simultaneous machine translation. We propose a general framework with first a learned prefix-to-program prediction module, and then a simple yet effective thresholding heuristic for subprogram selection for early execution. Experiments on the SMCalFlow and TreeDST datasets show our approach achieves large latency reduction with good parsing quality, with a 30%–65% latency reduction depending on function execution time and allowed cost.
Algorithms for Acyclic Weighted Finite-State Automata with Failure Arcs
Anej Svete | Benjamin Dayan | Ryan Cotterell | Tim Vieira | Jason Eisner
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Anej Svete | Benjamin Dayan | Ryan Cotterell | Tim Vieira | Jason Eisner
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Weighted finite-state automata (WSFAs) arecommonly used in NLP. Failure transitions area useful extension for compactly representingbackoffs or interpolation in n-gram modelsand CRFs, which are special cases of WFSAs.Unfortunately, applying standard algorithmsfor computing the pathsum requires expand-ing these compact failure transitions. As aresult, na ̈ıve computation of the pathsum inacyclic WFSAs with failure transitions runs inO(|Q|2|Σ|) (O(|Q||Σ|) for deterministic WF-SAs) while the equivalent algorithm in normalWFSAs runs in O(|E|), where E representsthe set of transitions, Q the set of states, andΣ the alphabet. In this work, we present moreefficient algorithms for computing the pathsumin sparse acyclic WFSAs, i.e., WFSAs with av-erage out symbol fraction s ≪ 1. In those,backward runs in O(s|Q||Σ|). We proposean algorithm for semiring-weighted automatawhich runs in O(|E| + s|Σ||Q||Tmax| log |Σ|),where |Tmax| is the size of the largest con-nected component of failure transitions. Ad-ditionally, we propose faster algorithms fortwo specific cases. For ring-weighted WF-SAs we propose an algorithm with complex-ity O(|E| + s|Σ||Q||πmax|), where |πmax| de-notes the longest path length of failure transi-tions stemming from q and Σ(q) the set of sym-bols on the outgoing transitions from q. Forsemiring-weighted WFSAs whose failure tran-sition topology satisfies a condition exemplifiedby CRFs, we propose an algorithm with com-plexity O(|E| + s|Σ||Q| log |Σ|).
When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems
Elias Stengel-Eskin | Emmanouil Antonios Platanios | Adam Pauls | Sam Thomson | Hao Fang | Benjamin Van Durme | Jason Eisner | Yu Su
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Elias Stengel-Eskin | Emmanouil Antonios Platanios | Adam Pauls | Sam Thomson | Hao Fang | Benjamin Van Durme | Jason Eisner | Yu Su
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
In natural language understanding (NLU) production systems, users’ evolving needs necessitate the addition of new features over time, indexed by new symbols added to the meaning representation space. This requires additional training data and results in ever-growing datasets. We present the first systematic investigation into this incremental symbol learning scenario. Our analysis reveals a troubling quirk in building broad-coverage NLU systems: as the training dataset grows, performance on a small set of new symbols often decreases. We show that this trend holds for multiple mainstream models on two common NLU tasks: intent recognition and semantic parsing. Rejecting class imbalance as the sole culprit, we reveal that the trend is closely associated with an effect we call source signal dilution, where strong lexical cues for the new symbol become diluted as the training dataset grows. Selectively dropping training examples to prevent dilution often reverses the trend, showing the over-reliance of mainstream neural NLU models on simple lexical cues.
2021
Constrained Language Models Yield Few-Shot Semantic Parsers
Richard Shin | Christopher Lin | Sam Thomson | Charles Chen | Subhro Roy | Emmanouil Antonios Platanios | Adam Pauls | Dan Klein | Jason Eisner | Benjamin Van Durme
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Richard Shin | Christopher Lin | Sam Thomson | Charles Chen | Subhro Roy | Emmanouil Antonios Platanios | Adam Pauls | Dan Klein | Jason Eisner | Benjamin Van Durme
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
We explore the use of large pretrained language models as few-shot semantic parsers. The goal in semantic parsing is to generate a structured meaning representation given a natural language input. However, language models are trained to generate natural language. To bridge the gap, we use language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation. Our results demonstrate that with only a small amount of data and very little code to convert into English-like representations, our blueprint for rapidly bootstrapping semantic parsers leads to surprisingly effective performance on multiple community tasks, greatly exceeding baseline methods also trained on the same limited data.
Searching for More Efficient Dynamic Programs
Tim Vieira | Ryan Cotterell | Jason Eisner
Findings of the Association for Computational Linguistics: EMNLP 2021
Tim Vieira | Ryan Cotterell | Jason Eisner
Findings of the Association for Computational Linguistics: EMNLP 2021
Computational models of human language often involve combinatorial problems. For instance, a probabilistic parser may marginalize over exponentially many trees to make predictions. Algorithms for such problems often employ dynamic programming and are not always unique. Finding one with optimal asymptotic runtime can be unintuitive, time-consuming, and error-prone. Our work aims to automate this laborious process. Given an initial correct declarative program, we search for a sequence of semantics-preserving transformations to improve its running time as much as possible. To this end, we describe a set of program transformations, a simple metric for assessing the efficiency of a transformed program, and a heuristic search procedure to improve this metric. We show that in practice, automated search—like the mental search performed by human programmers—can find substantial improvements to the initial program. Empirically, we show that many speed-ups described in the NLP literature could have been discovered automatically by our system.
Limitations of Autoregressive Models and Their Alternatives
Chu-Cheng Lin | Aaron Jaech | Xin Li | Matthew R. Gormley | Jason Eisner
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Chu-Cheng Lin | Aaron Jaech | Xin Li | Matthew R. Gormley | Jason Eisner
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Standard autoregressive language models perform only polynomial-time computation to compute the probability of the next symbol. While this is attractive, it means they cannot model distributions whose next-symbol probability is hard to compute. Indeed, they cannot even model them well enough to solve associated easy decision problems for which an engineer might want to consult a language model. These limitations apply no matter how much computation and data are used to train the model, unless the model is given access to oracle parameters that grow superpolynomially in sequence length. Thus, simply training larger autoregressive language models is not a panacea for NLP. Alternatives include energy-based models (which give up efficient sampling) and latent-variable autoregressive models (which give up efficient scoring of a given string). Both are powerful enough to escape the above limitations.
Learning How to Ask: Querying LMs with Mixtures of Soft Prompts
Guanghui Qin | Jason Eisner
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Guanghui Qin | Jason Eisner
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Natural-language prompts have recently been used to coax pretrained language models into performing other AI tasks, using a fill-in-the-blank paradigm (Petroni et al., 2019) or a few-shot extrapolation paradigm (Brown et al., 2020). For example, language models retain factual knowledge from their training corpora that can be extracted by asking them to “fill in the blank” in a sentential prompt. However, where does this prompt come from? We explore the idea of learning prompts by gradient descent—either fine-tuning prompts taken from previous work, or starting from random initialization. Our prompts consist of “soft words,” i.e., continuous vectors that are not necessarily word type embeddings from the language model. Furthermore, for each task, we optimize a mixture of prompts, learning which prompts are most effective and how to ensemble them. Across multiple English LMs and tasks, our approach hugely outperforms previous methods, showing that the implicit factual knowledge in language models was previously underestimated. Moreover, this knowledge is cheap to elicit: random initialization is nearly as good as informed initialization.
2020
A Corpus for Large-Scale Phonetic Typology
Elizabeth Salesky | Eleanor Chodroff | Tiago Pimentel | Matthew Wiesner | Ryan Cotterell | Alan W Black | Jason Eisner
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Elizabeth Salesky | Eleanor Chodroff | Tiago Pimentel | Matthew Wiesner | Ryan Cotterell | Alan W Black | Jason Eisner
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
A major hurdle in data-driven research on typology is having sufficient data in many languages to draw meaningful conclusions. We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology, with aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants. Access to such data can greatly facilitate investigation of phonetic typology at a large scale and across many languages. However, it is non-trivial and computationally intensive to obtain such alignments for hundreds of languages, many of which have few to no resources presently available. We describe the methodology to create our corpus, discuss caveats with current methods and their impact on the utility of this data, and illustrate possible research directions through a series of case studies on the 48 highest-quality readings. Our corpus and scripts are publicly available for non-commercial use at https://voxclamantisproject.github.io.
Task-Oriented Dialogue as Dataflow Synthesis
Jacob Andreas | John Bufe | David Burkett | Charles Chen | Josh Clausman | Jean Crawford | Kate Crim | Jordan DeLoach | Leah Dorner | Jason Eisner | Hao Fang | Alan Guo | David Hall | Kristin Hayes | Kellie Hill | Diana Ho | Wendy Iwaszuk | Smriti Jha | Dan Klein | Jayant Krishnamurthy | Theo Lanman | Percy Liang | Christopher H. Lin | Ilya Lintsbakh | Andy McGovern | Aleksandr Nisnevich | Adam Pauls | Dmitrij Petters | Brent Read | Dan Roth | Subhro Roy | Jesse Rusak | Beth Short | Div Slomin | Ben Snyder | Stephon Striplin | Yu Su | Zachary Tellman | Sam Thomson | Andrei Vorobev | Izabela Witoszko | Jason Wolfe | Abby Wray | Yuchen Zhang | Alexander Zotov
Transactions of the Association for Computational Linguistics, Volume 8
Jacob Andreas | John Bufe | David Burkett | Charles Chen | Josh Clausman | Jean Crawford | Kate Crim | Jordan DeLoach | Leah Dorner | Jason Eisner | Hao Fang | Alan Guo | David Hall | Kristin Hayes | Kellie Hill | Diana Ho | Wendy Iwaszuk | Smriti Jha | Dan Klein | Jayant Krishnamurthy | Theo Lanman | Percy Liang | Christopher H. Lin | Ilya Lintsbakh | Andy McGovern | Aleksandr Nisnevich | Adam Pauls | Dmitrij Petters | Brent Read | Dan Roth | Subhro Roy | Jesse Rusak | Beth Short | Div Slomin | Ben Snyder | Stephon Striplin | Yu Su | Zachary Tellman | Sam Thomson | Andrei Vorobev | Izabela Witoszko | Jason Wolfe | Abby Wray | Yuchen Zhang | Alexander Zotov
Transactions of the Association for Computational Linguistics, Volume 8
We describe an approach to task-oriented dialogue in which dialogue state is represented as a dataflow graph. A dialogue agent maps each user utterance to a program that extends this graph. Programs include metacomputation operators for reference and revision that reuse dataflow fragments from previous turns. Our graph-based state enables the expression and manipulation of complex user intents, and explicit metacomputation makes these intents easier for learned models to predict. We introduce a new dataset, SMCalFlow, featuring complex dialogues about events, weather, places, and people. Experiments show that dataflow graphs and metacomputation substantially improve representability and predictability in these natural dialogues. Additional experiments on the MultiWOZ dataset show that our dataflow representation enables an otherwise off-the-shelf sequence-to-sequence model to match the best existing task-specific state tracking model. The SMCalFlow dataset, code for replicating experiments, and a public leaderboard are available at https://www.microsoft.com/en-us/research/project/dataflow-based-dialogue-semantic-machines.
1997
Search
Fix author
Co-authors
- Ryan Cotterell 26
- Benjamin Van Durme 12
- Matthew R. Gormley 7
- Noah A. Smith 7
- Sam Thomson 7
- Christo Kirov 6
- Philipp Koehn 6
- Adithya Renduchintala 6
- Tim Vieira 6
- Jacob Andreas 5
- Mark Dredze 5
- Markus Dreyer 5
- Hao Fang 5
- Mans Hulden 5
- Sanjeev Khudanpur 5
- Sabrina J. Mielke 5
- Adam Pauls 5
- David A. Smith 5
- Yu Su 5
- Dingquan Wang 5
- Patrick Xia 5
- Harsh Jhamtani 4
- Dan Klein 4
- Zhifei Li 4
- Richard Shin 4
- John Sylak-Glassman 4
- Ekaterina Vylomova 4
- David Yarowsky 4
- Nicholas Andrews 3
- Li Du 3
- Damianos Karakos 3
- Rebecca Knowles 3
- Xiang Lisa Li 3
- Chu-Cheng Lin 3
- Nanyun Peng 3
- Tiago Pimentel 3
- Emmanouil Antonios Platanios 3
- Brian Roark 3
- Géraldine Walther 3
- John Bufe 2
- Charles Chen, Jr. 2
- Jean Crawford 2
- Manaal Faruqui 2
- Eric Goldlust 2
- Tatsunori B. Hashimoto 2
- Jeffrey Heinz 2
- Chris Kedzie 2
- Jayant Krishnamurthy 2
- Sandra Kübler 2
- Percy Liang 2
- Arya D. McCarthy 2
- Andreas Opedal 2
- Subhro Roy 2
- Giorgio Satta 2
- Jason Smith 2
- Veselin Stoyanov 2
- Roy Tromble 2
- Ziyuan Wang 2
- Omar Zaidan 2
- Anusha Balakrishnan 1
- Breck Baldwin 1
- Timothy Baldwin 1
- Srinivas Bangalore 1
- Alan W. Black 1
- Kaj Bostrom 1
- David Burkett 1
- Sky CH-Wang 1
- Vishrav Chaudhary 1
- Yunmo Chen 1
- Tongfei Chen 1
- Eleanor Chodroff 1
- Josh Clausman 1
- Trevor Cohn 1
- Michael Collins 1
- Kate Crim 1
- Hal Daumé III 1
- Benjamin Dayan 1
- Jordan DeLoach 1
- Leah Dorner 1
- Francis Ferraro 1
- Daniel Fried 1
- Matthias Gallé 1
- Kyle Gorman 1
- Alan Guo 1
- David Hall 1
- Helia Hashemi 1
- Kristin Hayes 1
- He He 1
- Kellie Hill 1
- Diana Ho 1
- Ari Holtzman 1
- Wendy Iwaszuk 1
- Aaron Jaech 1
- Smriti Jha 1
- Martin Josifoski 1
- Lauri Karttunen 1
- Emre Kiciman 1
- Theo Lanman 1
- Holden Lee 1
- Lori Levin 1
- Mike Lewis 1
- Xin Li 1
- Belinda Z. Li 1
- Christopher H. Lin 1
- Christopher Lin 1
- Jessy Lin 1
- Ilya Lintsbakh 1
- Patrick Littell 1
- Andy McGovern 1
- Clara Meister 1
- Fatemehsadat Mireshghallah 1
- Nikita Moghe 1
- Giovanni Monea 1
- Michael Newman 1
- Garrett Nicolai 1
- Aleksandr Nisnevich 1
- Hamid Palangi 1
- Clemente Pasti 1
- Barun Patra 1
- Michael Paul 1
- Dmitrij Petters 1
- Maxime Peyrard 1
- Christine Piatko 1
- Adam Poliak 1
- Carey Priebe 1
- Guanghui Qin 1
- Ariadna Quattoni 1
- Guillaume Rabusseau 1
- Dragomir Radev 1
- Pushpendre Rastogi 1
- Adwait Ratnaparkhi 1
- Brent Read 1
- Jeff Reynar 1
- Joseph Rosenzweig 1
- Corby Rosset 1
- Dan Roth 1
- Jesse Rusak 1
- Elizabeth Salesky 1
- Anoop Sarkar 1
- Hinrich Schütze 1
- Beth Short 1
- Miikka Silfverberg 1
- Div Slomin 1
- Charlie Snell 1
- Ben Snyder 1
- Elias Stengel-Eskin 1
- Stephon Striplin 1
- Nishant Subramani 1
- Justin Svegliato 1
- Anej Svete 1
- Zachary Tellman 1
- Alain Thèriault 1
- Nicholas Tomlin 1
- Lucas Torroba Hennigen 1
- Andrei Vorobev 1
- Boshi Wang 1
- Robert West 1
- Matthew Wiesner 1
- Izabela Witoszko 1
- Jason Wolfe 1
- Abby Wray 1
- Fengting Yuchi 1
- Luke Zettlemoyer 1
- Yuchen Zhang 1
- Ruiqi Zhong 1
- Jiawei Zhou 1
- Hao Zhu 1
- Ran Zmigrod 1
- Alexander Zotov 1
- Katharina von der Wense 1