Mark Johnson

2023

Large Language Models (LLMs) are claimed to be capable of Natural Language Inference (NLI), necessary for applied tasks like question answering and summarization. We present a series of behavioral studies on several LLM families (LLaMA, GPT-3.5, and PaLM) which probe their behavior using controlled experiments. We establish two biases originating from pretraining which predict much of their behavior, and show that these are major sources of hallucination in generative LLMs. First, memorization at the level of sentences: we show that, regardless of the premise, models falsely label NLI test samples as entailing when the hypothesis is attested in training data, and that entities are used as “indices’ to access the memorized data. Second, statistical patterns of usage learned at the level of corpora: we further show a similar effect when the premise predicate is less frequent than that of the hypothesis in the training data, a bias following from previous studies. We demonstrate that LLMs perform significantly worse on NLI test samples which do not conform to these biases than those which do, and we offer these as valuable controls for future LLM evaluation.

pdf
Smoothing Entailment Graphs with Language Models
Nick McKenna | Tianyi Li | Mark Johnson | Mark Steedman
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

2021

pdf abs
Mention Flags (MF): Constraining Transformer-based Text Generators
Yufei Wang | Ian Wood | Stephen Wan | Mark Dras | Mark Johnson
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This paper focuses on Seq2Seq (S2S) constrained text generation where the text generator is constrained to mention specific words which are inputs to the encoder in the generated outputs. Pre-trained S2S models or a Copy Mechanism are trained to copy the surface tokens from encoders to decoders, but they cannot guarantee constraint satisfaction. Constrained decoding algorithms always produce hypotheses satisfying all constraints. However, they are computationally expensive and can lower the generated text quality. In this paper, we propose Mention Flags (MF), which traces whether lexical constraints are satisfied in the generated outputs in an S2S decoder. The MF models can be trained to generate tokens in a hypothesis until all constraints are satisfied, guaranteeing high constraint satisfaction. Our experiments on the Common Sense Generation task (CommonGen) (Lin et al., 2020), End2end Restaurant Dialog task (E2ENLG) (Duˇsek et al., 2020) and Novel Object Captioning task (nocaps) (Agrawal et al., 2019) show that the MF models maintain higher constraint satisfaction and text quality than the baseline models and other constrained decoding algorithms, achieving state-of-the-art performance on all three tasks. These results are achieved with a much lower run-time than constrained decoding algorithms. We also show that the MF models work well in the low-resource setting.

pdf abs
Integrating Lexical Information into Entity Neighbourhood Representations for Relation Prediction
Ian Wood | Mark Johnson | Stephen Wan
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Relation prediction informed from a combination of text corpora and curated knowledge bases, combining knowledge graph completion with relation extraction, is a relatively little studied task. A system that can perform this task has the ability to extend an arbitrary set of relational database tables with information extracted from a document corpus. OpenKi[1] addresses this task through extraction of named entities and predicates via OpenIE tools then learning relation embeddings from the resulting entity-relation graph for relation prediction, outperforming previous approaches. We present an extension of OpenKi that incorporates embeddings of text-based representations of the entities and the relations. We demonstrate that this results in a substantial performance increase over a system without this information.

pdf abs
Blindness to Modality Helps Entailment Graph Mining
Liane Guillou | Sander Bijl de Vroe | Mark Johnson | Mark Steedman
Proceedings of the Second Workshop on Insights from Negative Results in NLP

Understanding linguistic modality is widely seen as important for downstream tasks such as Question Answering and Knowledge Graph Population. Entailment Graph learning might also be expected to benefit from attention to modality. We build Entailment Graphs using a news corpus filtered with a modality parser, and show that stripping modal modifiers from predicates in fact increases performance. This suggests that for some tasks, the pragmatics of modal modification of predicates allows them to contribute as evidence of entailment.

pdf abs
ECOL-R: Encouraging Copying in Novel Object Captioning with Reinforcement Learning
Yufei Wang | Ian Wood | Stephen Wan | Mark Johnson
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Novel Object Captioning is a zero-shot Image Captioning task requiring describing objects not seen in the training captions, but for which information is available from external object detectors. The key challenge is to select and describe all salient detected novel objects in the input images. In this paper, we focus on this challenge and propose the ECOL-R model (Encouraging Copying of Object Labels with Reinforced Learning), a copy-augmented transformer model that is encouraged to accurately describe the novel object labels. This is achieved via a specialised reward function in the SCST reinforcement learning framework (Rennie et al., 2017) that encourages novel object mentions while maintaining the caption quality. We further restrict the SCST training to the images where detected objects are mentioned in reference captions to train the ECOL-R model. We additionally improve our copy mechanism via Abstract Labels, which transfer knowledge from known to novel object types, and a Morphological Selector, which determines the appropriate inflected forms of novel object labels. The resulting model sets new state-of-the-art on the nocaps (Agrawal et al., 2019) and held-out COCO (Hendricks et al., 2016) benchmarks.

pdf abs
Open-Domain Contextual Link Prediction and its Complementarity with Entailment Graphs
Mohammad Javad Hosseini | Shay B. Cohen | Mark Johnson | Mark Steedman
Findings of the Association for Computational Linguistics: EMNLP 2021

An open-domain knowledge graph (KG) has entities as nodes and natural language relations as edges, and is constructed by extracting (subject, relation, object) triples from text. The task of open-domain link prediction is to infer missing relations in the KG. Previous work has used standard link prediction for the task. Since triples are extracted from text, we can ground them in the larger textual context in which they were originally found. However, standard link prediction methods only rely on the KG structure and ignore the textual context that each triple was extracted from. In this paper, we introduce the new task of open-domain contextual link prediction which has access to both the textual context and the KG structure to perform link prediction. We build a dataset for the task and propose a model for it. Our experiments show that context is crucial in predicting missing relations. We also demonstrate the utility of contextual link prediction in discovering context-independent entailments between relations, in the form of entailment graphs (EG), in which the nodes are the relations. The reverse holds too: context-independent EGs assist in predicting relations in context.

Drawing inferences between open-domain natural language predicates is a necessity for true language understanding. There has been much progress in unsupervised learning of entailment graphs for this purpose. We make three contributions: (1) we reinterpret the Distributional Inclusion Hypothesis to model entailment between predicates of different valencies, like DEFEAT(Biden, Trump) entails WIN(Biden); (2) we actualize this theory by learning unsupervised Multivalent Entailment Graphs of open-domain predicates; and (3) we demonstrate the capabilities of these graphs on a novel question answering task. We show that directional entailment is more helpful for inference than non-directional similarity on questions of fine-grained semantics. We also show that drawing on evidence across valencies answers more questions than by using only the same valency evidence.

2020

pdf abs
Incorporating Temporal Information in Entailment Graph Mining
Liane Guillou | Sander Bijl de Vroe | Mohammad Javad Hosseini | Mark Johnson | Mark Steedman
Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)

We present a novel method for injecting temporality into entailment graphs to address the problem of spurious entailments, which may arise from similar but temporally distinct events involving the same pair of entities. We focus on the sports domain in which the same pairs of teams play on different occasions, with different outcomes. We present an unsupervised model that aims to learn entailments such as win/lose → play, while avoiding the pitfall of learning non-entailments such as win ̸→ lose. We evaluate our model on a manually constructed dataset, showing that incorporating time intervals and applying a temporal window around them, are effective strategies.

pdf abs
Improving Disfluency Detection by Self-Training a Self-Attentive Model
Paria Jamshid Lou | Mark Johnson
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Self-attentive neural syntactic parsers using contextualized word embeddings (e.g. ELMo or BERT) currently produce state-of-the-art results in joint parsing and disfluency detection in speech transcripts. Since the contextualized word embeddings are pre-trained on a large amount of unlabeled data, using additional unlabeled data to train a neural model might seem redundant. However, we show that self-training — a semi-supervised technique for incorporating unlabeled data — sets a new state-of-the-art for the self-attentive parser on disfluency detection, demonstrating that self-training provides benefits orthogonal to the pre-trained contextualized word representations. We also show that ensembling self-trained parsers provides further gains for disfluency detection.

bib
Transactions of the Association for Computational Linguistics, Volume 8
Mark Johnson | Brian Roark | Ani Nenkova
Transactions of the Association for Computational Linguistics, Volume 8

pdf abs
End-to-End Speech Recognition and Disfluency Removal
Paria Jamshid Lou | Mark Johnson
Findings of the Association for Computational Linguistics: EMNLP 2020

Disfluency detection is usually an intermediate step between an automatic speech recognition (ASR) system and a downstream task. By contrast, this paper aims to investigate the task of end-to-end speech recognition and disfluency removal. We specifically explore whether it is possible to train an ASR model to directly map disfluent speech into fluent transcripts, without relying on a separate disfluency detection model. We show that end-to-end models do learn to directly generate fluent transcripts; however, their performance is slightly worse than a baseline pipeline approach consisting of an ASR system and a specialized disfluency detection model. We also propose two new metrics for evaluating integrated ASR and disfluency removal models. The findings of this paper can serve as a benchmark for further research on the task of end-to-end speech recognition and disfluency removal in the future.

2019

pdf abs
Duality of Link Prediction and Entailment Graph Induction
Mohammad Javad Hosseini | Shay B. Cohen | Mark Johnson | Mark Steedman
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Link prediction and entailment graph induction are often treated as different problems. In this paper, we show that these two problems are actually complementary. We train a link prediction model on a knowledge graph of assertions extracted from raw text. We propose an entailment score that exploits the new facts discovered by the link prediction model, and then form entailment graphs between relations. We further use the learned entailments to predict improved link prediction scores. Our results show that the two tasks can benefit from each other. The new entailment score outperforms prior state-of-the-art results on a standard entialment dataset and the new link prediction scores show improvements over the raw link prediction scores.

pdf abs
How to Best Use Syntax in Semantic Role Labelling
Yufei Wang | Mark Johnson | Stephen Wan | Yifang Sun | Wei Wang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

There are many different ways in which external information might be used in a NLP task. This paper investigates how external syntactic information can be used most effectively in the Semantic Role Labeling (SRL) task. We evaluate three different ways of encoding syntactic parses and three different ways of injecting them into a state-of-the-art neural ELMo-based SRL sequence labelling model. We show that using a constituency representation as input features improves performance the most, achieving a new state-of-the-art for non-ensemble SRL models on the in-domain CoNLL’05 and CoNLL’12 benchmarks.

This paper describes a spoken-language end-to-end task-oriented dialogue system for small embedded devices such as home appliances. While the current system implements a smart alarm clock with advanced calendar scheduling functionality, the system is designed to make it easy to port to other application domains (e.g., the dialogue component factors out domain-specific execution from domain-general actions such as requesting and updating slot values). The system does not require internet connectivity because all components, including speech recognition, natural language understanding, dialogue management, execution and text-to-speech, run locally on the embedded device (our demo uses a Raspberry Pi). This simplifies deployment, minimizes server costs and most importantly, eliminates user privacy risks. The demo video in alarm domain is here youtu.be/N3IBMGocvHU

bib
Transactions of the Association for Computational Linguistics, Volume 7
Lillian Lee | Mark Johnson | Brian Roark | Ani Nenkova
Transactions of the Association for Computational Linguistics, Volume 7

pdf abs
Neural Constituency Parsing of Speech Transcripts
Paria Jamshid Lou | Yufei Wang | Mark Johnson
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

This paper studies the performance of a neural self-attentive parser on transcribed speech. Speech presents parsing challenges that do not appear in written text, such as the lack of punctuation and the presence of speech disfluencies (including filled pauses, repetitions, corrections, etc.). Disfluencies are especially problematic for conventional syntactic parsers, which typically fail to find any EDITED disfluency nodes at all. This motivated the development of special disfluency detection systems, and special mechanisms added to parsers specifically to handle disfluencies. However, we show here that neural parsers can find EDITED disfluency nodes, and the best neural parsers find them with an accuracy surpassing that of specialized disfluency detection systems, thus making these specialized mechanisms unnecessary. This paper also investigates a modified loss function that puts more weight on EDITED nodes. It also describes tree-transformations that simplify the disfluency detection task by providing alternative encodings of disfluencies and syntactic information.

2018

pdf abs
VnCoreNLP: A Vietnamese Natural Language Processing Toolkit
Thanh Vu | Dat Quoc Nguyen | Dai Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

We present an easy-to-use and fast toolkit, namely VnCoreNLP—a Java NLP annotation pipeline for Vietnamese. Our VnCoreNLP supports key natural language processing (NLP) tasks including word segmentation, part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing, and obtains state-of-the-art (SOTA) results for these tasks. We release VnCoreNLP to provide rich linguistic annotations to facilitate research work on Vietnamese NLP. Our VnCoreNLP is open-source and available at: https://github.com/vncorenlp/VnCoreNLP

pdf abs
AMR dependency parsing with a typed semantic algebra
Jonas Groschwitz | Matthias Lindemann | Meaghan Fowlie | Mark Johnson | Alexander Koller
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a semantic parser for Abstract Meaning Representations which learns to parse strings into tree representations of the compositional structure of an AMR graph. This allows us to use standard neural techniques for supertagging and dependency tree parsing, constrained by a linguistically principled type system. We present two approximative decoding algorithms, which achieve state-of-the-art accuracy and outperform strong baselines.

Semantic parsing requires training data that is expensive and slow to collect. We apply active learning to both traditional and “overnight” data collection approaches. We show that it is possible to obtain good training hyperparameters from seed data which is only a small fraction of the full dataset. We show that uncertainty sampling based on least confidence score is competitive in traditional data collection but not applicable for overnight collection. We propose several active learning strategies for overnight data collection and show that different example selection strategies per domain perform best.

pdf abs
Predicting accuracy on large datasets from smaller pilot data
Mark Johnson | Peter Anderson | Mark Dras | Mark Steedman
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Because obtaining training data is often the most difficult part of an NLP or ML project, we develop methods for predicting how much data is required to achieve a desired test accuracy by extrapolating results from models trained on a small pilot training dataset. We model how accuracy varies as a function of training size on subsets of the pilot data, and use that model to predict how much training data would be required to achieve the desired accuracy. We introduce a new performance extrapolation task to evaluate how well different extrapolations predict accuracy on larger training sets. We show that details of hyperparameter optimisation and the extrapolation models can have dramatic effects in a document classification task. We believe this is an important first step in developing methods for estimating the resources required to meet specific engineering performance targets.

pdf
A Fast and Accurate Vietnamese Word Segmenter
Dat Quoc Nguyen | Dai Quoc Nguyen | Thanh Vu | Mark Dras | Mark Johnson
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

bib
Transactions of the Association for Computational Linguistics, Volume 6
Lillian Lee | Mark Johnson | Kristina Toutanova | Brian Roark
Transactions of the Association for Computational Linguistics, Volume 6

This paper presents a new method for learning typed entailment graphs from text. We extract predicate-argument structures from multiple-source news corpora, and compute local distributional similarity scores to learn entailments between predicates with typed arguments (e.g., person contracted disease). Previous work has used transitivity constraints to improve local decisions, but these constraints are intractable on large graphs. We instead propose a scalable method that learns globally consistent similarity scores based on new soft constraints that consider both the structures across typed entailment graphs and inside each graph. Learning takes only a few hours to run over 100K predicates and our results show large improvements over local similarity scores on two entailment data sets. We further show improvements over paraphrases and entailments from the Paraphrase Database, and prior state-of-the-art entailment graphs. We show that the entailment graphs improve performance in a downstream task.

pdf abs
Disfluency Detection using Auto-Correlational Neural Networks
Paria Jamshid Lou | Peter Anderson | Mark Johnson
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In recent years, the natural language processing community has moved away from task-specific feature engineering, i.e., researchers discovering ad-hoc feature representations for various tasks, in favor of general-purpose methods that learn the input representation by themselves. However, state-of-the-art approaches to disfluency detection in spontaneous speech transcripts currently still depend on an array of hand-crafted features, and other representations derived from the output of pre-existing systems such as language models or dependency parsers. As an alternative, this paper proposes a simple yet effective model for automatic disfluency detection, called an auto-correlational neural network (ACNN). The model uses a convolutional neural network (CNN) and augments it with a new auto-correlation operator at the lowest layer that can capture the kinds of “rough copy” dependencies that are characteristic of repair disfluencies in speech. In experiments, the ACNN model outperforms the baseline CNN on a disfluency detection task with a 5% increase in f-score, which is close to the previous best result on this task.

2017

pdf abs
Idea density for predicting Alzheimer’s disease from transcribed speech
Kairit Sirts | Olivier Piguet | Mark Johnson
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Idea Density (ID) measures the rate at which ideas or elementary predications are expressed in an utterance or in a text. Lower ID is found to be associated with an increased risk of developing Alzheimer’s disease (AD) (Snowdon et al., 1996; Engelman et al., 2010). ID has been used in two different versions: propositional idea density (PID) counts the expressed ideas and can be applied to any text while semantic idea density (SID) counts pre-defined information content units and is naturally more applicable to normative domains, such as picture description tasks. In this paper, we develop DEPID, a novel dependency-based method for computing PID, and its version DEPID-R that enables to exclude repeating ideas—a feature characteristic to AD speech. We conduct the first comparison of automatically extracted PID and SID in the diagnostic classification task on two different AD datasets covering both closed-topic and free-recall domains. While SID performs better on the normative dataset, adding PID leads to a small but significant improvement (+1.7 F-score). On the free-topic dataset, PID performs better than SID as expected (77.6 vs 72.3 in F-score) but adding the features derived from the word embedding clustering underlying the automatic SID increases the results considerably, leading to an F-score of 84.8.

Extending semantic parsing systems to new domains and languages is a highly expensive, time-consuming process, so making effective use of existing resources is critical. In this paper, we describe a transfer learning method using crosslingual word embeddings in a sequence-to-sequence model. On the NLmaps corpus, our approach achieves state-of-the-art accuracy of 85.7% for English. Most importantly, we observed a consistent improvement for German compared with several baseline domain adaptation techniques. As a by-product of this approach, our models that are trained on a combination of English and German utterances perform reasonably well on code-switching utterances which contain a mixture of English and German, even though the training data does not contain any such. As far as we know, this is the first study of code-switching in semantic parsing. We manually constructed the set of code-switching test utterances for the NLmaps corpus and achieve 78.3% accuracy on this dataset.

pdf abs
A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing
Dat Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present a novel neural network model that learns POS tagging and graph-based dependency parsing jointly. Our model uses bidirectional LSTMs to learn feature representations shared for both POS tagging and dependency parsing tasks, thus handling the feature-engineering problem. Our extensive experiments, on 19 languages from the Universal Dependencies project, show that our model outperforms the state-of-the-art neural network-based Stack-propagation model for joint POS tagging and transition-based dependency parsing, resulting in a new state of the art. Our code is open-source and available together with pre-trained models at: https://github.com/datquocnguyen/jPTDP

pdf
From Word Segmentation to POS Tagging for Vietnamese
Dat Quoc Nguyen | Thanh Vu | Dai Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2017

pdf abs
Guided Open Vocabulary Image Captioning with Constrained Beam Search
Peter Anderson | Basura Fernando | Mark Johnson | Stephen Gould
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Existing image captioning models do not generalize well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real world applications dealing with images in the wild. We address this problem using a flexible approach that enables existing deep captioning architectures to take advantage of image taggers at test time, without re-training. Our method uses constrained beam search to force the inclusion of selected tag words in the output, and fixed, pretrained word embeddings to facilitate vocabulary expansion to previously unseen tag words. Using this approach we achieve state of the art results for out-of-domain captioning on MSCOCO (and improved results for in-domain captioning). Perhaps surprisingly, our results significantly outperform approaches that incorporate the same tag predictions into the learning algorithm. We also show that we can significantly improve the quality of generated ImageNet captions by leveraging ground-truth labels.

bib
Transactions of the Association for Computational Linguistics, Volume 5
Lillian Lee | Mark Johnson | Kristina Toutanova
Transactions of the Association for Computational Linguistics, Volume 5

pdf
A constrained graph algebra for semantic parsing with AMRs
Jonas Groschwitz | Meaghan Fowlie | Mark Johnson | Alexander Koller
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Long papers

pdf abs
Unsupervised Text Segmentation Based on Native Language Characteristics
Shervin Malmasi | Mark Dras | Mark Johnson | Lan Du | Magdalena Wolska
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most work on segmenting text does so on the basis of topic changes, but it can be of interest to segment by other, stylistically expressed characteristics such as change of authorship or native language. We propose a Bayesian unsupervised text segmentation approach to the latter. While baseline models achieve essentially random segmentation on our task, indicating its difficulty, a Bayesian model that incorporates appropriately compact language models and alternating asymmetric priors can achieve scores on the standard metrics around halfway to perfect segmentation.

pdf abs
Disfluency Detection using a Noisy Channel Model and a Deep Neural Language Model
Paria Jamshid Lou | Mark Johnson
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper presents a model for disfluency detection in spontaneous speech transcripts called LSTM Noisy Channel Model. The model uses a Noisy Channel Model (NCM) to generate n-best candidate disfluency analyses and a Long Short-Term Memory (LSTM) language model to score the underlying fluent sentences of each analysis. The LSTM language model scores, along with other features, are used in a MaxEnt reranker to identify the most plausible analysis. We show that using an LSTM language model in the reranking process of noisy channel disfluency model improves the state-of-the-art in disfluency detection.

2016

pdf
Using Left-corner Parsing to Encode Universal Structural Constraints in Grammar Induction
Hiroshi Noji | Yusuke Miyao | Mark Johnson
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf
Unsupervised Pre-training With Seq2Seq Reconstruction Loss for Deep Relation Extraction Models
Zhuang Li | Lizhen Qu | Qiongkai Xu | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2016

pdf
An empirical study for Vietnamese dependency parsing
Dat Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2016

bib
Transactions of the Association for Computational Linguistics, Volume 4
Lillian Lee | Mark Johnson | Kristina Toutanova
Transactions of the Association for Computational Linguistics, Volume 4

pdf
STransE: a novel embedding model of entities and relationships in knowledge bases
Dat Quoc Nguyen | Kairit Sirts | Lizhen Qu | Mark Johnson
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf abs
Grammar induction from (lots of) words alone
John K Pate | Mark Johnson
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Grammar induction is the task of learning syntactic structure in a setting where that structure is hidden. Grammar induction from words alone is interesting because it is similiar to the problem that a child learning a language faces. Previous work has typically assumed richer but cognitively implausible input, such as POS tag annotated data, which makes that work less relevant to human language acquisition. We show that grammar induction from words alone is in fact feasible when the model is provided with sufficient training data, and present two new streaming or mini-batch algorithms for PCFG inference that can learn from millions of words of training data. We compare the performance of these algorithms to a batch algorithm that learns from less data. The minibatch algorithms outperform the batch algorithm, showing that cheap inference with more data is better than intensive inference with less data. Additionally, we show that the harmonic initialiser, which previous work identified as essential when learning from small POS-tag annotated corpora (Klein and Manning, 2004), is not superior to a uniform initialisation.

pdf
Efficient techniques for parsing with tree automata
Jonas Groschwitz | Alexander Koller | Mark Johnson
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Neighborhood Mixture Model for Knowledge Base Completion
Dat Quoc Nguyen | Kairit Sirts | Lizhen Qu | Mark Johnson
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

2015

pdf
An Incremental Algorithm for Transition-based CCG Parsing
Bharat Ram Ambati | Tejaswini Deoskar | Mark Johnson | Mark Steedman
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Sign constraints on feature weights improve a joint model of word segmentation and phonology
Mark Johnson | Joe Pater | Robert Staubs | Emmanuel Dupoux
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
An Improved Non-monotonic Transition System for Dependency Parsing
Matthew Honnibal | Mark Johnson
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
A Computationally Efficient Algorithm for Learning Topical Collocation Models
Zhendong Zhao | Lan Du | Benjamin Börschinger | John K Pate | Massimiliano Ciaramita | Mark Steedman | Mark Johnson
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf
Using Entity Information from a Knowledge Base to Improve Relation Extraction
Lan Du | Anish Kumar | Mark Johnson | Massimiliano Ciaramita
Proceedings of the Australasian Language Technology Association Workshop 2015

pdf
Do POS Tags Help to Learn Better Morphological Segmentations?
Kairit Sirts | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2015

pdf
More Efficient Topic Modelling Through a Noun Only Approach
Fiona Martin | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2015

pdf
Improving Topic Coherence with Latent Feature Word Representations in MAP Estimation for Topic Modeling
Dat Quoc Nguyen | Kairit Sirts | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2015

pdf abs
Improving Topic Models with Latent Feature Word Representations
Dat Quoc Nguyen | Richard Billingsley | Lan Du | Mark Johnson
Transactions of the Association for Computational Linguistics, Volume 3

Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.

2014

pdf abs
Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems
Bogdan Ludusan | Maarten Versteegh | Aren Jansen | Guillaume Gravier | Xuan-Nga Cao | Mark Johnson | Emmanuel Dupoux
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The unsupervised discovery of linguistic terms from either continuous phoneme transcriptions or from raw speech has seen an increasing interest in the past years both from a theoretical and a practical standpoint. Yet, there exists no common accepted evaluation method for the systems performing term discovery. Here, we propose such an evaluation toolbox, drawing ideas from both speech technology and natural language processing. We first transform the speech-based output into a symbolic representation and compute five types of evaluation metrics on this representation: the quality of acoustic matching, the quality of the clusters found, and the quality of the alignment with real words (type, token, and boundary scores). We tested our approach on two term discovery systems taking speech as input, and one using symbolic input. The latter was run using both the gold transcription and a transcription obtained from an automatic speech recognizer, in order to simulate the case when only imperfect symbolic information is available. The results obtained are analysed through the use of the proposed evaluation metrics and the implications of these metrics are discussed.

pdf bib
The Effect of Dependency Representation Scheme on Syntactic Language Modelling
Sunghwan Kim | John Pate | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2014

pdf abs
Exploring the Role of Stress in Bayesian Word Segmentation using Adaptor Grammars
Benjamin Börschinger | Mark Johnson
Transactions of the Association for Computational Linguistics, Volume 2

Stress has long been established as a major cue in word segmentation for English infants. We show that enabling a current state-of-the-art Bayesian word segmentation model to take advantage of stress cues noticeably improves its performance. We find that the improvements range from 10 to 4%, depending on both the use of phonotactic cues and, to a lesser extent, the amount of evidence available to the learner. We also find that in particular early on, stress cues are much more useful for our model than phonotactic cues by themselves, consistent with the finding that children do seem to use stress cues before they use phonotactic cues. Finally, we study how the model’s knowledge about stress patterns evolves over time. We not only find that our model correctly acquires the most frequent patterns relatively quickly but also that the Unique Stress Constraint that is at the heart of a previously proposed model does not need to be built in but can be acquired jointly with word segmentation.

pdf abs
Joint Incremental Disfluency Detection and Dependency Parsing
Matthew Honnibal | Mark Johnson
Transactions of the Association for Computational Linguistics, Volume 2

We present an incremental dependency parsing model that jointly performs disfluency detection. The model handles speech repairs using a novel non-monotonic transition system, and includes several novel classes of features. For comparison, we evaluated two pipeline systems, using state-of-the-art disfluency detectors. The joint model performed better on both tasks, with a parse accuracy of 90.5% and 84.0% accuracy at disfluency detection. The model runs in expected linear time, and processes over 550 tokens a second.

pdf
Syllable weight encodes mostly the same information for English word segmentation as dictionary stress
John K Pate | Mark Johnson
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf
Modelling function words improves unsupervised word segmentation
Mark Johnson | Anne Christophe | Emmanuel Dupoux | Katherine Demuth
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Unsupervised Word Segmentation in Context
Gabriel Synnaeve | Isabelle Dautriche | Benjamin Börschinger | Mark Johnson | Emmanuel Dupoux
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf
The effect of non-tightness on Bayesian estimation of PCFGs
Shay B. Cohen | Mark Johnson
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
A joint model of word segmentation and phonological variation for English word-final /t/-deletion
Benjamin Börschinger | Mark Johnson | Katherine Demuth
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Topic Segmentation with a Structured Topic Model
Lan Du | Wray Buntine | Mark Johnson
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Modeling Graph Languages with Grammars Extracted via Tree Decompositions
Bevan Keeley Jones | Sharon Goldwater | Mark Johnson
Proceedings of the 11th International Conference on Finite State Methods and Natural Language Processing

pdf bib
Why is English so easy to segment?
Abdellah Fourtassi | Benjamin Börschinger | Mark Johnson | Emmanuel Dupoux
Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL)

pdf
Grammars and Topic Models
Mark Johnson
Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13)

pdf
A Non-Monotonic Arc-Eager Transition System for Dependency Parsing
Matthew Honnibal | Yoav Goldberg | Mark Johnson
Proceedings of the Seventeenth Conference on Computational Natural Language Learning

pdf abs
Parsing entire discourses as very long strings: Capturing topic continuity in grounded language learning
Minh-Thang Luong | Michael C. Frank | Mark Johnson
Transactions of the Association for Computational Linguistics, Volume 1

Grounded language learning, the task of mapping from natural language to a representation of meaning, has attracted more and more interest in recent years. In most work on this topic, however, utterances in a conversation are treated independently and discourse structure information is largely ignored. In the context of language acquisition, this independence assumption discards cues that are important to the learner, e.g., the fact that consecutive utterances are likely to share the same referent (Frank et al., 2013). The current paper describes an approach to the problem of simultaneously modeling grounded language at the sentence and discourse levels. We combine ideas from parsing and grammar induction to produce a parser that can handle long input strings with thousands of tokens, creating parse trees that represent full discourses. By casting grounded language learning as a grammatical inference task, we use our parser to extend the work of Johnson et al. (2012), investigating the importance of discourse continuity in children’s language acquisition and its interaction with social cues. Our model boosts performance in a language acquisition task and yields good discourse segmentations compared with human annotators.

2006

While both spoken and written language processing stand to benefit from parsing, the standard Parseval metrics (Black et al., 1991) and their canonical implementation (Sekine and Collins, 1997) are only useful for text. The Parseval metrics are undefined when the words input to the parser do not match the words in the gold standard parse tree exactly, and word errors are unavoidable with automatic speech recognition (ASR) systems. To fill this gap, we have developed a publicly available tool for scoring parses that implements a variety of metrics which can handle mismatches in words and segmentations, including: alignment-based bracket evaluation, alignment-based dependency evaluation, and a dependency evaluation that does not require alignment. We describe the different metrics, how to use the tool, and the outcome of an extensive set of experiments on the sensitivity.

pdf
Reranking and Self-Training for Parser Adaptation
David McClosky | Eugene Charniak | Mark Johnson
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
Contextual Dependencies in Unsupervised Word Segmentation
Sharon Goldwater | Thomas L. Griffiths | Mark Johnson
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
Effective Self-Training for Parsing
David McClosky | Eugene Charniak | Mark Johnson
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference