Kevin Knight

2021

This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task organised as part of the Con-ference on Machine Translation (WMT) 2021.In the news task, participants were asked tobuild machine translation systems for any of10 language pairs, to be evaluated on test setsconsisting mainly of news stories. The taskwas also opened up to additional test suites toprobe specific aspects of translation.

We present MeetDot, a videoconferencing system with live translation captions overlaid on screen. The system aims to facilitate conversation between people who speak different languages, thereby reducing communication barriers between multilingual participants. Currently, our system supports speech and captions in 4 languages and combines automatic speech recognition (ASR) and machine translation (MT) in a cascade. We use the re-translation strategy to translate the streamed speech, resulting in caption flicker. Additionally, our system has very strict latency requirements to have acceptable call quality. We implement several features to enhance user experience and reduce their cognitive load, such as smooth scrolling captions and reducing caption flicker. The modular architecture allows us to integrate different ASR and MT services in our backend. Our system provides an integrated evaluation suite to optimize key intrinsic evaluation metrics such as accuracy, latency and erasure. Finally, we present an innovative cross-lingual word-guessing game as an extrinsic evaluation metric to measure end-to-end system performance. We plan to make our system open-source for research purposes.

pdf abs
Learning Mathematical Properties of Integers
Maria Ryskina | Kevin Knight
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Embedding words in high-dimensional vector spaces has proven valuable in many natural language applications. In this work, we investigate whether similarly-trained embeddings of integers can capture concepts that are useful for mathematical applications. We probe the integer embeddings for mathematical knowledge, apply them to a set of numerical reasoning tasks, and show that by learning the representations from mathematical sequence data, we can substantially improve over number embeddings learned from English text corpora.

2020

The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2020) featured this year six challenge tracks: (i) Simultaneous speech translation, (ii) Video speech translation, (iii) Offline speech translation, (iv) Conversational speech translation, (v) Open domain translation, and (vi) Non-native speech translation. A total of teams participated in at least one of the tracks. This paper introduces each track’s goal, data and evaluation metrics, and reports the results of the received submissions.

pdf abs
Learning to Pronounce Chinese Without a Pronunciation Dictionary
Christopher Chu | Scot Fang | Kevin Knight
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We demonstrate a program that learns to pronounce Chinese text in Mandarin, without a pronunciation dictionary. From non-parallel streams of Chinese characters and Chinese pinyin syllables, it establishes a many-to-many mapping between characters and pronunciations. Using unsupervised methods, the program effectively deciphers writing into speech. Its token-level character-to-syllable accuracy is 89%, which significantly exceeds the 22% accuracy of prior work.

pdf abs
Solving Historical Dictionary Codes with a Neural Language Model
Christopher Chu | Raphael Valenti | Kevin Knight
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We solve difficult word-based substitution codes by constructing a decoding lattice and searching that lattice with a neural language model. We apply our method to a set of enciphered letters exchanged between US Army General James Wilkinson and agents of the Spanish Crown in the late 1700s and early 1800s, obtained from the US Library of Congress. We are able to decipher 75.1% of the cipher-word tokens correctly.

pdf bib
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Kam-Fai Wong | Kevin Knight | Hua Wu
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

pdf abs
Parallel Corpus Filtering via Pre-trained Language Models
Boliang Zhang | Ajay Nagesh | Kevin Knight
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Web-crawled data provides a good source of parallel corpora for training machine translation models. It is automatically obtained, but extremely noisy, and recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods. In this paper, we propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models. We measure sentence parallelism by leveraging the multilingual capability of BERT and use the Generative Pre-training (GPT) language model as a domain filter to balance data domains. We evaluate the proposed method on the WMT 2018 Parallel Corpus Filtering shared task, and on our own web-crawled Japanese-Chinese parallel corpus. Our method significantly outperforms baselines and achieves a new state-of-the-art. In an unsupervised setting, our method achieves comparable performance to the top-1 supervised method. We also evaluate on a web-crawled Japanese-Chinese parallel corpus that we make publicly available.

To assist human review process, we build a novel ReviewRobot to automatically assign a review score and write comments for multiple categories such as novelty and meaningful comparison. A good review needs to be knowledgeable, namely that the comments should be constructive and informative to help improve the paper; and explainable by providing detailed evidence. ReviewRobot achieves these goals via three steps: (1) We perform domain-specific Information Extraction to construct a knowledge graph (KG) from the target paper under review, a related work KG from the papers cited by the target paper, and a background KG from a large collection of previous papers in the domain. (2) By comparing these three KGs, we predict a review score and detailed structured knowledge as evidence for each review category. (3) We carefully select and generalize human review sentences into templates, and apply these templates to transform the review scores and evidence into natural language comments. Experimental results show that our review score predictor reaches 71.4%-100% accuracy. Human assessment by domain experts shows that 41.7%-70.5% of the comments generated by ReviewRobot are valid and constructive, and better than human-written ones for 20% of the time. Thus, ReviewRobot can serve as an assistant for paper reviewers, program chairs and authors.

This paper describes the DiDi AI Labs’ submission to the WMT2020 news translation shared task. We participate in the translation direction of Chinese->English. In this direction, we use the Transformer as our baseline model and integrate several techniques for model enhancement, including data filtering, data selection, back-translation, fine-tuning, model ensembling, and re-ranking. As a result, our submission achieves a BLEU score of 36.6 in Chinese->English.

2019

We present a PaperRobot who performs as an automatic research assistant by (1) conducting deep understanding of a large collection of human-written papers in a target domain and constructing comprehensive background knowledge graphs (KGs); (2) creating new ideas by predicting links from the background KGs, by combining graph attention and contextual text attention; (3) incrementally writing some key elements of a new paper based on memory-attention networks: from the input title along with predicted related entities to generate a paper abstract, from the abstract to generate conclusion and future work, and finally from future work to generate a title for a follow-on paper. Turing Tests, where a biomedical domain expert is asked to compare a system output and a human-authored string, show PaperRobot generated abstracts, conclusion and future work sections, and new titles are chosen over human-written ones up to 30%, 24% and 12% of the time, respectively.

pdf abs
Translating Translationese: A Two-Step Approach to Unsupervised Machine Translation
Nima Pourdamghani | Nada Aldarrab | Marjan Ghazvininejad | Kevin Knight | Jonathan May
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Given a rough, word-by-word gloss of a source language sentence, target language natives can uncover the latent, fully-fluent rendering of the translation. In this work we explore this intuition by breaking translation into a two step process: generating a rough gloss by means of a dictionary and then ‘translating’ the resulting pseudo-translation, or ‘Translationese’ into a fully fluent translation. We build our Translationese decoder once from a mish-mash of parallel data that has the target language in common and then can build dictionaries on demand using unsupervised techniques, resulting in rapidly generated unsupervised neural MT systems for many source languages. We apply this process to 14 test languages, obtaining better or comparable translation results on high-resource languages than previously published unsupervised MT studies, and obtaining good quality results for low-resource languages that have never been used in an unsupervised MT scenario.

2018

pdf abs
Towards Controllable Story Generation
Nanyun Peng | Marjan Ghazvininejad | Jonathan May | Kevin Knight
Proceedings of the First Workshop on Storytelling

We present a general framework of analyzing existing story corpora to generate controllable and creative new stories. The proposed framework needs little manual annotation to achieve controllable story generation. It creates a new interface for humans to interact with computers to generate personalized stories. We apply the framework to build recurrent neural network (RNN)-based generation models to control story ending valence and storyline. Experiments show that our methods successfully achieve the control and enhance the coherence of stories through introducing storylines. with additional control factors, the generation model gets lower perplexity, and yields more coherent stories that are faithful to the control factors according to human evaluation.

pdf abs
Creative Language Encoding under Censorship
Heng Ji | Kevin Knight
Proceedings of the First Workshop on Natural Language Processing for Internet Freedom

People often create obfuscated language for online communication to avoid Internet censorship, share sensitive information, express strong sentiment or emotion, plan for secret actions, trade illegal products, or simply hold interesting conversations. In this position paper we systematically categorize human-created obfuscated language on various levels, investigate their basic mechanisms, give an overview on automated techniques needed to simulate human encoding. These encoders have potential to frustrate and evade, co-evolve with dynamic human or automated decoders, and produce interesting and adoptable code words. We also summarize remaining challenges for future research on the interaction between Natural Language Processing (NLP) and encryption, and leveraging NLP techniques for encoding and decoding.

We aim to automatically generate natural language descriptions about an input structured knowledge base (KB). We build our generation framework based on a pointer network which can copy facts from the input KB, and add two attention mechanisms: (i) slot-aware attention to capture the association between a slot type and its corresponding slot value; and (ii) a new table position self-attention to capture the inter-dependencies among related slots. For evaluation, besides standard metrics including BLEU, METEOR, and ROUGE, we propose a KB reconstruction based metric by extracting a KB from the generation output and comparing it with the input KB. We also create a new data set which includes 106,216 pairs of structured KBs and their corresponding natural language descriptions for two distinct entity types. Experiments show that our approach significantly outperforms state-of-the-art methods. The reconstructed KB achieves 68.8% - 72.6% F-score.

There are few corpora that endeavor to represent the semantic content of entire documents. We present a corpus that accomplishes one way of capturing document level semantics, by annotating coreference and similar phenomena (bridging and implicit roles) on top of gold Abstract Meaning Representations of sentence-level semantics. We present a new corpus of this annotation, with analysis of its quality, alongside a plausible baseline for comparison. It is hoped that this Multi-Sentence AMR corpus (MS-AMR) may become a feasible method for developing rich representations of document meaning, useful for tasks such as information extraction and question answering.

pdf abs
Multi-lingual Common Semantic Space Construction via Cluster-consistent Word Embedding
Lifu Huang | Kyunghyun Cho | Boliang Zhang | Heng Ji | Kevin Knight
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We construct a multilingual common semantic space based on distributional semantics, where words from multiple languages are projected into a shared space via which all available resources and knowledge can be shared across multiple languages. Beyond word alignment, we introduce multiple cluster-level alignments and enforce the word clusters to be consistently distributed across multiple languages. We exploit three signals for clustering: (1) neighbor words in the monolingual word embedding space; (2) character-level information; and (3) linguistic properties (e.g., apposition, locative suffix) derived from linguistic structure knowledge bases available for thousands of languages. We introduce a new cluster-consistent correlational neural network to construct the common semantic space by aligning words as well as clusters. Intrinsic evaluation on monolingual and multilingual QVEC tasks shows our approach achieves significantly higher correlation with linguistic features which are extracted from manually crafted lexical resources than state-of-the-art multi-lingual embedding learning methods do. Using low-resource language name tagging as a case study for extrinsic evaluation, our approach achieves up to 14.6% absolute F-score gain over the state of the art on cross-lingual direct transfer. Our approach is also shown to be robust even when the size of bilingual dictionary is small.

pdf abs
Recurrent Neural Networks as Weighted Language Recognizers
Yining Chen | Sorcha Gilroy | Andreas Maletti | Jonathan May | Kevin Knight
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We investigate the computational complexity of various problems for simple recurrent neural networks (RNNs) as formal models for recognizing weighted languages. We focus on the single-layer, ReLU-activation, rational-weight RNNs with softmax, which are commonly used in natural language processing applications. We show that most problems for such RNNs are undecidable, including consistency, equivalence, minimization, and the determination of the highest-weighted string. However, for consistent RNNs the last problem becomes decidable, although the solution length can surpass all computable bounds. If additionally the string is limited to polynomial length, the problem becomes NP-complete. In summary, this shows that approximations and heuristic algorithms are necessary in practical applications of those RNNs.

pdf abs
Neural Poetry Translation
Marjan Ghazvininejad | Yejin Choi | Kevin Knight
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We present the first neural poetry translation system. Unlike previous works that often fail to produce any translation for fixed rhyme and rhythm patterns, our system always translates a source text to an English poem. Human evaluation of the translations ranks the quality as acceptable 78.2% of the time.

pdf abs
Using Word Vectors to Improve Word Alignments for Low Resource Machine Translation
Nima Pourdamghani | Marjan Ghazvininejad | Kevin Knight
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We present a method for improving word alignments using word similarities. This method is based on encouraging common alignment links between semantically similar words. We use word vectors trained on monolingual data to estimate similarity. Our experiments on translating fifteen languages into English show consistent BLEU score improvements across the languages.

pdf abs
ELISA-EDL: A Cross-lingual Entity Extraction, Linking and Localization System
Boliang Zhang | Ying Lin | Xiaoman Pan | Di Lu | Jonathan May | Kevin Knight | Heng Ji
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

We demonstrate ELISA-EDL, a state-of-the-art re-trainable system to extract entity mentions from low-resource languages, link them to external English knowledge bases, and visualize locations related to disaster topics on a world heatmap. We make all of our data sets, resources and system training and testing APIs publicly available for research purpose.

pdf abs
Modeling Naive Psychology of Characters in Simple Commonsense Stories
Hannah Rashkin | Antoine Bosselut | Maarten Sap | Kevin Knight | Yejin Choi
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Understanding a narrative requires reading between the lines and reasoning about the unspoken but obvious implications about events and people’s mental states — a capability that is trivial for humans but remarkably hard for machines. To facilitate research addressing this challenge, we introduce a new annotation framework to explain naive psychology of story characters as fully-specified chains of mental states with respect to motivations and emotional reactions. Our work presents a new large-scale dataset with rich low-level annotations and establishes baseline performance on several new tasks, suggesting avenues for future research.

We present a paper abstract writing system based on an attentive neural sequence-to-sequence model that can take a title as input and automatically generate an abstract. We design a novel Writing-editing Network that can attend to both the title and the previously generated abstract drafts and then iteratively revise and polish the abstract. With two series of Turing tests, where the human judges are asked to distinguish the system-generated abstracts from human-written ones, our system passes Turing tests by junior domain experts at a rate up to 30% and by non-expert at a rate up to 80%.

pdf abs
Out-of-the-box Universal Romanization Tool uroman
Ulf Hermjakob | Jonathan May | Kevin Knight
Proceedings of ACL 2018, System Demonstrations

We present uroman, a tool for converting text in myriads of languages and scripts such as Chinese, Arabic and Cyrillic into a common Latin-script representation. The tool relies on Unicode data and other tables, and handles nearly all character sets, including some that are quite obscure such as Tibetan and Tifinagh. uroman converts digital numbers in various scripts to Western Arabic numerals. Romanization enables the application of string-similarity metrics to texts from different scripts without the need and complexity of an intermediate phonetic representation. The tool is freely and publicly available as a Perl script suitable for inclusion in data processing pipelines and as an interactive demo web page.

pdf abs
Translating a Language You Don’t Know In the Chinese Room
Ulf Hermjakob | Jonathan May | Michael Pust | Kevin Knight
Proceedings of ACL 2018, System Demonstrations

In a corruption of John Searle’s famous AI thought experiment, the Chinese Room (Searle, 1980), we twist its original intent by enabling humans to translate text, e.g. from Uyghur to English, even if they don’t have any prior knowledge of the source language. Our enabling tool, which we call the Chinese Room, is equipped with the same resources made available to a machine translation engine. We find that our superior language model and world knowledge allows us to create perfectly fluent and nearly adequate translations, with human expertise required only for the target language. The Chinese Room tool can be used to rapidly create small corpora of parallel data when bilingual translators are not readily available, in particular for low-resource languages.

2017

pdf abs
Biomedical Event Extraction using Abstract Meaning Representation
Sudha Rao | Daniel Marcu | Kevin Knight | Hal Daumé III
BioNLP 2017

We propose a novel, Abstract Meaning Representation (AMR) based approach to identifying molecular events/interactions in biomedical text. Our key contributions are: (1) an empirical validation of our hypothesis that an event is a subgraph of the AMR graph, (2) a neural network-based model that identifies such an event subgraph given an AMR, and (3) a distant supervision based approach to gather additional training data. We evaluate our approach on the 2013 Genia Event Extraction dataset and show promising results.

pdf abs
Cross-lingual Name Tagging and Linking for 282 Languages
Xiaoman Pan | Boliang Zhang | Jonathan May | Joel Nothman | Kevin Knight | Heng Ji
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The ambitious goal of this work is to develop a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify name mentions, assign a coarse-grained or fine-grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable. We achieve this goal by performing a series of new KB mining methods: generating “silver-standard” annotations by transferring annotations from English to other languages through cross-lingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from cross-lingual links. Both name tagging and linking results for 282 languages are promising on Wikipedia data and on-Wikipedia data.

pdf abs
Speeding Up Neural Machine Translation Decoding by Shrinking Run-time Vocabulary
Xing Shi | Kevin Knight
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We speed up Neural Machine Translation (NMT) decoding by shrinking run-time target vocabulary. We experiment with two shrinking approaches: Locality Sensitive Hashing (LSH) and word alignments. Using the latter method, we get a 2x overall speed-up over a highly-optimized GPU implementation, without hurting BLEU. On certain low-resource language pairs, the same methods improve BLEU by 0.5 points. We also report a negative result for LSH on GPUs, due to relatively large overhead, though it was successful on CPUs. Compared with Locality Sensitive Hashing (LSH), decoding with word alignments is GPU-friendly, orthogonal to existing speedup methods and more robust across language pairs.

pdf
Hafez: an Interactive Poetry Generation System
Marjan Ghazvininejad | Xing Shi | Jay Priyadarshi | Kevin Knight
Proceedings of ACL 2017, System Demonstrations

pdf abs
Deciphering Related Languages
Nima Pourdamghani | Kevin Knight
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We present a method for translating texts between close language pairs. The method does not require parallel data, and it does not require the languages to be written in the same script. We show results for six language pairs: Afrikaans/Dutch, Bosnian/Serbian, Danish/Swedish, Macedonian/Bulgarian, Malaysian/Indonesian, and Polish/Belorussian. We report BLEU scores showing our method to outperform others that do not use parallel data.

pdf abs
Embracing Non-Traditional Linguistic Resources for Low-resource Language Name Tagging
Boliang Zhang | Di Lu | Xiaoman Pan | Ying Lin | Halidanmu Abudukelimu | Heng Ji | Kevin Knight
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Current supervised name tagging approaches are inadequate for most low-resource languages due to the lack of annotated data and actionable linguistic knowledge. All supervised learning methods (including deep neural networks (DNN)) are sensitive to noise and thus they are not quite portable without massive clean annotations. We found that the F-scores of DNN-based name taggers drop rapidly (20%-30%) when we replace clean manual annotations with noisy annotations in the training data. We propose a new solution to incorporate many non-traditional language universal resources that are readily available but rarely explored in the Natural Language Processing (NLP) community, such as the World Atlas of Linguistic Structure, CIA names, PanLex and survival guides. We acquire and encode various types of non-traditional linguistic resources into a DNN name tagger. Experiments on three low-resource languages show that feeding linguistic knowledge can make DNN significantly more robust to noise, achieving 8%-22% absolute F-score gains on name tagging without using any human annotation

2016

pdf abs
Extracting Structured Scholarly Information from the Machine Translation Literature
Eunsol Choi | Matic Horvat | Jonathan May | Kevin Knight | Daniel Marcu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Understanding the experimental results of a scientific paper is crucial to understanding its contribution and to comparing it with related work. We introduce a structured, queryable representation for experimental results and a baseline system that automatically populates this representation. The representation can answer compositional questions such as: “Which are the best published results reported on the NIST 09 Chinese to English dataset?” and “What are the most important methods for speeding up phrase-based decoding?” Answering such questions usually involves lengthy literature surveys. Current machine reading for academic papers does not usually consider the actual experiments, but mostly focuses on understanding abstracts. We describe annotation work to create an initial hscientific paper; experimental results representationi corpus. The corpus is composed of 67 papers which were manually annotated with a structured representation of experimental results by domain experts. Additionally, we present a baseline algorithm that characterizes the difficulty of the inference task.

pdf bib
Leveraging Entity Linking and Related Language Projection to Improve Name Transliteration
Ying Lin | Xiaoman Pan | Aliya Deri | Heng Ji | Kevin Knight
Proceedings of the Sixth Named Entity Workshop

pdf
Obfuscating Gender in Social Media Writing
Sravana Reddy | Kevin Knight
Proceedings of the First Workshop on NLP and Computational Social Science

pdf
Unsupervised Neural Hidden Markov Models
Ke M. Tran | Yonatan Bisk | Ashish Vaswani | Daniel Marcu | Kevin Knight
Proceedings of the Workshop on Structured Prediction for NLP

pdf
Generating English from Abstract Meaning Representations
Nima Pourdamghani | Kevin Knight | Ulf Hermjakob
Proceedings of the 9th International Natural Language Generation conference

pdf bib
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Kevin Knight | Ani Nenkova | Owen Rambow
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Multi-Source Neural Translation
Barret Zoph | Kevin Knight
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Name Tagging for Low-resource Incident Languages based on Expectation-driven Learning
Boliang Zhang | Xiaoman Pan | Tianlu Wang | Ashish Vaswani | Heng Ji | Kevin Knight | Daniel Marcu
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Simple, Fast Noise-Contrastive Estimation for Large RNN Vocabularies
Barret Zoph | Ashish Vaswani | Jonathan May | Kevin Knight
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
A Multi-media Approach to Cross-lingual Entity Knowledge Transfer
Di Lu | Xiaoman Pan | Nima Pourdamghani | Shih-Fu Chang | Heng Ji | Kevin Knight
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Grapheme-to-Phoneme Models for (Almost) Any Language
Aliya Deri | Kevin Knight
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Generating Topical Poetry
Marjan Ghazvininejad | Xing Shi | Yejin Choi | Kevin Knight
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf
Does String-Based Neural MT Learn Source Syntax?
Xing Shi | Inkit Padhi | Kevin Knight
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf
Transfer Learning for Low-Resource Neural Machine Translation
Barret Zoph | Deniz Yuret | Jonathan May | Kevin Knight
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf
Why Neural Translations are the Right Length
Xing Shi | Kevin Knight | Deniz Yuret
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

pdf
How Much Information Does a Human Translator Add to the Original?
Barret Zoph | Marjan Ghazvininejad | Kevin Knight
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
Parsing English into Abstract Meaning Representation Using Syntax-Based Machine Translation
Michael Pust | Ulf Hermjakob | Kevin Knight | Daniel Marcu | Jonathan May
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
Unifying Bayesian Inference and Vector Space Models for Improved Decipherment
Qing Dou | Ashish Vaswani | Kevin Knight | Chris Dyer
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf
How to Make a Frenemy: Multitape FSTs for Portmanteau Generation
Aliya Deri | Kevin Knight
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Unsupervised Entity Linking with Abstract Meaning Representation
Xiaoman Pan | Taylor Cassidy | Ulf Hermjakob | Heng Ji | Kevin Knight
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
How to Memorize a Random 60-Bit String
Marjan Ghazvininejad | Kevin Knight
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Invited Talk: How Much Information Does a Human Translator Add to the Original?
Kevin Knight
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

2014

pdf abs
Mapping Between English Strings and Reentrant Semantic Graphs
Fabienne Braune | Daniel Bauer | Kevin Knight
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We investigate formalisms for capturing the relation between semantic graphs and English strings. Semantic graph corpora have spurred recent interest in graph transduction formalisms, but it is not yet clear whether such formalisms are a good fit for natural language data―in particular, for describing how semantic reentrancies correspond to English pronouns, zero pronouns, reflexives, passives, nominalizations, etc. We introduce a data set that focuses on these problems, we build grammars to capture the graph/string relation in this data, and we evaluate those grammars for conciseness and accuracy.

pdf
How to Speak a Language without Knowing It
Xing Shi | Kevin Knight | Heng Ji
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Aligning context-based statistical models of language with brain activity during reading
Leila Wehbe | Ashish Vaswani | Kevin Knight | Tom Mitchell
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf
Aligning English Strings with Abstract Meaning Representation Graphs
Nima Pourdamghani | Yang Gao | Ulf Hermjakob | Kevin Knight
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf
Beyond Parallel Data: Joint Word Alignment and Decipherment Improves Machine Translation
Qing Dou | Ashish Vaswani | Kevin Knight
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf
Cipher Type Detection
Malte Nuhn | Kevin Knight
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Source languages with complex word-formation rules present a challenge for statistical machine translation (SMT). In this paper, we take on three facets of this challenge: (1) common stems are fragmented into many different forms in training data, (2) rare and unknown words are frequent in test data, and (3) spelling variation creates additional sparseness problems. We present a novel, lightweight technique for dealing with this fragmentation, based on bilingual data, and we also present a combination of linguistic and statistical techniques for dealing with rare and unknown words. Taking these techniques together, we demonstrate +1.3 and +1.6 BLEU increases on top of strong baselines for Arabic-English machine translation.

pdf bib abs
Using Bilingual Chinese-English Word Alignments to Resolve PP-attachment Ambiguity in English
Victoria Fossum | Kevin Knight
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop

Errors in English parse trees impact the quality of syntax-based MT systems trained using those parses. Frequent sources of error for English parsers include PP-attachment ambiguity, NP-bracketing ambiguity, and coordination ambiguity. Not all ambiguities are preserved across languages. We examine a common type of ambiguity in English that is not preserved in Chinese: given a sequence “VP NP PP”, should the PP be attached to the main verb, or to the object noun phrase? We present a discriminative method for exploiting bilingual Chinese-English word alignments to resolve this ambiguity in English. On a held-out test set of Chinese-English parallel sentences, our method achieves 86.3% accuracy on this PP-attachment disambiguation task, an improvement of 4% over the accuracy of the baseline Collins parser (82.3%).

pdf
Name Translation in Statistical Machine Translation - Learning When to Transliterate
Ulf Hermjakob | Kevin Knight | Hal Daumé III
Proceedings of ACL-08: HLT

pdf
Using Syntax to Improve Word Alignment Precision for Syntax-Based Machine Translation
Victoria Fossum | Kevin Knight | Steven Abney
Proceedings of the Third Workshop on Statistical Machine Translation

2007

pdf bib
Statistical machine translation
Kevin Knight | Philipp Koehn
Proceedings of Machine Translation Summit XI: Tutorials

pdf bib
Automatic language translation generation help needs badly
Kevin Knight
Proceedings of the Workshop on Using corpora for natural language generation

pdf
Syntactic Re-Alignment Models for Machine Translation
Jonathan May | Kevin Knight
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf
Binarizing Syntax Trees to Improve Syntax-Based Machine Translation Accuracy
Wei Wang | Kevin Knight | Daniel Marcu
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf
What Can Syntax-Based MT Learn from Phrase-Based MT?
Steve DeNeefe | Kevin Knight | Wei Wang | Daniel Marcu
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib
Capitalizing Machine Translation
Wei Wang | Kevin Knight | Daniel Marcu
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf
Relabeling Syntax Trees to Improve Syntax-Based Machine Translation Quality
Bryant Huang | Kevin Knight
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf
Synchronous Binarization for Machine Translation
Hao Zhang | Liang Huang | Daniel Gildea | Kevin Knight
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf
A Better N-Best List: Practical Determinization of Weighted Finite Tree Automata
Jonathan May | Kevin Knight
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf
Scalable Inference and Training of Context-Rich Syntactic Translation Models
Michel Galley | Jonathan Graehl | Kevin Knight | Daniel Marcu | Steve DeNeefe | Wei Wang | Ignacio Thayer
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
Unsupervised Analysis for Decipherment Problems
Kevin Knight | Anish Nair | Nishit Rathod | Kenji Yamada
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf
SPMT: Statistical Machine Translation with Syntactified Target Language Phrases
Daniel Marcu | Wei Wang | Abdessamad Echihabi | Kevin Knight
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Syntax-Directed Translator with Extended Domain of Locality
Liang Huang | Kevin Knight | Aravind Joshi
Proceedings of the Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing

pdf abs
Statistical Syntax-Directed Translation with Extended Domain of Locality
Liang Huang | Kevin Knight | Aravind Joshi
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

In syntax-directed translation, the source-language input is first parsed into a parse-tree, which is then recursively converted into a string in the target-language. We model this conversion by an extended tree-to-string transducer that has multi-level trees on the source-side, which gives our system more expressive power and flexibility. We also define a direct probability model and use a linear-time dynamic programming algorithm to search for the best derivation. The model is then extended to the general log-linear frame-work in order to incorporate other features like n-gram language models. We devise a simple-yet-effective algorithm to generate non-duplicate k-best translations for n-gram rescoring. Preliminary experiments on English-to-Chinese translation show a significant improvement in terms of translation quality compared to a state-of-the- art phrase-based system.

2005

pdf bib
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)
Kevin Knight | Hwee Tou Ng | Kemal Oflazer
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

pdf
Interactively Exploring a Machine Translation Model
Steve DeNeefe | Kevin Knight | Hayward H. Chan
Proceedings of the ACL Interactive Poster and Demonstration Sessions

pdf
ISI’s 2005 Statistical Machine Translation Entries
Steve DeNeefe | Kevin Knight
Proceedings of the Second International Workshop on Spoken Language Translation

2004

pdf
Training Tree Transducers
Jonathan Graehl | Kevin Knight
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004

pdf
What’s in a translation rule?
Michel Galley | Mark Hopkins | Kevin Knight | Daniel Marcu
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004

pdf bib
Introduction to statistical machine translation
Philipp Koehn | Kevin Knight
Proceedings of the 6th Conference of the Association for Machine Translation in the Americas: Tutorial Descriptions

pdf
Language Weaver Arabic->English MT
Daniel Marcu | Alex Fraser | William Wong | Kevin Knight
Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages

2003

pdf abs
Syntax-based language models for statistical machine translation
Eugene Charniak | Kevin Knight | Kenji Yamada
Proceedings of Machine Translation Summit IX: Papers

We present a syntax-based language model for use in noisy-channel machine translation. In particular, a language model based upon that described in (Cha01) is combined with the syntax based translation-model described in (YK01). The resulting system was used to translate 347 sentences from Chinese to English and compared with the results of an IBM-model-4-based system, as well as that of (YK02), all trained on the same data. The translations were sorted into four groups: good/bad syntax crossed with good/bad meaning. While the total number of translations that preserved meaning were the same for (YK02) and the syntax-based system (and both higher than the IBM-model-4-based system), the syntax based system had 45% more translations that also had good syntax than did (YK02) (and approximately 70% more than IBM Model 4). The number of translations that did not preserve meaning, but at least had good grammar, also increased, though to less avail.

pdf bib abs
Language Weaver: the next generation of machine translation
Bryce Benjamin | Laurie Gerber | Kevin Knight | Daniel Marcu
Proceedings of Machine Translation Summit IX: System Presentations

We introduce a new generation of commercial translation software, based primarily on statistical learning and statistical language models.

pdf bib abs
Teaching statistical machine translation
Kevin Knight
Workshop on Teaching Translation Technologies and Tools

This paper describes some resources for introducing concepts of statistical machine translation. Students using these resources are not required to have any particular background in computational linguistics or mathematics.

pdf
Empirical Methods for Compound Splitting
Philipp Koehn | Kevin Knight
10th Conference of the European Chapter of the Association for Computational Linguistics

pdf
Feature-Rich Statistical Translation of Noun Phrases
Philipp Koehn | Kevin Knight
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

pdf
Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences
Bo Pang | Kevin Knight | Daniel Marcu
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics

pdf
Cognates Can Improve Statistical Translation Models
Grzegorz Kondrak | Daniel Marcu | Kevin Knight
Companion Volume of the Proceedings of HLT-NAACL 2003 - Short Papers

pdf
What’s New in Statistical Machine Translation
Kevin Knight | Philipp Koehn
Companion Volume of the Proceedings of HLT-NAACL 2003 - Tutorial Abstracts

2002

pdf
A Decoder for Syntax-based Statistical MT
Kenji Yamada | Kevin Knight
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

pdf
Translating Named Entities Using Monolingual and Bilingual Resources
Yaser Al-Onaizan | Kevin Knight
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

bib
Statistical machine translation
Kevin Knight
Proceedings of the 9th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Tutorials

pdf
Machine Transliteration of Names in Arabic Texts
Yaser Al-Onaizan | Kevin Knight
Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages

pdf bib
Learning a Translation Lexicon from Monolingual Corpora
Philipp Koehn | Kevin Knight
Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition

pdf bib
The Importance of Lexicalized Syntax Models for Natural Language Generation Tasks
Hal Daume III | Kevin Knight | Irene Langkilde-Geary | Daniel Marcu | Kenji Yamada
Proceedings of the International Natural Language Generation Conference

pdf abs
Using a large monolingual corpus to improve translation accuracy
Radu Soricut | Kevin Knight | Daniel Marcu
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers

The existence of a phrase in a large monolingual corpus is very useful information, and so is its frequency. We introduce an alternative approach to automatic translation of phrases/sentences that operationalizes this observation. We use a statistical machine translation system to produce alternative translations and a large monolingual corpus to (re)rank these translations. Our results show that this combination yields better translations, especially when translating out-of-domain phrases/sentences. Our approach can be also used to automatically construct parallel corpora from monolingual resources.

pdf abs
Translation by the numbers: Language Weaver
Bryce Benjamin | Kevin Knight | Daniel Marcu
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: System Descriptions

Pre-market prototype - to be available commercially in the second or third quarter of 2003.

2001

pdf
Fast Decoding and Optimal Decoding for Machine Translation
Ulrich Germann | Michael Jahr | Kevin Knight | Daniel Marcu | Kenji Yamada
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics

pdf
A Syntax-based Statistical Translation Model
Kenji Yamada | Kevin Knight
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics

pdf
Knowledge Sources for Word-Level Translation Models
Philipp Koehn | Kevin Knight
Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing

2000

Statistical machine translation
Kevin Knight
Proceedings of the Fourth Conference of the Association for Machine Translation in the Americas: Tutorial Descriptions

1999

pdf
Decoding complexity in word-replacement translation models
Kevin Knight
Computational Linguistics, Volume 25, Number 4, December 1999

pdf
A Computational Approach to Deciphering Unknown Scripts
Kevin Knight | Kenji Yamada
Unsupervised Learning in Natural Language Processing

1998

pdf
Generation that Exploits Corpus-Based Statistical Knowledge
Irene Langkilde | Kevin Knight
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf
Generation that Exploits Corpus-Based Statistical Knowledge
Irene Langkilde | Kevin Knight
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf
Machine Transliteration
Kevin Knight | Jonathan Graehl
Computational Linguistics, Volume 24, Number 4, December 1998

pdf
Translating Names and Technical Terms in Arabic Text
Bonnie Glover | Kevin Knight
Computational Approaches to Semitic Languages

pdf
The Practical Value of N-Grams Is in Generation
Irene Langkilde | Kevin Knight
Natural Language Generation

pdf abs
Translation with finite-state devices
Kevin Knight | Yaser Al-Onaizan
Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Technical Papers

Statistical models have recently been applied to machine translation with interesting results. Algorithms for processing these models have not received wide circulation, however. By contrast, general finite-state transduction algorithms have been applied in a variety of tasks. This paper gives a finite-state reconstruction of statistical translation and demonstrates the use of standard tools to compute statistically likely translations. Ours is the first translation algorithm for “fertility/permutation” statistical models to be described in replicable detail.

1997

pdf
Machine Transliteration
Kevin Knight | Jonathan Graehl
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics

1996

1995

pdf
Two-Level, Many-Paths Generation
Kevin Knight | Vasileios Hatzivassiloglou
33rd Annual Meeting of the Association for Computational Linguistics

1994

1993

pdf
Building a Large Ontology for Machine Translation
Kevin Knight
Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993

1991

We describe an interlingua-based approach to machine translation, in which a DRS representation of the source text is used as the interlingua representation. A target DRS is then created and used to construct the target text. We describe several advantages of this level of representation. We also argue that problems of translation mismatch and divergence should properly be viewed not as translation problems per se but rather as generation problems, although the source text can be used to guide the target generator. The system we have built relics exclusively on monolingual linguistic descriptions that are also, for the most part, bi-directional.