Jaime G. Carbonell

CMU

Also published as: Jaime Carbonell, Jaime G. Carbonell Jr

Other people with similar names: Jaime R. Carbonell (BBN; d. 1973)

2021

pdf abs
StructSum: Summarization via Structured Representations
Vidhisha Balachandran | Artidoro Pagnoni | Jay Yoon Lee | Dheeraj Rajagopal | Jaime Carbonell | Yulia Tsvetkov
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Abstractive text summarization aims at compressing the information of a long source document into a rephrased, condensed summary. Despite advances in modeling techniques, abstractive summarization models still suffer from several key challenges: (i) layout bias: they overfit to the style of training corpora; (ii) limited abstractiveness: they are optimized to copying n-grams from the source rather than generating novel abstractive summaries; (iii) lack of transparency: they are not interpretable. In this work, we propose a framework based on document-level structure induction for summarization to address these challenges. To this end, we propose incorporating latent and explicit dependencies across sentences in the source document into end-to-end single-document summarization models. Our framework complements standard encoder-decoder summarization models by augmenting them with rich structure-aware document representations based on implicitly learned (latent) structures and externally-derived linguistic (explicit) structures. We show that our summarization framework, trained on the CNN/DM dataset, improves the coverage of content in the source documents, generates more abstractive summaries by generating more novel n-grams, and incorporates interpretable sentence-level structures, while performing on par with standard baselines.

2020

pdf abs
Soft Gazetteers for Low-Resource Named Entity Recognition
Shruti Rijhwani | Shuyan Zhou | Graham Neubig | Jaime Carbonell
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Traditional named entity recognition models use gazetteers (lists of entities) as features to improve performance. Although modern neural network models do not require such hand-crafted features for strong performance, recent work has demonstrated their utility for named entity recognition on English data. However, designing such features for low-resource languages is challenging, because exhaustive entity gazetteers do not exist in these languages. To address this problem, we propose a method of “soft gazetteers” that incorporates ubiquitously available information from English knowledge bases, such as Wikipedia, into neural named entity recognition models through cross-lingual entity linking. Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.

pdf abs
Efficient Meta Lifelong-Learning with Limited Memory
Zirui Wang | Sanket Vaibhav Mehta | Barnabas Poczos | Jaime Carbonell
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Current natural language processing models work well on a single task, yet they often fail to continuously learn new tasks without forgetting previous ones as they are re-trained throughout their lifetime, a challenge known as lifelong learning. State-of-the-art lifelong language learning methods store past examples in episodic memory and replay them at both training and inference time. However, as we show later in our experiments, there are three significant impediments: (1) needing unrealistically large memory module to achieve good performance, (2) suffering from negative transfer, (3) requiring multiple local adaptation steps for each test example that significantly slows down the inference speed. In this paper, we identify three common principles of lifelong learning methods and propose an efficient meta-lifelong framework that combines them in a synergistic fashion. To achieve sample efficiency, our method trains the model in a manner that it learns a better initialization for local adaptation. Extensive experiments on text classification and question answering benchmarks demonstrate the effectiveness of our framework by achieving state-of-the-art performance using merely 1% memory size and narrowing the gap with multi-task learning. We further show that our method alleviates both catastrophic forgetting and negative transfer at the same time.

pdf abs
Improving Candidate Generation for Low-resource Cross-lingual Entity Linking
Shuyan Zhou | Shruti Rijhwani | John Wieting | Jaime Carbonell | Graham Neubig
Transactions of the Association for Computational Linguistics, Volume 8

Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. The first step of (X)EL is candidate generation, which retrieves a list of plausible candidate entities from the target-language KB for each mention. Approaches based on resources from Wikipedia have proven successful in the realm of relatively high-resource languages, but these do not extend well to low-resource languages with few, if any, Wikipedia pages. Recently, transfer learning methods have been shown to reduce the demand for resources in the low-resource languages by utilizing resources in closely related languages, but the performance still lags far behind their high-resource counterparts. In this paper, we first assess the problems faced by current entity candidate generation methods for low-resource XEL, then propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios. The methods are simple, but effective: We experiment with our approach on seven XEL datasets and find that they yield an average gain of 16.9% in Top-30 gold candidate recall, compared with state-of-the-art baselines. Our improved model also yields an average gain of 7.9% in in-KB accuracy of end-to-end XEL.1

2019

pdf abs
CMU-01 at the SIGMORPHON 2019 Shared Task on Crosslinguality and Context in Morphology
Aditi Chaudhary | Elizabeth Salesky | Gayatri Bhat | David R. Mortensen | Jaime Carbonell | Yulia Tsvetkov
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper presents the submission by the CMU-01 team to the SIGMORPHON 2019 task 2 of Morphological Analysis and Lemmatization in Context. This task requires us to produce the lemma and morpho-syntactic description of each token in a sequence, for 107 treebanks. We approach this task with a hierarchical neural conditional random field (CRF) model which predicts each coarse-grained feature (eg. POS, Case, etc.) independently. However, most treebanks are under-resourced, thus making it challenging to train deep neural models for them. Hence, we propose a multi-lingual transfer training regime where we transfer from multiple related languages that share similar typology.

pdf abs
A Little Annotation does a Lot of Good: A Study in Bootstrapping Low-resource Named Entity Recognizers
Aditi Chaudhary | Jiateng Xie | Zaid Sheikh | Graham Neubig | Jaime Carbonell
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Most state-of-the-art models for named entity recognition (NER) rely on the availability of large amounts of labeled data, making them challenging to extend to new, lower-resourced languages. However, there are now many proposed solutions to this problem involving either cross-lingual transfer learning, which learns from other highly resourced languages, or active learning, which efficiently selects effective training data based on model predictions. In this paper, we ask the question: given this recent progress, and some amount of human annotation, what is the most effective method for efficiently creating high-quality entity recognizers in under-resourced languages? Based on extensive experimentation using both simulated and real human annotation, we settle on a recipe of starting with a cross-lingual transferred model, then performing targeted annotation of only uncertain entity spans in the target language, minimizing annotator effort. Results demonstrate that cross-lingual transfer is a powerful tool when very little data can be annotated, but an entity-targeted annotation strategy can achieve competitive accuracy quickly, with just one-tenth of training data.

pdf abs
Learning Rhyming Constraints using Structured Adversaries
Harsh Jhamtani | Sanket Vaibhav Mehta | Jaime Carbonell | Taylor Berg-Kirkpatrick
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Existing recurrent neural language models often fail to capture higher-level structure present in text: for example, rhyming patterns present in poetry. Much prior work on poetry generation uses manually defined constraints which are satisfied during decoding using either specialized decoding procedures or rejection sampling. The rhyming constraints themselves are typically not learned by the generator. We propose an alternate approach that uses a structured discriminator to learn a poetry generator that directly captures rhyming constraints in a generative adversarial setup. By causing the discriminator to compare poems based only on a learned similarity matrix of pairs of line ending words, the proposed approach is able to successfully learn rhyming patterns in two different English poetry datasets (Sonnet and Limerick) without explicitly being provided with any phonetic information

pdf abs
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
Zihang Dai | Zhilin Yang | Yiming Yang | Jaime Carbonell | Quoc Le | Ruslan Salakhutdinov
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

pdf abs
Domain Adaptation of Neural Machine Translation by Lexicon Induction
Junjie Hu | Mengzhou Xia | Graham Neubig | Jaime Carbonell
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

It has been previously noted that neural machine translation (NMT) is very sensitive to domain shift. In this paper, we argue that this is a dual effect of the highly lexicalized nature of NMT, resulting in failure for sentences with large numbers of unknown words, and lack of supervision for domain-specific words. To remedy this problem, we propose an unsupervised adaptation method which fine-tunes a pre-trained out-of-domain NMT model using a pseudo-in-domain corpus. Specifically, we perform lexicon induction to extract an in-domain lexicon, and construct a pseudo-parallel in-domain corpus by performing word-for-word back-translation of monolingual in-domain target sentences. In five domains over twenty pairwise adaptation settings and two model architectures, our method achieves consistent improvements without using any in-domain parallel sentences, improving up to 14 BLEU over unadapted models, and up to 2 BLEU over strong back-translation baselines.

2018

pdf abs
Neural Cross-Lingual Named Entity Recognition with Minimal Resources
Jiateng Xie | Zhilin Yang | Graham Neubig | Noah A. Smith | Jaime Carbonell
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

For languages with no annotated resources, unsupervised transfer of natural language processing models such as named-entity recognition (NER) from resource-rich languages would be an appealing capability. However, differences in words and word order across languages make it a challenging problem. To improve mapping of lexical items across languages, we propose a method that finds translations based on bilingual word embeddings. To improve robustness to word order differences, we propose to use self-attention, which allows for a degree of flexibility with respect to word order. We demonstrate that these methods achieve state-of-the-art or competitive NER performance on commonly tested languages under a cross-lingual setting, with much lower resource requirements than past approaches. We also evaluate the challenges of applying these methods to Uyghur, a low-resource language.

pdf abs
DeepCx: A transition-based approach for shallow semantic parsing with complex constructional triggers
Jesse Dunietz | Jaime Carbonell | Lori Levin
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper introduces the surface construction labeling (SCL) task, which expands the coverage of Shallow Semantic Parsing (SSP) to include frames triggered by complex constructions. We present DeepCx, a neural, transition-based system for SCL. As a test case for the approach, we apply DeepCx to the task of tagging causal language in English, which relies on a wider variety of constructions than are typically addressed in SSP. We report substantial improvements over previous tagging efforts on a causal language dataset. We also propose ways DeepCx could be extended to still more difficult constructions and to other semantic domains once appropriate datasets become available.

pdf abs
Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations
Aditi Chaudhary | Chunting Zhou | Lori Levin | Graham Neubig | David R. Mortensen | Jaime Carbonell
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Much work in Natural Language Processing (NLP) has been for resource-rich languages, making generalization to new, less-resourced languages challenging. We present two approaches for improving generalization to low-resourced languages by adapting continuous word representations using linguistically motivated subword units: phonemes, morphemes and graphemes. Our method requires neither parallel corpora nor bilingual dictionaries and provides a significant gain in performance over previous methods relying on these resources. We demonstrate the effectiveness of our approaches on Named Entity Recognition for four languages, namely Uyghur, Turkish, Bengali and Hindi, of which Uyghur and Bengali are low resource languages, and also perform experiments on Machine Translation. Exploiting subwords with transfer learning gives us a boost of +15.2 NER F1 for Uyghur and +9.7 F1 for Bengali. We also show improvements in the monolingual setting where we achieve (avg.) +3 F1 and (avg.) +1.35 BLEU.

pdf abs
Towards Semi-Supervised Learning for Deep Semantic Role Labeling
Sanket Vaibhav Mehta | Jay Yoon Lee | Jaime Carbonell
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Neural models have shown several state-of-the-art performances on Semantic Role Labeling (SRL). However, the neural models require an immense amount of semantic-role corpora and are thus not well suited for low-resource languages or domains. The paper proposes a semi-supervised semantic role labeling method that outperforms the state-of-the-art in limited SRL training corpora. The method is based on explicitly enforcing syntactic constraints by augmenting the training objective with a syntactic-inconsistency loss component and uses SRL-unlabeled instances to train a joint-objective LSTM. On CoNLL-2012 English section, the proposed semi-supervised training with 1%, 10% SRL-labeled data and varying amounts of SRL-unlabeled data achieves +1.58, +0.78 F1, respectively, over the pre-trained models that were trained on SOTA architecture with ELMo on the same SRL-labeled data. Additionally, by using the syntactic-inconsistency loss on inference time, the proposed model achieves +3.67, +2.1 F1 over pre-trained model on 1%, 10% SRL-labeled data, respectively.

2017

pdf abs
The BECauSE Corpus 2.0: Annotating Causality and Overlapping Relations
Jesse Dunietz | Lori Levin | Jaime Carbonell
Proceedings of the 11th Linguistic Annotation Workshop

Language of cause and effect captures an essential component of the semantics of a text. However, causal language is also intertwined with other semantic relations, such as temporal precedence and correlation. This makes it difficult to determine when causation is the primary intended meaning. This paper presents BECauSE 2.0, a new version of the BECauSE corpus with exhaustively annotated expressions of causal language, but also seven semantic relations that are frequently co-present with causation. The new corpus shows high inter-annotator agreement, and yields insights both about the linguistic expressions of causation and about the process of annotating co-present semantic relations.

pdf abs
Automatically Tagging Constructions of Causation and Their Slot-Fillers
Jesse Dunietz | Lori Levin | Jaime Carbonell
Transactions of the Association for Computational Linguistics, Volume 5

This paper explores extending shallow semantic parsing beyond lexical-unit triggers, using causal relations as a test case. Semantic parsing becomes difficult in the face of the wide variety of linguistic realizations that causation can take on. We therefore base our approach on the concept of constructions from the linguistic paradigm known as Construction Grammar (CxG). In CxG, a construction is a form/function pairing that can rely on arbitrary linguistic and semantic features. Rather than codifying all aspects of each construction’s form, as some attempts to employ CxG in NLP have done, we propose methods that offload that problem to machine learning. We describe two supervised approaches for tagging causal constructions and their arguments. Both approaches combine automatically induced pattern-matching rules with statistical classifiers that learn the subtler parameters of the constructions. Our results show that these approaches are promising: they significantly outperform naïve baselines for both construction recognition and cause and effect head matches.

2016

pdf
Generation from Abstract Meaning Representation using Tree Transducers
Jeffrey Flanigan | Chris Dyer | Noah A. Smith | Jaime Carbonell
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss
Jeffrey Flanigan | Chris Dyer | Noah A. Smith | Jaime Carbonell
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
Phonologically Aware Neural Model for Named Entity Recognition in Low Resource Transfer Settings
Akash Bharadwaj | David Mortensen | Chris Dyer | Jaime Carbonell
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf abs
Leveraging Multilingual Training for Limited Resource Event Extraction
Andrew Hsi | Yiming Yang | Jaime Carbonell | Ruochen Xu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Event extraction has become one of the most important topics in information extraction, but to date, there is very limited work on leveraging cross-lingual training to boost performance. We propose a new event extraction approach that trains on multiple languages using a combination of both language-dependent and language-independent features, with particular focus on the case where target domain training data is of very limited size. We show empirically that multilingual training can boost performance for the tasks of event trigger extraction and event argument extraction on the Chinese ACE 2005 dataset.

2015

pdf
Annotating Causal Language Using Corpus Lexicography of Constructions
Jesse Dunietz | Lori Levin | Jaime Carbonell
Proceedings of the 9th Linguistic Annotation Workshop

pdf
Extending a Single-Document Summarizer to Multi-Document: a Hierarchical Approach
Luís Marujo | Ricardo Ribeiro | David Martins de Matos | João Neto | Anatole Gershman | Jaime Carbonell
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf
Frame-Semantic Role Labeling with Heterogeneous Annotations
Meghana Kshirsagar | Sam Thomson | Nathan Schneider | Jaime Carbonell | Noah A. Smith | Chris Dyer
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

This paper describes a suite of tools for extracting conventionalized metaphors in English, Spanish, Farsi, and Russian. The method depends on three significant resources for each language: a corpus of conventionalized metaphors, a table of conventionalized conceptual metaphors (CCM table), and a set of extraction rules. Conventionalized metaphors are things like “escape from poverty” and “burden of taxation”. For each metaphor, the CCM table contains the metaphorical source domain word (such as “escape”) the target domain word (such as “poverty”) and the grammatical construction in which they can be found. The extraction rules operate on the output of a dependency parser and identify the grammatical configurations (such as a verb with a prepositional phrase complement) that are likely to contain conventional metaphors. We present results on detection rates for conventional metaphors and analysis of the similarity and differences of source domains for conventional metaphors in the four languages.

pdf
A Discriminative Graph-Based Parser for the Abstract Meaning Representation
Jeffrey Flanigan | Sam Thomson | Jaime Carbonell | Chris Dyer | Noah A. Smith
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Cross-Lingual Information to the Rescue in Keyword Extraction
Chung-Chi Huang | Maxine Eskenazi | Jaime Carbonell | Lun-Wei Ku | Ping-Che Yang
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2013

pdf
Large-Scale Discriminative Training for Statistical Machine Translation Using Held-Out Line Search
Jeffrey Flanigan | Chris Dyer | Jaime Carbonell
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
The Effects of Lexical Resource Quality on Preference Violation Detection
Jesse Dunietz | Lori Levin | Jaime Carbonell
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf abs
Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization
Luís Marujo | Anatole Gershman | Jaime Carbonell | Robert Frederking | João P. Neto
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Fast and effective automated indexing is critical for search and personalized services. Key phrases that consist of one or more words and represent the main concepts of the document are often used for the purpose of indexing. In this paper, we investigate the use of additional semantic features and pre-processing steps to improve automatic key phrase extraction. These features include the use of signal words and freebase categories. Some of these features lead to significant improvements in the accuracy of the results. We also experimented with 2 forms of document pre-processing that we call light filtering and co-reference normalization. Light filtering removes sentences from the document, which are judged peripheral to its main content. Co-reference normalization unifies several written forms of the same named entity into a unique form. We also needed a Gold Standard ― a set of labeled documents for training and evaluation. While the subjective nature of key phrase selection precludes a true Gold Standard, we used Amazon's Mechanical Turk service to obtain a useful approximation. Our data indicates that the biggest improvements in performance were due to shallow semantic features, news categories, and rhetorical signals (nDCG 78.47% vs. 68.93%). The inclusion of deeper semantic features such as Freebase sub-categories was not beneficial by itself, but in combination with pre-processing, did cause slight improvements in the nDCG scores.

2011

pdf
Multi-Strategy Approaches to Active Learning for Statistical Machine Translation
Vamshi Ambati | Stephan Vogel | Jaime Carbonell
Proceedings of Machine Translation Summit XIII: Papers

pdf
Active Learning with Multiple Annotations for Comparable Data Classification Task
Vamshi Ambati | Sanjika Hewavitharana | Stephan Vogel | Jaime Carbonell
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

2010

pdf bib
Active Semi-Supervised Learning for Improving Word Alignment
Vamshi Ambati | Stephan Vogel | Jaime Carbonell
Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing

pdf
Monolingual Distributional Profiles for Word Substitution in Machine Translation
Rashmi Gangadharaiah | Ralf D. Brown | Jaime Carbonell
Coling 2010: Posters

pdf
Chunk-Based EBMT
Jae Dong Kim | Ralf Brown | Jaime Carbonell
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf
Automatic Determination of Number of clusters for creating Templates in Example-Based Machine Translation
Rashmi Gangadharaiah | Ralf Brown | Jaime Carbonell
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf abs
Active Learning and Crowd-Sourcing for Machine Translation
Vamshi Ambati | Stephan Vogel | Jaime Carbonell
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Large scale parallel data generation for new language pairs requires intensive human effort and availability of experts. It becomes immensely difficult and costly to provide Statistical Machine Translation (SMT) systems for most languages due to the paucity of expert translators to provide parallel data. Even if experts are present, it appears infeasible due to the impending costs. In this paper we propose Active Crowd Translation (ACT), a new paradigm where active learning and crowd-sourcing come together to enable automatic translation for low-resource language pairs. Active learning aims at reducing cost of label acquisition by prioritizing the most informative data for annotation, while crowd-sourcing reduces cost by using the power of the crowds to make do for the lack of expensive language experts. We experiment and compare our active learning strategies with strong baselines and see significant improvements in translation quality. Similarly, our experiments with crowd-sourcing on Mechanical Turk have shown that it is possible to create parallel corpora using non-experts and with sufficient quality assurance, a translation system that is trained using this corpus approaches expert quality.

pdf
Active Learning-Based Elicitation for Semi-Supervised Word Alignment
Vamshi Ambati | Stephan Vogel | Jaime Carbonell
Proceedings of the ACL 2010 Conference Short Papers

2009

pdf
Proactive Learning for Building Machine Translation Systems for Minority Languages
Vamshi Ambati | Jaime Carbonell
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing

pdf
Active Learning in Example-Based Machine Translation
Rashmi Gangadharaiah | Ralf D. Brown | Jaime Carbonell
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

pdf bib
Extraction of Syntactic Translation Models from Parallel Data using Syntax from Source and Target Languages
Vamshi Ambati | Alon Lavie | Jaime Carbonell
Proceedings of Machine Translation Summit XII: Posters

2008

pdf
Cluster-Based Query Expansion for Statistical Question Answering
Lucian Vlad Lita | Jaime Carbonell
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

Producing machine translation (MT) for the many minority languages in the world is a serious challenge. Minority languages typically have few resources for building MT systems. For many minor languages there is little machine readable text, few knowledgeable linguists, and little money available for MT development. For these reasons, our research programs on minority language MT have focused on leveraging to the maximum extent two resources that are available for minority languages: linguistic structure and bilingual informants. All natural languages contain linguistic structure. And although the details of that linguistic structure vary from language to language, language universals such as context-free syntactic structure and the paradigmatic structure of inflectional morphology, allow us to learn the specific details of a minority language. Similarly, most minority languages possess speakers who are bilingual with the major language of the area. This paper discusses our efforts to utilize linguistic structure and the translation information that bilingual informants can provide in three sub-areas of our rapid development MT program: morphology induction, syntactic transfer rule learning, and refinement of imperfect learned rules.

pdf
Evaluating an Agglutinative Segmentation Model for ParaMor
Christian Monson | Alon Lavie | Jaime Carbonell | Lori Levin
Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology

2007

pdf
Improving transfer-based MT systems with automatic refinements
Ariadna Font Llitjós | Jaime Carbonell | Alon Lavie
Proceedings of Machine Translation Summit XI: Papers

pdf
ParaMor: Minimally Supervised Induction of Paradigm Structure and Morphological Analysis
Christian Monson | Jaime Carbonell | Alon Lavie | Lori Levin
Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology

pdf
Combining Probability-Based Rankers for Action-Item Detection
Paul N. Bennett | Jaime G. Carbonell
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

2006

Context-Based Machine TranslationTM (CBMT) is a new paradigm for corpus-based translation that requires no parallel text. Instead, CBMT relies on a lightweight translation model utilizing a fullform bilingual dictionary and a sophisticated decoder using long-range context via long n-grams and cascaded overlapping. The translation process is enhanced via in-language substitution of tokens and phrases, both for source and target, when top candidates cannot be confirmed or resolved in decoding. Substitution utilizes a synonym and near-synonym generator implemented as a corpus-based unsupervised learning process. Decoding requires a very large target-language-only corpus, and while substitution in target can be performed using that same corpus, substitution in source requires a separate (and smaller) source monolingual corpus. Spanish-to-English CBMT was tested on Spanish newswire text, achieving a BLEU score of 0.6462 in June 2006, the highest BLEU reported for any language pair. Further testing also shows that quality increases above the reported score as the target corpus size increases and as dictionary coverage of source words and phrases becomes more complete.

pdf bib
Presentation
Jaime Carbonell
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Panel on hybrid machine translation: why and how?

pdf
Spectral Clustering for Example Based Machine Translation
Rashmi Gangadharaiah | Ralf Brown | Jaime Carbonell
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

2005

pdf
A framework for interactive and automatic refinement of transfer-based machine translation
Ariadna Font Llitjós | Jaime G. Carbonell | Alon Lavie
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

pdf
Symmetric probabilistic alignment for example-based translation
Jae Dong Kim | Ralf D. Brown | Peter J. Jansen | Jaime G. Carbonell
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

pdf
Symmetric Probabilistic Alignment
Ralf D. Brown | Jae Dong Kim | Peter J. Jansen | Jaime G. Carbonell
Proceedings of the ACL Workshop on Building and Using Parallel Texts

2004

pdf abs
Error analysis of two types of grammar for the purpose of automatic rule refinement
Ariadna Font Llitjós | Katharina Probst | Jaime Carbonell
Proceedings of the 6th Conference of the Association for Machine Translation in the Americas: Technical Papers

This paper compares a manually written MT grammar and a grammar learned automatically from an English-Spanish elicitation corpus with the ultimate purpose of automatically refining the translation rules. The experiment described here shows that the kind of automatic refinement operations required to correct a translation not only varies depending on the type of error, but also on the type of grammar. This paper describes the two types of grammars and gives a detailed error analysis of their output, indicating what kinds of refinements are required in each case.

pdf
The Translation Correction Tool: English-Spanish User Studies
Ariadna Font Llitjós | Jaime Carbonell
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf
Developing Language Resources for a Transnational Digital Government System
Violetta Cavalli-Sforza | Jaime G. Carbonell | Peter J. Jansen
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf
Challenges in using an example-based MT system for a transnational digital government project
Violetta Cavalli-Sforza | Ralf D. Brown | Jaime G. Carbonell | Peter G. Jansen | Jae Dong Kim
Proceedings of the 9th EAMT Workshop: Broadening horizons of machine translation and its applications

pdf
Unsupervised Induction of Natural Language Morphology Inflection Classes
Christian Monson | Alon Lavie | Jaime Carbonell | Lori Levin
Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology

pdf
Instance-Based Question Answering: A Data-Driven Approach
Lucian Vlad Lita | Jaime Carbonell
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

2003

pdf abs
Reducing boundary friction using translation-fragment overlap
Ralf D. Brown | Rebecca Hutchinson | Paul N. Bennett | Jaime G. Carbonell | Peter Jansen
Proceedings of Machine Translation Summit IX: Papers

Many corpus-based Machine Translation (MT) systems generate a number of partial translations which are then pieced together rather than immediately producing one overall translation. While this makes them more robust to ill-formed input, they are subject to disfluencies at phrasal translation boundaries even for well-formed input. We address this “boundary friction” problem by introducing a method that exploits overlapping phrasal translations and the increased confidence in translation accuracy they imply. We specify an efficient algorithm for producing translations using overlap. Finally, our empirical analysis indicates that this approach produces higher quality translations than the standard method of combining non-overlapping fragments generated by our Example-Based MT (EBMT) system in a peak-to-peak comparison.

2002

pdf
Design and Evolution of a Language Technologies Curriculum
Robert Frederking | Eric H. Nyberg | Teruko Mitamura | Jaime G. Carbonell
Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics

Machine Translation of minority languages presents unique challenges, including the paucity of bilingual training data and the unavailability of linguistically-trained speakers. This paper focuses on a machine learning approach to transfer-based MT, where data in the form of translations and lexical alignments are elicited from bilingual speakers, and a seeded version-space learning algorithm formulates and refines transfer rules. A rule-generalization lattice is defined based on LFG-style f-structures, permitting generalization operators in the search for the most general rules consistent with the elicited data. The paper presents these methods and illustrates examples.

2001

NICE is a machine translation project for low-density languages. We are building a tool that will elicit a controlled corpus from a bilingual speaker who is not an expert in linguistics. The corpus is intended to cover major typological phenomena, as it is designed to work for any language. Using implicational universals, we strive to minimize the number of sentences that each informant has to translate. From the elicited sentences, we learn transfer rules with a version space algorithm. Our vision for MT in the future is one in which systems can be quickly trained for new languages by native speakers, so that speakers of minor languages can participate in education, health care, government, and internet without having to give up their languages.

2000

pdf
Multi-Document Summarization By Sentence Extraction
Jade Goldstein | Vibhu Mittal | Jaime Carbonell | Mark Kantrowitz
NAACL-ANLP 2000 Workshop: Automatic Summarization

1998

pdf
Summarization: (1) Using MMR for Diversity- Based Reranking and (2) Evaluating Summaries
Jade Goldstein | Jaime Carbonell
TIPSTER TEXT PROGRAM PHASE III: Proceedings of a Workshop held at Baltimore, Maryland, October 13-15, 1998

1996

Knowledge-based interlingual machine translation systems produce semantically accurate translations, but typically require massive knowledge acquisition. This paper describes KANT, a system that reduces this requirement to produce practical, scalable, and accurate KBMT applications. First, the set of requirements is discussed, then the full KANT architecture is illustrated, and finally results from a fully implemented prototype are presented.

pdf
Session 3: Machine Translation
Jaime Carbonell
Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, February 19-22, 1991