Khalil Sima’an
Also published as: K. Sima’an
2025
How Aligned Are Unimodal Language and Graph Encodings of Chemical Molecules?
Congfeng Cao | Zhi Zhang | Jelke Bloem | Khalil Sima’an
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Congfeng Cao | Zhi Zhang | Jelke Bloem | Khalil Sima’an
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Chemical molecules can be represented as graphs or as language descriptions. Training unimodal models on graphs results in different encodings than training them on language. Therefore, the existing literature force-aligns the unimodal models during training to use them in downstream applications such as drug discovery. But to what extent are graph and language unimodal model representations inherently aligned, i.e., aligned prior to any force-alignment training? Knowing this is useful for a more expedient and effective forced-alignment. For the first time, we explore methods to gauge the alignment of graph and language unimodal models. We find compelling differences between models and their ability to represent slight structural differences without force-alignment. We also present an unified unimodal alignment (U2A) benchmark for gauging the inherent alignment between graph and language encoders which we make available with this paper.
2024
Continual Reinforcement Learning for Controlled Text Generation
Velizar Shulev | Khalil Sima’an
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Velizar Shulev | Khalil Sima’an
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Controlled Text Generation (CTG) steers the generation of continuations of a given context (prompt) by a Large Language Model (LLM) towards texts possessing a given attribute (e.g., topic, sentiment). In this paper we view CTG as a Continual Learning problem: how to learn at every step to steer next-word generation, without having to wait for end-of-sentence. This continual view is useful for online applications such as CTG for speech, where end-of-sentence is often uncertain. We depart from an existing model, the Plug-and-Play language models (PPLM), which perturbs the context at each step to better predict next-words that posses the desired attribute. While PPLM is intricate and has many hyper-parameters, we provide a proof that the PPLM objective function can be reduced to a Continual Reinforcement Learning (CRL) reward function, thereby simplifying PPLM and endowing it with a better understood learning framework. Subsequently, we present, the first of its kind, CTG algorithm that is fully based on CRL and exhibit promising empirical results.
2022
Passing Parser Uncertainty to the Transformer: Labeled Dependency Distributions for Neural Machine Translation
Dongqi Liu | Khalil Sima’an
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
Dongqi Liu | Khalil Sima’an
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
Existing syntax-enriched neural machine translation (NMT) models work either with the single most-likely unlabeled parse or the set of n-best unlabeled parses coming out of an external parser. Passing a single or n-best parses to the NMT model risks propagating parse errors. Furthermore, unlabeled parses represent only syntactic groupings without their linguistically relevant categories. In this paper we explore the question: Does passing both parser uncertainty and labeled syntactic knowledge to the Transformer improve its translation performance? This paper contributes a novel method for infusing the whole labeled dependency distributions (LDD) of the source sentence’s dependency forest into the self-attention mechanism of the encoder of the Transformer. A range of experimental results on three language pairs demonstrate that the proposed approach outperforms both the vanilla Transformer as well as the single best-parse Transformer model across several evaluation metrics.
2018
Deep Generative Model for Joint Alignment and Word Representation
Miguel Rios | Wilker Aziz | Khalil Sima’an
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
Miguel Rios | Wilker Aziz | Khalil Sima’an
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
This work exploits translation data as a source of semantically relevant learning signal for models of word representation. In particular, we exploit equivalence through translation as a form of distributional context and jointly learn how to embed and align with a deep generative model. Our EmbedAlign model embeds words in their complete observed context and learns by marginalisation of latent lexical alignments. Besides, it embeds words as posterior probability densities, rather than point estimates, which allows us to compare words in context using a measure of overlap between distributions (e.g. KL divergence). We investigate our model’s performance on a range of lexical semantics tasks achieving competitive results on several standard benchmarks including natural language inference, paraphrasing, and text similarity.
2017
Elastic-substitution decoding for Hierarchical SMT: efficiency, richer search and double labels
Gideon Maillette de Buy Wenniger | Khalil Sima’an | Andy Way
Proceedings of Machine Translation Summit XVI: Research Track
Gideon Maillette de Buy Wenniger | Khalil Sima’an | Andy Way
Proceedings of Machine Translation Summit XVI: Research Track
Graph Convolutional Encoders for Syntax-aware Neural Machine Translation
Jasmijn Bastings | Ivan Titov | Wilker Aziz | Diego Marcheggiani | Khalil Sima’an
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Jasmijn Bastings | Ivan Titov | Wilker Aziz | Diego Marcheggiani | Khalil Sima’an
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
We present a simple and effective approach to incorporating syntactic structure into neural attention-based encoder-decoder models for machine translation. We rely on graph-convolutional networks (GCNs), a recent class of neural networks developed for modeling graph-structured data. Our GCNs use predicted syntactic dependency trees of source sentences to produce representations of words (i.e. hidden states of the encoder) that are sensitive to their syntactic neighborhoods. GCNs take word representations as input and produce word representations as output, so they can easily be incorporated as layers into standard encoders (e.g., on top of bidirectional RNNs or convolutional neural networks). We evaluate their effectiveness with English-German and English-Czech translation experiments for different types of encoders and observe substantial improvements over their syntax-agnostic versions in all the considered setups.
Alternative Objective Functions for Training MT Evaluation Metrics
Miloš Stanojević | Khalil Sima’an
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Miloš Stanojević | Khalil Sima’an
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
MT evaluation metrics are tested for correlation with human judgments either at the sentence- or the corpus-level. Trained metrics ignore corpus-level judgments and are trained for high sentence-level correlation only. We show that training only for one objective (sentence or corpus level), can not only harm the performance on the other objective, but it can also be suboptimal for the objective being optimized. To this end we present a metric trained for corpus-level and show empirical comparison against a metric trained for sentence-level exemplifying how their performance may vary per language pair, type and level of judgment. Subsequently we propose a model trained to optimize both objectives simultaneously and show that it is far more stable than–and on average outperforms–both models on both objectives.
2016
Hierarchical Permutation Complexity for Word Order Evaluation
Miloš Stanojević | Khalil Sima’an
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Miloš Stanojević | Khalil Sima’an
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Existing approaches for evaluating word order in machine translation work with metrics computed directly over a permutation of word positions in system output relative to a reference translation. However, every permutation factorizes into a permutation tree (PET) built of primal permutations, i.e., atomic units that do not factorize any further. In this paper we explore the idea that permutations factorizing into (on average) shorter primal permutations should represent simpler ordering as well. Consequently, we contribute Permutation Complexity, a class of metrics over PETs and their extension to forests, and define tight metrics, a sub-class of metrics implementing this idea. Subsequently we define example tight metrics and empirically test them in word order evaluation. Experiments on the WMT13 data sets for ten language pairs show that a tight metric is more often than not better than the baselines.
Universal Reordering via Linguistic Typology
Joachim Daiber | Miloš Stanojević | Khalil Sima’an
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Joachim Daiber | Miloš Stanojević | Khalil Sima’an
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
In this paper we explore the novel idea of building a single universal reordering model from English to a large number of target languages. To build this model we exploit typological features of word order for a large number of target languages together with source (English) syntactic features and we train this model on a single combined parallel corpus representing all (22) involved language pairs. We contribute experimental evidence for the usefulness of linguistically defined typological features for building such a model. When the universal reordering model is used for preordering followed by monotone translation (no reordering inside the decoder), our experiments show that this pipeline gives comparable or improved translation performance with a phrase-based baseline for a large number of language pairs (12 out of 22) from diverse language families.
Word Alignment without NULL Words
Philip Schulz | Wilker Aziz | Khalil Sima’an
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Philip Schulz | Wilker Aziz | Khalil Sima’an
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Adapting to All Domains at Once: Rewarding Domain Invariance in SMT
Hoang Cuong | Khalil Sima’an | Ivan Titov
Transactions of the Association for Computational Linguistics, Volume 4
Hoang Cuong | Khalil Sima’an | Ivan Titov
Transactions of the Association for Computational Linguistics, Volume 4
Existing work on domain adaptation for statistical machine translation has consistently assumed access to a small sample from the test distribution (target domain) at training time. In practice, however, the target domain may not be known at training time or it may change to match user needs. In such situations, it is natural to push the system to make safer choices, giving higher preference to domain-invariant translations, which work well across domains, over risky domain-specific alternatives. We encode this intuition by (1) inducing latent subdomains from the training data only; (2) introducing features which measure how specialized phrases are to individual induced sub-domains; (3) estimating feature weights on out-of-domain data (rather than on the target domain). We conduct experiments on three language pairs and a number of different domains. We observe consistent improvements over a baseline which does not explicitly reward domain invariance.
Examining the Relationship between Preordering and Word Order Freedom in Machine Translation
Joachim Daiber | Miloš Stanojević | Wilker Aziz | Khalil Sima’an
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers
Joachim Daiber | Miloš Stanojević | Wilker Aziz | Khalil Sima’an
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers
ILLC-UvA Adaptation System (Scorpio) at WMT’16 IT-DOMAIN Task
Hoang Cuong | Stella Frank | Khalil Sima’an
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
Hoang Cuong | Stella Frank | Khalil Sima’an
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
A Shared Task on Multimodal Machine Translation and Crosslingual Image Description
Lucia Specia | Stella Frank | Khalil Sima’an | Desmond Elliott
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
Lucia Specia | Stella Frank | Khalil Sima’an | Desmond Elliott
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
Multi30K: Multilingual English-German Image Descriptions
Desmond Elliott | Stella Frank | Khalil Sima’an | Lucia Specia
Proceedings of the 5th Workshop on Vision and Language
Desmond Elliott | Stella Frank | Khalil Sima’an | Lucia Specia
Proceedings of the 5th Workshop on Vision and Language
Factoring Adjunction in Hierarchical Phrase-Based SMT
Sophie Arnoult | Khalil Sima’an
Proceedings of the 2nd Deep Machine Translation Workshop
Sophie Arnoult | Khalil Sima’an
Proceedings of the 2nd Deep Machine Translation Workshop
2015
Machine translation with source-predicted target morphology
Joachim Daiber | Khalil Sima’an
Proceedings of Machine Translation Summit XV: Papers
Joachim Daiber | Khalil Sima’an
Proceedings of Machine Translation Summit XV: Papers
The EXPERT project: Advancing the state of the art in hybrid translation technologies
Constantin Orasan | Alessandro Cattelan | Gloria Corpas Pastor | Josef van Genabith | Manuel Herranz | Juan José Arevalillo | Qun Liu | Khalil Sima’an | Lucia Specia
Proceedings of Translating and the Computer 37
Constantin Orasan | Alessandro Cattelan | Gloria Corpas Pastor | Josef van Genabith | Manuel Herranz | Juan José Arevalillo | Qun Liu | Khalil Sima’an | Lucia Specia
Proceedings of Translating and the Computer 37
Reordering Grammar Induction
Miloš Stanojević | Khalil Sima’an
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
Miloš Stanojević | Khalil Sima’an
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
Latent Domain Word Alignment for Heterogeneous Corpora
Hoang Cuong | Khalil Sima’an
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Hoang Cuong | Khalil Sima’an
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
BEER 1.1: ILLC UvA submission to metrics and tuning task
Miloš Stanojević | Khalil Sima’an
Proceedings of the Tenth Workshop on Statistical Machine Translation
Miloš Stanojević | Khalil Sima’an
Proceedings of the Tenth Workshop on Statistical Machine Translation
Modelling the Adjunct/Argument Distinction in Hierarchical Phrase-Based SMT
Sophie Arnoult | Khalil Sima’an
Proceedings of the 1st Deep Machine Translation Workshop
Sophie Arnoult | Khalil Sima’an
Proceedings of the 1st Deep Machine Translation Workshop
Delimiting Morphosyntactic Search Space with Source-Side Reordering Models
Joachim Daiber | Khalil Sima’an
Proceedings of the 1st Deep Machine Translation Workshop
Joachim Daiber | Khalil Sima’an
Proceedings of the 1st Deep Machine Translation Workshop
2014
Latent Domain Translation Models in Mix-of-Domains Haystack
Hoang Cuong | Khalil Sima’an
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
Hoang Cuong | Khalil Sima’an
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
Fitting Sentence Level Translation Evaluation with Many Dense Features
Miloš Stanojević | Khalil Sima’an
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Miloš Stanojević | Khalil Sima’an
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Latent Domain Phrase-based Models for Adaptation
Hoang Cuong | Khalil Sima’an
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Hoang Cuong | Khalil Sima’an
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
All Fragments Count in Parser Evaluation
Jasmijn Bastings | Khalil Sima’an
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Jasmijn Bastings | Khalil Sima’an
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
PARSEVAL, the default paradigm for evaluating constituency parsers, calculates parsing success (Precision/Recall) as a function of the number of matching labeled brackets across the test set. Nodes in constituency trees, however, are connected together to reflect important linguistic relations such as predicate-argument and direct-dominance relations between categories. In this paper, we present FREVAL, a generalization of PARSEVAL, where the precision and recall are calculated not only for individual brackets, but also for co-occurring, connected brackets (i.e. fragments). FREVAL fragments precision (FLP) and recall (FLR) interpolate the match across the whole spectrum of fragment sizes ranging from those consisting of individual nodes (labeled brackets) to those consisting of full parse trees. We provide evidence that FREVAL is informative for inspecting relative parser performance by comparing a range of existing parsers.
BEER: BEtter Evaluation as Ranking
Miloš Stanojević | Khalil Sima’an
Proceedings of the Ninth Workshop on Statistical Machine Translation
Miloš Stanojević | Khalil Sima’an
Proceedings of the Ninth Workshop on Statistical Machine Translation
Bilingual Markov Reordering Labels for Hierarchical SMT
Gideon Maillette de Buy Wenniger | Khalil Sima’an
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation
Gideon Maillette de Buy Wenniger | Khalil Sima’an
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation
Evaluating Word Order Recursively over Permutation-Forests
Miloš Stanojević | Khalil Sima’an
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation
Miloš Stanojević | Khalil Sima’an
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation
How Synchronous are Adjuncts in Translation Data?
Sophie Arnoult | Khalil Sima’an
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation
Sophie Arnoult | Khalil Sima’an
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation
2013
Hierarchical Alignment Decomposition Labels for Hiero Grammar Rules
Gideon Maillette de Buy Wenniger | Khalil Sima’an
Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation
Gideon Maillette de Buy Wenniger | Khalil Sima’an
Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation
A Formal Characterization of Parsing Word Alignments by Synchronous Grammars with Empirical Evidence to the ITG Hypothesis.
Gideon Maillette de Buy Wenniger | Khalil Sima’an
Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation
Gideon Maillette de Buy Wenniger | Khalil Sima’an
Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation
2012
Adjunct Alignment in Translation Data with an Application to Phrase Based Statistical Machine Translation
Sophie Arnoult | Khalil Sima’an
Proceedings of the 16th Annual Conference of the European Association for Machine Translation
Sophie Arnoult | Khalil Sima’an
Proceedings of the 16th Annual Conference of the European Association for Machine Translation
2011
Context-Sensitive Syntactic Source-Reordering by Statistical Transduction
Maxim Khalilov | Khalil Sima’an
Proceedings of 5th International Joint Conference on Natural Language Processing
Maxim Khalilov | Khalil Sima’an
Proceedings of 5th International Joint Conference on Natural Language Processing
Learning Hierarchical Translation Structure with Linguistic Annotations
Markos Mylonakis | Khalil Sima’an
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
Markos Mylonakis | Khalil Sima’an
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
ILLC-UvA translation system for EMNLP-WMT 2011
Maxim Khalilov | Khalil Sima’an
Proceedings of the Sixth Workshop on Statistical Machine Translation
Maxim Khalilov | Khalil Sima’an
Proceedings of the Sixth Workshop on Statistical Machine Translation
Learning Structural Dependencies of Words in the Zipfian Tail
Tejaswini Deoskar | Markos Mylonakis | Khalil Sima’an
Proceedings of the 12th International Conference on Parsing Technologies
Tejaswini Deoskar | Markos Mylonakis | Khalil Sima’an
Proceedings of the 12th International Conference on Parsing Technologies
2010
Source reordering using MaxEnt classifiers and supertags
Maxim Khalilov | Khalil Sima’an
Proceedings of the 14th Annual Conference of the European Association for Machine Translation
Maxim Khalilov | Khalil Sima’an
Proceedings of the 14th Annual Conference of the European Association for Machine Translation
ILLC-UvA machine translation system for the IWSLT 2010 evaluation
Maxim Khalilov | Khalil Sima’an
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign
Maxim Khalilov | Khalil Sima’an
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign
Modeling Morphosyntactic Agreement in Constituency-Based Parsing of Modern Hebrew
Reut Tsarfaty | Khalil Sima’an
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Reut Tsarfaty | Khalil Sima’an
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Learning Probabilistic Synchronous CFGs for Phrase-Based Translation
Markos Mylonakis | Khalil Sima’an
Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Markos Mylonakis | Khalil Sima’an
Proceedings of the Fourteenth Conference on Computational Natural Language Learning
A Discriminative Syntactic Model for Source Permutation via Tree Transduction
Maxim Khalilov | Khalil Sima’an
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation
Maxim Khalilov | Khalil Sima’an
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation
2009
An Alternative to Head-Driven Approaches for Parsing a (Relatively) Free Word-Order Language
Reut Tsarfaty | Khalil Sima’an | Remko Scha
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
Reut Tsarfaty | Khalil Sima’an | Remko Scha
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
A Syntactified Direct Translation Model with Linear-time Decoding
Hany Hassan | Khalil Sima’an | Andy Way
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
Hany Hassan | Khalil Sima’an | Andy Way
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
Lexicalized Semi-incremental Dependency Parsing
Hany Hassan | Khalil Sima’an | Andy Way
Proceedings of the International Conference RANLP-2009
Hany Hassan | Khalil Sima’an | Andy Way
Proceedings of the International Conference RANLP-2009
Smoothing fine-grained PCFG lexicons
Tejaswini Deoskar | Mats Rooth | Khalil Sima’an
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)
Tejaswini Deoskar | Mats Rooth | Khalil Sima’an
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)
2008
Relational-Realizational Parsing
Reut Tsarfaty | Khalil Sima’an
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)
Reut Tsarfaty | Khalil Sima’an
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)
Phrase Translation Probabilities with ITG Priors and Smoothing as Learning Objective
Markos Mylonakis | Khalil Sima’an
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing
Markos Mylonakis | Khalil Sima’an
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing
Subdomain Sensitive Statistical Parsing using Raw Corpora
Barbara Plank | Khalil Sima’an
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Barbara Plank | Khalil Sima’an
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Modern statistical parsers are trained on large annotated corpora (treebanks). These treebanks usually consist of sentences addressing different subdomains (e.g. sports, politics, music), which implies that the statistics gathered by current statistical parsers are mixtures of subdomains of language use. In this paper we present a method that exploits raw subdomain corpora gathered from the web to introduce subdomain sensitivity into a given parser. We employ statistical techniques for creating an ensemble of domain sensitive parsers, and explore methods for amalgamating their predictions. Our experiments show that introducing domain sensitivity by exploiting raw corpora can improve over a tough, state-of-the-art baseline.
2007
Supertagged Phrase-Based Statistical Machine Translation
Hany Hassan | Khalil Sima’an | Andy Way
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics
Hany Hassan | Khalil Sima’an | Andy Way
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics
Smoothing a Lexicon-based POS Tagger for Arabic and Hebrew
Saib Manour | Khalil Sima’an | Yoad Winter
Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Saib Manour | Khalil Sima’an | Yoad Winter
Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Three-Dimensional Parametrization for Parsing Morphologically Rich Languages
Reut Tsarfaty | Khalil Sima’an
Proceedings of the Tenth International Conference on Parsing Technologies
Reut Tsarfaty | Khalil Sima’an
Proceedings of the Tenth International Conference on Parsing Technologies
2006
Corpus Variations for Translation Lexicon Induction
Rebecca Hwa | Carol Nichols | Khalil Sima’an
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers
Rebecca Hwa | Carol Nichols | Khalil Sima’an
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers
Lexical mappings (word translations) between languages are an invaluable resource for multilingual processing. While the problem of extracting lexical mappings from parallel corpora is well-studied, the task is more challenging when the language samples are from non-parallel corpora. The goal of this work is to investigate one such scenario: finding lexical mappings between dialects of a diglossic language, in which people conduct their written communications in a prestigious formal dialect, but they communicate verbally in a colloquial dialect. Because the two dialects serve different socio-linguistic functions, parallel corpora do not naturally exist between them. An example of a diglossic dialect pair is Modern Standard Arabic (MSA) and Levantine Arabic. In this paper, we evaluate the applicability of a standard algorithm for inducing lexical mappings between comparable corpora (Rapp, 1999) to such diglossic corpora pairs. The focus of the paper is an in-depth error analysis, exploring the notion of relatedness in diglossic corpora and scrutinizing the effects of various dimensions of relatedness (such as mode, topic, style, and statistics) on the quality of the resulting translation lexicon.
2005
Choosing an Optimal Architecture for Segmentation and POS-Tagging of Modern Hebrew
Roy Bar-Haim | Khalil Sima’an | Yoad Winter
Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Roy Bar-Haim | Khalil Sima’an | Yoad Winter
Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
2004
BioGrapher: Biography Questions as a Restricted Domain Question Answering Task
Oren Tsur | Maarten de Rijke | Khalil Sima’an
Proceedings of the Conference on Question Answering in Restricted Domains
Oren Tsur | Maarten de Rijke | Khalil Sima’an
Proceedings of the Conference on Question Answering in Restricted Domains
2003
On maximizing metrics for syntactic disambiguation
Khalil Sima’an
Proceedings of the Eighth International Conference on Parsing Technologies
Khalil Sima’an
Proceedings of the Eighth International Conference on Parsing Technologies
Given a probabilistic parsing model and an evaluation metric for scoring the match between parse-trees, e.g., PARSEVAL [Black et al., 1991], this paper addresses the problem of how to select the on average best scoring parse-tree for an input sentence. Common wisdom dictates that it is optimal to select the parse with the highest probability, regardless of the evaluation metric. In contrast, the Maximizing Metrics (MM) method [Goodman, 1998, Stolcke et al., 1997] proposes that an algorithm that optimizes the evaluation metric itself constitutes the optimal choice. We study the MM method within parsing. We observe that the MM does not always hold for tree-bank models, and that optimizing weak metrics is not interesting for semantic processing. Subsequently, we state an alternative proposition: the optimal algorithm must maximize the metric that scores parse-trees according to linguistically relevant features. We present new algorithms that optimize metrics that take into account increasingly more linguistic features, and exhibit experiments in support of our claim.
2001
Robust Data Oriented Parsing of Speech Utterances
Khalil Sima’an
Proceedings of the Seventh International Workshop on Parsing Technologies
Khalil Sima’an
Proceedings of the Seventh International Workshop on Parsing Technologies
2000
Tree-gram Parsing: Lexical Dependencies and Structural Relations
K. Sima’an
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics
K. Sima’an
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics
1997
Explanation-Based Learning of Data-Oriented Parsing
K. Sima’an
CoNLL97: Computational Natural Language Learning
K. Sima’an
CoNLL97: Computational Natural Language Learning
1996
Search
Fix author
Co-authors
- Miloš Stanojević 9
- Hoang Cuong 5
- Maxim Khalilov 5
- Sophie Arnoult 4
- Wilker Aziz 4
- Joachim Daiber 4
- Gideon Maillette de Buy Wenniger 4
- Markos Mylonakis 4
- Reut Tsarfaty 4
- Andy Way 4
- Stella Frank 3
- Hany Hassan Awadalla 3
- Lucia Specia 3
- Jasmijn Bastings 2
- Tejaswini Deoskar 2
- Desmond Elliott 2
- Ivan Titov 2
- Yoad Winter 2
- Roy Bar-Haim 1
- Jelke Bloem 1
- Congfeng Cao 1
- Alessandro Cattelan 1
- Gloria Corpas Pastor 1
- Manuel Herranz 1
- Rebecca Hwa 1
- Juan José Arevalillo 1
- Qun Liu 1
- Dongqi Liu 1
- Saib Manour 1
- Diego Marcheggiani 1
- Carol Nichols 1
- Constantin Orasan 1
- Barbara Plank 1
- Miguel Rios 1
- Mats Rooth 1
- Remko Scha 1
- Philip Schulz 1
- Velizar Shulev 1
- Oren Tsur 1
- Zhi Zhang 1
- Maarten de Rijke 1
- Josef van Genabith 1