2015
pdf
NeRoSim: A System for Measuring and Interpreting Semantic Textual Similarity
Rajendra Banjade
|
Nobal Bikram Niraula
|
Nabin Maharjan
|
Vasile Rus
|
Dan Stefanescu
|
Mihai Lintean
|
Dipesh Gautam
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
2014
pdf
abs
The DARE Corpus: A Resource for Anaphora Resolution in Dialogue Based Intelligent Tutoring Systems
Nobal Niraula
|
Vasile Rus
|
Rajendra Banjade
|
Dan Stefanescu
|
William Baggett
|
Brent Morgan
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We describe the DARE corpus, an annotated data set focusing on pronoun resolution in tutorial dialogue. Although data sets for general purpose anaphora resolution exist, they are not suitable for dialogue based Intelligent Tutoring Systems. To the best of our knowledge, no data set is currently available for pronoun resolution in dialogue based intelligent tutoring systems. The described DARE corpus consists of 1,000 annotated pronoun instances collected from conversations between high-school students and the intelligent tutoring system DeepTutor. The data set is publicly available.
pdf
abs
Latent Semantic Analysis Models on Wikipedia and TASA
Dan Ștefănescu
|
Rajendra Banjade
|
Vasile Rus
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper introduces a collection of freely available Latent Semantic Analysis models built on the entire English Wikipedia and the TASA corpus. The models differ not only on their source, Wikipedia versus TASA, but also on the linguistic items they focus on: all words, content-words, nouns-verbs, and main concepts. Generating such models from large datasets (e.g. Wikipedia), that can provide a large coverage for the actual vocabulary in use, is computationally challenging, which is the reason why large LSA models are rarely available. Our experiments show that for the task of word-to-word similarity, the scores assigned by these models are strongly correlated with human judgment, outperforming many other frequently used measures, and comparable to the state of the art.
2013
pdf
Wikipedia as an SMT Training Corpus
Dan Tufiș
|
Radu Ion
|
Ștefan Dumitrescu
|
Dan Ștefănescu
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013
pdf
SEMILAR: The Semantic Similarity Toolkit
Vasile Rus
|
Mihai Lintean
|
Rajendra Banjade
|
Nobal Niraula
|
Dan Stefanescu
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations
2012
pdf
abs
ROMBAC: The Romanian Balanced Annotated Corpus
Radu Ion
|
Elena Irimia
|
Dan Ştefănescu
|
Dan Tufiș
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This article describes the collecting, processing and validation of a large balanced corpus for Romanian. The annotation types and structure of the corpus are briefly reviewed. It was constructed at the Research Institute for Artificial Intelligence of the Romanian Academy in the context of an international project (METANET4U). The processing covers tokenization, POS-tagging, lemmatization and chunking. The corpus is in XML format generated by our in-house annotation tools; the corpus encoding schema is XCES compliant and the metadata specification is conformant to the METANET recommendations. To the best of our knowledge, this is the first large and richly annotated corpus for Romanian. ROMBAC is intended to be the foundation of a linguistic environment containing a reference corpus for contemporary Romanian and a comprehensive collection of interoperable processing tools.
pdf
ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora
Mārcis Pinnis
|
Radu Ion
|
Dan Ştefănescu
|
Fangzhong Su
|
Inguna Skadiņa
|
Andrejs Vasiļjevs
|
Bogdan Babych
Proceedings of the ACL 2012 System Demonstrations
pdf
abs
Romanian to English automatic MT experiments at IWSLT12 – system description paper
Ştefan Daniel Dumitrescu
|
Radu Ion
|
Dan Ştefănescu
|
Tiberiu Boroş
|
Dan Tufiş
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign
The paper presents the system developed by RACAI for the ISWLT 2012 competition, TED task, MT track, Romanian to English translation. We describe the starting baseline phrase-based SMT system, the experiments conducted to adapt the language and translation models and our post-translation cascading system designed to improve the translation without external resources. We further present our attempts at creating a better controlled decoder than the open-source Moses system offers.
pdf
Hybrid Parallel Sentence Mining from Comparable Corpora
Dan Ștefănescu
|
Radu Ion
|
Sabine Hunsicker
Proceedings of the 16th Annual Conference of the European Association for Machine Translation
2011
pdf
Experiments with a Differential Semantics Annotation for WordNet 3.0
Dan Tufiş
|
Dan Ştefănescu
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011)
2010
pdf
RACAI: Unsupervised WSD Experiments @ SemEval-2, Task 17
Radu Ion
|
Dan Ştefănescu
Proceedings of the 5th International Workshop on Semantic Evaluation
pdf
abs
A Differential Semantics Approach to the Annotation of Synsets in WordNet
Dan Tufiş
|
Dan Ştefănescu
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
We describe a new method for sentiment load annotation of the synsets of a wordnet, along the principles of Osgoods Semantic Differential theory and extending the Kamp and Marx calculus, by taking into account not only the WordNet structure but also the SUMO/MILO (Niles & Pease, 2001) and DOMAINS (Bentivogli et al., 2004) knowledge sources. We discuss the method to annotate all the synsets in PWN2.0, irrespective of their part of speech. As the number of possible factors (semantic oppositions, along which the synsets are ranked) is very large, we developed also an application allowing the text analyst to select the most discriminating factors for the type of text to be analyzed. Once the factors have been selected, the underlying wordnet is marked-up on the fly and it can be used for the intended textual analysis. We anticipate that these annotations can be imported in other language wordnets, provided they are aligned to PWN2.0. The method for the synsets annotation generalizes the usual subjectivity mark-up (positive, negative and objective) according to a user-based multi-criteria differential semantics model.
2008
pdf
abs
RACAI’s Linguistic Web Services
Dan Tufiş
|
Radu Ion
|
Alexandru Ceauşu
|
Dan Ştefănescu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Nowadays, there are hundreds of Natural Language Processing applications and resources for different languages that are developed and/or used, almost exclusively with a few but notable exceptions, by their creators. Assuming that the right to use a particular application or resource is licensed by the rightful owner, the user is faced with the often not so easy task of interfacing it with his/her own systems. Even if standards are defined that provide a unified way of encoding resources, few are the cases when the resources are actually coded in conformance to the standard (and, at present time, there is no such thing as general NLP application interoperability). Semantic Web came with the promise that the web will be a universal medium for information exchange whatever its content. In this context, the present article outlines a collection of linguistic web services for Romanian and English, developed at the Research Institute for AI for the Romanian Academy (RACAI) which are ready to provide a standardized way of calling particular NLP operations and extract the results without caring about what exactly is going on in the background.
pdf
abs
A Hybrid Approach to Extracting and Classifying Verb+Noun Constructions
Amalia Todiraşcu
|
Dan Tufiş
|
Ulrich Heid
|
Christopher Gledhill
|
Dan Ştefanescu
|
Marion Weller
|
François Rousselot
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
We present the main findings and preliminary results of an ongoing project aimed at developing a system for collocation extraction based on contextual morpho-syntactic properties. We explored two hybrid extraction methods: the first method applies language-indepedent statistical techniques followed by a linguistic filtering, while the second approach, available only for German, is based on a set of lexico-syntactic patterns to extract collocation candidates. To define extraction and filtering patterns, we studied a specific collocation category, the Verb-Noun constructions, using a model inspired by the systemic functional grammar, proposing three level analysis: lexical, functional and semantic criteria. From tagged and lemmatized corpus, we identify some contextual morpho-syntactic properties helping to filter the output of the statistical methods and to extract some potential interesting VN constructions (complex predicates vs complex predicators). The extracted candidates are validated and classified manually.
2006
pdf
Improved Lexical Alignment by Combining Multiple Reified Alignments
Dan Tufiş
|
Radu Ion
|
Alexandru Ceauşu
|
Dan Ştefănescu
11th Conference of the European Chapter of the Association for Computational Linguistics
pdf
abs
Aligning Multilingual Thesauri
Dan Ştefănescu
|
Dan Tufiş
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
The aligning and merging of ontologies with overlapping information are actual one of the most active domain of investigation in the Semantic Web community. Multilingual lexical ontologies thesauri are fundamental knowledge sources for most NLP projects addressing multilinguality. The alignment of multilingual lexical knowledge sources has various applications ranging from knowledge acquisition to semantic validation of interlingual equivalence of presumably the same meaning express in different languages. In this paper, we present a general method for aligning ontologies, which was used to align a conceptual thesaurus, lexicalized in 20 languages with a partial version of it lexicalized in Romanian. The objective of our work was to align the existing terms in the Romanian Eurovoc to the terms in the English Eurovoc and to automatically update the Romanian Eurovoc. The general formulation of the ontology alignment problem was set up along the lines established by Heterogeneity group of the KnowledgeWeb consortium, but the actual case study was motivated by the needs of a specific NLP project.
pdf
abs
Acquis Communautaire Sentence Alignment using Support Vector Machines
Alexandru Ceauşu
|
Dan Ştefănescu
|
Dan Tufiş
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Sentence alignment is a task that requires not only accuracy, as possible errors can affect further processing, but also requires small computation resources and to be language pair independent. Although many implementations do not use translation equivalents because they are dependent on the language pair, this feature is a requirement for the accuracy increase. The paper presents a hybrid sentence aligner that has two alignment iterations. The first iteration is based mostly on sentences length, and the second is based on a translation equivalents table estimated from the results of the first iteration. The aligner uses a Support Vector Machine classifier to discriminate between positive and negative examples of sentence pairs.
2005
pdf
Combined Word Alignments
Dan Tufiş
|
Radu Ion
|
Alexandru Ceauşu
|
Dan Ştefănescu
Proceedings of the ACL Workshop on Building and Using Parallel Texts