Ştefan Daniel Dumitrescu

Also published as: Stefan Daniel Dumitrescu, Ștefan Daniel Dumitrescu, Ștefan Dumitrescu

2020

pdf abs
Introducing RONEC - the Romanian Named Entity Corpus
Stefan Daniel Dumitrescu | Andrei-Marius Avram
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present RONEC - the Named Entity Corpus for the Romanian language. The corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition. It is available in BRAT and CoNLL-U Plus formats, and it is free to use and extend at github.com/dumitrescustefan/ronec

2018

pdf abs
NLP-Cube: End-to-End Raw Text Processing With Neural Networks
Tiberiu Boros | Stefan Daniel Dumitrescu | Ruxandra Burtica
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We introduce NLP-Cube: an end-to-end Natural Language Processing framework, evaluated in CoNLL’s “Multilingual Parsing from Raw Text to Universal Dependencies 2018” Shared Task. It performs sentence splitting, tokenization, compound word expansion, lemmatization, tagging and parsing. Based entirely on recurrent neural networks, written in Python, this ready-to-use open source system is freely available on GitHub. For each task we describe and discuss its specific network architecture, closing with an overview on the results obtained in the competition.

pdf
Attention-free encoder decoder for morphological processing
Stefan Daniel Dumitrescu | Tiberiu Boros
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

2017

pdf abs
Fast and Accurate Decision Trees for Natural Language Processing Tasks
Tiberiu Boros | Stefan Daniel Dumitrescu | Sonia Pipa
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Decision trees have been previously employed in many machine-learning tasks such as part-of-speech tagging, lemmatization, morphological-attribute resolution, letter-to-sound conversion and statistical-parametric speech synthesis. In this paper we introduce an optimized tree-computation algorithm, which is based on the original ID3 algorithm. We also introduce a tree-pruning method that uses a development set to delete nodes from over-fitted models. The later mentioned algorithm also uses a results caching method for speed-up. Our algorithm is almost 200 times faster than a naive implementation and yields accurate results on our test datasets.

pdf abs
CASSANDRA: A multipurpose configurable voice-enabled human-computer-interface
Tiberiu Boros | Stefan Daniel Dumitrescu | Sonia Pipa
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

Voice enabled human computer interfaces (HCI) that integrate automatic speech recognition, text-to-speech synthesis and natural language understanding have become a commodity, introduced by the immersion of smart phones and other gadgets in our daily lives. Smart assistants are able to respond to simple queries (similar to text-based question-answering systems), perform simple tasks (call a number, reject a call etc.) and help organizing appointments. With this paper we introduce a newly created process automation platform that enables the user to control applications and home appliances and to query the system for information using a natural voice interface. We offer an overview of the technologies that enabled us to construct our system and we present different usage scenarios in home and office environments.

pdf abs
RACAI’s Natural Language Processing pipeline for Universal Dependencies
Stefan Daniel Dumitrescu | Tiberiu Boros | Dan Tufis
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper presents RACAI’s approach, experiments and results at CONLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. We handle raw text and we cover tokenization, sentence splitting, word segmentation, tagging, lemmatization and parsing. All results are reported under strict training, development and testing conditions, in which the corpora provided for the shared tasks is used “as is”, without any modifications to the composition of the train and development sets.

2016

pdf abs
RACAI Entry for the IWSLT 2016 Shared Task
Sonia Pipa | Alin Florentin Vasile | Ioana Ionașcu | Stefan Daniel Dumitrescu | Tiberiu Boros
Proceedings of the 13th International Conference on Spoken Language Translation

Spoken Language Translation is currently a hot topic in the research community. This task is very complex, involving automatic speech recognition, text-normalization and machine translation. We present our speech translation system, which was compared against the other systems participating in the IWSLT 2016 Shared Task. We introduce our ASR system for English and our MT system for English to French (En-Fr) and English to German (En-De) language pairs. Additionally, for the English to French Challenge we introduce a methodology that enables the enhancement of statistical phrase-based translation with translation equivalents deduced from monolingual corpora using neural word embedding.

pdf abs
The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language
Dan Tufiș | Verginica Barbu Mititelu | Elena Irimia | Ștefan Daniel Dumitrescu | Tiberiu Boroș
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The article describes the current status of a large national project, CoRoLa, aiming at building a reference corpus for the contemporary Romanian language. Unlike many other national corpora, CoRoLa contains only - IPR cleared texts and speech data, obtained from some of the country’s most representative publishing houses, broadcasting agencies, editorial offices, newspapers and popular bloggers. For the written component 500 million tokens are targeted and for the oral one 300 hours of recordings. The choice of texts is done according to their functional style, domain and subdomain, also with an eye to the international practice. A metadata file (following the CMDI model) is associated to each text file. Collected texts are cleaned and transformed in a format compatible with the tools for automatic processing (segmentation, tokenization, lemmatization, part-of-speech tagging). The paper also presents up-to-date statistics about the structure of the corpus almost two years before its official launching. The corpus will be freely available for searching. Users will be able to download the results of their searches and those original files when not against stipulations in the protocols we have with text providers.

2014

pdf abs
RSS-TOBI - A Prosodically Enhanced Romanian Speech Corpus
Tiberiu Boroș | Adriana Stan | Oliver Watts | Stefan Daniel Dumitrescu
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper introduces a recent development of a Romanian Speech corpus to include prosodic annotations of the speech data in the form of ToBI labels. We describe the methodology of determining the required pitch patterns that are common for the Romanian language, annotate the speech resource, and then provide a comparison of two text-to-speech synthesis systems to establish the benefits of using this type of information to our speech resource. The result is a publicly available speech dataset which can be used to further develop speech synthesis systems or to automatically learn the prediction of ToBI labels from text in Romanian language.

pdf
News about the Romanian Wordnet
Verginica Barbu Mititelu | Ștefan Daniel Dumitrescu | Dan Tufiș
Proceedings of the Seventh Global Wordnet Conference

pdf
RACAI GEC – A hybrid approach to Grammatical Error Correction
Tiberiu Boroș | Stefan Daniel Dumitrescu | Adrian Zafiu | Verginica Barbu Mititelu | Ionut Paul Văduva
Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task

2013

pdf
Wikipedia as an SMT Training Corpus
Dan Tufiș | Radu Ion | Ștefan Dumitrescu | Dan Ștefănescu
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf
Cascaded Phrase-Based Statistical Machine Translation Systems
Dan Tufiş | Ștefan Daniel Dumitrescu
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

pdf abs
Romanian to English automatic MT experiments at IWSLT12 – system description paper
Ştefan Daniel Dumitrescu | Radu Ion | Dan Ştefănescu | Tiberiu Boroş | Dan Tufiş
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

The paper presents the system developed by RACAI for the ISWLT 2012 competition, TED task, MT track, Romanian to English translation. We describe the starting baseline phrase-based SMT system, the experiments conducted to adapt the language and translation models and our post-translation cascading system designed to improve the translation without external resources. We further present our attempts at creating a better controlled decoder than the open-source Moses system offers.