Patrick Wambacq

2018

pdf abs
A Comparison of Different Punctuation Prediction Approaches in a Translation Context
Vincent Vandeghinste | Lyan Verwimp | Joris Pelemans | Patrick Wambacq
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

We test a series of techniques to predict punctuation and its effect on machine translation (MT) quality. Several techniques for punctuation prediction are compared: language modeling techniques, such as n-grams and long shortterm memories (LSTM), sequence labeling LSTMs (unidirectional and bidirectional), and monolingual phrase-based, hierarchical and neural MT. For actual translation, phrase-based, hierarchical and neural MT are investigated. We observe that for punctuation prediction, phrase-based statistical MT and neural MT reach similar results, and are best used as a preprocessing step which is followed by neural MT to perform the actual translation. Implicit punctuation insertion by a dedicated neural MT system, trained on unpunctuated source and punctuated target, yields similar results.

We present the highlights of the now finished 4-year SCATE project. It was completed in February 2018 and funded by the Flemish Government IWT-SBO, project No. 130041.1

pdf abs
State Gradients for RNN Memory Analysis
Lyan Verwimp | Hugo Van hamme | Vincent Renkens | Patrick Wambacq
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

We present a framework for analyzing what the state in RNNs remembers from its input embeddings. We compute the gradients of the states with respect to the input embeddings and decompose the gradient matrix with Singular Value Decomposition to analyze which directions in the embedding space are best transferred to the hidden state space, characterized by the largest singular values. We apply our approach to LSTM language models and investigate to what extent and for how long certain classes of words are remembered on average for a certain corpus. Additionally, the extent to which a specific property or relationship is remembered by the RNN can be tracked by comparing a vector characterizing that property with the direction(s) in embedding space that are best preserved in hidden state space.

pdf
TF-LM: TensorFlow-based Language Modeling Toolkit
Lyan Verwimp | Hugo Van hamme | Patrick Wambacq
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
Character-Word LSTM Language Models
Lyan Verwimp | Joris Pelemans | Hugo Van hamme | Patrick Wambacq
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We present a Character-Word Long Short-Term Memory Language Model which both reduces the perplexity with respect to a baseline word-level language model and reduces the number of parameters of the model. Character information can reveal structural (dis)similarities between words and can even be used when a word is out-of-vocabulary, thus improving the modeling of infrequent and unknown words. By concatenating word and character embeddings, we achieve up to 2.77% relative improvement on English compared to a baseline model with a similar amount of parameters and 4.57% on Dutch. Moreover, we also outperform baseline word-level models with a larger number of parameters.

2016

pdf abs
SCALE: A Scalable Language Engineering Toolkit
Joris Pelemans | Lyan Verwimp | Kris Demuynck | Hugo Van hamme | Patrick Wambacq
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present SCALE, a new Python toolkit that contains two extensions to n-gram language models. The first extension is a novel technique to model compound words called Semantic Head Mapping (SHM). The second extension, Bag-of-Words Language Modeling (BagLM), bundles popular models such as Latent Semantic Analysis and Continuous Skip-grams. Both extensions scale to large data and allow the integration into first-pass ASR decoding. The toolkit is open source, includes working examples and can be found on http://github.com/jorispelemans/scale.

2015

2014

pdf abs
Speech Recognition Web Services for Dutch
Joris Pelemans | Kris Demuynck | Hugo Van hamme | Patrick Wambacq
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present 3 applications in the domain of Automatic Speech Recognition for Dutch, all of which are developed using our in-house speech recognition toolkit SPRAAK. The speech-to-text transcriber is a large vocabulary continuous speech recognizer, optimized for Southern Dutch. It is capable to select components and adjust parameters on the fly, based on the observed conditions in the audio and was recently extended with the capability of adding new words to the lexicon. The grapheme-to-phoneme converter generates possible pronunciations for Dutch words, based on lexicon lookup and linguistic rules. The speech-text alignment system takes audio and text as input and constructs a time aligned output where every word receives exact begin and end times. All three of the applications (and others) are freely available, after registration, as a web application on http://www.spraak.org/webservice/ and in addition, can be accessed as a web service in automated tools.

2004

pdf abs
Automatic Phonemic Labeling and Segmentation of Spoken Dutch
Kris Demuynck | Tom Laureys | Patrick Wambacq | Dirk Van Compernolle
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

The CGN corpus (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of the phonemic annotations and the corresponding segmentations. First, we detail the processes used to generate possible pronunciations for each sentence and to select to most likely one. Next, we identify the remaining difficulties when handling the CGN data and explain how we solved them. We conclude with an evaluation of the quality of the resulting transcriptions and segmentations.

2002

pdf
An Improved Algorithm for the Automatic Segmentation of Speech Corpora
Tom Laureys | Kris Demuynck | Jacques Duchateau | Patrick Wambacq
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)