Arnaud Delhay


EMO&LY (EMOtion and AnomaLY) : A new corpus for anomaly detection in an audiovisual stream with emotional context.
Cédric Fayet | Arnaud Delhay | Damien Lolive | Pierre-François Marteau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


pdf bib
Large Linguistic Corpus Reduction with SCP Algorithms
Nelly Barbot | Olivier Boëffard | Jonathan Chevelu | Arnaud Delhay
Computational Linguistics, Volume 41, Issue 3 - September 2015


Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora
Nelly Barbot | Olivier Boeffard | Arnaud Delhay
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Set covering algorithms are efficient tools for solving an optimal linguistic corpus reduction. The optimality of such a process is directly related to the descriptive features of the sentences of a reference corpus. This article suggests to verify experimentally the behaviour of three algorithms, a greedy approach and a lagrangian relaxation based one giving importance to rare events and a third one considering the Kullback-Liebler divergence between a reference and the ongoing distribution of events. The analysis of the content of the reduced corpora shows that the both first approaches stay the most effective to compress a corpus while guaranteeing a minimal content. The variant which minimises the Kullback-Liebler divergence guarantees a distribution of events close to a reference distribution as expected; however, the price for this solution is a much more important corpus. In the proposed experiments, we have also evaluated a mixed-approach considering a random complement to the smallest coverings.


Comparing Set-Covering Strategies for Optimal Corpus Design
Jonathan Chevelu | Nelly Barbot | Olivier Boeffard | Arnaud Delhay
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This article is interested in the problem of the linguistic content of a speech corpus. Depending on the target task, the phonological and linguistic content of the corpus is controlled by collecting a set of sentences which covers a preset description of phonological attributes under the constraint of an overall duration as small as possible. This goal is classically achieved by greedy algorithms which however do not guarantee the optimality of the desired cover. In recent works, a lagrangian-based algorithm, called LamSCP, has been used to extract coverings of diphonemes from a large corpus in French, giving better results than a greedy algorithm. We propose to keep comparing both algorithms in terms of the shortest duration, stability and robustness by achieving multi-represented diphoneme or triphoneme covering. These coverings correspond to very large scale optimization problems, from a corpus in English. For each experiment, LamSCP improves the greedy results from 3.9 to 9.7 percent.