Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora

Nelly Barbot, Olivier Boeffard, Arnaud Delhay


Abstract
Set covering algorithms are efficient tools for solving an optimal linguistic corpus reduction. The optimality of such a process is directly related to the descriptive features of the sentences of a reference corpus. This article suggests to verify experimentally the behaviour of three algorithms, a greedy approach and a lagrangian relaxation based one giving importance to rare events and a third one considering the Kullback-Liebler divergence between a reference and the ongoing distribution of events. The analysis of the content of the reduced corpora shows that the both first approaches stay the most effective to compress a corpus while guaranteeing a minimal content. The variant which minimises the Kullback-Liebler divergence guarantees a distribution of events close to a reference distribution as expected; however, the price for this solution is a much more important corpus. In the proposed experiments, we have also evaluated a mixed-approach considering a random complement to the smallest coverings.
Anthology ID:
L12-1192
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
969–974
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/381_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Nelly Barbot, Olivier Boeffard, and Arnaud Delhay. 2012. Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 969–974, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora (Barbot et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/381_Paper.pdf