Abstract
Set covering algorithms are efficient tools for solving an optimal linguistic corpus reduction. The optimality of such a process is directly related to the descriptive features of the sentences of a reference corpus. This article suggests to verify experimentally the behaviour of three algorithms, a greedy approach and a lagrangian relaxation based one giving importance to rare events and a third one considering the Kullback-Liebler divergence between a reference and the ongoing distribution of events. The analysis of the content of the reduced corpora shows that the both first approaches stay the most effective to compress a corpus while guaranteeing a minimal content. The variant which minimises the Kullback-Liebler divergence guarantees a distribution of events close to a reference distribution as expected; however, the price for this solution is a much more important corpus. In the proposed experiments, we have also evaluated a mixed-approach considering a random complement to the smallest coverings.- Anthology ID:
- L12-1192
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 969–974
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/381_Paper.pdf
- DOI:
- Cite (ACL):
- Nelly Barbot, Olivier Boeffard, and Arnaud Delhay. 2012. Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 969–974, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora (Barbot et al., LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/381_Paper.pdf