Improving in-domain data selection for small in-domain sets

Mohammed Mediani, Joshua Winebarger, Alexander Waibel


Abstract
Finding sufficient in-domain text data for language modeling is a recurrent challenge. Some methods have already been proposed for selecting parts of out-of-domain text data most closely resembling the in-domain data using a small amount of the latter. Including this new “near-domain” data in training can potentially lead to better language model performance, while reducing training resources relative to incorporating all data. One popular, state-of-the-art selection process based on cross-entropy scores makes use of in-domain and out-ofdomain language models. In order to compensate for the limited availability of the in-domain data required for this method, we introduce enhancements to two of its steps. Firstly, we improve the procedure for drawing the outof-domain sample data used for selection. Secondly, we use word-associations in order to extend the underlying vocabulary of the sample language models used for scoring. These enhancements are applied to selecting text for language modeling of talks given in a technical subject area. Besides comparing perplexity, we judge the resulting language models by their performance in automatic speech recognition and machine translation tasks. We evaluate our method in different contexts. We show that it yields consistent improvements, up to 2% absolute reduction in word error rate and 0.3 Bleu points. We achieve these improvements even given a much smaller in-domain set.
Anthology ID:
2014.iwslt-papers.14
Volume:
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers
Month:
December 4-5
Year:
2014
Address:
Lake Tahoe, California
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Note:
Pages:
249–256
Language:
URL:
https://aclanthology.org/2014.iwslt-papers.14
DOI:
Bibkey:
Cite (ACL):
Mohammed Mediani, Joshua Winebarger, and Alexander Waibel. 2014. Improving in-domain data selection for small in-domain sets. In Proceedings of the 11th International Workshop on Spoken Language Translation: Papers, pages 249–256, Lake Tahoe, California.
Cite (Informal):
Improving in-domain data selection for small in-domain sets (Mediani et al., IWSLT 2014)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2014.iwslt-papers.14.pdf