Abstract
Although the current transcription systems could achieve high recognition performance, they still have a lot of difficulties to transcribe speech in very noisy environments. The transcription quality has a direct impact on classification tasks using text features. In this paper, we propose to identify themes of telephone conversation services with the classical Term Frequency-Inverse Document Frequency using Gini purity criteria (TF-IDF-Gini) method and with a Latent Dirichlet Allocation (LDA) approach. These approaches are coupled with a Support Vector Machine (SVM) classification to resolve theme identification problem. Results show the effectiveness of the proposed LDA-based method compared to the classical TF-IDF-Gini approach in the context of highly imperfect automatic transcriptions. Finally, we discuss the impact of discriminative and non-discriminative words extracted by both methods in terms of transcription accuracy.- Anthology ID:
- L14-1621
- Volume:
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
- Month:
- May
- Year:
- 2014
- Address:
- Reykjavik, Iceland
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 1309–1314
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/8_Paper.pdf
- DOI:
- Cite (ACL):
- Mohamed Morchid, Richard Dufour, and Georges Linarès. 2014. A LDA-Based Topic Classification Approach From Highly Imperfect Automatic Transcriptions. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1309–1314, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Cite (Informal):
- A LDA-Based Topic Classification Approach From Highly Imperfect Automatic Transcriptions (Morchid et al., LREC 2014)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/8_Paper.pdf