Abstract
State-of-the-art statistical machine translation (SMT) technique requires a good quality parallel data to build a translation model. The availability of large parallel corpora has rapidly increased over the past decade. However, often these newly developed parallel data contains contain significant noise. In this paper, we describe our approach for classifying good quality parallel sentence pairs from noisy parallel data. We use 10 different features within a Support Vector Machine (SVM)-based model for our classification task. We report a reasonably good classification accuracy and its positive effect on overall MT accuracy.- Anthology ID:
- L14-1248
- Volume:
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
- Month:
- May
- Year:
- 2014
- Address:
- Reykjavik, Iceland
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 41–45
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/272_Paper.pdf
- DOI:
- Cite (ACL):
- Sandipan Dandapat and Declan Groves. 2014. MTWatch: A Tool for the Analysis of Noisy Parallel Data. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 41–45, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Cite (Informal):
- MTWatch: A Tool for the Analysis of Noisy Parallel Data (Dandapat & Groves, LREC 2014)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/272_Paper.pdf