Abstract
Preprocessing is a preliminary step in many fields including IR and NLP. The effect of basic preprocessing settings on English for text summarization is well-studied. However, there is no such effort found for the Urdu language (with the best of our knowledge). In this study, we analyze the effect of basic preprocessing settings for single-document text summarization for Urdu, on a benchmark corpus using various experiments. The analysis is performed using the state-of-the-art algorithms for extractive summarization and the effect of stopword removal, lemmatization, and stemming is analyzed. Results showed that these pre-processing settings improve the results.- Anthology ID:
- L16-1585
- Volume:
- Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
- Month:
- May
- Year:
- 2016
- Address:
- Portorož, Slovenia
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 3686–3693
- Language:
- URL:
- https://aclanthology.org/L16-1585
- DOI:
- Cite (ACL):
- Muhammad Humayoun and Hwanjo Yu. 2016. Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3686–3693, Portorož, Slovenia. European Language Resources Association (ELRA).
- Cite (Informal):
- Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization (Humayoun & Yu, LREC 2016)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/L16-1585.pdf
- Code
- humsha/USCorpus
- Data
- CC-News