Monolingual Paraphrase Detection Corpus for Low Resource Pashto Language at Sentence Level

Iqra Ali, Hidetaka Kamigaito, Taro Watanabe


Abstract
Paraphrase detection is a task to identify if two sentences are semantically similar or not. It plays an important role in maintaining the integrity of written work such as plagiarism detection and text reuse detection. Formerly, researchers focused on developing large corpora for English. However, no research has been conducted on sentence-level paraphrase detection in low-resource Pashto language. To bridge this gap, we introduce the first fully manually annotated Pashto sentential paraphrase detection corpus collected from authentic cases in journalism covering 10 different domains, including Sports, Health, Environment, and more. Our proposed corpus contains 6,727 sentences, encompassing 3,687 paraphrased and 3,040 non-paraphrased. Experimental findings reveal that our proposed corpus is sufficient to train XLM-RoBERTa to accurately detect paraphrased sentence pairs in Pashto with an F1 score of 84%. To compare our corpus with those in other languages, we also applied our fine-tuned model to the Indonesian and English paraphrase datasets in a zero-shot manner, achieving F1 scores of 82% and 78%, respectively. This result indicates that the quality of our corpus is not less than commonly used datasets. It‘s a pioneering contribution to the field. We will publicize a subset of 1,800 instances from our corpus, free from any licensing issues.
Anthology ID:
2024.lrec-main.1011
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
11574–11581
Language:
URL:
https://aclanthology.org/2024.lrec-main.1011
DOI:
Bibkey:
Cite (ACL):
Iqra Ali, Hidetaka Kamigaito, and Taro Watanabe. 2024. Monolingual Paraphrase Detection Corpus for Low Resource Pashto Language at Sentence Level. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 11574–11581, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Monolingual Paraphrase Detection Corpus for Low Resource Pashto Language at Sentence Level (Ali et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-3/2024.lrec-main.1011.pdf
Optional supplementary material:
 2024.lrec-main.1011.OptionalSupplementaryMaterial.zip