Ammar Alsalka
2020
Constructing a Bilingual Hadith Corpus Using a Segmentation Tool
Shatha Altammami
|
Eric Atwell
|
Ammar Alsalka
Proceedings of the Twelfth Language Resources and Evaluation Conference
This article describes the process of gathering and constructing a bilingual parallel corpus of Islamic Hadith, which is the set of narratives reporting different aspects of the prophet Muhammad’s life. The corpus data is gathered from the six canonical Hadith collections using a custom segmentation tool that automatically segments and annotates the two Hadith components with 92% accuracy. This Hadith segmenter minimises the costs of language resource creation and produces consistent results independently from previous knowledge and experiences that usually influence human annotators. The corpus includes more than 10M tokens and will be freely available via the LREC repository.
2019
Text Segmentation Using N-grams to Annotate Hadith Corpus
Shatha Altammami
|
Eric Atwell
|
Ammar Alsalka
Proceedings of the 3rd Workshop on Arabic Corpus Linguistics
Search