Abstract
This paper presents a large corpus created from the original Quranic text, where semantically similar or related verses are linked together. This corpus will be a valuable evaluation resource for computational linguists investigating similarity and relatedness in short texts. Furthermore, this dataset can be used for evaluation of paraphrase analysis and machine translation tasks. Our dataset is characterised by: (1) superior quality of relatedness assignment; as we have incorporated relations marked by well-known domain experts, this dataset could thus be considered a gold standard corpus for various evaluation tasks, (2) the size of our dataset; over 7,600 pairs of related verses are collected from scholarly sources with several levels of degree of relatedness. This dataset could be extended to over 13,500 pairs of related verses observing the commutative property of strongly related pairs. This dataset was incorporated into online query pages where users can visualize for a given verse a network of all directly and indirectly related verses. Empirical experiments showed that only 33% of related pairs shared root words, emphasising the need to go beyond common lexical matching methods, and incorporate -in addition- semantic, domain knowledge, and other corpus-based approaches.- Anthology ID:
- L12-1051
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 2295–2302
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/190_Paper.pdf
- DOI:
- Cite (ACL):
- Abdul-Baquee Sharaf and Eric Atwell. 2012. QurSim: A corpus for evaluation of relatedness in short texts. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2295–2302, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- QurSim: A corpus for evaluation of relatedness in short texts (Sharaf & Atwell, LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/190_Paper.pdf