Exploiting Wikipedia as a Knowledge Base for the Extraction of Linguistic Resources: Application on Arabic-French Comparable Corpora and Bilingual Lexicons

Rahma Sellami, Fatiha Sadat, Lamia Hadrich Belguith


Abstract
We present simple and effective methods for extracting comparable corpora and bilingual lexicons from Wikipedia. We shall exploit the large scale and the structure of Wikipedia articles to extract two resources that will be very useful for natural language applications. We build a comparable corpus from Wikipedia using categories as topic restrictions and we extract bilingual lexicons from inter-language links aligned with statistical method or a combined statistical and linguistic method.
Anthology ID:
2012.amta-caas14.10
Volume:
Fourth Workshop on Computational Approaches to Arabic-Script-based Languages
Month:
November 1
Year:
2012
Address:
San Diego, California, USA
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
72–79
Language:
URL:
https://aclanthology.org/2012.amta-caas14.10
DOI:
Bibkey:
Cite (ACL):
Rahma Sellami, Fatiha Sadat, and Lamia Hadrich Belguith. 2012. Exploiting Wikipedia as a Knowledge Base for the Extraction of Linguistic Resources: Application on Arabic-French Comparable Corpora and Bilingual Lexicons. In Fourth Workshop on Computational Approaches to Arabic-Script-based Languages, pages 72–79, San Diego, California, USA. Association for Machine Translation in the Americas.
Cite (Informal):
Exploiting Wikipedia as a Knowledge Base for the Extraction of Linguistic Resources: Application on Arabic-French Comparable Corpora and Bilingual Lexicons (Sellami et al., AMTA 2012)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2012.amta-caas14.10.pdf