2010
pdf
abs
Cooperation for Arabic Language Resources and Tools — The MEDAR Project
Bente Maegaard
|
Mohamed Attia
|
Khalid Choukri
|
Olivier Hamon
|
Steven Krauwer
|
Mustafa Yaseen
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The paper describes some of the work carried out within the European funded project MEDAR. The project has three streams of activity: the technical stream, the cooperation stream and the dissemination stream. MEDAR has first updated the existing surveys and BLARK for Arabic, and then the technical stream focused on machine translation. The consortium identified a number of freely available MT systems and then customized two versions of the famous MOSES package. The Consortium addressed the needs to package MOSES for English to Arabic (while the main MT stream is on Arabic to English). For performance assessment purposes, the partners produced test data that allowed carrying out an evaluation campaign with 5 different systems (including from outside the consortium) and two online ones. Both the MT baselines and the collected data will be made available via ELRA catalogue. The cooperation stream focuses mostly on the cooperation roadmap for Human Language Technologies for Arabic. Cooperation Roadmap for the region directed towards the Arabic HLT in general. It is the purpose of the roadmap to outline areas and priorities for collaboration, in terms of collaboration between EU countries and Arabic speaking countries, as well as cooperation in general: between countries, between universities, and last but not least between universities and industry.
2008
pdf
abs
A Compact Arabic Lexical Semantics Language Resource Based on the Theory of Semantic Fields
Mohamed Attia
|
Mohsen Rashwan
|
Ahmed Ragheb
|
Mohamed Al-Badrashiny
|
Husein Al-Basoumy
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Applications of statistical Arabic NLP in general, and text mining in specific, along with the tools underneath perform much better as the statistical processing operates on deeper language factorization(s) than on raw text. Lexical semantic factorization is very important in that aspect due to its feasibility, high level of abstraction, and the language independence of its output. In the core of such a factorization lies an Arabic lexical semantic DB. While building this LR, we had to go beyond the conventional exclusive collection of words from dictionaries and thesauri that cannot alone produce a satisfactory coverage of this highly inflective and derivative language. This paper is hence devoted to the design and implementation of an Arabic lexical semantics LR that enables the retrieval of the possible senses of any given Arabic word at a high coverage. Instead of tying full Arabic words to their possible senses, our LR flexibly relates morphologically and PoS-tags constrained Arabic lexical compounds to a predefined limited set of semantic fields across which the standard semantic relations are defined. With the aid of the same large-scale Arabic morphological analyzer and PoS tagger in the runtime, the possible senses of virtually any given Arabic word are retrievable.
2006
pdf
abs
Building Annotated Written and Spoken Arabic LRs in NEMLAR Project
M. Yaseen
|
M. Attia
|
B. Maegaard
|
K. Choukri
|
N. Paulsson
|
S. Haamid
|
S. Krauwer
|
C. Bendahman
|
H. Fersøe
|
M. Rashwan
|
B. Haddad
|
C. Mukbel
|
A. Mouradi
|
A. Al-Kufaishi
|
M. Shahin
|
N. Chenfour
|
A. Ragheb
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
The NEMLAR project: Network for Euro-Mediterranean LAnguage Resource and human language technology development and support (www.nemlar.org) was a project supported by the EC with partners from Europe and Arabic countries, whose objective is to build a network of specialized partners to promote and support the development of Arabic Language Resources (LRs) in the Mediterranean region. The project focused on identifying the state of the art of LRs in the region, assessing priority requirements through consultations with language industry and communication players, and establishing a protocol for developing and identifying a Basic Language Resource Kit (BLARK) for Arabic, and to assess first priority requirements. The BLARK is defined as the minimal set of language resources that is necessary to do any pre-competitive research and education, in addition to the development of crucial components for any future NLP industry. Following the identification of high priority resources the NEMLAR partners agreed to focus on, and produce three main resources, which are 1) Annotated Arabic written corpus of about 500 K words, 2) Arabic speech corpus for TTS applications of 2x5 hours, and 3) Arabic broadcast news speech corpus of 40 hours Modern Standard Arabic. For each of the resources underlying linguistic models and assumptions of the corpus, technical specifications, methodologies for the collection and building of the resources, validation and verification mechanisms were put and applied for the three LRs.