Niklas Paulsson

Also published as: N. Paulsson


The ETAPE corpus for the evaluation of speech-based TV content processing in the French language
Guillaume Gravier | Gilles Adda | Niklas Paulsson | Matthieu Carré | Aude Giraudel | Olivier Galibert
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper presents a comprehensive overview of existing data for the evaluation of spoken content processing in a multimedia framework for the French language. We focus on the ETAPE corpus which will be made publicly available by ELDA mid 2012, after completion of the evaluation campaign, and recall existing resources resulting from previous evaluation campaigns. The ETAPE corpus consists of 30 hours of TV and radio broadcasts, selected to cover a wide variety of topics and speaking styles, emphasizing spontaneous speech and multiple speaker areas.


Quick Rich Transcriptions of Arabic Broadcast News Speech Data
Chomicha Bendahman | Meghan Glenn | Djamel Mostefa | Niklas Paulsson | Stephanie Strassel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes the collect and transcription of a large set of Arabic broadcast news speech data. A total of more than 2000 hours of data was transcribed. The transcription factor for transcribing the broadcast news data has been reduced using a method such as Quick Rich Transcription (QRTR) as well as reducing the number of quality controls performed on the data. The data was collected from several Arabic TV and radio sources and from both Modern Standard Arabic and dialectal Arabic. The orthographic transcriptions included segmentation, speaker turns, topics, sentence unit types and a minimal noise mark-up. The transcripts were produced as a part of the GALE project.

LILA: Cellular Telephone Speech Databases from Asia
Eric Sanders | Asuncion Moreno | Herbert Tropf | Lynette Melnar | Nurit Dekel | Breanna Gillies | Niklas Paulsson
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The goal of the LILA project was the collection of speech databases over cellular telephone networks of five languages in three Asian countries. Three languages were recorded in India: Hindi by first language speakers, Hindi by second language speakers and Indian English. Furthermore, Mandarin was recorded in China and Korean in South-Korea. The databases are part of the SpeechDat-family and follow the SpeechDat rules in many respects. All databases have been finished and have passed the validation tests. Both Hindi databases and the Korean database will be available to the public for sale.


Building Annotated Written and Spoken Arabic LRs in NEMLAR Project
M. Yaseen | M. Attia | B. Maegaard | K. Choukri | N. Paulsson | S. Haamid | S. Krauwer | C. Bendahman | H. Fersøe | M. Rashwan | B. Haddad | C. Mukbel | A. Mouradi | A. Al-Kufaishi | M. Shahin | N. Chenfour | A. Ragheb
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The NEMLAR project: Network for Euro-Mediterranean LAnguage Resource and human language technology development and support ( was a project supported by the EC with partners from Europe and Arabic countries, whose objective is to build a network of specialized partners to promote and support the development of Arabic Language Resources (LRs) in the Mediterranean region. The project focused on identifying the state of the art of LRs in the region, assessing priority requirements through consultations with language industry and communication players, and establishing a protocol for developing and identifying a Basic Language Resource Kit (BLARK) for Arabic, and to assess first priority requirements. The BLARK is defined as the minimal set of language resources that is necessary to do any pre-competitive research and education, in addition to the development of crucial components for any future NLP industry. Following the identification of high priority resources the NEMLAR partners agreed to focus on, and produce three main resources, which are 1) Annotated Arabic written corpus of about 500 K words, 2) Arabic speech corpus for TTS applications of 2x5 hours, and 3) Arabic broadcast news speech corpus of 40 hours Modern Standard Arabic. For each of the resources underlying linguistic models and assumptions of the corpus, technical specifications, methodologies for the collection and building of the resources, validation and verification mechanisms were put and applied for the three LRs.


Network of Data Centres (NetDC): BNSC - An Arabic Broadcast News Speech Corpus
Khalid Choukri | Mahtab Nikkhou | Niklas Paulsson
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)