2024
pdf
bib
Proceedings of the Workshop on Legal and Ethical Issues in Human Language Technologies @ LREC-COLING 2024
Ingo Siegert
|
Khalid Choukri
Proceedings of the Workshop on Legal and Ethical Issues in Human Language Technologies @ LREC-COLING 2024
pdf
bib
abs
Compliance by Design Methodologies in the Legal Governance Schemes of European Data Spaces
Kossay Talmoudi
|
Khalid Choukri
|
Isabelle Gavanon
Proceedings of the Workshop on Legal and Ethical Issues in Human Language Technologies @ LREC-COLING 2024
Creating novel ways of sharing data to boost the digital economy has been one of the growing priorities of the European Union. In order to realise a set of data-sharing modalities, the European Union funds several projects that aim to put in place Common Data Spaces. These infrastructures are set to be a catalyser for the data economy. However, many hurdles face their implementation. Legal compliance is still one of the major ambiguities of European Common Data Spaces and many initiatives intend to proactively integrate legal compliance schemes in the architecture of sectoral Data Spaces. The various initiatives must navigate a complex web of cross-cutting legal frameworks, including contract law, data protection, intellectual property, protection of trade secrets, competition law, European sovereignty, and cybersecurity obligations. As the conceptualisation of Data Spaces evolves and shows signs of differentiation from one sector to another, it is important to showcase the legal repercussions of the options of centralisation and decentralisation that can be observed in different Data Spaces. This paper will thus delve into their legal requirements and attempt to sketch out a stepping stone for understanding legal governance in data spaces.
pdf
abs
Common European Language Data Space
Georg Rehm
|
Stelios Piperidis
|
Khalid Choukri
|
Andrejs Vasiļjevs
|
Katrin Marheinecke
|
Victoria Arranz
|
Aivars Bērziņš
|
Miltos Deligiannis
|
Dimitris Galanis
|
Maria Giagkou
|
Katerina Gkirtzou
|
Dimitris Gkoumas
|
Annika Grützner-Zahn
|
Athanasia Kolovou
|
Penny Labropoulou
|
Andis Lagzdiņš
|
Elena Leitner
|
Valérie Mapelli
|
Hélène Mazo
|
Simon Ostermann
|
Stefania Racioppa
|
Mickaël Rigault
|
Leon Voukoutis
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The Common European Language Data Space (LDS) is an integral part of the EU data strategy, which aims at developing a single market for data. Its decentralised technical infrastructure and governance scheme are currently being developed by the LDS project, which also has dedicated tasks for proof-of-concept prototypes, handling legal aspects, raising awareness and promoting the LDS through events and social media channels. The LDS is part of a broader vision for establishing all necessary components to develop European large language models.
2023
pdf
bib
abs
FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN
Milind Agarwal
|
Sweta Agrawal
|
Antonios Anastasopoulos
|
Luisa Bentivogli
|
Ondřej Bojar
|
Claudia Borg
|
Marine Carpuat
|
Roldano Cattoni
|
Mauro Cettolo
|
Mingda Chen
|
William Chen
|
Khalid Choukri
|
Alexandra Chronopoulou
|
Anna Currey
|
Thierry Declerck
|
Qianqian Dong
|
Kevin Duh
|
Yannick Estève
|
Marcello Federico
|
Souhir Gahbiche
|
Barry Haddow
|
Benjamin Hsu
|
Phu Mon Htut
|
Hirofumi Inaguma
|
Dávid Javorský
|
John Judge
|
Yasumasa Kano
|
Tom Ko
|
Rishu Kumar
|
Pengwei Li
|
Xutai Ma
|
Prashant Mathur
|
Evgeny Matusov
|
Paul McNamee
|
John P. McCrae
|
Kenton Murray
|
Maria Nadejde
|
Satoshi Nakamura
|
Matteo Negri
|
Ha Nguyen
|
Jan Niehues
|
Xing Niu
|
Atul Kr. Ojha
|
John E. Ortega
|
Proyag Pal
|
Juan Pino
|
Lonneke van der Plas
|
Peter Polák
|
Elijah Rippeth
|
Elizabeth Salesky
|
Jiatong Shi
|
Matthias Sperber
|
Sebastian Stüker
|
Katsuhito Sudoh
|
Yun Tang
|
Brian Thompson
|
Kevin Tran
|
Marco Turchi
|
Alex Waibel
|
Mingxuan Wang
|
Shinji Watanabe
|
Rodolfo Zevallos
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
This paper reports on the shared tasks organized by the 20th IWSLT Conference. The shared tasks address 9 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, multilingual, dialect and low-resource speech translation, and formality control. The shared tasks attracted a total of 38 submissions by 31 teams. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.
2022
pdf
bib
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Nicoletta Calzolari
|
Frédéric Béchet
|
Philippe Blache
|
Khalid Choukri
|
Christopher Cieri
|
Thierry Declerck
|
Sara Goggi
|
Hitoshi Isahara
|
Bente Maegaard
|
Joseph Mariani
|
Hélène Mazo
|
Jan Odijk
|
Stelios Piperidis
Proceedings of the Thirteenth Language Resources and Evaluation Conference
pdf
abs
Language Resources to Support Language Diversity – the ELRA Achievements
Valérie Mapelli
|
Victoria Arranz
|
Khalid Choukri
|
Hélène Mazo
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This article highlights ELRA’s latest achievements in the field of Language Resources (LRs) identification, sharing and production. It also reports on ELRA’s involvement in several national and international projects, as well as in the organization of events for the support of LRs and related Language Technologies, including for under-resourced languages. Over the past few years, ELRA, together with its operational agency ELDA, has continued to increase its catalogue offer of LRs, establishing worldwide partnerships for the production of various types of LRs (SMS, tweets, crawled data, MT aligned data, speech LRs, sentiment-based data, etc.). Through their consistent involvement in EU-funded projects, ELRA and ELDA have contributed to improve the access to multilingual information in the context of the pandemic, develop tools for the de-identification of texts in the legal and medical domains, support the EU eTranslation Machine Translation system, and set up a European platform providing access to both resources and services. In December 2019, ELRA co-organized the LT4All conference, whose main topics were Language Technologies for enabling linguistic diversity and multilingualism worldwide. Moreover, although LREC was cancelled in 2020, ELRA published the LREC 2020 proceedings for the Main conference and Workshops papers, and carried on its dissemination activities while targeting the new LREC edition for 2022.
pdf
abs
MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents
Victoria Arranz
|
Khalid Choukri
|
Montse Cuadros
|
Aitor García Pablos
|
Lucie Gianola
|
Cyril Grouin
|
Manuel Herranz
|
Patrick Paroubek
|
Pierre Zweigenbaum
Proceedings of the Workshop on Ethical and Legal Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Data In Language Resources within the 13th Language Resources and Evaluation Conference
This paper presents the outcomes of the MAPA project, a set of annotated corpora for 24 languages of the European Union and an open-source customisable toolkit able to detect and substitute sensitive information in text documents from any domain, using state-of-the art, deep learning-based named entity recognition techniques. In the context of the project, the toolkit has been developed and tested on administrative, legal and medical documents, obtaining state-of-the-art results. As a result of the project, 24 dataset packages have been released and the de-identification toolkit is available as open source.
pdf
abs
Legal and Ethical Challenges in Recording Air Traffic Control Speech
Mickaël Rigault
|
Claudia Cevenini
|
Khalid Choukri
|
Martin Kocour
|
Karel Veselý
|
Igor Szoke
|
Petr Motlicek
|
Juan Pablo Zuluaga-Gomez
|
Alexander Blatt
|
Dietrich Klakow
|
Allan Tart
|
Pavel Kolčárek
|
Jan Černocký
Proceedings of the Workshop on Ethical and Legal Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Data In Language Resources within the 13th Language Resources and Evaluation Conference
In this paper the authors detail the various legal and ethical issues faced during the ATCO2 project. This project is aimed at developing tools to automatically collect and transcribe air traffic conversations, especially conversations between pilots and air controls towers. In this paper the authors will develop issues related to intellectual property, public data, privacy, and general ethics issues related to the collection of air-traffic control speech.
2021
pdf
abs
European Language Grid: A Joint Platform for the European Language Technology Community
Georg Rehm
|
Stelios Piperidis
|
Kalina Bontcheva
|
Jan Hajic
|
Victoria Arranz
|
Andrejs Vasiļjevs
|
Gerhard Backfried
|
Jose Manuel Gomez-Perez
|
Ulrich Germann
|
Rémi Calizzano
|
Nils Feldhus
|
Stefanie Hegele
|
Florian Kintzel
|
Katrin Marheinecke
|
Julian Moreno-Schneider
|
Dimitris Galanis
|
Penny Labropoulou
|
Miltos Deligiannis
|
Katerina Gkirtzou
|
Athanasia Kolovou
|
Dimitris Gkoumas
|
Leon Voukoutis
|
Ian Roberts
|
Jana Hamrlova
|
Dusan Varis
|
Lukas Kacena
|
Khalid Choukri
|
Valérie Mapelli
|
Mickaël Rigault
|
Julija Melnika
|
Miro Janosik
|
Katja Prinz
|
Andres Garcia-Silva
|
Cristian Berrio
|
Ondrej Klejch
|
Steve Renals
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Europe is a multilingual society, in which dozens of languages are spoken. The only option to enable and to benefit from multilingualism is through Language Technologies (LT), i.e., Natural Language Processing and Speech Technologies. We describe the European Language Grid (ELG), which is targeted to evolve into the primary platform and marketplace for LT in Europe by providing one umbrella platform for the European LT landscape, including research and industry, enabling all stakeholders to upload, share and distribute their services, products and resources. At the end of our EU project, which will establish a legal entity in 2022, the ELG will provide access to approx. 1300 services for all European languages as well as thousands of data sets.
2020
pdf
bib
Proceedings of the Twelfth Language Resources and Evaluation Conference
Nicoletta Calzolari
|
Frédéric Béchet
|
Philippe Blache
|
Khalid Choukri
|
Christopher Cieri
|
Thierry Declerck
|
Sara Goggi
|
Hitoshi Isahara
|
Bente Maegaard
|
Joseph Mariani
|
Hélène Mazo
|
Asuncion Moreno
|
Jan Odijk
|
Stelios Piperidis
Proceedings of the Twelfth Language Resources and Evaluation Conference
pdf
abs
The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Georg Rehm
|
Katrin Marheinecke
|
Stefanie Hegele
|
Stelios Piperidis
|
Kalina Bontcheva
|
Jan Hajič
|
Khalid Choukri
|
Andrejs Vasiļjevs
|
Gerhard Backfried
|
Christoph Prinz
|
José Manuel Gómez-Pérez
|
Luc Meertens
|
Paul Lukowicz
|
Josef van Genabith
|
Andrea Lösch
|
Philipp Slusallek
|
Morten Irgens
|
Patrick Gatellier
|
Joachim Köhler
|
Laure Le Bars
|
Dimitra Anastasiou
|
Albina Auksoriūtė
|
Núria Bel
|
António Branco
|
Gerhard Budin
|
Walter Daelemans
|
Koenraad De Smedt
|
Radovan Garabík
|
Maria Gavriilidou
|
Dagmar Gromann
|
Svetla Koeva
|
Simon Krek
|
Cvetana Krstev
|
Krister Lindén
|
Bernardo Magnini
|
Jan Odijk
|
Maciej Ogrodniczuk
|
Eiríkur Rögnvaldsson
|
Mike Rosner
|
Bolette Pedersen
|
Inguna Skadiņa
|
Marko Tadić
|
Dan Tufiș
|
Tamás Váradi
|
Kadri Vider
|
Andy Way
|
François Yvon
Proceedings of the Twelfth Language Resources and Evaluation Conference
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.
pdf
abs
European Language Grid: An Overview
Georg Rehm
|
Maria Berger
|
Ela Elsholz
|
Stefanie Hegele
|
Florian Kintzel
|
Katrin Marheinecke
|
Stelios Piperidis
|
Miltos Deligiannis
|
Dimitris Galanis
|
Katerina Gkirtzou
|
Penny Labropoulou
|
Kalina Bontcheva
|
David Jones
|
Ian Roberts
|
Jan Hajič
|
Jana Hamrlová
|
Lukáš Kačena
|
Khalid Choukri
|
Victoria Arranz
|
Andrejs Vasiļjevs
|
Orians Anvari
|
Andis Lagzdiņš
|
Jūlija Meļņika
|
Gerhard Backfried
|
Erinç Dikici
|
Miroslav Janosik
|
Katja Prinz
|
Christoph Prinz
|
Severin Stampler
|
Dorothea Thomas-Aniola
|
José Manuel Gómez-Pérez
|
Andres Garcia Silva
|
Christian Berrío
|
Ulrich Germann
|
Steve Renals
|
Ondrej Klejch
Proceedings of the Twelfth Language Resources and Evaluation Conference
With 24 official EU and many additional languages, multilingualism in Europe and an inclusive Digital Single Market can only be enabled through Language Technologies (LTs). European LT business is dominated by hundreds of SMEs and a few large players. Many are world-class, with technologies that outperform the global players. However, European LT business is also fragmented – by nation states, languages, verticals and sectors, significantly holding back its impact. The European Language Grid (ELG) project addresses this fragmentation by establishing the ELG as the primary platform for LT in Europe. The ELG is a scalable cloud platform, providing, in an easy-to-integrate way, access to hundreds of commercial and non-commercial LTs for all European languages, including running tools and services as well as data sets and resources. Once fully operational, it will enable the commercial and non-commercial European LT community to deposit and upload their technologies and data sets into the ELG, to deploy them through the grid, and to connect with other resources. The ELG will boost the Multilingual Digital Single Market towards a thriving European LT community, creating new jobs and opportunities. Furthermore, the ELG project organises two open calls for up to 20 pilot projects. It also sets up 32 national competence centres and the European LT Council for outreach and coordination purposes.
pdf
abs
Making Metadata Fit for Next Generation Language Technology Platforms: The Metadata Schema of the European Language Grid
Penny Labropoulou
|
Katerina Gkirtzou
|
Maria Gavriilidou
|
Miltos Deligiannis
|
Dimitris Galanis
|
Stelios Piperidis
|
Georg Rehm
|
Maria Berger
|
Valérie Mapelli
|
Michael Rigault
|
Victoria Arranz
|
Khalid Choukri
|
Gerhard Backfried
|
José Manuel Gómez-Pérez
|
Andres Garcia-Silva
Proceedings of the Twelfth Language Resources and Evaluation Conference
The current scientific and technological landscape is characterised by the increasing availability of data resources and processing tools and services. In this setting, metadata have emerged as a key factor facilitating management, sharing and usage of such digital assets. In this paper we present ELG-SHARE, a rich metadata schema catering for the description of Language Resources and Technologies (processing and generation services and tools, models, corpora, term lists, etc.), as well as related entities (e.g., organizations, projects, supporting documents, etc.). The schema powers the European Language Grid platform that aims to be the primary hub and marketplace for industry-relevant Language Technology in Europe. ELG-SHARE has been based on various metadata schemas, vocabularies, and ontologies, as well as related recommendations and guidelines.
pdf
abs
The Multilingual Anonymisation Toolkit for Public Administrations (MAPA) Project
Ēriks Ajausks
|
Victoria Arranz
|
Laurent Bié
|
Aleix Cerdà-i-Cucó
|
Khalid Choukri
|
Montse Cuadros
|
Hans Degroote
|
Amando Estela
|
Thierry Etchegoyhen
|
Mercedes García-Martínez
|
Aitor García-Pablos
|
Manuel Herranz
|
Alejandro Kohan
|
Maite Melero
|
Mike Rosner
|
Roberts Rozis
|
Patrick Paroubek
|
Artūrs Vasiļevskis
|
Pierre Zweigenbaum
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
We describe the MAPA project, funded under the Connecting Europe Facility programme, whose goal is the development of an open-source de-identification toolkit for all official European Union languages. It will be developed since January 2020 until December 2021.
pdf
bib
Proceedings of the 1st International Workshop on Language Technology Platforms
Georg Rehm
|
Kalina Bontcheva
|
Khalid Choukri
|
Jan Hajič
|
Stelios Piperidis
|
Andrejs Vasiļjevs
Proceedings of the 1st International Workshop on Language Technology Platforms
pdf
abs
ELRI: A Decentralised Network of National Relay Stations to Collect, Prepare and Share Language Resources
Thierry Etchegoyhen
|
Borja Anza Porras
|
Andoni Azpeitia
|
Eva Martínez Garcia
|
José Luis Fonseca
|
Patricia Fonseca
|
Paulo Vale
|
Jane Dunne
|
Federico Gaspari
|
Teresa Lynn
|
Helen McHugh
|
Andy Way
|
Victoria Arranz
|
Khalid Choukri
|
Hervé Pusset
|
Alexandre Sicard
|
Rui Neto
|
Maite Melero
|
David Perez
|
António Branco
|
Ruben Branco
|
Luís Gomes
Proceedings of the 1st International Workshop on Language Technology Platforms
We describe the European Language Resource Infrastructure (ELRI), a decentralised network to help collect, prepare and share language resources. The infrastructure was developed within a project co-funded by the Connecting Europe Facility Programme of the European Union, and has been deployed in the four Member States participating in the project, namely France, Ireland, Portugal and Spain. ELRI provides sustainable and flexible means to collect and share language resources via National Relay Stations, to which members of public institutions can freely subscribe. The infrastructure includes fully automated data processing engines to facilitate the preparation, sharing and wider reuse of useful language resources that can help optimise human and automated translation services in the European Union.
2018
bib
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Nicoletta Calzolari
|
Khalid Choukri
|
Christopher Cieri
|
Thierry Declerck
|
Sara Goggi
|
Koiti Hasida
|
Hitoshi Isahara
|
Bente Maegaard
|
Joseph Mariani
|
Hélène Mazo
|
Asuncion Moreno
|
Jan Odijk
|
Stelios Piperidis
|
Takenobu Tokunaga
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
Data Management Plan (DMP) for Language Data under the New General Da-ta Protection Regulation (GDPR)
Pawel Kamocki
|
Valérie Mapelli
|
Khalid Choukri
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management
Andrea Lösch
|
Valérie Mapelli
|
Stelios Piperidis
|
Andrejs Vasiļjevs
|
Lilli Smal
|
Thierry Declerck
|
Eileen Schnur
|
Khalid Choukri
|
Josef van Genabith
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
Automatic Identification of Maghreb Dialects Using a Dictionary-Based Approach
Houda Saâdane
|
Hosni Seffih
|
Christian Fluhr
|
Khalid Choukri
|
Nasredine Semmar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
abs
ELRI - European Language Resources Infrastructure
Thierry Etchegoyhen
|
Borja Anza Porras
|
Andoni Azpeitia
|
Eva Martínez Garcia
|
Paulo Vale
|
José Luis Fonseca
|
Teresa Lynn
|
Jane Dunne
|
Federico Gaspari
|
Andy Way
|
Victoria Arranz
|
Khalid Choukri
|
Vladimir Popescu
|
Pedro Neiva
|
Rui Neto
|
Maite Melero
|
David Perez Fernandez
|
Antonio Branco
|
Ruben Branco
|
Luis Gomes
Proceedings of the 21st Annual Conference of the European Association for Machine Translation
We describe the European Language Resources Infrastructure project, whose main aim is the provision of an infrastructure to help collect, prepare and share language resources that can in turn improve translation services in Europe.
2016
bib
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Nicoletta Calzolari
|
Khalid Choukri
|
Thierry Declerck
|
Sara Goggi
|
Marko Grobelnik
|
Bente Maegaard
|
Joseph Mariani
|
Helene Mazo
|
Asuncion Moreno
|
Jan Odijk
|
Stelios Piperidis
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
pdf
abs
ELRA Activities and Services
Khalid Choukri
|
Valérie Mapelli
|
Hélène Mazo
|
Vladimir Popescu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
After celebrating its 20th anniversary in 2015, ELRA is carrying on its strong involvement in the HLT field. To share ELRA’s expertise of those 21 past years, this article begins with a presentation of ELRA’s strategic Data and LR Management Plan for a wide use by the language communities. Then, we further report on ELRA’s activities and services provided since LREC 2014. When looking at the cataloguing and licensing activities, we can see that ELRA has been active at making the Meta-Share repository move toward new developments steps, supporting Europe to obtain accurate LRs within the Connecting Europe Facility programme, promoting the use of LR citation, creating the ELRA License Wizard web portal. The article further elaborates on the recent LR production activities of various written, speech and video resources, commissioned by public and private customers. In parallel, ELDA has also worked on several EU-funded projects centred on strategic issues related to the European Digital Single Market. The last part gives an overview of the latest dissemination activities, with a special focus on the celebration of its 20th anniversary organised in Dubrovnik (Croatia) and the following up of LREC, as well as the launching of the new ELRA portal.
pdf
abs
Language Resource Citation: the ISLRN Dissemination and Further Developments
Valérie Mapelli
|
Vladimir Popescu
|
Lin Liu
|
Khalid Choukri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This article presents the latest dissemination activities and technical developments that were carried out for the International Standard Language Resource Number (ISLRN) service. It also recalls the main principle and submission process for providers to obtain their 13-digit ISLRN identifier. Up to March 2016, 2100 Language Resources were allocated an ISLRN number, not only ELRA’s and LDC’s catalogued Language Resources, but also the ones from other important organisations like the Joint Research Centre (JRC) and the Resource Management Agency (RMA) who expressed their strong support to this initiative. In the research field, not only assigning a unique identification number is important, but also referring to a Language Resource as an object per se (like publications) has now become an obvious requirement. The ISLRN could also become an important parameter to be considered to compute a Language Resource Impact Factor (LRIF) in order to recognize the merits of the producers of Language Resources. Integrating the ISLRN number into a LR-oriented bibliographical reference is thus part of the objective. The idea is to make use of a BibTeX entry that would take into account Language Resources items, including ISLRN.The ISLRN being a requested field within the LREC 2016 submission, we expect that several other LRs will be allocated an ISLRN number by the conference date. With this expansion, this number aims to be a spreadly-used LR citation instrument within works referring to LRs.
pdf
abs
New Developments in the LRE Map
Vladimir Popescu
|
Lin Liu
|
Riccardo Del Gratta
|
Khalid Choukri
|
Nicoletta Calzolari
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper we describe the new developments brought to LRE Map, especially in terms of the user interface of the Web application, of the searching of the information therein, and of the data model updates.
pdf
abs
The ELRA License Wizard
Valérie Mapelli
|
Vladimir Popescu
|
Lin Liu
|
Meritxell Fernández Barrera
|
Khalid Choukri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
To allow an easy understanding of the various licenses that exist for the use of Language Resources (ELRA’s, META-SHARE’s, Creative Commons’, etc.), ELRA has developed a License Wizardto help the right-holders share/distribute their resources under the appropriate license. It also aims to be exploited by users to better understand the legal obligations that apply in various licensing situations. The present paper elaborates on the License Wizard functionalities of this web configurator, which enables to select a number of legal features and obtain the user license adapted to the users selection, to define which user licenses they would like to select in order to distribute their Language Resources, to integrate the user license terms into a Distribution Agreement that could be proposed to ELRA or META-SHARE for further distribution through the ELRA Catalogue of Language Resources. Thanks to a flexible back office, the structure of the legal feature selection can easily be reviewed to include other features that may be relevant for other licenses. Integrating contributions from other initiatives thus aim to be one of the obvious next steps, with a special focus on CLARIN and Linked Data experiences.
pdf
abs
Enhancing Cross-border EU E-commerce through Machine Translation: Needed Language Resources, Challenges and Opportunities
Meritxell Fernández Barrera
|
Vladimir Popescu
|
Antonio Toral
|
Federico Gaspari
|
Khalid Choukri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper discusses the role that statistical machine translation (SMT) can play in the development of cross-border EU e-commerce,by highlighting extant obstacles and identifying relevant technologies to overcome them. In this sense, it firstly proposes a typology of e-commerce static and dynamic textual genres and it identifies those that may be more successfully targeted by SMT. The specific challenges concerning the automatic translation of user-generated content are discussed in detail. Secondly, the paper highlights the risk of data sparsity inherent to e-commerce and it explores the state-of-the-art strategies to achieve domain adequacy via adaptation. Thirdly, it proposes a robust workflow for the development of SMT systems adapted to the e-commerce domain by relying on inexpensive methods. Given the scarcity of user-generated language corpora for most language pairs, the paper proposes to obtain monolingual target-language data to train language models and aligned parallel corpora to tune and evaluate MT systems by means of crowdsourcing.
2014
bib
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Nicoletta Calzolari
|
Khalid Choukri
|
Thierry Declerck
|
Hrafn Loftsson
|
Bente Maegaard
|
Joseph Mariani
|
Asuncion Moreno
|
Jan Odijk
|
Stelios Piperidis
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
pdf
abs
The ETAPE speech processing evaluation
Olivier Galibert
|
Jeremy Leixa
|
Gilles Adda
|
Khalid Choukri
|
Guillaume Gravier
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The ETAPE evaluation is the third evaluation in automatic speech recognition and associated technologies in a series which started with ESTER. This evaluation proposed some new challenges, by proposing TV and radio shows with prepared and spontaneous speech, annotation and evaluation of overlapping speech, a cross-show condition in speaker diarization, and new, complex but very informative named entities in the information extraction task. This paper presents the whole campaign, including the data annotated, the metrics used and the anonymized system results. All the data created in the evaluation, hopefully including system outputs, will be distributed through the ELRA catalogue in the future.
pdf
abs
META-SHARE: One year after
Stelios Piperidis
|
Harris Papageorgiou
|
Christian Spurk
|
Georg Rehm
|
Khalid Choukri
|
Olivier Hamon
|
Nicoletta Calzolari
|
Riccardo del Gratta
|
Bernardo Magnini
|
Christian Girardi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper presents META-SHARE (www.meta-share.eu), an open language resource infrastructure, and its usage since its Europe-wide deployment in early 2013. META-SHARE is a network of repositories that store language resources (data, tools and processing services) documented with high-quality metadata, aggregated in central inventories allowing for uniform search and access. META-SHARE was developed by META-NET (www.meta-net.eu) and aims to serve as an important component of a language technology marketplace for researchers, developers, professionals and industrial players, catering for the full development cycle of language technology, from research through to innovative products and services. The observed usage in its initial steps, the steadily increasing number of network nodes, resources, users, queries, views and downloads are all encouraging and considered as supportive of the choices made so far. In tandem, take-up activities like direct linking and processing of datasets by language processing services as well as metadata transformation to RDF are expected to open new avenues for data and resources linking and boost the organic growth of the infrastructure while facilitating language technology deployment by much wider research communities and industrial sectors.
pdf
abs
ELRA’s Consolidated Services for the HLT Community
Victoria Arranz
|
Khalid Choukri
|
Valérie Mapelli
|
Hélène Mazo
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper emphasises on ELRAs contribution to the HLT field thanks to the consolidation of its services since LREC 2012. Among the most recent contributions is the establishment of the International Standard Language Resource Number (ISLRN), with the creation and exploitation of an associated web portal to enable the procurement of unique identifiers for Language Resources. Interoperability, consolidation and synchronization remain also a strong focus in ELRAs cataloguing work, in particular with ELRAs involvement in the META-SHARE project, whose platform is to become ELRAs next instrument of sharing LRs. Since last LREC, ELRA has continued its action to offer free LRs to the research community. Cooperation is another watchword within ELRAs activities on multiple aspects: 1) at the legal level, ELRA is supporting the EC in identifying the gaps to be fulfilled to reach harmonized copyright regulations for the HLT community in Europe; 2) at the production level, ELRA is participating in several international projects, in the field of LR production and evaluation of technologies; 3) at the communication level, ELRA has organised the NLP12 meeting with the aim of boosting co-operation and strengthening the bridges between various communities.
2012
bib
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Nicoletta Calzolari
|
Khalid Choukri
|
Thierry Declerck
|
Mehmet Uğur Doğan
|
Bente Maegaard
|
Joseph Mariani
|
Asuncion Moreno
|
Jan Odijk
|
Stelios Piperidis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
pdf
abs
ELRA in the heart of a cooperative HLT world
Valérie Mapelli
|
Victoria Arranz
|
Matthieu Carré
|
Hélène Mazo
|
Djamel Mostefa
|
Khalid Choukri
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper aims at giving an overview of ELRAs recent activities. The first part elaborates on ELRAs means of boosting the sharing Language Resources (LRs) within the HLT community through its catalogues, LRE-Map initiative, as well as its work towards the integration of its LRs within the META-SHARE open infrastructure. The second part shows how ELRA helps in the development and evaluation of HLT, in particular through its numerous participations to collaborative projects for the production of resources and platforms to facilitate their production and exploitation. A third part focuses on ELRAs work for clearing IPR issues in a HLT-oriented context, one of its latest initiative being its involvement in a Fair Research Act proposal to promote the easy access to LRs to the widest community. Finally, the last part elaborates on recent actions for disseminating information and promoting cooperation in the field, e.g. an the Language Library being launched at LREC2012 and the creation of an International Standard LR Number, a LR unique identifier to enable the accurate identification of LRs. Among the other messages ELRA will be conveying the attendees are the announcement of a set of freely available resources, the establishment of a LR and Evaluation forum, etc.
pdf
abs
The FLaReNet Strategic Language Resource Agenda
Claudia Soria
|
Núria Bel
|
Khalid Choukri
|
Joseph Mariani
|
Monica Monachini
|
Jan Odijk
|
Stelios Piperidis
|
Valeria Quochi
|
Nicoletta Calzolari
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The FLaReNet Strategic Agenda highlights the most pressing needs for the sector of Language Resources and Technologies and presents a set of recommendations for its development and progress in Europe, as issued from a three-year consultation of the FLaReNet European project. The FLaReNet recommendations are organised around nine dimensions: a) documentation b) interoperability c) availability, sharing and distribution d) coverage, quality and adequacy e) sustainability f) recognition g) development h) infrastructure and i) international cooperation. As such, they cover a broad range of topics and activities, spanning over production and use of language resources, licensing, maintenance and preservation issues, infrastructures for language resources, resource identification and sharing, evaluation and validation, interoperability and policy issues. The intended recipients belong to a large set of players and stakeholders in Language Resources and Technology, ranging from individuals to research and education institutions, to policy-makers, funding agencies, SMEs and large companies, service and media providers. The main goal of these recommendations is to serve as an instrument to support stakeholders in planning for and addressing the urgencies of the Language Resources and Technologies of the future.
pdf
abs
New language resources for the Pashto language
Djamel Mostefa
|
Khalid Choukri
|
Sylvie Brunessaux
|
Karim Boudahmane
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper reports on the development of new language resources for the Pashto language, a very low-resource language spoken in Afghanistan and Pakistan. In the scope of a multilingual data collection project, three large corpora are collected for Pashto. Firstly a monolingual text corpus of 100 million words is produced. Secondly a 100 hours speech database is recorded and manually transcribed. Finally a bilingual Pashto-French parallel corpus of around 2 million is produced by translating Pashto texts into French. These resources will be used to develop Human Language Technology systems for Pashto with a special focus on Machine Translation.
pdf
abs
Using the International Standard Language Resource Number: Practical and Technical Aspects
Khalid Choukri
|
Victoria Arranz
|
Olivier Hamon
|
Jungyeul Park
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper describes the International Standard Language Resource Number (ISLRN), a new identification schema for Language Resources where a Language Resource is provided with a unique and universal name using a standardized nomenclature. This will ensure that Language Resources be identified, accessed and disseminated in a unique manner, thus allowing them to be recognized with proper references in all activities concerning Human Language Technologies as well as in all documents and scientific papers. This would allow, for instance, the formal identification of potentially repeated resources across different repositories, the formal referencing of language resources and their correct use when different versions are processed by tools.
pdf
abs
An Analytical Model of Language Resource Sustainability
Khalid Choukri
|
Victoria Arranz
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper elaborates on a sustainability model for Language Resources, both at a descriptive and analytical level. The first part, devoted to the descriptive model, elaborates on the definition of this concept both from a general point of view and from the Human Language Technology and Language Resources perspective. The paper also intends to list an exhaustive number of factors that have an impact on this sustainability. These factors will be clustered into Pillars so as ease understanding as well as the prediction of LR sustainability itself. Rather than simply identifying a set of LRs that have been in use for a while and that one can consider as sustainable, the paper aims at first clarifying and (re)defining the concept of sustainability by also connecting it to other domains. Then it also presents a detailed decomposition of all dimensions of Language Resource features that can contribute and/or have an impact on such sustainability. Such analysis will also help anticipate and forecast sustainability for a LR before taking any decisions concerning design and production.
2011
pdf
Proposal for the International Standard Language Resource Number
Khalid Choukri
|
Jungyeul Park
|
Olivier Hamon
|
Victoria Arranz
Proceedings of the Workshop on Language Resources, Technology and Services in the Sharing Paradigm
pdf
Evaluation Methodology and Results for English-to-Arabic MT
Olivier Hamon
|
Khalid Choukri
Proceedings of Machine Translation Summit XIII: Papers
2010
bib
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Nicoletta Calzolari
|
Khalid Choukri
|
Bente Maegaard
|
Joseph Mariani
|
Jan Odijk
|
Stelios Piperidis
|
Mike Rosner
|
Daniel Tapias
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
pdf
abs
The LREC Map of Language Resources and Technologies
Nicoletta Calzolari
|
Claudia Soria
|
Riccardo Del Gratta
|
Sara Goggi
|
Valeria Quochi
|
Irene Russo
|
Khalid Choukri
|
Joseph Mariani
|
Stelios Piperidis
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper we present the LREC Map of Language Resources and Tools, an innovative feature introduced with this LREC. The purpose of the Map is to shed light on the vast amount of resources and tools that represent the background of the research presented at LREC, in the attempt to fill in a gap in the community knowledge about the resources and tools that are used or created worldwide. It also aims at a change of culture in the field, actively engaging each researcher in the documentation task about resources. The Map has been developed on the basis of the information provided by LREC authors during the submission of papers to the LREC 2010 conference and the LREC workshops, and contains information about almost 2000 resources. The paper illustrates the motivation behind this initiative, its main characteristics, its relevance and future impact in the field, the metadata used to describe the resources, and finally presents some of the most relevant findings.
pdf
abs
Evaluation Protocol and Tools for Question-Answering on Speech Transcripts
Nicolas Moreau
|
Olivier Hamon
|
Djamel Mostefa
|
Sophie Rosset
|
Olivier Galibert
|
Lori Lamel
|
Jordi Turmo
|
Pere R. Comas
|
Paolo Rosso
|
Davide Buscaldi
|
Khalid Choukri
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Question Answering (QA) technology aims at providing relevant answers to natural language questions. Most Question Answering research has focused on mining document collections containing written texts to answer written questions. In addition to written sources, a large (and growing) amount of potentially interesting information appears in spoken documents, such as broadcast news, speeches, seminars, meetings or telephone conversations. The QAST track (Question-Answering on Speech Transcripts) was introduced in CLEF to investigate the problem of question answering in such audio documents. This paper describes in detail the evaluation protocol and tools designed and developed for the CLEF-QAST evaluation campaigns that have taken place between 2007 and 2009. We first remind the data, question sets, and submission procedures that were produced or set up during these three campaigns. As for the evaluation procedure, the interface that was developed to ease the assessors work is described. In addition, this paper introduces a methodology for a semi-automatic evaluation of QAST systems based on time slot comparisons. Finally, the QAST Evaluation Package 2007-2009 resulting from these evaluation campaigns is also introduced.
pdf
abs
Cooperation for Arabic Language Resources and Tools — The MEDAR Project
Bente Maegaard
|
Mohamed Attia
|
Khalid Choukri
|
Olivier Hamon
|
Steven Krauwer
|
Mustafa Yaseen
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The paper describes some of the work carried out within the European funded project MEDAR. The project has three streams of activity: the technical stream, the cooperation stream and the dissemination stream. MEDAR has first updated the existing surveys and BLARK for Arabic, and then the technical stream focused on machine translation. The consortium identified a number of freely available MT systems and then customized two versions of the famous MOSES package. The Consortium addressed the needs to package MOSES for English to Arabic (while the main MT stream is on Arabic to English). For performance assessment purposes, the partners produced test data that allowed carrying out an evaluation campaign with 5 different systems (including from outside the consortium) and two online ones. Both the MT baselines and the collected data will be made available via ELRA catalogue. The cooperation stream focuses mostly on the cooperation roadmap for Human Language Technologies for Arabic. Cooperation Roadmap for the region directed towards the Arabic HLT in general. It is the purpose of the roadmap to outline areas and priorities for collaboration, in terms of collaboration between EU countries and Arabic speaking countries, as well as cooperation in general: between countries, between universities, and last but not least between universities and industry.
pdf
abs
ELRA’s Services 15 Years on...Sharing and Anticipating the Community
Victoria Arranz
|
Khalid Choukri
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
15 years have gone by and ELRA continues embracing the needs of the HLT community to design its services and to implement them through its operational body, ELDA. The needs of the community have become much more ambitious...Larger language resources (LR), better quality ones (how do we reach a compromise between price ― maybe free ― and quality?), more annotations, at different levels and for different modalities...easy access to these LRs and solved IPR issues, appropriate and adaptable licensing schemas...large activity in HLT evaluation, both in terms of setting up the evaluation and in helping produce all necessary data, protocols, specifications as well as conducting the whole process...producing the LRs researchers and developers need, LRs for a wide variety of activities and technologies...for development, for training, for evaluation...Disseminating all knowledge in the field, whether generated at ELRA or elsewhere...keeping the community up to date with what goes on regularly (LREC conferences, LangTech, Newsletters, HLT Evaluation Portal, etc.). Needless to say, part of ELRAs evolution implies facing and anticipating the realities of the new Internet and data exchange era and remaining a LR backbone...looking into new models of LR data centres and platforms, LR access and exchange via web services, new models for infrastructures and repositories with even higher collaboration to make it happen. ELRA/ELDA participate in a number of international projects focused on this new production and sharing schema that will be detailed in the current paper.
pdf
abs
A Road Map for Interoperable Language Resource Metadata
Christopher Cieri
|
Khalid Choukri
|
Nicoletta Calzolari
|
D. Terence Langendoen
|
Johannes Leveling
|
Martha Palmer
|
Nancy Ide
|
James Pustejovsky
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
LRs remain expensive to create and thus rare relative to demand across languages and technology types. The accidental re-creation of an LR that already exists is a nearly unforgivable waste of scarce resources that is unfortunately not so easy to avoid. The number of catalogs the HLT researcher must search, with their different formats, make it possible to overlook an existing resource. This paper sketches the sources of this problem and outlines a proposal to rectify along with a new vision of LR cataloging that will to facilitates the documentation and exploitation of a much wider range of LRs than previously considered.
2009
pdf
End-to-End Evaluation in Simultaneous Translation
Olivier Hamon
|
Christian Fügen
|
Djamel Mostefa
|
Victoria Arranz
|
Muntsin Kolss
|
Alex Waibel
|
Khalid Choukri
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)
2008
bib
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Nicoletta Calzolari
|
Khalid Choukri
|
Bente Maegaard
|
Joseph Mariani
|
Jan Odijk
|
Stelios Piperidis
|
Daniel Tapias
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
pdf
abs
Data Collection for the CHIL CLEAR 2007 Evaluation Campaign
Nicolas Moreau
|
Djamel Mostefa
|
Rainer Stiefelhagen
|
Susanne Burger
|
Khalid Choukri
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper describes in detail the data that was collected and annotated during the third and final year of the CHIL project. This data was used for the CLEAR evaluation campaign in spring 2007. The paper also introduces the CHIL Evaluation Package 2007 that resulted from this campaign including a complete description of the performed evaluation tasks. This evaluation package will be made available to the community through the ELRA General Catalogue.
pdf
abs
Latest Developments in ELRA’s Services
Valérie Mapelli
|
Victoria Arranz
|
Hélène Mazo
|
Khalid Choukri
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper describes the latest developments in ELRAs services within the field of Language Resources (LR). These developments focus on 4 main groups of activities: the identification and distribution of Language Resources; the production of LRs; the evaluation of Human Language Technology (HLT), and the dissemination of information in the field. ELRAs initial work on the distribution of language resources has evolved throughout the years, currently covering a much wider range of activities that have been considered crucial for the current needs of the R&D community and the good health of the LR world. Regarding distribution, considerable work has been done on a broader identification, which does not only consider resources to be immediately negotiated for distribution but which aims to inform on all available resources. This has been the seed for the Universal Catalogue. Furthermore, a Catalogue of LRs with favourable conditions for R&D has also been created. Moreover, the different activities in what regards identification on demand, production within different frameworks, evaluation of language technologies and participation in evaluation campaigns, as well as our very specific focus on information dissemination are described in detail in this paper.
pdf
abs
The INFILE Project: a Crosslingual Filtering Systems Evaluation Campaign
Romaric Besançon
|
Stéphane Chaudiron
|
Djamel Mostefa
|
Ismaïl Timimi
|
Khalid Choukri
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The InFile project (INformation, FILtering, Evaluation) is a cross-language adaptive filtering evaluation campaign, sponsored by the French National Research Agency. The campaign is organized by the CEA LIST, ELDA and the University of Lille3-GERiiCO. It has an international scope as it is a pilot track of the CLEF 2008 campaigns. The corpus is built from a collection of about 1.4 million newswires (10 GB) in three languages, Arabic, English and French provided by the French news Agency Agence France Press (AFP) and selected from a 3-year period. The profiles corpus is made of 50 profiles from which 30 concern general news and events (national and international affairs, politics, sports?) and 20 concern scientific and technical subjects.
pdf
abs
MEDAR: Collaboration between European and Mediterranean Arabic Partners to Support the Development of Language Technology for Arabic
Bente Maegaard
|
Mohammed Atiyya
|
Khalid Choukri
|
Steven Krauwer
|
Chafic Mokbel
|
Mustafa Yaseen
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
After the successful completion of the NEMLAR project 2003-2005, a new opportunity for a project was opened by the European Commission, and a group of largely the same partners is now executing the MEDAR project. MEDAR will be updating the surveys and BLARK for Arabic already made, and will then focus on machine translation (and other tools for translation) and information retrieval with a focus on language resources, tools and evaluation for these applications. A very important part of the MEDAR project is to reinforce and extend the NEMLAR network and to create a cooperation roadmap for Human Language Technologies for Arabic. It is expected that the cooperation roadmap will attract wide attention from other parties and that it can help create a larger platform for collaborative projects. Finally, the project will focus on dissemination of knowledge about existing resources and tools, as well as actors and activities; this will happen through newsletter, website and an international conference which will follow up on the Cairo conference of 2004. Dissemination to user communities will also be important, e.g. through participation in translators? conferences. The goal of these activities is to create a stronger and lasting collaboration between EU countries and Arabic speaking countries.
pdf
abs
A Guide for the Production of Reusable Language Resources
Victoria Arranz
|
Franck Gandcher
|
Valérie Mapelli
|
Khalid Choukri
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The project described in this paper is funded by the French Ministry of Research. It aims at providing producers of Language Resources, and HLT players in general, with a guide which offers technical, legal and strategic recommendations/guidelines for the reuse of their Language Resources. The guide is dedicated in particular to academic laboratories which produce Language Resources and may benefit from further advice to start development, but also to any HLT player who wishes to follow the best practices in this field. The guidelines focus on different steps of a Language Resources life, i.e. specifications, production, validation, distribution, and maintenance. This paper gives a brief overview of the guide, and describes a) technical formats, standards and best practices which correspond to the current state of the art, for different types of resources, whether written or spoken, at different steps of the production line, b) legal issues and models/templates which can be used for the dissemination of Language Resources as widely as possible, c) strategic issues, by offering a dissemination plan which takes into account all types of constraints faced by HLT community players.
2007
pdf
End-to-end evaluation of a speech-to-speech translation system in TC-STAR
Olivier Hamon
|
Djamel Mostefa
|
Khalid Choukri
Proceedings of Machine Translation Summit XI: Papers
pdf
Assessing human and automated quality judgments in the French MT evaluation campaign CESTA
Olivier Hamon
|
Anthony Hartley
|
Andrei Popescu-Belis
|
Khalid Choukri
Proceedings of Machine Translation Summit XI: Papers
pdf
bib
Proceedings of the Workshop on Automatic procedures in MT evaluation
Gregor Thurmair
|
Khalid Choukri
|
Bente Maegaard
Proceedings of the Workshop on Automatic procedures in MT evaluation
MT evaluation & TC-STAR
Khalid Choukri
|
Olivier Hamon
|
Djamel Mostefa
Proceedings of the Workshop on Automatic procedures in MT evaluation
2006
bib
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Nicoletta Calzolari
|
Khalid Choukri
|
Aldo Gangemi
|
Bente Maegaard
|
Joseph Mariani
|
Jan Odijk
|
Daniel Tapias
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
pdf
abs
Terminological Resources Acquisition Tools: Toward a User-oriented Evaluation Model
Widad Mustafa El Hadi
|
Ismail Timimi
|
Marianne Dabbadie
|
Khalid Choukri
|
Olivier Hamon
|
Yun-Chuang Chiao
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper describes the CESART project which deals with the evaluation of terminological resources acquisition tools. The objective of the project is to propose and validate an evaluation protocol allowing one to objectively evaluate and compare different systems for terminology application such as terminological resource creation and semantic relation extraction. The project also aims to create quality-controlled resources such as domain-specific corpora, automatic scoring tool, etc.
pdf
abs
TC-STAR: New language resources for ASR and SLT purposes
Henk van den Heuvel
|
Khalid Choukri
|
Christian Gollan
|
Asuncion Moreno
|
Djamel Mostefa
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
In TC-STAR a variety of Language Resources (LR) is being produced. In this contribution we address the resources that have been created for Automatic Speech Recrognition and Spoken Language Translation. As yet, these are 14 LR in total: two training SLR for ASR (English and Spanish), three development LR and three evaluation LR for ASR (English, Spanish, Mandarin), and three development LR and three evaluation LR for SLT (English-Spanish, Spanish-English, Mandarin-English). In this paper we describe the properties, validation, and availability of these resources.
pdf
abs
Building Annotated Written and Spoken Arabic LRs in NEMLAR Project
M. Yaseen
|
M. Attia
|
B. Maegaard
|
K. Choukri
|
N. Paulsson
|
S. Haamid
|
S. Krauwer
|
C. Bendahman
|
H. Fersøe
|
M. Rashwan
|
B. Haddad
|
C. Mukbel
|
A. Mouradi
|
A. Al-Kufaishi
|
M. Shahin
|
N. Chenfour
|
A. Ragheb
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
The NEMLAR project: Network for Euro-Mediterranean LAnguage Resource and human language technology development and support (www.nemlar.org) was a project supported by the EC with partners from Europe and Arabic countries, whose objective is to build a network of specialized partners to promote and support the development of Arabic Language Resources (LRs) in the Mediterranean region. The project focused on identifying the state of the art of LRs in the region, assessing priority requirements through consultations with language industry and communication players, and establishing a protocol for developing and identifying a Basic Language Resource Kit (BLARK) for Arabic, and to assess first priority requirements. The BLARK is defined as the minimal set of language resources that is necessary to do any pre-competitive research and education, in addition to the development of crucial components for any future NLP industry. Following the identification of high priority resources the NEMLAR partners agreed to focus on, and produce three main resources, which are 1) Annotated Arabic written corpus of about 500 K words, 2) Arabic speech corpus for TTS applications of 2x5 hours, and 3) Arabic broadcast news speech corpus of 40 hours Modern Standard Arabic. For each of the resources underlying linguistic models and assumptions of the corpus, technical specifications, methodologies for the collection and building of the resources, validation and verification mechanisms were put and applied for the three LRs.
pdf
abs
Unified Lexicon and Unified Morphosyntactic Specifications for Written and Spoken Italian
Monica Monachini
|
Nicoletta Calzolari
|
Khalid Choukri
|
Jochen Friedrich
|
Giulio Maltese
|
Michele Mammini
|
Jan Odijk
|
Marisa Ulivieri
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
The goal of this paper is (1) to illustrate a specific procedure for merging different monolingual lexicons, focussing on techniques for detecting and mapping equivalent lexical entries, and (2) to sketch a production model that enables one to obtain lexical resources via unification of existing data. We describe the creation of a Unified Lexicon (UL) from a common sample of the Italian PAROLE-SIMPLE-CLIPS phonological lexicon and of the Italian LCSTAR pronunciation lexicon. We expand previous experiments carried out at ILC-CNR: based on a detailed mechanism for mapping grammatical classifications of candidate UL entries, a consensual set of Unified Morphosyntactic Specifications (UMS) shared by lexica for the written and spoken areas is proposed. The impact of the UL on cross-validation issues is analysed: by looking into conflicts, mismatches and diverging classifications can be detected in both resources. The work presented is in line with the activities promoted by ELRA towards the development of methods for packaging new language resources by combining independently created resources, and was carried out as part of the ELRA Production Committee activities. ELRA aims to exploit the UL experience to carry out such merging activities for resources available on the ELRA catalogue in order to fulfill the users' needs.
pdf
abs
The BLARK concept and BLARK for Arabic
Bente Maegaard
|
Steven Krauwer
|
Khalid Choukri
|
Lise Damsgaard Jørgensen
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
The EU project NEMLAR (Network for Euro-Mediterranean LAnguage Resources) on Arabic language resources carried out two surveys on the availability of Arabic LRs in the region, and on industrial requirements. The project also worked out a BLARK (Basic Language Resource Kit) for Arabic. In this paper we describe the further development of the BLARK concept made during the work on a BLARK for Arabic, as well as the results for Arabic.
pdf
abs
Language Resources Production Models: the Case of the INTERA Multilingual Corpus and Terminology
Maria Gavrilidou
|
Penny Labropoulou
|
Stelios Piperidis
|
Voula Giouli
|
Nicoletta Calzolari
|
Monica Monachini
|
Claudia Soria
|
Khalid Choukri
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper reports on the multilingual Language Resources (MLRs), i.e. parallel corpora and terminological lexicons for less widely digitally available languages, that have been developed in the INTERA project and the methodology adopted for their production. Special emphasis is given to the reality factors that have influenced the MLRs development approach and their final constitution. Building on the experience gained in the project, a production model has been elaborated, suggesting ways and techniques that can be exploited in order to improve LRs production taking into account realistic issues.
pdf
abs
CESTA: First Conclusions of the Technolangue MT Evaluation Campaign
O. Hamon
|
A. Popescu-Belis
|
K. Choukri
|
M. Dabbadie
|
A. Hartley
|
W. Mustafa El Hadi
|
M. Rajman
|
I. Timimi
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This article outlines the evaluation protocol and provides the main results of the French Evaluation Campaign for Machine Translation Systems, CESTA. Following the initial objectives and evaluation plans, the evaluation metrics are briefly described: along with fluency and adequacy assessed by human judges, a number of recently proposed automated metrics are used. Two evaluation campaigns were organized, the first one in the general domain, and the second one in the medical domain. Up to six systems translating from English into French, and two systems translating from Arabic into French, took part in the campaign. The numerical results illustrate the differences between classes of systems, and provide interesting indications about the reliability of the automated metrics for French as a target language, both by comparison to human judges and using correlations between metrics. The corpora that were produced, as well as the information about the reliability of metrics, constitute reusable resources for MT evaluation.
pdf
abs
Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News
S. Galliano
|
E. Geoffrois
|
G. Gravier
|
J.-F. Bonastre
|
D. Mostefa
|
K. Choukri
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper presents the audio corpus developed in the framework of the ESTER evaluation campaign of French broadcast news transcription systems. This corpus includes 100 hours of manually annotated recordings and 1,677 hours of non transcribed data. The manual annotations include the detailed verbatim orthographic transcription, the speaker turns and identities, information about acoustic conditions, and name entities. Additional resources generated by automatic speech processing systems, such as phonetic alignments and word graphs, are also described.
pdf
abs
Evaluation of Automatic Speech Recognition and Speech Language Translation within TC-STAR:Results from the first evaluation campaign
Djamel Mostefa
|
Olivier Hamon
|
Khalid Choukri
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper reports on the evaluation activities conducted in the first year of the TC-STAR project. The TC-STAR project, financed by the European Commission within the Sixth Framework Program, is envisaged as a long-term effort to advance research in the core technologies of Speech-to-Speech Translation (SST). SST technology is a combination of Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text To Speech (TTS).
pdf
abs
Evaluation of multimodal components within CHIL: The evaluation packages and results
Djamel Mostefa
|
Marie-Neige Garcia
|
Khalid Choukri
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This article describes the first CHIL evaluation campaign in which 12 technologies were evaluated. The major outcomes of the first evaluation campaign are the so-called Evaluation Packages. An evaluation package is the full documentation (definition and description of the evaluation methodologies, protocols and metrics) alongside the data sets and software scoring tools, which an organisation needs in order to perform the evaluation of one or more systems for a given technology. These evaluation packages will be made available to the community through ELDA General Catalogue.
2005
pdf
abs
Evaluation of Machine Translation with Predictive Metrics beyond BLEU/NIST: CESTA Evaluation Campaign # 1
Sylvain Surcin
|
Olivier Hamon
|
Antony Hartley
|
Martin Rajman
|
Andrei Popescu-Belis
|
Widad Mustafa El Hadi
|
Ismaïl Timimi
|
Marianne Dabbadie
|
Khalid Choukri
Proceedings of Machine Translation Summit X: Papers
In this paper, we report on the results of a full-size evaluation campaign of various MT systems. This campaign is novel compared to the classical DARPA/NIST MT evaluation campaigns in the sense that French is the target language, and that it includes an experiment of meta-evaluation of various metrics claiming to better predict different attributes of translation quality. We first describe the campaign, its context, its protocol and the data we used. Then we summarise the results obtained by the participating systems and discuss the meta-evaluation of the metrics used.
2004
pdf
abs
The Future of Evaluation for Cross-Language Information Retrieval Systems
Carol Peters
|
Martin Braschler
|
Khalid Choukri
|
Julio Gonzalo
|
Michael Kluck
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
The objective of the Cross-Language Evaluation Forum (CLEF) is to promote research in the multilingual information access domain. In this short paper, we list the achievements of CLEF during its first four years of activity and describe how the range of tasks has been considerably expanded during this period. The aim of the paper is to demonstrate the importance of evaluation initiatives with respect to system research and development and to show how essential it is for such initiatives to keep abreast of and even anticipate the emerging needs of both system developers and application communities if they are to have a future.
pdf
abs
Collection of SLR in the Asian-Pacific Area
Asunción Moreno
|
Khalid Choukri
|
Phil Hall
|
Henk van den Heuvel
|
Eric Sanders
|
Francesco Senia
|
Herbert Tropf
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
The goal of this project (LILA) is the collection of a large number of spoken databases for training Automatic Speech Recognition Systems for telephone applications in the Asian Pacific area. Specifications follow those of SpeechDat-like databases. Utterances will be recorded directly from calls made either from fixed or cellular telephones and are composed by read text and answers to specific questions. The project is driven by a consortium composed by a large number of industrial companies. Each company is in charge of the production of two databases. The consortium shares the databases produced in the project. The goal of the project should be reached within the year 2005.
pdf
abs
The French MEDIA/EVALDA Project: the Evaluation of the Understanding Capability of Spoken Language Dialogue Systems
Laurence Devillers
|
Hélène Maynard
|
Sophie Rosset
|
Patrick Paroubek
|
Kevin McTait
|
D. Mostefa
|
Khalid Choukri
|
Laurent Charnay
|
Caroline Bousquet
|
Nadine Vigouroux
|
Frédéric Béchet
|
Laurent Romary
|
Jean-Yves Antoine
|
J. Villaneau
|
Myriam Vergnes
|
J. Goulian
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
The aim of the MEDIA project is to design and test a methodology for the evaluat ion of context-dependent and independent spoken dialogue systems. We propose an evaluation paradigm based on the use of test suites from real-world corpora and a common semantic representation and common metrics. This paradigm should allow us to diagnose the context-sensitive understanding capability of dialogue system s. This paradigm will be used within an evaluation campaign involving several si tes all of which will carry out the task of querying information from a database .
pdf
abs
The C-ORAL-ROM CORPUS. A Multilingual Resource of Spontaneous Speech for Romance Languages
Emanuela Cresti
|
Fernanda Bacelar do Nascimento
|
Antonio Moreno Sandoval
|
Jean Veronis
|
Philippe Martin
|
Khalid Choukri
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
The C-ORAL-ROM project has delivered a multilingual corpus of spontaneous speech for the main romance languages (Italian, French, Portuguese and Spanish). The collection aims to represent the variety of speech acts performed in everyday language and to enable the description of prosodic and syntactic structures in the four romance languages. Sampling criteria are defined in a corpus design scheme. C-ORAL-ROM adopts two different sampling strategies, one for the formal and one for the informal part: While a set of typical domains of application is selected to document the formal use of language, the informal part documents speech variation using parameters referring to the event’s structure (dialogue vs. monologue) and the sociological domain of use (family-private vs public). The four romance corpora are tagged with respect to terminal and non terminal prosodic breaks. Terminal breaks are assumed to be the more relevant cues for the identification of relevant linguistic domains in spontaneous speech (utterances). Relations with other concurrent criteria are discussed. The multimedia storage of the C-ORAL-ROM corpus is based on this principle; each textual string ending with a terminal break is aligned, through the Win Pitch speech software, to its acoustic counterpart, generating the data base of all utterances.
pdf
ENABLER Thematic Network of National Projects: Technical, Strategic and Political Issues of LRs
Nicoletta Calzolari
|
Khalid Choukri
|
Maria Gavrilidou
|
Bente Maegaard
|
Paola Baroni
|
Hanne Fersøe
|
Alessandro Lenci
|
Valérie Mapelli
|
Monica Monachini
|
Stelios Piperidis
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
pdf
OrienTel - Telephony Databases Across Northern Africa and the Middle East
Dorota Iskra
|
Rainer Siemund
|
Jamal Borno
|
Asuncion Moreno
|
Ossama Emam
|
Khalid Choukri
|
Oren Gedge
|
Herbert Tropf
|
Albino Nogueiras
|
Imed Zitouni
|
Anastasios Tsopanoglou
|
Nikos Fakotakis
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
pdf
Development of New Telephone Speech Databases for French: the NEOLOGOS Project
Elisabeth Pinto
|
Delphine Charlet
|
Hélène François
|
Djamel Mostefa
|
Olivier Boëffard
|
Dominique Fohr
|
Odile Mella
|
Frédéric Bimbot
|
Khalid Choukri
|
Yann Philip
|
Francis Charpentier
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
pdf
The ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News
G. Gravier
|
J-F. Bonastre
|
E. Geoffrois
|
S. Galliano
|
K. McTait
|
K. Choukri
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
pdf
Recent Activities within the European Language Resources Association: Issues on Sharing Language Resources and Evaluation
Khalid Choukri
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
pdf
Network of Data Centres (NetDC): BNSC - An Arabic Broadcast News Speech Corpus
Khalid Choukri
|
Mahtab Nikkhou
|
Niklas Paulsson
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
pdf
Technolangue: A Permanent Evaluation and Information Infrastructure
Valérie Mapelli
|
Maria Nava
|
Sylvain Surcin
|
Djamel Mostefa
|
Khalid Choukri
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
2003
pdf
Setting up an Evaluation Infrastructure for Human Language Technologies in Europe
Kevin McTait
|
Khalid Choukri
Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: are evaluation methods, metrics and resources reusable?
2002
pdf
OrienTel - Multilingual access to interactive communication services for the Mediterranean and the Middle East
Rainer Siemund
|
Barbara Heuft
|
Khalid Choukri
|
Ossama Emam
|
Emmanuel Maragoudakis
|
Herbert Tropf
|
Oren Gedge
|
Sherrie Shammass
|
Asuncion Moreno
|
Albino Nogueiras Rodriguez
|
Imed Zitouni
|
Dorota Iskra
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
pdf
Give me a bug. a framework for a bug report service
Henk van den Heuvel
|
Khalid Choukri
|
Harald Höge
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
pdf
The C-ORAL-ROM Project. New methods for spoken language archives in a multilingual romance corpus
Emanuela Cresti
|
Massimo Moneglia
|
Fernanda Bacelar do Nascimento
|
Antonio Moreno Sandoval
|
Jean Veronis
|
Philippe Martin
|
Kalid Choukri
|
Valerie Mapelli
|
Daniele Falavigna
|
Antonio Cid
|
Claude Blum
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
2000
pdf
SLR Validation: Present State of Affairs and Prospects
Henk van den Heuvel
|
Lou Boves
|
Khalid Choukri
|
Simo Goddijn
|
Eric Sanders
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
pdf
For a Repository of NLP Tools
Stéphane Chaudiron
|
Khalid Choukri
|
Audrey Mance
|
Valérie Mapelli
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
pdf
Survey of Language Engineering Needs: a Language Resources Perspective
Jeffrey Allen
|
Khalid Choukri
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
pdf
SPEECHDAT-CAR. A Large Speech Database for Automotive Environments
Asunción Moreno
|
Børge Lindberg
|
Christoph Draxler
|
Gaël Richard
|
Khalid Choukri
|
Stephan Euler
|
Jeffrey Allen
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
pdf
Recent Developments within the European Language Resources Association (ELRA)
Khalid Choukri
|
Audrey Mance
|
Valérie Mapelli
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)