2018
pdf
Measuring Innovation in Speech and Language Processing Publications.
Joseph Mariani
|
Gil Francopoulo
|
Patrick Paroubek
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
abs
Predictive Modeling: Guessing the NLP Terms of Tomorrow
Gil Francopoulo
|
Joseph Mariani
|
Patrick Paroubek
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Predictive modeling, often called “predictive analytics” in a commercial context, encompasses a variety of statistical techniques that analyze historical and present facts to make predictions about unknown events. Often the unknown events are in the future, but prediction can be applied to any type of unknown whether it be in the past or future. In our case, we present some experiments applying predictive modeling to the usage of technical terms within the NLP domain.
pdf
abs
The CAMOMILE Collaborative Annotation Platform for Multi-modal, Multi-lingual and Multi-media Documents
Johann Poignant
|
Mateusz Budnik
|
Hervé Bredin
|
Claude Barras
|
Mickael Stefas
|
Pierrick Bruneau
|
Gilles Adda
|
Laurent Besacier
|
Hazim Ekenel
|
Gil Francopoulo
|
Javier Hernando
|
Joseph Mariani
|
Ramon Morros
|
Georges Quénot
|
Sophie Rosset
|
Thomas Tamisier
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper, we describe the organization and the implementation of the CAMOMILE collaborative annotation framework for multimodal, multimedia, multilingual (3M) data. Given the versatile nature of the analysis which can be performed on 3M data, the structure of the server was kept intentionally simple in order to preserve its genericity, relying on standard Web technologies. Layers of annotations, defined as data associated to a media fragment from the corpus, are stored in a database and can be managed through standard interfaces with authentication. Interfaces tailored specifically to the needed task can then be developed in an agile way, relying on simple but reliable services for the management of the centralized annotations. We then present our implementation of an active learning scenario for person annotation in video, relying on the CAMOMILE server; during a dry run experiment, the manual annotation of 716 speech segments was thus propagated to 3504 labeled tracks. The code of the CAMOMILE framework is distributed in open source.
pdf
abs
A Study of Reuse and Plagiarism in LREC papers
Gil Francopoulo
|
Joseph Mariani
|
Patrick Paroubek
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The aim of this experiment is to present an easy way to compare fragments of texts in order to detect (supposed) results of copy & paste operations between articles in the domain of Natural Language Processing (NLP). The search space of the comparisons is a corpus labeled as NLP4NLP gathering a large part of the NLP field. The study is centered on LREC papers in both directions, first with an LREC paper borrowing a fragment of text from the collection, and secondly in the reverse direction with fragments of LREC documents borrowed and inserted in the collection.
pdf
A Study of Reuse and Plagiarism in Speech and Natural Language Processing papers
Joseph Mariani
|
Gil Francopoulo
|
Patrick Paroubek
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)
pdf
abs
Providing and Analyzing NLP Terms for our Community
Gil Francopoulo
|
Joseph Mariani
|
Patrick Paroubek
|
Frédéric Vernier
Proceedings of the 5th International Workshop on Computational Terminology (Computerm2016)
By its own nature, the Natural Language Processing (NLP) community is a priori the best equipped to study the evolution of its own publications, but works in this direction are rare and only recently have we seen a few attempts at charting the field. In this paper, we use the algorithms, resources, standards, tools and common practices of the NLP field to build a list of terms characteristic of ongoing research, by mining a large corpus of scientific publications, aiming at the largest possible exhaustivity and covering the largest possible time span. Study of the evolution of this term list through time reveals interesting insights on the dynamics of field and the availability of the term database and of the corpus (for a large part) make possible many further comparative studies in addition to providing a test field for a new graphic interface designed to perform visual time analytics of large sized thesauri.
2014
pdf
abs
Rediscovering 15 Years of Discoveries in Language Resources and Evaluation: The LREC Anthology Analysis
Joseph Mariani
|
Patrick Paroubek
|
Gil Francopoulo
|
Olivier Hamon
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper aims at analyzing the content of the LREC conferences contained in the ELRA Anthology over the past 15 years (1998-2013). It follows similar exercises that have been conducted, such as the survey on the IEEE ICASSP conference series from 1976 to 1990, which served in the launching of the ESCA Eurospeech conference, a survey of the Association of Computational Linguistics (ACL) over 50 years of existence, which was presented at the ACL conference in 2012, or a survey over the 25 years (1987-2012) of the conferences contained in the ISCA Archive, presented at Interspeech 2013. It contains first an analysis of the evolution of the number of papers and authors over time, including the study of their gender, nationality and affiliation, and of the collaboration among authors. It then studies the funding sources of the research investigations that are reported in the papers. It conducts an analysis of the evolution of the research topics within the community over time. It finally looks at reuse and plagiarism in the papers. The survey shows the present trends in the conference series and in the Language Resources and Evaluation scientific community. Conducting this survey also demonstrated the importance of a clear and unique identification of authors, papers and other sources to facilitate the analysis. This survey is preliminary, as many other aspects also deserve attention. But we hope it will help better understanding and forging our community in the global village.
pdf
abs
Facing the Identification Problem in Language-Related Scientific Data Analysis.
Joseph Mariani
|
Christopher Cieri
|
Gil Francopoulo
|
Patrick Paroubek
|
Marine Delaborde
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper describes the problems that must be addressed when studying large amounts of data over time which require entity normalization applied not to the usual genres of news or political speech, but to the genre of academic discourse about language resources, technologies and sciences. It reports on the normalization processes that had to be applied to produce data usable for computing statistics in three past studies on the LRE Map, the ISCA Archive and the LDC Bibliography. It shows the need for human expertise during normalization and the necessity to adapt the work to the study objectives. It investigates possible improvements for reducing the workload necessary to produce comparable results. Through this paper, we show the necessity to define and agree on international persistent and unique identifiers.
2013
pdf
Improving Minor Opinion Polarity Classification with Named Entity Analysis (L’apport des Entités Nommées pour la classification des opinions minoritaires) [in French]
Amel Fraisse
|
Patrick Paroubek
|
Gil Francopoulo
Proceedings of TALN 2013 (Volume 2: Short Papers)
2012
pdf
abs
The LRE Map. Harmonising Community Descriptions of Resources
Nicoletta Calzolari
|
Riccardo Del Gratta
|
Gil Francopoulo
|
Joseph Mariani
|
Francesco Rubino
|
Irene Russo
|
Claudia Soria
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Accurate and reliable documentation of Language Resources is an undisputable need: documentation is the gateway to discovery of Language Resources, a necessary step towards promoting the data economy. Language resources that are not documented virtually do not exist: for this reason every initiative able to collect and harmonise metadata about resources represents a valuable opportunity for the NLP community. In this paper we describe the LRE Map, reporting statistics on resources associated with LREC2012 papers and providing comparisons with LREC2010 data. The LRE Map, jointly launched by FLaReNet and ELRA in conjunction with the LREC 2010 Conference, is an instrument for enhancing availability of information about resources, either new or already existing ones. It wants to reinforce and facilitate the use of standards in the community. The LRE Map web interface provides the possibility of searching according to a fixed set of metadata and to view the details of extracted resources. The LRE Map is continuing to collect bottom-up input about resources from authors of other conferences through standard submission process. This will help broadening the notion of language resources and attract to the field neighboring disciplines that so far have been only marginally involved by the standard notion of language resources.
pdf
abs
The META-SHARE Metadata Schema for the Description of Language Resources
Maria Gavrilidou
|
Penny Labropoulou
|
Elina Desipri
|
Stelios Piperidis
|
Haris Papageorgiou
|
Monica Monachini
|
Francesca Frontini
|
Thierry Declerck
|
Gil Francopoulo
|
Victoria Arranz
|
Valerie Mapelli
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper presents a metadata model for the description of language resources proposed in the framework of the META-SHARE infrastructure, aiming to cover both datasets and tools/technologies used for their processing. It places the model in the overall framework of metadata models, describes the basic principles and features of the model, elaborates on the distinction between minimal and maximal versions thereof, briefly presents the integrated environment supporting the LRs description and search and retrieval processes and concludes with work to be done in the future for the improvement of the model.
2011
pdf
A Metadata Schema for the Description of Language Resources (LRs)
Maria Gavrilidou
|
Penny Labropoulou
|
Stelios Piperidis
|
Monica Monachini
|
Francesca Frontini
|
Gil Francopoulo
|
Victoria Arranz
|
Valérie Mapelli
Proceedings of the Workshop on Language Resources, Technology and Services in the Sharing Paradigm
2010
pdf
abs
MLIF : A Metamodel to Represent and Exchange Multilingual Textual Information
Samuel Cruz-Lara
|
Gil Francopoulo
|
Laurent Romary
|
Nasredine Semmar
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The fast evolution of language technology has produced pressing needs in standardization. The multiplicity of language resources representation levels and the specialization of these representations make difficult the interaction between linguistic resources and components manipulating these resources. In this paper, we describe the MultiLingual Information Framework (MLIF ― ISO CD 24616). MLIF is a metamodel which allows the representation and the exchange of multilingual textual information. This generic metamodel is designed to provide a common platform for all the tools developed around the existing multilingual data exchange formats. This platform provides, on the one hand, a set of generic data categories for various application domains, and on the other hand, strategies for the interoperability with existing standards. The objective is to reach a better convergence between heterogeneous standardisation activities that are taking place in the domain of data modeling (XML; W3C), text management (TEI; TEIC), multilingual information (TMX-LISA; XLIFF-OASIS) and multimedia (SMILText; W3C). This is a work in progress within ISO-TC37 in order to define a new ISO standard.
pdf
abs
PASSAGE Syntactic Representation: a Minimal Common Ground for Evaluation
Anne Vilnat
|
Patrick Paroubek
|
Eric Villemonte de la Clergerie
|
Gil Francopoulo
|
Marie-Laure Guénot
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The current PASSAGE syntactic representation is the result of 9 years of constant evolution with the aim of providing a common ground for evaluating parsers of French whatever their type and supporting theory. In this paper we present the latest developments concerning the formalism and show first through a review of basic linguistic phenomena that it is a plausible minimal common ground for representing French syntax in the context of generic black box quantitative objective evaluation. For the phenomena reviewed, which include: the notion of syntactic head, apposition, control and coordination, we explain how PASSAGE representation relates to other syntactic representation schemes for French and English, slightly extending the annotation to address English when needed. Second, we describe the XML format chosen for PASSAGE and show that it is compliant with the latest propositions in terms of linguistic annotation standard. We conclude discussing the influence that corpus-based evaluation has on the characteristics of syntactic representation when willing to assess the performance of any kind of parser.
2008
pdf
Large Scale Production of Syntactic Annotations to Move Forward
Anne Vilnat
|
Gil Francopoulo
|
Olivier Hamon
|
Sylvain Loiseau
|
Patrick Paroubek
|
Eric Villemonte de la Clergerie
Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation
2007
pdf
abs
Évaluer SYNLEX
Ingrid Falk
|
Gil Francopoulo
|
Claire Gardent
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
SYNLEX est un lexique syntaxique extrait semi-automatiquement des tables du LADL. Comme les autres lexiques syntaxiques du français disponibles et utilisables pour le TAL (LEFFF, DICOVALENCE), il est incomplet et n’a pas fait l’objet d’une évaluation permettant de déterminer son rappel et sa précision par rapport à un lexique de référence. Nous présentons une approche qui permet de combler au moins partiellement ces lacunes. L’approche s’appuie sur les méthodes mises au point en acquisition automatique de lexique. Un lexique syntaxique distinct de SYNLEX est acquis à partir d’un corpus de 82 millions de mots puis utilisé pour valider et compléter SYNLEX. Le rappel et la précision de cette version améliorée de SYNLEX sont ensuite calculés par rapport à un lexique de référence extrait de DICOVALENCE.
pdf
abs
Modélisation des paradigmes de flexion des verbes arabes selon la norme LMF - ISO 24613
Aïda Khemakhem
|
Bilel Gargouri
|
Abdelhamid Abdelwahed
|
Gil Francopoulo
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Posters
Dans cet article, nous spécifions les paradigmes de flexion des verbes arabes en respectant la version 9 de LMF (Lexical Markup Framework), future norme ISO 24613 qui traite de la standardisation des bases lexicales. La spécification de ces paradigmes se fonde sur une combinaison des racines et des schèmes. En particulier, nous mettons en relief les terminaisons de racines sensibles aux ajouts de suffixes et ce, afin de couvrir les situations non considérées dans les travaux existants. L’élaboration des paradigmes de flexion verbale que nous proposons est une description en intension d’ArabicLDB (Arabic Lexical DataBase) qui est une base lexicale normalisée pour la langue arabe. Nos travaux sont illustrés par la réalisation d’un conjugueur des verbes arabes à partir d’ArabicLDB.
2006
pdf
abs
Lexical Markup Framework (LMF)
Gil Francopoulo
|
Monte George
|
Nicoletta Calzolari
|
Monica Monachini
|
Nuria Bel
|
Mandy Pet
|
Claudia Soria
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Optimizing the production, maintenance and extension of lexical resources is one the crucial aspects impacting Natural Language Processing (NLP). A second aspect involves optimizing the process leading to their integration in applications. With this respect, we believe that the production of a consensual specification on lexicons can be a useful aid for the various NLP actors. Within ISO, the purpose of LMF is to define a standard for lexicons. LMF is a model that provides a common standardized framework for the construction of NLP lexicons. The goals of LMF are to provide a common model for the creation and use of lexical resources, to manage the exchange of data between and among these resources, and to enable the merging of large number of individual electronic resources to form extensive global electronic resources. In this paper, we describe the work in progress within the sub-group ISO-TC37/SC4/WG4. Various experts from a lot of countries have been consulted in order to take into account best practices in a lot of languages for (we hope) all kinds of NLP lexicons.
pdf
bib
Lexical Markup Framework (LMF) for NLP Multilingual Resources
Gil Francopoulo
|
Nuria Bel
|
Monte George
|
Nicoletta Calzolari
|
Monica Monachini
|
Mandy Pet
|
Claudia Soria
Proceedings of the Workshop on Multilingual Language Resources and Interoperability
2004
pdf
Standards going concrete : from LMF to Morphalou
Laurent Romary
|
Susanne Salmon-Alt
|
Gil Francopoulo
Proceedings of the Workshop on Enhancing and Using Electronic Dictionaries
1988
pdf
Language Learning as Problem Solving
Michael Zock
|
Gil Francopoulo
|
Abdellatif Laroui
Coling Budapest 1988 Volume 2: International Conference on Computational Linguistics