This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
SabineBarreaux
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Le corpus Machine Translation se compose de publications scientifiques issues du réservoir Istex. Conçu comme un cas d’usage, il permet d’explorer l’histoire de la traduction automatique au travers des métadonnées et des textes intégraux disponibles pour chacun de ses documents. D’une part, les métadonnées permettent d’apporter un premier regard sur le paysage de la traduction automatique grâce à des tableaux de bord bibliométriques. D’autre part, l’utilisation d’outils de fouille de textes sur le texte intégral rend saillantes des informations inaccessibles sans une lecture approfondie des articles. L’exploration du corpus est réalisée grâce à Lodex, logiciel open source dédié à la valorisation de données structurées.
Préalable indispensable à de nombreuses activités de TAL et de fouille de textes, l’élaboration d’un corpus peut nécessiter plusieurs phases de traitement pour améliorer sa qualité et ainsi obtenir les meilleurs résultats d’analyse automatique. Les post-traitements appliqués à un tel corpus, notamment pour garantir la pertinence de son contenu et l’homogénéité de son format, pourront s’avérer d’autant plus coûteux et fastidieux que la construction du corpus de travail aura été imprécise. Cette démonstration se proposera de tirer parti de la plateforme ISTEX et de ses services associés pour constituer, au travers d’un cycle itératif, un corpus homogène de publications scientifiquement pertinentes pour une utilisation simplifiée par des outils de fouille.
To exploit scientific publications from global research for TDM purposes, the ISTEX platform enriched its data with value-added information to ease access to its full-text documents. We built an experiment to explore new enrichment possibilities in documents focussing on scientific named entities recognition which could be integrated into ISTEX resources. This led to testing two detection tools for animal species names in a corpus of 100 documents in zoology. This makes it possible to provide the French scientific community with an annotated reference corpus available for use to measure these tools’ performance.
Keyphrase extraction is the task of finding phrases that represent the important content of a document. The main aim of keyphrase extraction is to propose textual units that represent the most important topics developed in a document. The output keyphrases of automatic keyphrase extraction methods for test documents are typically evaluated by comparing them to manually assigned reference keyphrases. Each output keyphrase is considered correct if it matches one of the reference keyphrases. However, the choice of the appropriate textual unit (keyphrase) for a topic is sometimes subjective and evaluating by exact matching underestimates the performance. This paper presents a dataset of evaluation scores assigned to automatically extracted keyphrases by human evaluators. Along with the reference keyphrases, the manual evaluations can be used to validate new evaluation measures. Indeed, an evaluation measure that is highly correlated to the manual evaluation is appropriate for the evaluation of automatic keyphrase extraction methods.
The Quaero program has organized a set of evaluations for terminology extraction systems in 2010 and 2011. Three objectives were targeted in this initiative: the first one was to evaluate the behavior and scalability of term extractors regarding the size of corpora, the second goal was to assess progress between different versions of the same systems, the last one was to measure the influence of corpus type. The protocol used during this initiative was a comparative analysis of 32 runs against a gold standard. Scores were computed using metrics that take into account gradual relevance. Systems produced by Quaero partners and publicly available systems were evaluated on pharmacology corpora composed of European Patents or abstracts of scientific articles, all in English. The gold standard was an unstructured version of the pharmacology thesaurus used by INIST-CNRS for indexing purposes. Most systems scaled with large corpora, contrasted differences were observed between different versions of the same systems and with better results on scientific articles than on patents. During the ongoing adjudication phase domain experts are enriching the thesaurus with terms found by several systems.