Adrien Barbaresi


2022

pdf bib
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)
Piotr Banski | Adrien Barbaresi | Simon Clematide | Marc Kupietz | Harald Lüngen
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)

2021

pdf
Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction
Adrien Barbaresi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

An essential operation in web corpus construction consists in retaining the desired content while discarding the rest. Another challenge finding one’s way through websites. This article introduces a text discovery and extraction tool published under open-source license. Its installation and use is straightforward, notably from Python and on the command-line. The software allows for main text, comments and metadata extraction, while also providing building blocks for web crawling tasks. A comparative evaluation on real-world data also shows its interest as well as the performance of other available solutions. The contributions of this paper are threefold: it references the software, features a benchmark, and provides a meaningful baseline for similar tasks. The tool performs significantly better than other open-source solutions in this evaluation and in external benchmarks.

2020

pdf
Bien choisir son outil d’extraction de contenu à partir du Web (Choosing the appropriate tool for Web Content Extraction )
Gaël Lejeune | Adrien Barbaresi
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 4 : Démonstrations et résumés d'articles internationaux

Nous proposons une démonstration sur l’extraction de contenu textuel dans des pages web ainsi que son évaluation. Nous nous concentrons sur les pages web contenant du texte (articles de presse, magazines en ligne et blogs) et montrons que les textes peuvent varier grandement selon différentes dimensions : diachronique, géographique et typologique. Dès lors, les outils et mesures d’évaluation correspondantes sont sujettes à caution : les indicateurs communément utilisés et censés présider au choix de l’outil approprié par les utilisateurs finaux sont à la fois imprécis et difficiles à interpréter.

pdf
Que recèlent les données textuelles issues du web ? (What do text data from the Web have to hide ?)
Adrien Barbaresi | Gaël Lejeune
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). 2e atelier Éthique et TRaitemeNt Automatique des Langues (ETeRNAL)

La collecte et l’usage opportunistes de données textuelles tirées du web sont sujets à une série de problèmes éthiques, méthodologiques et épistémologiques qui méritent l’attention de la communauté scientifique. Nous présentons des études empiriques de leur impact en linguistique et TAL centrées sur la forme (méthodes d’extraction des données) ainsi que sur le fond (contenu des corpus).

pdf bib
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora
Piotr Bański | Adrien Barbaresi | Simon Clematide | Marc Kupietz | Harald Lüngen | Ines Pisetta
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora

pdf bib
Proceedings of the 12th Web as Corpus Workshop
Adrien Barbaresi | Felix Bildhauer | Roland Schäfer | Egon Stemle
Proceedings of the 12th Web as Corpus Workshop

pdf bib
Out-of-the-Box and into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools
Adrien Barbaresi | Gaël Lejeune
Proceedings of the 12th Web as Corpus Workshop

This article examines extraction methods designed to retain the main text content of web pages and discusses how the extraction could be oriented and evaluated: can and should it be as generic as possible to ensure opportunistic corpus construction? The evaluation grounds on a comparative benchmark of open-source tools used on pages in five different languages (Chinese, English, Greek, Polish and Russian), it features several metrics to obtain more fine-grained differentiations. Our experiments highlight the diversity of web page layouts across languages or publishing countries. These discrepancies are reflected by diverging performances so that the right tool has to be chosen accordingly.

2018

pdf
Computationally efficient discrimination between language varieties with large feature vectors and regularized classifiers
Adrien Barbaresi
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

The present contribution revolves around efficient approaches to language classification which have been field-tested in the Vardial evaluation campaign. The methods used in several language identification tasks comprising different language types are presented and their results are discussed, giving insights on real-world application of regularization, linear classifiers and corresponding linguistic features. The use of a specially adapted Ridge classifier proved useful in 2 tasks out of 3. The overall approach (XAC) has slightly outperformed most of the other systems on the DFS task (Dutch and Flemish) and on the ILI task (Indo-Aryan languages), while its comparative performance was poorer in on the GDI task (Swiss German dialects).

pdf
A corpus of German political speeches from the 21st century
Adrien Barbaresi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
A database of German definitory contexts from selected web sources
Adrien Barbaresi | Lothar Lemnitzer | Alexander Geyken
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf
Discriminating between Similar Languages using Weighted Subword Features
Adrien Barbaresi
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task. I present and discuss the method used in this 14-way language identification task comprising varieties of 6 main language groups. It features the following characteristics: (1) the preprocessing and conversion of a collection of documents to sparse features; (2) weighted character n-gram profiles; (3) a multinomial Bayesian classifier. Meaningful bag-of-n-grams features can be used as a system in a straightforward way, my approach outperforms most of the systems used in the DSL shared task (3rd rank).

2016

pdf bib
Efficient construction of metadata-enhanced web corpora
Adrien Barbaresi
Proceedings of the 10th Web as Corpus Workshop

pdf
An Unsupervised Morphological Criterion for Discriminating Similar Languages
Adrien Barbaresi
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

In this study conducted on the occasion of the Discriminating between Similar Languages shared task, I introduce an additional decision factor focusing on the token and subtoken level. The motivation behind this submission is to test whether a morphologically-informed criterion can add linguistically relevant information to global categorization and thus improve performance. The contributions of this paper are (1) a description of the unsupervised, low-resource method; (2) an evaluation and analysis of its raw performance; and (3) an assessment of its impact within a model comprising common indicators used in language identification. I present and discuss the systems used in the task A, a 12-way language identification task comprising varieties of five main language groups. Additionally I introduce a new off-the-shelf Naive Bayes classifier using a contrastive word and subword n-gram model (“Bayesline”) which outperforms the best submissions.

2014

pdf bib
Finding Viable Seed URLs for Web Corpora: A Scouting Approach and Comparative Study of Available Sources
Adrien Barbaresi
Proceedings of the 9th Web as Corpus Workshop (WaC-9)

pdf bib
Focused Web Corpus Crawling
Roland Schäfer | Adrien Barbaresi | Felix Bildhauer
Proceedings of the 9th Web as Corpus Workshop (WaC-9)

2013

pdf bib
Crawling microblogging services to gather language-classified URLs. Workflow and case study
Adrien Barbaresi
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop

2011

pdf
La complexité linguistique Méthode d’analyse
Adrien Barbaresi
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues (articles courts)

La complexité linguistique regroupe différents phénomènes dont il s’agit de modéliser le rapport. Le travail en cours que je décris ici propose une réflexion sur les approches linguistiques et techniques de cette notion et la mise en application d’un balayage des textes qui s’efforce de contribuer à leur enrichissement. Ce traitement en surface effectué suivant une liste de critères qui représentent parfois des approximations de logiques plus élaborées tente de fournir une image “raisonnable” de la complexité.