2024
pdf
abs
FRAPPE: FRAming, Persuasion, and Propaganda Explorer
Ahmed Sajwani
|
Alaa El Setohy
|
Ali Mekky
|
Diana Turmakhan
|
Lara Hassan
|
Mohamed El Zeftawy
|
Omar El Herraoui
|
Osama Mohammed Afzal
|
Qisheng Liao
|
Tarek Mahmoud
|
Zain Muhammad Mujahid
|
Muhammad Umar Salman
|
Muhammad Arslan Manzoor
|
Massa Baali
|
Jakub Piskorski
|
Nicolas Stefanovitch
|
Giovanni Da San Martino
|
Preslav Nakov
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
The abundance of news sources and the urgent demand for reliable information have led to serious concerns about the threat of misleading information. In this paper, we present FRAPPE, a FRAming, Persuasion, and Propaganda Explorer system. FRAPPE goes beyond conventional news analysis of articles and unveils the intricate linguistic techniques used to shape readers’ opinions and emotions. Our system allows users not only to analyze individual articles for their genre, framings, and use of persuasion techniques, but also to draw comparisons between the strategies of persuasion and framing adopted by a diverse pool of news outlets and countries across multiple languages for different topics, thus providing a comprehensive understanding of how information is presented and manipulated. FRAPPE is publicly accessible at https://frappe.streamlit.app/ and a video explaining our system is available at https://www.youtube.com/watch?v=3RlTfSVnZmk
pdf
abs
Cross-lingual Named Entity Corpus for Slavic Languages
Jakub Piskorski
|
Michał Marcińczuk
|
Roman Yangarber
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper presents a corpus manually annotated with named entities for six Slavic languages — Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017–2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5,017 documents on seven topics. The documents are annotated with five classes of named entities. Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits — single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture with the pre-trained multilingual models — XLM-RoBERTa-large for named entity mention recognition and categorization, and mT5-large for named entity lemmatization and linking.
pdf
abs
Exploring the Usability of Persuasion Techniques for Downstream Misinformation-related Classification Tasks
Nikolaos Nikolaidis
|
Jakub Piskorski
|
Nicolas Stefanovitch
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We systematically explore the predictive power of features derived from Persuasion Techniques detected in texts, for solving different tasks of interest for media analysis; notably: detecting mis/disinformation, fake news, propaganda, partisan news and conspiracy theories. Firstly, we propose a set of meaningful features, aiming to capture the persuasiveness of a text. Secondly, we assess the discriminatory power of these features in different text classification tasks on 8 selected datasets from the literature using two metrics. We also evaluate the per-task discriminatory power of each Persuasion Technique and report on different insights. We find out that most of these features have a noticeable potential to distinguish conspiracy theories, hyperpartisan news and propaganda, while we observed mixed results in the context of fake news detection.
2023
pdf
abs
SemEval-2023 Task 3: Detecting the Category, the Framing, and the Persuasion Techniques in Online News in a Multi-lingual Setup
Jakub Piskorski
|
Nicolas Stefanovitch
|
Giovanni Da San Martino
|
Preslav Nakov
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
We describe SemEval-2023 task 3 on Detecting the Category, the Framing, and the Persuasion Techniques in Online News in a Multilingual Setup: the dataset, the task organization process, the evaluation setup, the results, and the participating systems. The task focused on news articles in nine languages (six known to the participants upfront: English, French, German, Italian, Polish, and Russian), and three additional ones revealed to the participants at the testing phase: Spanish, Greek, and Georgian). The task featured three subtasks: (1) determining the genre of the article (opinion, reporting, or satire), (2) identifying one or more frames used in an article from a pool of 14 generic frames, and (3) identify the persuasion techniques used in each paragraph of the article, using a taxonomy of 23 persuasion techniques. This was a very popular task: a total of 181 teams registered to participate, and 41 eventually made an official submission on the test set.
pdf
abs
Holistic Inter-Annotator Agreement and Corpus Coherence Estimation in a Large-scale Multilingual Annotation Campaign
Nicolas Stefanovitch
|
Jakub Piskorski
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
In this paper we report on the complexity of persuasion technique annotation in the context of a large multilingual annotation campaign involving 6 languages and approximately 40 annotators. We highlight the techniques that appear to be difficult for humans to annotate and elaborate on our findings on the causes of this phenomenon. We introduce Holistic IAA, a new word embedding-based annotator agreement metric and we report on various experiments using this metric and its correlation with the traditional Inter Annotator Agreement (IAA) metrics. However, given somewhat limited and loose interaction between annotators, i.e., only a few annotators annotate the same document subsets, we try to devise a way to assess the coherence of the entire dataset and strive to find a good proxy for IAA between annotators tasked to annotate different documents and in different languages, for which classical IAA metrics can not be applied.
pdf
abs
Multilingual Multifaceted Understanding of Online News in Terms of Genre, Framing, and Persuasion Techniques
Jakub Piskorski
|
Nicolas Stefanovitch
|
Nikolaos Nikolaidis
|
Giovanni Da San Martino
|
Preslav Nakov
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present a new multilingual multifacet dataset of news articles, each annotated for genre (objective news reporting vs. opinion vs. satire), framing (what key aspects are highlighted), and persuasion techniques (logical fallacies, emotional appeals, ad hominem attacks, etc.). The persuasion techniques are annotated at the span level, using a taxonomy of 23 fine-grained techniques grouped into 6 coarse categories. The dataset contains 1,612 news articles covering recent news on current topics of public interest in six European languages (English, French, German, Italian, Polish, and Russian), with more than 37k annotated spans of persuasion techniques. We describe the dataset and the annotation process, and we report the evaluation results of multilabel classification experiments using state-of-the-art multilingual transformers at different levels of granularity: token-level, sentence-level, paragraph-level, and document-level.
pdf
bib
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
Jakub Piskorski
|
Michał Marcińczuk
|
Preslav Nakov
|
Maciej Ogrodniczuk
|
Senja Pollak
|
Pavel Přibáň
|
Piotr Rybak
|
Josef Steinberger
|
Roman Yangarber
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
pdf
abs
On Experiments of Detecting Persuasion Techniques in Polish and Russian Online News: Preliminary Study
Nikolaos Nikolaidis
|
Nicolas Stefanovitch
|
Jakub Piskorski
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
This paper reports on the results of preliminary experiments on the detection of persuasion techniques in online news in Polish and Russian, using a taxonomy of 23 persuasion techniques. The evaluation addresses different aspects, namely, the granularity of the persuasion technique category, i.e., coarse- (6 labels) versus fine-grained (23 labels), and the focus of the classification, i.e., at which level the labels are detected (subword, sentence, or paragraph). We compare the performance of mono- verus multi-lingual-trained state-of-the-art transformed-based models in this context.
pdf
abs
Slav-NER: the 4th Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic languages
Roman Yangarber
|
Jakub Piskorski
|
Anna Dmitrieva
|
Michał Marcińczuk
|
Pavel Přibáň
|
Piotr Rybak
|
Josef Steinberger
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
This paper describes Slav-NER: the 4th Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and cross-lingual linking. This version of the Challenge covers three languages and five entity types. It is organized as part of the 9th Slavic Natural Language Processing Workshop, co-located with the EACL 2023 Conference.Seven teams registered and three participated actively in the competition. Performance for the named entity recognition and normalization tasks reached 90% F1 measure, much higher than reported in the first edition of the Challenge, but similar to the results reported in the latest edition. Performance for the entity linking task for individual language reached the range of 72-80% F1 measure. Detailed evaluation information is available on the Shared Task web page.
2022
pdf
abs
Resources and Experiments on Sentiment Classification for Georgian
Nicolas Stefanovitch
|
Jakub Piskorski
|
Sopho Kharazi
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper presents, to the best of our knowledge, the first ever publicly available annotated dataset for sentiment classification and semantic polarity dictionary for Georgian. The characteristics of these resources and the process of their creation are described in detail. The results of various experiments on the performance of both lexicon- and machine learning-based models for Georgian sentiment classification are also reported. Both 3-label (positive, neutral, negative) and 4-label settings (same labels + mixed) are considered. The machine learning models explored include, i.a., logistic regression, SVMs, and transformed-based models. We also explore transfer learning- and translation-based (to a well-supported language) approaches. The obtained results for Georgian are on par with the state-of-the-art results in sentiment classification for well studied languages when using training data of comparable size.
2021
pdf
abs
Exploring Linguistically-Lightweight Keyword Extraction Techniques for Indexing News Articles in a Multilingual Set-up
Jakub Piskorski
|
Nicolas Stefanovitch
|
Guillaume Jacquet
|
Aldo Podavini
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation
This paper presents a study of state-of-the-art unsupervised and linguistically unsophisticated keyword extraction algorithms, based on statistic-, graph-, and embedding-based approaches, including, i.a., Total Keyword Frequency, TF-IDF, RAKE, KPMiner, YAKE, KeyBERT, and variants of TextRank-based keyword extraction algorithms. The study was motivated by the need to select the most appropriate technique to extract keywords for indexing news articles in a real-world large-scale news analysis engine. The algorithms were evaluated on a corpus of circa 330 news articles in 7 languages. The overall best F1 scores for all languages on average were obtained using a combination of the recently introduced YAKE algorithm and KPMiner (20.1%, 46.6% and 47.2% for exact, partial and fuzzy matching resp.).
pdf
bib
abs
Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021): Workshop and Shared Task Report
Ali Hürriyetoğlu
|
Hristo Tanev
|
Vanni Zavarella
|
Jakub Piskorski
|
Reyyan Yeniterzi
|
Osman Mutlu
|
Deniz Yuret
|
Aline Villavicencio
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)
This workshop is the fourth issue of a series of workshops on automatic extraction of socio-political events from news, organized by the Emerging Market Welfare Project, with the support of the Joint Research Centre of the European Commission and with contributions from many other prominent scholars in this field. The purpose of this series of workshops is to foster research and development of reliable, valid, robust, and practical solutions for automatically detecting descriptions of socio-political events, such as protests, riots, wars and armed conflicts, in text streams. This year workshop contributors make use of the state-of-the-art NLP technologies, such as Deep Learning, Word Embeddings and Transformers and cover a wide range of topics from text classification to news bias detection. Around 40 teams have registered and 15 teams contributed to three tasks that are i) multilingual protest news detection detection, ii) fine-grained classification of socio-political events, and iii) discovering Black Lives Matter protest events. The workshop also highlights two keynote and four invited talks about various aspects of creating event data sets and multi- and cross-lingual machine learning in few- and zero-shot settings.
pdf
abs
Fine-grained Event Classification in News-like Text Snippets - Shared Task 2, CASE 2021
Jacek Haneczok
|
Guillaume Jacquet
|
Jakub Piskorski
|
Nicolas Stefanovitch
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)
This paper describes the Shared Task on Fine-grained Event Classification in News-like Text Snippets. The Shared Task is divided into three sub-tasks: (a) classification of text snippets reporting socio-political events (25 classes) for which vast amount of training data exists, although exhibiting different structure and style vis-a-vis test data, (b) enhancement to a generalized zero-shot learning problem, where 3 additional event types were introduced in advance, but without any training data (‘unseen’ classes), and (c) further extension, which introduced 2 additional event types, announced shortly prior to the evaluation phase. The reported Shared Task focuses on classification of events in English texts and is organized as part of the Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021), co-located with the ACL-IJCNLP 2021 Conference. Four teams participated in the task. Best performing systems for the three aforementioned sub-tasks achieved 83.9%, 79.7% and 77.1% weighted F1 scores respectively.
pdf
bib
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Bogdan Babych
|
Olga Kanishcheva
|
Preslav Nakov
|
Jakub Piskorski
|
Lidia Pivovarova
|
Vasyl Starko
|
Josef Steinberger
|
Roman Yangarber
|
Michał Marcińczuk
|
Senja Pollak
|
Pavel Přibáň
|
Marko Robnik-Šikonja
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
pdf
abs
Slav-NER: the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski
|
Bogdan Babych
|
Zara Kancheva
|
Olga Kanishcheva
|
Maria Lebedeva
|
Michał Marcińczuk
|
Preslav Nakov
|
Petya Osenova
|
Lidia Pivovarova
|
Senja Pollak
|
Pavel Přibáň
|
Ivaylo Radev
|
Marko Robnik-Sikonja
|
Vasyl Starko
|
Josef Steinberger
|
Roman Yangarber
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2021 Conference. Ten teams participated in the competition. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all six languages, and five teams participated in the cross-lingual entity linking task. Detailed valuation information is available on the shared task web page.
2020
pdf
abs
TF-IDF Character N-grams versus Word Embedding-based Models for Fine-grained Event Classification: A Preliminary Study
Jakub Piskorski
|
Guillaume Jacquet
Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020
Automating the detection of event mentions in online texts and their classification vis-a-vis domain-specific event type taxonomies has been acknowledged by many organisations worldwide to be of paramount importance in order to facilitate the process of intelligence gathering. This paper reports on some preliminary experiments of comparing various linguistically-lightweight approaches for fine-grained event classification based on short text snippets reporting on events. In particular, we compare the performance of a TF-IDF-weighted character n-gram SVM-based model versus SVMs trained on various of-the-shelf pre-trained word embeddings (GloVe, BERT, FastText) as features. We exploit a relatively large event corpus consisting of circa 610K short text event descriptions classified using a 25-event categories that cover political violence and protest events. The best results, i.e., 83.5% macro and 92.4% micro F1 score, were obtained using the TF-IDF-weighted character n-gram model.
pdf
abs
New Benchmark Corpus and Models for Fine-grained Event Classification: To BERT or not to BERT?
Jakub Piskorski
|
Jacek Haneczok
|
Guillaume Jacquet
Proceedings of the 28th International Conference on Computational Linguistics
We introduce a new set of benchmark datasets derived from ACLED data for fine-grained event classification and compare the performance of various state-of-the-art models on these datasets, including SVM based on TF-IDF character n-grams and neural context-free embeddings (GLOVE and FASTTEXT) as well as deep learning-based BERT with its contextual embeddings. The best results in terms of micro (94.3-94.9%) and macro F1 (86.0-88.9%) were obtained using BERT transformer, with simpler TF-IDF character n-gram based SVM being an interesting alternative. Further, we discuss the pros and cons of the considered benchmark models in terms of their robustness and the dependence of the classification performance on the size of training data.
2019
pdf
bib
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec
|
Michał Marcińczuk
|
Preslav Nakov
|
Jakub Piskorski
|
Lidia Pivovarova
|
Jan Šnajder
|
Josef Steinberger
|
Roman Yangarber
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
pdf
abs
The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski
|
Laska Laskova
|
Michał Marcińczuk
|
Lidia Pivovarova
|
Pavel Přibáň
|
Josef Steinberger
|
Roman Yangarber
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking. The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams participated in the competition, which covered four languages and five entity types. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all four languages, and five teams participated in the cross-lingual entity linking task. Detailed evaluation information is available on the shared task web page.
pdf
abs
JRC TMA-CC: Slavic Named Entity Recognition and Linking. Participation in the BSNLP-2019 shared task
Guillaume Jacquet
|
Jakub Piskorski
|
Hristo Tanev
|
Ralf Steinberger
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
We report on the participation of the JRC Text Mining and Analysis Competence Centre (TMA-CC) in the BSNLP-2019 Shared Task, which focuses on named-entity recognition, lemmatisation and cross-lingual linking. We propose a hybrid system combining a rule-based approach and light ML techniques. We use multilingual lexical resources such as JRC-NAMES and BABELNET together with a named entity guesser to recognise names. In a second step, we combine known names with wild cards to increase recognition recall by also capturing inflection variants. In a third step, we increase precision by filtering these name candidates with automatically learnt inflection patterns derived from name occurrences in large news article collections. Our major requirement is to achieve high precision. We achieved an average of 65% F-measure with 93% precision on the four languages.
2018
pdf
abs
On Training Classifiers for Linking Event Templates
Jakub Piskorski
|
Fredi Šarić
|
Vanni Zavarella
|
Martin Atkinson
Proceedings of the Workshop Events and Stories in the News 2018
The paper reports on exploring various machine learning techniques and a range of textual and meta-data features to train classifiers for linking related event templates automatically extracted from online news. With the best model using textual features only we achieved 94.7% (92.9%) F1 score on GOLD (SILVER) dataset. These figures were further improved to 98.6% (GOLD) and 97% (SILVER) F1 score by adding meta-data features, mainly thanks to the strong discriminatory power of automatically extracted geographical information related to events.
2017
pdf
bib
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec
|
Jakub Piskorski
|
Lidia Pivovarova
|
Jan Šnajder
|
Josef Steinberger
|
Roman Yangarber
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
pdf
abs
The First Cross-Lingual Challenge on Recognition, Normalization, and Matching of Named Entities in Slavic Languages
Jakub Piskorski
|
Lidia Pivovarova
|
Jan Šnajder
|
Josef Steinberger
|
Roman Yangarber
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
This paper describes the outcomes of the first challenge on multilingual named entity recognition that aimed at recognizing mentions of named entities in web documents in Slavic languages, their normalization/lemmatization, and cross-language matching. It was organised in the context of the 6th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2017 conference. Although eleven teams signed up for the evaluation, due to the complexity of the task(s) and short time available for elaborating a solution, only two teams submitted results on time. The reported evaluation figures reflect the relatively higher level of complexity of named entity-related tasks in the context of processing texts in Slavic languages. Since the duration of the challenge goes beyond the date of the publication of this paper and updated picture of the participating systems and their corresponding performance can be found on the web page of the challenge.
pdf
bib
abs
Multi-word Entity Classification in a Highly Multilingual Environment
Sophie Chesney
|
Guillaume Jacquet
|
Ralf Steinberger
|
Jakub Piskorski
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)
This paper describes an approach for the classification of millions of existing multi-word entities (MWEntities), such as organisation or event names, into thirteen category types, based only on the tokens they contain. In order to classify our very large in-house collection of multilingual MWEntities into an application-oriented set of entity categories, we trained and tested distantly-supervised classifiers in 43 languages based on MWEntities extracted from BabelNet. The best-performing classifier was the multi-class SVM using a TF.IDF-weighted data representation. Interestingly, one unique classifier trained on a mix of all languages consistently performed better than classifiers trained for individual languages, reaching an averaged F1-value of 88.8%. In this paper, we present the training and test data, including a human evaluation of its accuracy, describe the methods used to train the classifiers, and discuss the results.
pdf
abs
On the Creation of a Security-Related Event Corpus
Martin Atkinson
|
Jakub Piskorski
|
Hristo Tanev
|
Vanni Zavarella
Proceedings of the Events and Stories in the News Workshop
This paper reports on an effort of creating a corpus of structured information on security-related events automatically extracted from on-line news, part of which has been manually curated. The main motivation behind this effort is to provide material to the NLP community working on event extraction that could be used both for training and evaluation purposes.
2015
pdf
bib
The 5th Workshop on Balto-Slavic Natural Language Processing
Jakub Piskorski
|
Lidia Pivovarova
|
Jan Šnajder
|
Hristo Tanev
|
Roman Yangarber
The 5th Workshop on Balto-Slavic Natural Language Processing
pdf
Open Relation Extraction for Polish: Preliminary Experiments
Jakub Piskorski
The 5th Workshop on Balto-Slavic Natural Language Processing
2013
pdf
bib
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing
Jakub Piskorski
|
Lidia Pivovarova
|
Hristo Tanev
|
Roman Yangarber
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing
pdf
On Named Entity Recognition in Targeted Twitter Streams in Polish.
Jakub Piskorski
|
Maud Ehrmann
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing
2011
pdf
Exploring the Usefulness of Cross-lingual Information Fusion for Refining Real-time News Event Extraction: A Preliminary Study
Jakub Piskorski
|
Jenya Belayeva
|
Martin Atkinson
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011
2008
pdf
bib
Online-Monitoring of Security-Related Events
Martin Atkinson
|
Jakub Piskorski
|
Bruno Pouliquen
|
Ralf Steinberger
|
Hristo Tanev
|
Vanni Zavarella
Coling 2008: Companion volume: Demonstrations
2007
pdf
bib
Proceedings of the Workshop on Balto-Slavonic Natural Language Processing
Jakub Piskorski
|
Hristo Tanev
Proceedings of the Workshop on Balto-Slavonic Natural Language Processing
pdf
Lemmatization of Polish Person Names
Jakub Piskorski
|
Marcin Sydow
|
Anna Kupść
Proceedings of the Workshop on Balto-Slavonic Natural Language Processing
2006
pdf
abs
Linguistic Suite for Polish Cadastral System
Witold Abramowicz
|
Agata Filipowska
|
Jakub Piskorski
|
Krzysztof Węcel
|
Karol Wieloch
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper reports on an endeavour of creating basic linguistic resources for geo-referencing of Polish free-text documents. We have defined a fine-grained named entity hierarchy, produced an exhaustive gazetteer, and developed named-entity grammars for Polish. Additionally, an annotated corpus for the cadastral domain was prepared for evaluation purposes. Our baseline approach to geo-referencing is based on application of aforementioned resources and a lightweight co-referencing technique which utilizes string-similarity metric of Jaro-Winkler. We carried out a detailed evaluation of detecting locations, organizations and persons, which revealed that best results are obtained via application of a combined grammar for all types. The application of lightweight co-referencing for organizations and persons improves recall but deteriorates precision, and no gain is observed for locations. The paper is accompanied by a demo, a geo-referencing application capable of: (a) finding documents and text fragments based on named entities and (b) populating the spatial ontology from texts.
2005
pdf
Modelling of a Gazetteer Look-up Component
Jakub Piskorski
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts
2004
pdf
Integrated Language Technologies for Multilingual Information Services in the MEMPHIS Project
Walter Kasper
|
Jörg Steffen
|
Jakub Piskorski
|
Paul Buitelaar
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
pdf
Extraction of Polish Named-Entities
Jakub Piskorski
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
2003
pdf
Integrating Information Extraction and Automatic Hyperlinking
Stephan Busemann
|
Witold Drozdzynski
|
Hans-Ulrich Krieger
|
Jakub Piskorski
|
Ulrich Schaefer
|
Hans Uszkoreit
|
Feiyu Xu
The Companion Volume to the Proceedings of 41st Annual Meeting of the Association for Computational Linguistics
2002
pdf
A Flexible XML-based Regular Compiler for Creation and Conversion of Linguistic Resources
Jakub Piskorski
|
Witold Drożdżyński
|
Oliver Scherf
|
Feiyu Xu
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
pdf
A Domain Adaptive Approach to Automatic Acquisition of Domain Relevant Terms and their Relations with Bootstrapping
Feiyu Xu
|
Daniela Kurz
|
Jakub Piskorski
|
Sven Schmeier
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
pdf
An Integrated Archictecture for Shallow and Deep Processing
Berthold Crysmann
|
Anette Frank
|
Bernd Kiefer
|
Stefan Mueller
|
Guenter Neumann
|
Jakub Piskorski
|
Ulrich Schaefer
|
Melanie Siegel
|
Hans Uszkoreit
|
Feiyu Xu
|
Markus Becker
|
Hans-Ulrich Krieger
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics
2000
pdf
A Divide-and-Conquer Strategy for Shallow Parsing of German Free Texts
Gunter Neumann
|
Christian Braun
|
Jakub Piskorski
Sixth Applied Natural Language Processing Conference