Marcus Klang


2020

pdf
Hedwig: A Named Entity Linker
Marcus Klang | Pierre Nugues
Proceedings of the Twelfth Language Resources and Evaluation Conference

Named entity linking is the task of identifying mentions of named things in text, such as “Barack Obama” or “New York”, and linking these mentions to unique identifiers. In this paper, we describe Hedwig, an end-to-end named entity linker, which uses a combination of word and character BILSTM models for mention detection, a Wikidata and Wikipedia-derived knowledge base with global information aggregated over nine language editions, and a PageRank algorithm for entity linking. We evaluated Hedwig on the TAC2017 dataset, consisting of news texts and discussion forums, and we obtained a final score of 59.9% on CEAFmC+, an improvement over our previous generation linker Ugglan, and a trilingual entity link score of 71.9%.

2019

pdf
Docria: Processing and Storing Linguistic Data with Wikipedia
Marcus Klang | Pierre Nugues
Proceedings of the 22nd Nordic Conference on Computational Linguistics

The availability of user-generated content has increased significantly over time. Wikipedia is one example of a corpora which spans a huge range of topics and is freely available. Storing and processing these corpora requires flexible documents models as they may contain malicious and incorrect data. Docria is a library which attempts to address this issue by providing a solution which can be used with small to large corpora, from laptops using Python interactively in a Jupyter notebook to clusters running map-reduce frameworks with optimized compiled code. Docria is available as open-source code.

2018

pdf
Linking, Searching, and Visualizing Entities in Wikipedia
Marcus Klang | Pierre Nugues
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf
Docforia: A Multilayer Document Model
Marcus Klang | Pierre Nugues
Proceedings of the 21st Nordic Conference on Computational Linguistics

2016

pdf
Pairing Wikipedia Articles Across Languages
Marcus Klang | Pierre Nugues
Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016)

Wikipedia has become a reference knowledge source for scores of NLP applications. One of its invaluable features lies in its multilingual nature, where articles on a same entity or concept can have from one to more than 200 different versions. The interlinking of language versions in Wikipedia has undergone a major renewal with the advent of Wikidata, a unified scheme to identify entities and their properties using unique numbers. However, as the interlinking is still manually carried out by thousands of editors across the globe, errors may creep in the assignment of entities. In this paper, we describe an optimization technique to match automatically language versions of articles, and hence entities, that is only based on bags of words and anchors. We created a dataset of all the articles on persons we extracted from Wikipedia in six languages: English, French, German, Russian, Spanish, and Swedish. We report a correct match of at least 94.3% on each pair.

pdf
WIKIPARQ: A Tabulated Wikipedia Resource Using the Parquet Format
Marcus Klang | Pierre Nugues
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Wikipedia has become one of the most popular resources in natural language processing and it is used in quantities of applications. However, Wikipedia requires a substantial pre-processing step before it can be used. For instance, its set of nonstandardized annotations, referred to as the wiki markup, is language-dependent and needs specific parsers from language to language, for English, French, Italian, etc. In addition, the intricacies of the different Wikipedia resources: main article text, categories, wikidata, infoboxes, scattered into the article document or in different files make it difficult to have global view of this outstanding resource. In this paper, we describe WikiParq, a unified format based on the Parquet standard to tabulate and package the Wikipedia corpora. In combination with Spark, a map-reduce computing framework, and the SQL query language, WikiParq makes it much easier to write database queries to extract specific information or subcorpora from Wikipedia, such as all the first paragraphs of the articles in French, or all the articles on persons in Spanish, or all the articles on persons that have versions in French, English, and Spanish. WikiParq is available in six language versions and is potentially extendible to all the languages of Wikipedia. The WikiParq files are downloadable as tarball archives from this location: http://semantica.cs.lth.se/wikiparq/.

pdf
Multilingual Supervision of Semantic Annotation
Peter Exner | Marcus Klang | Pierre Nugues
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper, we investigate the annotation projection of semantic units in a practical setting. Previous approaches have focused on using parallel corpora for semantic transfer. We evaluate an alternative approach using loosely parallel corpora that does not require the corpora to be exact translations of each other. We developed a method that transfers semantic annotations from one language to another using sentences aligned by entities, and we extended it to include alignments by entity-like linguistic units. We conducted our experiments on a large scale using the English, Swedish, and French language editions of Wikipedia. Our results show that the annotation projection using entities in combination with loosely parallel corpora provides a viable approach to extending previous attempts. In addition, it allows the generation of proposition banks upon which semantic parsers can be trained.

pdf
Langforia: Language Pipelines for Annotating Large Collections of Documents
Marcus Klang | Pierre Nugues
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

In this paper, we describe Langforia, a multilingual processing pipeline to annotate texts with multiple layers: formatting, parts of speech, named entities, dependencies, semantic roles, and entity links. Langforia works as a web service, where the server hosts the language processing components and the client, the input and result visualization. To annotate a text or a Wikipedia page, the user chooses an NLP pipeline and enters the text in the interface or selects the page URL. Once processed, the results are returned to the client, where the user can select the annotation layers s/he wants to visualize. We designed Langforia with a specific focus for Wikipedia, although it can process any type of text. Wikipedia has become an essential encyclopedic corpus used in many NLP projects. However, processing articles and visualizing the annotations are nontrivial tasks that require dealing with multiple markup variants, encodings issues, and tool incompatibilities across the language versions. This motivated the development of a new architecture. A demonstration of Langforia is available for six languages: English, French, German, Spanish, Russian, and Swedish at http://vilde.cs.lth.se:9000/ as well as a web API: http://vilde.cs.lth.se:9000/api. Langforia is also provided as a standalone library and is compatible with cluster computing.

2015

pdf
A Distant Supervision Approach to Semantic Role Labeling
Peter Exner | Marcus Klang | Pierre Nugues
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics