2024
pdf
abs
MoCCA: A Model of Comparative Concepts for Aligning Constructicons
Arthur Lorenzi
|
Peter Ljunglöf
|
Ben Lyngfelt
|
Tiago Timponi Torrent
|
William Croft
|
Alexander Ziem
|
Nina Böbel
|
Linnéa Bäckström
|
Peter Uhrig
|
Ely E. Matos
Proceedings of the 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation @ LREC-COLING 2024
This paper presents MoCCA, a Model of Comparative Concepts for Aligning Constructicons under development by a consortium of research groups building Constructicons of different languages including Brazilian Portuguese, English, German and Swedish. The Constructicons will be aligned by using comparative concepts (CCs) providing language-neutral definitions of linguistic properties. The CCs are drawn from typological research on grammatical categories and constructions, and from FrameNet frames, organized in a conceptual network. Language-specific constructions are linked to the CCs in accordance with general principles. MoCCA is organized into files of two types: a largely static CC Database file and multiple Linking files containing relations between constructions in a Constructicon and the CCs. Tools are planned to facilitate visualization of the CC network and linking of constructions to the CCs. All files and guidelines will be versioned, and a mechanism is set up to report cases where a language-specific construction cannot be easily linked to existing CCs.
2023
pdf
bib
A Pipeline for the Creation of Multimodal Corpora from YouTube Videos
Nathan Dykes
|
Anna Wilson
|
Peter Uhrig
Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing
2019
pdf
The_Illiterati: Part-of-Speech Tagging for Magahi and Bhojpuri without even knowing the alphabet
Thomas Proisl
|
Peter Uhrig
|
Andreas Blombach
|
Natalie Dykes
|
Philipp Heinrich
|
Besim Kabashi
|
Sefora Mammarella
Proceedings of the First International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019) co-located with ICNLSP 2019 - Short Papers
2016
pdf
SoMaJo: State-of-the-art tokenization for German web and social media texts
Thomas Proisl
|
Peter Uhrig
Proceedings of the 10th Web as Corpus Workshop
2012
pdf
abs
Efficient Dependency Graph Matching with the IMS Open Corpus Workbench
Thomas Proisl
|
Peter Uhrig
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
State-of-the-art dependency representations such as the Stanford Typed Dependencies may represent the grammatical relations in a sentence as directed, possibly cyclic graphs. Querying a syntactically annotated corpus for grammatical structures that are represented as graphs requires graph matching, which is a non-trivial task. In this paper, we present an algorithm for graph matching that is tailored to the properties of large, syntactically annotated corpora. The implementation of the algorithm is built on top of the popular IMS Open Corpus Workbench, allowing corpus linguists to re-use existing infrastructure. An evaluation of the resulting software, CWB-treebank, shows that its performance in real world applications, such as a web query interface, compares favourably to implementations that rely on a relational database or a dedicated graph database while at the same time offering a greater expressive power for queries. An intuitive graphical interface for building the query graphs is available via the Treebank.info project.