Daan Broeder

Also published as: D. Broeder

2022

pdf bib abs
Language Technologies for the Creation of Multilingual Terminologies. Lessons Learned from the SSHOC Project
Federica Gamba | Francesca Frontini | Daan Broeder | Monica Monachini
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper is framed in the context of the SSHOC project and aims at exploring how Language Technologies can help in promoting and facilitating multilingualism in the Social Sciences and Humanities (SSH). Although most SSH researchers produce culturally and societally relevant work in their local languages, metadata and vocabularies used in the SSH domain to describe and index research data are currently mostly in English. We thus investigate Natural Language Processing and Machine Translation approaches in view of providing resources and tools to foster multilingual access and discovery to SSH content across different languages. As case studies, we create and deliver as freely, openly available data a set of multilingual metadata concepts and an automatically extracted multilingual Data Stewardship terminology. The two case studies allow as well to evaluate performances of state-of-the-art tools and to derive a set of recommendations as to how best apply them. Although not adapted to the specific domain, the employed tools prove to be a valid asset to translation tasks. Nonetheless, validation of results by domain experts proficient in the language is an unavoidable phase of the whole workflow.

2020

pdf bib
Proceedings of the Workshop about Language Resources for the SSH Cloud
Daan Broeder | Maria Eskevich | Monica Monachini
Proceedings of the Workshop about Language Resources for the SSH Cloud

pdf bib abs
LR4SSHOC: The Future of Language Resources in the Context of the Social Sciences and Humanities Open Cloud
Daan Broeder | Maria Eskevich | Monica Monachini
Proceedings of the Workshop about Language Resources for the SSH Cloud

This paper outlines the future of language resources and identifies their potential contribution for creating and sustaining the social sciences and humanities (SSH) component of the European Open Science Cloud (EOSC).

2014

pdf bib abs
Towards automatic quality assessment of component metadata
Thorsten Trippel | Daan Broeder | Matej Durco | Oddrun Ohren
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Measuring the quality of metadata is only possible by assessing the quality of the underlying schema and the metadata instance. We propose some factors that are measurable automatically for metadata according to the CMD framework, taking into account the variability of schemas that can be defined in this framework. The factors include among others the number of elements, the (re-)use of reusable components, the number of filled in elements. The resulting score can serve as an indicator of the overall quality of the CMD instance, used for feedback to metadata providers or to provide an overview of the overall quality of metadata within a reposi-tory. The score is independent of specific schemas and generalizable. An overall assessment of harvested metadata is provided in form of statistical summaries and the distribution, based on a corpus of harvested metadata. The score is implemented in XQuery and can be used in tools, editors and repositories.

pdf bib abs
The DWAN framework: Application of a web annotation framework for the general humanities to the domain of language resources
Przemyslaw Lenkiewicz | Olha Shkaravska | Twan Goosen | Daan Broeder | Menzo Windhouwer | Stephanie Roth | Olof Olsson
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Researchers share large amounts of digital resources, which offer new chances for cooperation. Collaborative annotation systems are meant to support this. Often these systems are targeted at a specific task or domain, e.g., annotation of a corpus. The DWAN framework for web annotation is generic and can support a wide range of tasks and domains. A key feature of the framework is its support for caching representations of the annotated resource. This allows showing the context of the annotation even if the resource has changed or has been removed. The paper describes the design and implementation of the framework. Use cases provided by researchers are well in line with the key characteristics of the DWAN annotation framework.

pdf bib abs
Experiences with the ISOcat Data Category Registry
Daan Broeder | Ineke Schuurman | Menzo Windhouwer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The ISOcat Data Category Registry has been a joint project of both ISO TC 37 and the European CLARIN infrastructure. In this paper the experiences of using ISOcat in CLARIN are described and evaluated. This evaluation clarifies the requirements of CLARIN with regard to a semantic registry to support its semantic interoperability needs. A simpler model based on concepts instead of data cate-gories and a simpler workflow based on community recommendations will address these needs better and offer the required flexibility.

2012

pdf bib abs
Federated Search: Towards a Common Search Infrastructure
Herman Stehouwer | Matej Durco | Eric Auer | Daan Broeder
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Within scientific institutes there exist many language resources. These resources are often quite specialized and relatively unknown. The current infrastructural initiatives try to tackle this issue by collecting metadata about the resources and establishing centers with stable repositories to ensure the availability of the resources. It would be beneficial if the researcher could, by means of a simple query, determine which resources and which centers contain information useful to his or her research, or even work on a set of distributed resources as a virtual corpus. In this article we propose an architecture for a distributed search environment allowing researchers to perform searches in a set of distributed language resources.

pdf bib abs
Proper Language Resource Centers
Willem Elbers | Daan Broeder | Dieter van Uytvanck
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Language resource centers allow researchers to reliably deposit their structured data together with associated meta data and run services operating on this deposited data. We are looking into possibilities to create long-term persistency of both the deposited data and the services operating on this data. Challenges, both technical and non-technical, that need to be solved are the need to replicate more than just the data, proper identification of the digital objects in a distributed environment by making use of persistent identifiers and the set-up of a proper authentication and authorization domain including the management of the authorization information on the digital objects. We acknowledge the investment that most language resource centers have made in their current infrastructure. Therefore one of the most important requirements is the loose coupling with existing infrastructures without the need to make many changes. This shift from a single language resource center into a federated environment of many language resource centers is discussed in the context of a real world center: The Language Archive supported by the Max Planck Institute for Psycholinguistics.

pdf bib abs
Standardizing a Component Metadata Infrastructure
Daan Broeder | Dieter van Uytvanck | Maria Gavrilidou | Thorsten Trippel | Menzo Windhouwer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the status of the standardization efforts of a Component Metadata approach for describing Language Resources with metadata. Different linguistic and Language & Technology communities as CLARIN, META-SHARE and NaLiDa use this component approach and see its standardization of as a matter for cooperation that has the possibility to create a large interoperable domain of joint metadata. Starting with an overview of the component metadata approach together with the related semantic interoperability tools and services as the ISOcat data category registry and the relation registry we explain the standardization plan and efforts for component metadata within ISO TC37/SC4. Finally, we present information about uptake and plans of the use of component metadata within the three mentioned linguistic and L&T communities.

pdf bib abs
Citing on-line Language Resources
Daan Broeder | Dieter van Uytvanck | Gunter Senft
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Although the possibility of referring or citing on-line data from publications is seen at least theoretically as an important means to provide immediate testable proof or simple illustration of a line of reasoning, the practice has not been wide-spread yet and no extensive experience has been gained about the possibilities and problems of referring to raw data-sets. This paper makes a case to investigate the possibility and need of persistent data visualization services that facilitate the inspection and evaluation of the cited data.

pdf bib abs
The Language Archive — a new hub for language resources
Sebastian Drude | Daan Broeder | Paul Trilsbeek | Peter Wittenburg
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This contribution presents The Language Archive (TLA), a new unit at the MPI for Psycholinguistics, discussing the current developments in management of scientific data, considering the need for new data research infrastructures. Although several initiatives worldwide in the realm of language resources aim at the integration, preservation and mobilization of research data, the state of such scientific data is still often problematic. Data are often not well organized and archived and not described by metadata ― even unique data such as field-work observational data on endangered languages is still mostly on perishable carriers. New data centres are needed that provide trusted, quality-reviewed, persistent services and suitable tools and that take legal and ethical issues seriously. The CLARIN initiative has established criteria for suitable centres. TLA is in a good position to be one of such centres. It is based on three essential pillars: (1) A data archive; (2) management, access and annotation tools; (3) archiving and software expertise for collaborative projects. The archive hosts mostly observational data on small languages worldwide and language acquisition data, but also data resulting from experiments.

2010

We describe our computer-supported framework to overcome the rule of metadata schism. It combines the use of controlled vocabularies, managed by a data category registry, with a component-based approach, where the categories can be combined to yield complex metadata structures. A metadata scheme devised in this way will thus be grounded in its use of categories. Schema designers will profit from existing prefabricated larger building blocks, motivating re-use at a larger scale. The common base of any two metadata schemes within this framework will solve, at least to a good extent, the semantic interoperability problem, and consequently, further promote systematic use of metadata for existing resources and tools to be shared.

pdf bib abs
Virtual Language Observatory: The Portal to the Language Resources and Technology Universe
Dieter Van Uytvanck | Claus Zinn | Daan Broeder | Peter Wittenburg | Mariano Gardellini
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Over the years, the field of Language Resources and Technology (LRT) has developed a tremendous amount of resources and tools. However, there is no ready-to-use map that researchers could use to gain a good overview and steadfast orientation when searching for, say corpora or software tools to support their studies. It is rather the case that information is scattered across project- or organisation-specific sites, which makes it hard if not impossible for less-experienced researchers to gather all relevant material. Clearly, the provision of metadata is central to resource and software exploration. However, in the LRT field, metadata comes in many forms, tastes and qualities, and therefore substantial harmonization and curation efforts are required to provide researchers with metadata-based guidance. To address this issue a broad alliance of LRT providers (CLARIN, the Linguist List, DOBES, DELAMAN, DFKI, ELRA) have initiated the Virtual Language Observatory portal to provide a low-barrier, easy-to-follow entry point to language resources and tools; it can be accessed via http://www.clarin.eu/vlo

2008

pdf bib abs
Foundation of a Component-based Flexible Registry for Language Resources and Technology
Daan Broeder | Thierry Declerck | Erhard Hinrichs | Stelios Piperidis | Laurent Romary | Nicoletta Calzolari | Peter Wittenburg
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Within the CLARIN e-science infrastructure project it is foreseen to develop a component-based registry for metadata for Language Resources and Language Technology. With this registry it is hoped to overcome the problems of the current available systems with respect to inflexible fixed schema, unsuitable terminology and interoperability problems. The registry will address interoperability needs by refering to a shared vocabulary registered in data category registries as they are suggested by ISO.

pdf bib abs
Building a Federation of Language Resource Repositories: the DAM-LR Project and its Continuation within CLARIN.
Daan Broeder | David Nathan | Sven Strömqvist | Remco van Veenendaal
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The DAM-LR project aims at virtually integrating various European language resource archives that allow users to navigate and operate in a single unified domain of language resources. This type of integration introduces Grid technology to the humanities disciplines and forms a federation of archives. The complete architecture is designed based on a few well-known components .This is considered the basis for building a research infrastructure for Language Resources as is planned within the CLARIN project. The DAM-LR project was purposefully started with only a small number of participants for flexibility and to avoid complex contract negotiations with respect to legal issues. Now that we have gained insights into the basic technology issues and organizational issues, it is foreseen that the federation will be expanded considerably within the CLARIN project that will also address the associated legal issues.

pdf bib abs
A Grid of Regional Language Archives
Paul Trilsbeek | Daan Broeder | Tobias Valkenhoef | Peter Wittenburg
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

About two years ago, the Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands, started an initiative to install regional language archives in various places around the world, particularly in places where a large number of endangered languages exist and are being documented. These digital archives make use of the LAT archiving framework that the MPI has developed over the past nine years. This framework consists of a number of web-based tools for depositing, organizing and utilizing linguistic resources in a digital archive. The regional archives are in principle autonomous archives, but they can decide to share metadata descriptions and language resources with the MPI archive in Nijmegen and become part of a grid of linked LAT archives. By doing so, they will also take advantage of the long-term preservation strategy of the MPI archive. This paper describes the reasoning behind this initiative and how in practice such an archive is set up.

2006

pdf bib abs
LAMUS: the Language Archive Management and Upload System
Daan Broeder | Andreas Claus | Freddy Offenga | Romuald Skiba | Paul Trilsbeek | Peter Wittenburg
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Language Archiving, Resource Management LAMUS is a web-based service that allows researchers to deposit their language resources into a language resources archive. It was developed at the MPI for Psycholinguistics for stricter control of the archive coherence and consistency and allowing wider use of the archiving facilities without increasing the workload for archive and corpus managers. LAMUS is based on the use of IMDI metadata standard for language resources and offers metadata search and browsing over the archive.

pdf bib abs
Technologies for a Federation of Language Resource Archives
Daan Broeder | Freddy Offenga | Peter Wittenburg | Peter van der Kamp | David Nathan | Sven Strömqvist
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The DAM-LR project aims at virtually integrating various European language resource archives that allow users to navigate and operate in a single unified domain of language resources. This type of integration introduces Grid technology to the humanities disciplines and forms a federation of archives. It is the basis for establishing a research infrastructure for language resources which will finally enable eHumanities. Currently, the complete architecture is designed based on a few well-known components and some components are already tested. Based on the technological insights gathered and due to discussions within the international DELAMAN network the ethical and organizational basis for such a federation is defined.

pdf bib abs
Foundations of Modern Language Resource Archives
Peter Wittenburg | Daan Broeder | Wolfgang Klein | Stephen Levinson | Laurent Romary
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

A number of serious reasons will convince an increasing amount of researchers to store their relevant material in centers which we will call "language resource archives". They combine the duty of taking care of long-term preservation as well as the task to give access to their material to different user groups. Access here is meant in the sense that an active interaction with the data will be made possible to support the integration of new data, new versions or commentaries of all sorts. Modern Language Resource Archives will have to adhere to a number of basic principles to fulfill all requirements and they will have to be involved in federations to create joint language resource domains making it even simpler for the researchers to access the data. This paper makes an attempt to formulate the essential pillars language resource archives have to adhere to.

pdf bib abs
Metadata Profile in the ISO Data Category Registry
Freddy Offenga | Daan Broeder | Peter Wittenburg | Julien Ducret | Laurent Romary
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Metadata descriptions of language resources become an increasing necessity since the shear amount of language resources is increasing rapidly and especially since we are now creating infrastuctures to access these resources via the web through integrated domains of language resource archives. Yet, the metadata frameworks offered for the domain of language resources (IMDI and OLAC), although mature, are not as widely accepted as necessary. The lack of confidence in the stability and persistence of the concepts and formats introduced by these metadata sets seems to be one argument for people to not invest the time needed for metadata creation. The introduction of these concepts into an ISO standardization process may convince contributors to make use of the terminology. The availability of the ISO Data Category Registry that includes a metadata profile will also offer the opportunity for researchers to construct their own metadata set tailored to the needs of the project at hand, but nevertheless supporting interoperability.

pdf bib abs
Comparison of Resource Discovery Methods
Alex Klassmann | Freddy Offenga | Daan Broeder | Romuald Skiba | Peter Wittenburg
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

It is an ongoing debate whether categorical systems created by some experts are an appropriate way to help users finding useful resources in the internet. However for the much more restricted domain of language documentation such a category system might still prove reasonable if not indispensable. This article gives an overview over the particular IMDI category set and presents a rough evaluation of its practical use at the Max-Planck-Institute Nijmegen.

2004

pdf bib abs
Architecture for Distributed Language Resource Management and Archiving
Peter Wittenburg | Heidi Johnson | Markus Buchhorn | Hennie Brugman | Daan Broeder
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

An architecture is presented that provides an integrated framework for managing, archiving and accessing language resources. This architecture was discussed in the DELAMAN network – a world-wide network of archives holding material about endangered languages. Such a framework will be built upon a metadata infrastructure, a mechanism to resolve unique resource identifiers, user and access rights management components. These components are closely related and have to be based on redundant and distributed services. For all these components existing middleware seems to be available, however, it has to be checked how they can interact with each other.

pdf bib
Cross-Disciplinary Integration of Metadata Descriptions
Peter Wittenburg | Greg Gulrajani | Daan Broeder | Marcus Uneson
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Using Profiles for IMDI Metadata Creation
Daan Broeder | Peter Wittenburg | Onno Crasborn
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Towards Metadata Interoperability
Peter Wittenburg | Daan Broeder | Paul Buitelaar
Proceeedings of the Workshop on NLP and XML (NLPXML-2004): RDF/RDFS and OWL in Language Technology

Daan Broeder

Fixing paper assignments

2022

2020

2014

2012

2010

2008

2006

2004

2003

2002

2001

2000

Co-authors

Venues