This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
DaanBroeder
Also published as:
D. Broeder
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
This paper is framed in the context of the SSHOC project and aims at exploring how Language Technologies can help in promoting and facilitating multilingualism in the Social Sciences and Humanities (SSH). Although most SSH researchers produce culturally and societally relevant work in their local languages, metadata and vocabularies used in the SSH domain to describe and index research data are currently mostly in English. We thus investigate Natural Language Processing and Machine Translation approaches in view of providing resources and tools to foster multilingual access and discovery to SSH content across different languages. As case studies, we create and deliver as freely, openly available data a set of multilingual metadata concepts and an automatically extracted multilingual Data Stewardship terminology. The two case studies allow as well to evaluate performances of state-of-the-art tools and to derive a set of recommendations as to how best apply them. Although not adapted to the specific domain, the employed tools prove to be a valid asset to translation tasks. Nonetheless, validation of results by domain experts proficient in the language is an unavoidable phase of the whole workflow.
This paper outlines the future of language resources and identifies their potential contribution for creating and sustaining the social sciences and humanities (SSH) component of the European Open Science Cloud (EOSC).
Measuring the quality of metadata is only possible by assessing the quality of the underlying schema and the metadata instance. We propose some factors that are measurable automatically for metadata according to the CMD framework, taking into account the variability of schemas that can be defined in this framework. The factors include among others the number of elements, the (re-)use of reusable components, the number of filled in elements. The resulting score can serve as an indicator of the overall quality of the CMD instance, used for feedback to metadata providers or to provide an overview of the overall quality of metadata within a reposi-tory. The score is independent of specific schemas and generalizable. An overall assessment of harvested metadata is provided in form of statistical summaries and the distribution, based on a corpus of harvested metadata. The score is implemented in XQuery and can be used in tools, editors and repositories.
Researchers share large amounts of digital resources, which offer new chances for cooperation. Collaborative annotation systems are meant to support this. Often these systems are targeted at a specific task or domain, e.g., annotation of a corpus. The DWAN framework for web annotation is generic and can support a wide range of tasks and domains. A key feature of the framework is its support for caching representations of the annotated resource. This allows showing the context of the annotation even if the resource has changed or has been removed. The paper describes the design and implementation of the framework. Use cases provided by researchers are well in line with the key characteristics of the DWAN annotation framework.
The ISOcat Data Category Registry has been a joint project of both ISO TC 37 and the European CLARIN infrastructure. In this paper the experiences of using ISOcat in CLARIN are described and evaluated. This evaluation clarifies the requirements of CLARIN with regard to a semantic registry to support its semantic interoperability needs. A simpler model based on concepts instead of data cate-gories and a simpler workflow based on community recommendations will address these needs better and offer the required flexibility.
Within scientific institutes there exist many language resources. These resources are often quite specialized and relatively unknown. The current infrastructural initiatives try to tackle this issue by collecting metadata about the resources and establishing centers with stable repositories to ensure the availability of the resources. It would be beneficial if the researcher could, by means of a simple query, determine which resources and which centers contain information useful to his or her research, or even work on a set of distributed resources as a virtual corpus. In this article we propose an architecture for a distributed search environment allowing researchers to perform searches in a set of distributed language resources.
Language resource centers allow researchers to reliably deposit their structured data together with associated meta data and run services operating on this deposited data. We are looking into possibilities to create long-term persistency of both the deposited data and the services operating on this data. Challenges, both technical and non-technical, that need to be solved are the need to replicate more than just the data, proper identification of the digital objects in a distributed environment by making use of persistent identifiers and the set-up of a proper authentication and authorization domain including the management of the authorization information on the digital objects. We acknowledge the investment that most language resource centers have made in their current infrastructure. Therefore one of the most important requirements is the loose coupling with existing infrastructures without the need to make many changes. This shift from a single language resource center into a federated environment of many language resource centers is discussed in the context of a real world center: The Language Archive supported by the Max Planck Institute for Psycholinguistics.
This paper describes the status of the standardization efforts of a Component Metadata approach for describing Language Resources with metadata. Different linguistic and Language & Technology communities as CLARIN, META-SHARE and NaLiDa use this component approach and see its standardization of as a matter for cooperation that has the possibility to create a large interoperable domain of joint metadata. Starting with an overview of the component metadata approach together with the related semantic interoperability tools and services as the ISOcat data category registry and the relation registry we explain the standardization plan and efforts for component metadata within ISO TC37/SC4. Finally, we present information about uptake and plans of the use of component metadata within the three mentioned linguistic and L&T communities.
Although the possibility of referring or citing on-line data from publications is seen at least theoretically as an important means to provide immediate testable proof or simple illustration of a line of reasoning, the practice has not been wide-spread yet and no extensive experience has been gained about the possibilities and problems of referring to raw data-sets. This paper makes a case to investigate the possibility and need of persistent data visualization services that facilitate the inspection and evaluation of the cited data.
This contribution presents The Language Archive (TLA), a new unit at the MPI for Psycholinguistics, discussing the current developments in management of scientific data, considering the need for new data research infrastructures. Although several initiatives worldwide in the realm of language resources aim at the integration, preservation and mobilization of research data, the state of such scientific data is still often problematic. Data are often not well organized and archived and not described by metadata ― even unique data such as field-work observational data on endangered languages is still mostly on perishable carriers. New data centres are needed that provide trusted, quality-reviewed, persistent services and suitable tools and that take legal and ethical issues seriously. The CLARIN initiative has established criteria for suitable centres. TLA is in a good position to be one of such centres. It is based on three essential pillars: (1) A data archive; (2) management, access and annotation tools; (3) archiving and software expertise for collaborative projects. The archive hosts mostly observational data on small languages worldwide and language acquisition data, but also data resulting from experiments.
We describe our computer-supported framework to overcome the rule of metadata schism. It combines the use of controlled vocabularies, managed by a data category registry, with a component-based approach, where the categories can be combined to yield complex metadata structures. A metadata scheme devised in this way will thus be grounded in its use of categories. Schema designers will profit from existing prefabricated larger building blocks, motivating re-use at a larger scale. The common base of any two metadata schemes within this framework will solve, at least to a good extent, the semantic interoperability problem, and consequently, further promote systematic use of metadata for existing resources and tools to be shared.
Over the years, the field of Language Resources and Technology (LRT) has developed a tremendous amount of resources and tools. However, there is no ready-to-use map that researchers could use to gain a good overview and steadfast orientation when searching for, say corpora or software tools to support their studies. It is rather the case that information is scattered across project- or organisation-specific sites, which makes it hard if not impossible for less-experienced researchers to gather all relevant material. Clearly, the provision of metadata is central to resource and software exploration. However, in the LRT field, metadata comes in many forms, tastes and qualities, and therefore substantial harmonization and curation efforts are required to provide researchers with metadata-based guidance. To address this issue a broad alliance of LRT providers (CLARIN, the Linguist List, DOBES, DELAMAN, DFKI, ELRA) have initiated the Virtual Language Observatory portal to provide a low-barrier, easy-to-follow entry point to language resources and tools; it can be accessed via http://www.clarin.eu/vlo
Within the CLARIN e-science infrastructure project it is foreseen to develop a component-based registry for metadata for Language Resources and Language Technology. With this registry it is hoped to overcome the problems of the current available systems with respect to inflexible fixed schema, unsuitable terminology and interoperability problems. The registry will address interoperability needs by refering to a shared vocabulary registered in data category registries as they are suggested by ISO.
The DAM-LR project aims at virtually integrating various European language resource archives that allow users to navigate and operate in a single unified domain of language resources. This type of integration introduces Grid technology to the humanities disciplines and forms a federation of archives. The complete architecture is designed based on a few well-known components .This is considered the basis for building a research infrastructure for Language Resources as is planned within the CLARIN project. The DAM-LR project was purposefully started with only a small number of participants for flexibility and to avoid complex contract negotiations with respect to legal issues. Now that we have gained insights into the basic technology issues and organizational issues, it is foreseen that the federation will be expanded considerably within the CLARIN project that will also address the associated legal issues.
About two years ago, the Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands, started an initiative to install regional language archives in various places around the world, particularly in places where a large number of endangered languages exist and are being documented. These digital archives make use of the LAT archiving framework that the MPI has developed over the past nine years. This framework consists of a number of web-based tools for depositing, organizing and utilizing linguistic resources in a digital archive. The regional archives are in principle autonomous archives, but they can decide to share metadata descriptions and language resources with the MPI archive in Nijmegen and become part of a grid of linked LAT archives. By doing so, they will also take advantage of the long-term preservation strategy of the MPI archive. This paper describes the reasoning behind this initiative and how in practice such an archive is set up.
Language Archiving, Resource Management LAMUS is a web-based service that allows researchers to deposit their language resources into a language resources archive. It was developed at the MPI for Psycholinguistics for stricter control of the archive coherence and consistency and allowing wider use of the archiving facilities without increasing the workload for archive and corpus managers. LAMUS is based on the use of IMDI metadata standard for language resources and offers metadata search and browsing over the archive.
The DAM-LR project aims at virtually integrating various European language resource archives that allow users to navigate and operate in a single unified domain of language resources. This type of integration introduces Grid technology to the humanities disciplines and forms a federation of archives. It is the basis for establishing a research infrastructure for language resources which will finally enable eHumanities. Currently, the complete architecture is designed based on a few well-known components and some components are already tested. Based on the technological insights gathered and due to discussions within the international DELAMAN network the ethical and organizational basis for such a federation is defined.
A number of serious reasons will convince an increasing amount of researchers to store their relevant material in centers which we will call "language resource archives". They combine the duty of taking care of long-term preservation as well as the task to give access to their material to different user groups. Access here is meant in the sense that an active interaction with the data will be made possible to support the integration of new data, new versions or commentaries of all sorts. Modern Language Resource Archives will have to adhere to a number of basic principles to fulfill all requirements and they will have to be involved in federations to create joint language resource domains making it even simpler for the researchers to access the data. This paper makes an attempt to formulate the essential pillars language resource archives have to adhere to.
Metadata descriptions of language resources become an increasing necessity since the shear amount of language resources is increasing rapidly and especially since we are now creating infrastuctures to access these resources via the web through integrated domains of language resource archives. Yet, the metadata frameworks offered for the domain of language resources (IMDI and OLAC), although mature, are not as widely accepted as necessary. The lack of confidence in the stability and persistence of the concepts and formats introduced by these metadata sets seems to be one argument for people to not invest the time needed for metadata creation. The introduction of these concepts into an ISO standardization process may convince contributors to make use of the terminology. The availability of the ISO Data Category Registry that includes a metadata profile will also offer the opportunity for researchers to construct their own metadata set tailored to the needs of the project at hand, but nevertheless supporting interoperability.
It is an ongoing debate whether categorical systems created by some experts are an appropriate way to help users finding useful resources in the internet. However for the much more restricted domain of language documentation such a category system might still prove reasonable if not indispensable. This article gives an overview over the particular IMDI category set and presents a rough evaluation of its practical use at the Max-Planck-Institute Nijmegen.
An architecture is presented that provides an integrated framework for managing, archiving and accessing language resources. This architecture was discussed in the DELAMAN network – a world-wide network of archives holding material about endangered languages. Such a framework will be built upon a metadata infrastructure, a mechanism to resolve unique resource identifiers, user and access rights management components. These components are closely related and have to be based on redundant and distributed services. For all these components existing middleware seems to be available, however, it has to be checked how they can interact with each other.