Language resources are valuable assets, both for institutions and researchers. To safeguard these resources requirements for repository systems and data management have been specified by various branch organizations, e.g., CLARIN and the Data Seal of Approval. This paper describes these and some additional ones posed by the authors’ home institutions. And it shows how they are met by FLAT, to provide a new home for language resources. The basis of FLAT is formed by the Fedora Commons repository system. This repository system can meet many of the requirements out-of-the box, but still additional configuration and some development work is needed to meet the remaining ones, e.g., to add support for Handles and Component Metadata. This paper describes design decisions taken in the construction of FLAT’s system architecture via a mix-and-match strategy, with a preference for the reuse of existing solutions. FLAT is developed and used by the Meertens Institute and The Language Archive, but is also freely available for anyone in need of a CLARIN-compliant repository for their language resources.
Researchers share large amounts of digital resources, which offer new chances for cooperation. Collaborative annotation systems are meant to support this. Often these systems are targeted at a specific task or domain, e.g., annotation of a corpus. The DWAN framework for web annotation is generic and can support a wide range of tasks and domains. A key feature of the framework is its support for caching representations of the annotated resource. This allows showing the context of the annotation even if the resource has changed or has been removed. The paper describes the design and implementation of the framework. Use cases provided by researchers are well in line with the key characteristics of the DWAN annotation framework.
The ISOcat Data Category Registry has been a joint project of both ISO TC 37 and the European CLARIN infrastructure. In this paper the experiences of using ISOcat in CLARIN are described and evaluated. This evaluation clarifies the requirements of CLARIN with regard to a semantic registry to support its semantic interoperability needs. A simpler model based on concepts instead of data cate-gories and a simpler workflow based on community recommendations will address these needs better and offer the required flexibility.
The Lexical Markup Framework (ISO 24613:2008) provides a core class diagram and various extensions as the basis for constructing lexical resources. Unfortunately the informative Document Type Definition provided by the standard and other available LMF serializations lack support for many of the powerful features of the model. This paper describes RELISH LMF, which unlocks the full power of the LMF model by providing a set of extensible modern schema modules. As use cases RELISH LL LMF and support by LEXUS, an online lexicon tool, are described.
The CLARIN Component Metadata Infrastructure (CMDI) established means for flexible resource descriptions for the domain of language resources with sound provisions for semantic interoperability weaved deeply into the meta model and the infrastructure. Based on this solid grounding, the infrastructure accommodates a growing collection of metadata records. In this paper, we give a short overview of the current status in the CMD data domain on the schema and instance level and harness the installed mechanisms for semantic interoperability to explore the similarity relations between individual profiles/schemas. We propose a method to use the semantic links shared among the profiles to generate/compile a similarity graph. This information is further rendered in an interactive graph viewer: the SMC Browser. The resulting interactive graph offers an intuitive view on the complex interrelations of the discussed dataset revealing clusters of more similar profiles. This information is useful both for metadata modellers, for metadata curation tasks as well as for general audience seeking for a ‘big picture’ of the complex CMD data domain.
Within the European CLARIN infrastructure ISOcat is used to enable both humans and computer programs to find specific resources even when they use different terminology or data structures. In order to do so, it should be clear which concepts are used in these resources, both at the level of metadata for the resource as well as its content, and what is meant by them. The concepts can be specified in ISOcat. SCHEMAcat enables us to relate the concepts used by a resource, while RELcat enables to type these relationships and add relationships beyond resource boundaries. This way these three registries together allow us (and the programs) to find what we are looking for.
This paper describes the status of the standardization efforts of a Component Metadata approach for describing Language Resources with metadata. Different linguistic and Language & Technology communities as CLARIN, META-SHARE and NaLiDa use this component approach and see its standardization of as a matter for cooperation that has the possibility to create a large interoperable domain of joint metadata. Starting with an overview of the component metadata approach together with the related semantic interoperability tools and services as the ISOcat data category registry and the relation registry we explain the standardization plan and efforts for component metadata within ISO TC37/SC4. Finally, we present information about uptake and plans of the use of component metadata within the three mentioned linguistic and L&T communities.
The RELISH project promotes language-oriented research by addressing a two-pronged problem: (1) the lack of harmonization between digital standards for lexical information in Europe and America, and (2) the lack of interoperability among existing lexicons of endangered languages, in particular those created with the Shoebox/Toolbox lexicon building software. The cooperation partners in the RELISH project are the University of Frankfurt (FRA), the Max Planck Institute for Psycholinguistics (MPI Nijmegen), and Eastern Michigan University, the host of the Linguist List (ILIT). The project aims at harmonizing key European and American digital standards whose divergence has hitherto impeded international collaboration on language technology for resource creation and analysis, as well as web services for archive access. Focusing on several lexicons of endangered languages, the project will establish a unified way of referencing lexicon structure and linguistic concepts, and develop a procedure for migrating these heterogeneous lexicons to a standards-compliant format. Once developed, the procedure will be generalizable to the large store of lexical resources involved in the LEGO and DoBeS projects.
The ISOcat Data Category Registry contains basically a flat and easily extensible list of data category specifications. To foster reuse and standardization only very shallow relationships among data categories are stored in the registry. However, to assist crosswalks, possibly based on personal views, between various (application) domains and to overcome possible proliferation of data categories more types of ontological relationships need to be specified. RELcat is a first prototype of a Relation Registry, which allows storing arbitrary relationships. These relationships can reflect the personal view of one linguist or a larger community. The basis of the registry is a relation type taxonomy that can easily be extended. This allows on one hand to load existing sets of relations specified in, for example, an OWL (2) ontology or SKOS taxonomy. And on the other hand allows algorithms that query the registry to traverse the stored semantic network to remain ignorant of the original source vocabulary. This paper describes first experiences with RELcat and explains some initial design decisions.
We describe our computer-supported framework to overcome the rule of metadata schism. It combines the use of controlled vocabularies, managed by a data category registry, with a component-based approach, where the categories can be combined to yield complex metadata structures. A metadata scheme devised in this way will thus be grounded in its use of categories. Schema designers will profit from existing prefabricated larger building blocks, motivating re-use at a larger scale. The common base of any two metadata schemes within this framework will solve, at least to a good extent, the semantic interoperability problem, and consequently, further promote systematic use of metadata for existing resources and tools to be shared.
In this paper, we describe a unifying approach to tackle data heterogeneity issues for lexica and related resources. We present LEXUS, our software that implements the Lexical Markup Framework (LMF) to uniformly describe and manage lexica of different structures. LEXUS also makes use of a central Data Category Registry (DCR) to address terminological issues with regard to linguistic concepts as well as the handling of working and object languages. Finally, we report on ViCoS, a LEXUS extension, providing support for the definition of arbitrary semantic relations between lexical entries or parts thereof.
To achieve true interoperability for valuable linguistic resources different levels of variation need to be addressed. ISO Technical Committee 37, Terminology and other language and content resources, is developing a Data Category Registry. This registry will provide a reusable set of data categories. A new implementation, dubbed ISOcat, of the registry is currently under construction. This paper shortly describes the new data model for data categories that will be introduced in this implementation. It goes on with a sketch of the standardization process. Completed data categories can be reused by the community. This is done by either making a selection of data categories using the ISOcat web interface, or by other tools which interact with the ISOcat system using one of its various Application Programming Interfaces. Linguistic resources that use data categories from the registry should include persistent references, e.g. in the metadata or schemata of the resource, which point back to their origin. These data category references can then be used to determine if two or more resources share common semantics, thus providing a level of interoperability close to the source data and a promising layer for semantic alignment on higher levels.