2022
pdf
abs
Automated speech tools for helping communities process restricted-access corpora for language revival efforts
Nay San
|
Martijn Bartelds
|
Tolulope Ogunremi
|
Alison Mount
|
Ruben Thompson
|
Michael Higgins
|
Roy Barker
|
Jane Simpson
|
Dan Jurafsky
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages
Many archival recordings of speech from endangered languages remain unannotated and inaccessible to community members and language learning programs. One bottleneck is the time-intensive nature of annotation. An even narrower bottleneck occurs for recordings with access constraints, such as language that must be vetted or filtered by authorised community members before annotation can begin. We propose a privacy-preserving workflow to widen both bottlenecks for recordings where speech in the endangered language is intermixed with a more widely-used language such as English for meta-linguistic commentary and questions (e.g.What is the word for ‘tree’?). We integrate voice activity detection (VAD), spoken language identification (SLI), and automatic speech recognition (ASR) to transcribe the metalinguistic content, which an authorised person can quickly scan to triage recordings that can be annotated by people with lower levels of access. We report work-in-progress processing 136 hours archival audio containing a mix of English and Muruwari. Our collaborative work with the Muruwari custodian of the archival materials show that this workflow reduces metalanguage transcription time by 20% even given only minimal amounts of annotated training data, 10 utterances per language for SLI and for ASR at most 39 minutes, and possibly as little as 39 seconds.
2006
pdf
abs
Collaborative Annotation that Lasts Forever: Using Peer-to-Peer Technology for Disseminating Corpora and Language Resources
Magesh Balasubramanya
|
Michael Higgins
|
Peter Lucas
|
Jeff Senn
|
Dominic Widdows
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper describes a peer-to-peer architecture for representing and disseminating linguistic corpora, linguistic annotation, and resources such as lexical databases and gazetteers. The architecture is based upon a Universal Database technology in which all information is represented in globally identified, extensible bundles of attribute-value pairs. These objects are replicated at will between peers in the network, and the business rules that implement replication involve checking digital signatures and proper attribution of data, to avoid information being tampered with or abuse of copyright. Universal identifiers enable comprehensive standoff annotation and commentary. A carefully constructed publication mechanism is described that enables different users to subscribe to material provided by trusted publishers on recognized topics or themes. Access to content and related annotation is provided by distributed indexes, represented using the same underlying data objects as the rest of the database.
pdf
abs
The Information Commons Gazetteer
Peter Lucas
|
Magesh Balasubramanya
|
Dominic Widdows
|
Michael Higgins
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Advances in location aware computing and the convergence of geographic and textual information systems will require a comprehensive, extensible, information rich framework called the Information Commons Gazetteer that can be freely disseminated to small devices in a modular fashion. This paper describes the infrastructure and datasets used to create such a resource. The Gazetteer makes use of MAYA Design's Universal Database Architecture; a peer-to-peer system based upon bundles of attribute-value pairs with universally unique identity, and sophisticated indexing and data fusion tools. The Gazetteer primarily constitutes publicly available geographic information from various agencies that is organized into a well-defined scalable hierarchy of worldwide administrative divisions and populated places. The data from various sources are imported into the commons incrementally and are fused with existing data in an iterative process allowing for rich information to evolve over time. Such a flexible and distributed public resource of the geographic places and place names allows for both researchers and practitioners to realize location aware computing in an efficient and useful way in the near future by eliminating redundant time consuming fusion of disparate sources.