Steven Moran

Also published as: Steve Moran

2022

pdf abs
TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP
Steven Moran | Christian Bentz | Ximena Gutierrez-Vasques | Olga Pelloni | Tanja Samardzic
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present the TeDDi sample, a diversity sample of text data for language comparison and multilingual Natural Language Processing. The TeDDi sample currently features 89 languages based on the typological diversity sample in the World Atlas of Language Structures. It consists of more than 20k texts and is accompanied by open-source corpus processing tools. The aim of TeDDi is to facilitate text-based quantitative analysis of linguistic diversity. We describe in detail the TeDDi sample, how it was created, data availability, and its added value through for NLP and linguistic research.

2020

pdf abs
The ACQDIV Corpus Database and Aggregation Pipeline
Anna Jancso | Steven Moran | Sabine Stoll
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present the ACQDIV corpus database and aggregation pipeline, a tool developed as part of the European Research Council (ERC) funded project ACQDIV, which aims to identify the universal cognitive processes that allow children to acquire any language. The corpus database represents 15 corpora from 14 typologically maximally diverse languages. Here we give an overview of the project, database, and our extensible software package for adding more corpora to the current language sample. Lastly, we discuss how we use the corpus database to mine for universal patterns in child language acquisition corpora and we describe avenues for future research.

pdf abs
SegBo: A Database of Borrowed Sounds in the World’s Languages
Eitan Grossman | Elad Eisen | Dmitry Nikolaev | Steven Moran
Proceedings of the Twelfth Language Resources and Evaluation Conference

Phonological segment borrowing is a process through which languages acquire new contrastive speech sounds as the result of borrowing new words from other languages. Despite the fact that phonological segment borrowing is documented in many of the world’s languages, to date there has been no large-scale quantitative study of the phenomenon. In this paper, we present SegBo, a novel cross-linguistic database of borrowed phonological segments. We describe our data aggregation pipeline and the resulting language sample. We also present two short case studies based on the database. The first deals with the impact of large colonial languages on the sound systems of the world’s languages; the second deals with universals of borrowing in the domain of rhotic consonants.

2019

pdf abs
Is Word Segmentation Child’s Play in All Languages?
Georgia R. Loukatou | Steven Moran | Damian Blasi | Sabine Stoll | Alejandrina Cristia
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

When learning language, infants need to break down the flow of input speech into minimal word-like units, a process best described as unsupervised bottom-up segmentation. Proposed strategies include several segmentation algorithms, but only cross-linguistically robust algorithms could be plausible candidates for human word learning, since infants have no initial knowledge of the ambient language. We report on the stability in performance of 11 conceptually diverse algorithms on a selection of 8 typologically distinct languages. The results consist evidence that some segmentation algorithms are cross-linguistically valid, thus could be considered as potential strategies employed by all infants.

2018

pdf
Towards faithfully visualizing global linguistic diversity
Garland McNew | Curdin Derungs | Steven Moran
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
BDPROTO: A Database of Phonological Inventories from Ancient and Reconstructed Languages
Egidio Marsico | Sebastien Flavier | Annemarie Verkerk | Steven Moran
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Cross-linguistically Small World Networks are Ubiquitous in Child-directed Speech
Steven Moran | Danica Pajović | Sabine Stoll
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

The Open Linguistics Working Group (OWLG) brings together researchers from various fields of linguistics, natural language processing, and information technology to present and discuss principles, case studies, and best practices for representing, publishing and linking linguistic data collections. A major outcome of our work is the Linguistic Linked Open Data (LLOD) cloud, an LOD (sub-)cloud of linguistic resources, which covers various linguistic databases, lexicons, corpora, terminologies, and metadata repositories. We present and summarize five years of progress on the development of the cloud and of advancements in open data in linguistics, and we describe recent community activities. The paper aims to serve as a guideline to orient and involve researchers with the community and/or Linguistic Linked Open Data.

pdf abs
The ACQDIV Database: Min(d)ing the Ambient Language
Steven Moran
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

One of the most pressing questions in cognitive science remains unanswered: what cognitive mechanisms enable children to learn any of the world’s 7000 or so languages? Much discovery has been made with regard to specific learning mechanisms in specific languages, however, given the remarkable diversity of language structures (Evans and Levinson 2009, Bickel 2014) the burning question remains: what are the underlying processes that make language acquisition possible, despite substantial cross-linguistic variation in phonology, morphology, syntax, etc.? To investigate these questions, a comprehensive cross-linguistic database of longitudinal child language acquisition corpora from maximally diverse languages has been built.

2014

pdf abs
A Crowdsourcing Smartphone Application for Swiss German: Putting Language Documentation in the Hands of the Users
Jean-Philippe Goldman | Adrian Leeman | Marie-José Kolly | Ingrid Hove | Ibrahim Almajai | Volker Dellwo | Steven Moran
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This contribution describes an on-going projects a smartphone application called Voice Ãpp, which is a follow-up of a previous application called Dialäkt Ãpp. The main purpose of both apps is to identify the users Swiss German dialect on the basis of the dialectal variations of 15 words. The result is returned as one or more geographical points on a map. In Dialäkt Ãpp, launched in 2013, the user provides his or her own pronunciation through buttons, while the Voice Ãpp, currently in development, asks users to pronounce the word and uses speech recognition techniques to identify the variants and localize the user. This second app is more challenging from a technical point of view but nevertheless recovers the nature of dialect variation of spoken language. Besides, the Voice Ãpp takes its users on a journey in which they explore the individuality of their own voices, answering questions such as: How high is my voice? How fast do I speak? Do I speak faster than users in the neighbouring city?

2013

pdf
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
Steven Moran | Martin Brümmer
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data

pdf
Linguistic Resources Enhanced with Geospatial Information
Richard Littauer | Boris Villazon-Terrazas | Steven Moran
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data

pdf
An Open Source Toolkit for Quantitative Historical Linguistics
Johann-Mattis List | Steven Moran
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

This paper describes the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation (OKFN). The OWLG is an initiative concerned with linguistic data by scholars from diverse fields, including linguistics, NLP, and information science. The primary goal of the working group is to promote the idea of open linguistic resources, to develop means for their representation and to encourage the exchange of ideas across different disciplines. This paper summarizes the progress of the working group, goals that have been identified, problems that we are going to address, and recent activities and ongoing developments. Here, we put particular emphasis on the development of a Linked Open Data (sub-)cloud of linguistic resources that is currently being pursued by several OWLG members.

2009

pdf
An Ontology for Accessing Transcription Systems (OATS)
Steven Moran
Proceedings of the First Workshop on Language Technologies for African Languages

2008

Lexicon schemas and their use are discussed in this paper from the perspective of lexicographers and field linguists. A variety of lexicon schemas have been developed, with goals ranging from computational lexicography (DATR) through archiving (LIFT, TEI) to standardization (LMF, FSR). A number of requirements for lexicon schemas are given. The lexicon schemas are introduced and compared to each other in terms of conversion and usability for this particular user group, using a common lexicon entry and providing examples for each schema under consideration. The formats are assessed and the final recommendation is given for the potential users, namely to request standard compliance from the developers of the tools used. This paper should foster a discussion between authors of standards, lexicographers and field linguists.

Co-authors

Venues

lrec11
ldl2
acl2
ws1