This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
EricSanders
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
The Dutch Dialect Database (also known as the ‘Nederlandse Dialectenbank’) contains dialectal variations of Dutch that were recorded all over the Netherlands in the second half of the twentieth century. A subset of these recordings of about 300 hours were enriched with manual orthographic transcriptions, using non-standard approximations of dialectal speech. In this paper we describe the creation of a corpus containing both the audio recordings and their corresponding transcriptions and focus on our method for aligning the recordings with the transcriptions and the metadata.
LeiLanD (Leiden Language Data) is a searchable catalogue initiated by the Leiden University Centre for Linguistics (LUCL) with the support of CLARIAH. The catalogue contains metadata about language datasets collected at LUCL and other institutes of Leiden University. This paper describes a project to FAIRify the datasets increasing their findability and accessibility through a standardised metadata format CMDI so as to obtain a rich metadata description for all resources and to make them findable through CLARIN’s Virtual Language Observatory. The paper describes the creation of the catalogue and the steps that led from unstructured metadata to CMDI standards. This FAIRifi- cation of LeiLanD has enhanced the findability and accessibility of incredibly diverse collection of language datasets.
Twitter has been used as a textual resource to attempt to predict the outcome of elections for over a decade. A body of literature suggests that this is not consistently possible. In this paper we test the hypothesis that mentions of political parties in tweets are better correlated with the appearance of party names in newspapers than to the intention of the tweeter to vote for that party. Five Dutch national elections are used in this study. We find only a small positive, negligible difference in Pearson’s correlation coefficient as well as in the absolute error of the relation between tweets and news, and between tweets and elections. However, we find a larger correlation and a smaller absolute error between party mentions in newspapers and the outcome of the elections in four of the five elections. This suggests that newspapers are a better starting point for predicting the election outcome than tweets.
We study the relation between the number of mentions of political parties in the last weeks before the elections and the election results. In this paper we focus on the Dutch elections of the parliament in 2012 and for the provinces (and the senate) in 2011 and 2015. With raw counts, without adaptations, we achieve a mean absolute error (MAE) of 2.71% for 2011, 2.02% for 2012 and 2.89% for 2015. A set of over 17,000 tweets containing political party names were annotated by at least three annotators per tweet on ten features denoting communicative intent (including the presence of sarcasm, the message’s polarity, the presence of an explicit voting endorsement or explicit voting advice, etc.). The annotations were used to create oracle (gold-standard) filters. Tweets with or without a certain majority annotation are held out from the tweet counts, with the goal of attaining lower MAEs. With a grid search we tested all combinations of filters and their responding MAE to find the best filter ensemble. It appeared that the filters show markedly different behaviour for the three elections and only a small MAE improvement is possible when optimizing on all three elections. Larger improvements for one election are possible, but result in deterioration of the MAE for the other elections.
We set out to investigate whether TV ratings and mentions of TV programmes on the Twitter social media platform are correlated. If such a correlation exists, Twitter may be used as an alternative source for estimating viewer popularity. Moreover, the Twitter-based rating estimates may be generated during the programme, or even before. We count the occurrences of programme-specific hashtags in an archive of Dutch tweets of eleven popular TV shows broadcast in the Netherlands in one season, and perform correlation tests. Overall we find a strong correlation of 0.82; the correlation remains strong, 0.79, if tweets are counted a half hour before broadcast time. However, the two most popular TV shows account for most of the positive effect; if we leave out the single and second most popular TV shows, the correlation drops to being moderate to weak. Also, within a TV show, correlations between ratings and tweet counts are mostly weak, while correlations between TV ratings of the previous and next shows are strong. In absence of information on previous shows, Twitter-based counts may be a viable alternative to classic estimation methods for TV ratings. Estimates are more reliable with more popular TV shows.
We developed a web application for crowdsourcing transcriptions of Dutch words spoken by Spanish L2 learners. In this paper we discuss the design of the application and the influence of metadata and various forms of feedback. Useful data were obtained from 159 participants, with an average of over 20 transcriptions per item, which seems a satisfactory result for this type of research. Informing participants about how many items they still had to complete, and not how many they had already completed, turned to be an incentive to do more items. Assigning participants a score for their performance made it more attractive for them to carry out the transcription task, but this seemed to influence their performance. We discuss possible advantages and disadvantages in connection with the aim of the research and consider possible lessons for designing future experiments.
This paper describes the process of semi-automatically converting dictionaries from paper to structured text (database) and the integration of these into the CLARIN infrastructure in order to make the dictionaries accessible and retrievable for the research community. The case study at hand is that of the curation of 42 fascicles of the Dictionaries of the Brabantic and Limburgian dialects, and 6 fascicles of the Dictionary of dialects in Gelderland.
This paper describes the Dutch LESLLA data and its curation. LESLLA stands for Low-Educated Second Language and Literacy Acquisition. The data was collected for research in this field and would have been disappeared if it were not saved. Within the CLARIN project Data Curation Service the data was made into a spoken language resource and made available to other researchers.
The VALID Data Archive is an open multimedia data archive (under construction) with data from speakers suffering from language impairments. We report on a pilot project in the CLARIN-NL framework in which five data resources were curated. For all data sets concerned, written informed consent from the participants or their caretakers has been obtained. All materials were anonymized. The audio files were converted into wav (linear PCM) files and the transcriptions into CHAT or ELAN format. Research data that consisted of test, SPSS and Excel files were documented and converted into CSV files. All data sets obtained appropriate CMDI metadata files. A new CMDI metadata profile for this type of data resources was established and care was taken that ISOcat metadata categories were used to optimize interoperability. After curation all data are deposited at the Max Planck Institute for Psycholinguistics Nijmegen where persistent identifiers are linked to all resources. The content of the transcriptions in CHAT and plain text format can be searched with the TROVA search engine.
We present a web-based tool for retrieving and annotating audio fragments of e.g. interviews. Our collection contains 250 interviews with veterans of Dutch conflicts and military missions. The audio files of the interviews were disclosed using ASR technology focussed at keyword retrieval. Resulting transcripts were stored in a MySQL database together with metadata, summary texts, and keywords, and carefully indexed. Retrieved fragments can be made audible and annotated. Annotations can be kept personal or be shared with other users. The tool and formats comply with CLARIN standards. A demo version of the tool is available at http://wwwlands2.let.kun.nl/spex/annotationtooldemo.
In this paper a collection of chats and tweets from the Netherlands and Flanders is described. The chats and tweets are part of the freely available SoNaR corpus, a 500 million word text corpus of the Dutch language. Recruitment, metadata, anonymisation and IPR issues are discussed. To illustrate the difference of language use between the various text types and other parameters (like gender and age) simple text analysis in the form of unigram frequency lists is carried out. Furthermore a website is presented with which users can retrieve their own frequency lists.
Enhanced Publications are a new way to publish scientific and other results in an electronic article. The advantage of EPs is that the relation between the article and the underlying data facilitate the peer review process and other quality assessment activities. Due to the link between de publication and the research data the publication can be much richer than a paper edition permits. We present an example of EPs in which links are made to interview fragments that include transcripts, audio segments, annotations and metadata. EPs call for a new paradigm of research methodology in which digital persistent access to research data are a central issue. In this contribution we highlight 1. The research data as it is archived and curated, 2. the concept ""enhanced publication"" and its scientific value, 3. the ""fragment fitter tool"", a language processing tool to facilitate the creation of EPs, 4. IPR issues related to the re-use of the interview data.
Within the framework of the Dutch-Flemish programme STEVIN, the JASMIN-CGN (Jongeren, Anderstaligen en Senioren in Mens-machine Interactie Corpus Gesproken Nederlands) project was carried out, which was aimed at collecting speech of children, non-natives and elderly people. The JASMIN-CGN project is an extension of the Spoken Dutch Corpus (CGN) along three dimensions. First, by collecting a corpus of contemporary Dutch as spoken by children of different age groups, elderly people and non-natives with different mother tongues, an extension along the age and mother tongue dimensions was achieved. In addition, we collected speech material in a communication setting that was not envisaged in the CGN: human-machine interaction. One third of the data was collected in Flanders and two thirds in the Netherlands. In this paper we report on our experiences in collecting this corpus and we describe some of the important decisions that we made in the attempt to combine efficiency and high quality.
Research into spoken language has become more visual over the years. Both fundamental and applied research have progressively included gestures, gaze, and facial expression. Corpora of multi-modal conversational speech are rare and frequently difficult to use due to privacy and copyright restrictions. A freely available annotated corpus is presented, gratis and libre, of high quality video recordings of face-to-face conversational speech. Annotations include orthography, POS tags, and automatically generated phonemes transcriptions and word boundaries. In addition, labeling of both simple conversational function and gaze direction has been a performed. Within the bounds of the law, everything has been done to remove copyright and use restrictions. Annotations have been processed to RDBMS tables that allow SQL queries and direct connections to statistical software. From our experiences we would like to advocate the formulation of best practises for both legal handling and database storage of recordings and annotations.
The goal of the LILA project was the collection of speech databases over cellular telephone networks of five languages in three Asian countries. Three languages were recorded in India: Hindi by first language speakers, Hindi by second language speakers and Indian English. Furthermore, Mandarin was recorded in China and Korean in South-Korea. The databases are part of the SpeechDat-family and follow the SpeechDat rules in many respects. All databases have been finished and have passed the validation tests. Both Hindi databases and the Korean database will be available to the public for sale.
The goal of this project (LILA) is the collection of a large number of spoken databases for training Automatic Speech Recognition Systems for telephone applications in the Asian Pacific area. Specifications follow those of SpeechDat-like databases. Utterances will be recorded directly from calls made either from fixed or cellular telephones and are composed by read text and answers to specific questions. The project is driven by a consortium composed by a large number of industrial companies. Each company is in charge of the production of two databases. The consortium shares the databases produced in the project. The goal of the project should be reached within the year 2005.
This paper deals with the quality evaluation (validation) of Spoken Language Resources (SLR). The current situation in terms of relevant validation criteria and procedures is briefly presented. Next, a number of validation issues related to new data formats (XML-based annotations, UTF-16 encoding) are discussed. Further, new validation cycles that were introduced in a series of new projects like SpeeCon and OrienTel are addressed: prompt sheet validation, lexicon validation and pre-release validation. Finally, SPEX's current and future