Franciska de Jong


pdf bib
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference
Darja Fišer | Maria Eskevich | Jakob Lenardič | Franciska de Jong
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference


pdf bib
Proceedings of the Second ParlaCLARIN Workshop
Darja Fišer | Maria Eskevich | Franciska de Jong
Proceedings of the Second ParlaCLARIN Workshop

CLARIN: Distributed Language Resources and Technology in a European Infrastructure
Maria Eskevich | Franciska de Jong | Alexander König | Darja Fišer | Dieter Van Uytvanck | Tero Aalto | Lars Borin | Olga Gerassimenko | Jan Hajic | Henk van den Heuvel | Neeme Kahusk | Krista Liin | Martin Matthiesen | Stelios Piperidis | Kadri Vider
Proceedings of the 1st International Workshop on Language Technology Platforms

CLARIN is a European Research Infrastructure providing access to digital language resources and tools from across Europe and beyond to researchers in the humanities and social sciences. This paper focuses on CLARIN as a platform for the sharing of language resources. It zooms in on the service offer for the aggregation of language repositories and the value proposition for a number of communities that benefit from the enhanced visibility of their data and services as a result of integration in CLARIN. The enhanced findability of language resources is serving the social sciences and humanities (SSH) community at large and supports research communities that aim to collaborate based on virtual collections for a specific domain. The paper also addresses the wider landscape of service platforms based on language technologies which has the potential of becoming a powerful set of interoperable facilities to a variety of communities of use.

Interoperability in an Infrastructure Enabling Multidisciplinary Research: The case of CLARIN
Franciska de Jong | Bente Maegaard | Darja Fišer | Dieter van Uytvanck | Andreas Witt
Proceedings of the Twelfth Language Resources and Evaluation Conference

CLARIN is a European Research Infrastructure providing access to language resources and technologies for researchers in the humanities and social sciences. It supports the use and study of language data in general and aims to increase the potential for comparative research of cultural and societal phenomena across the boundaries of languages and disciplines, all in line with the European agenda for Open Science. Data infrastructures such as CLARIN have recently embarked on the emerging frameworks for the federation of infrastructural services, such as the European Open Science Cloud and the integration of services resulting from multidisciplinary collaboration in federated services for the wider SSH domain. In this paper we describe the interoperability requirements that arise through the existing ambitions and the emerging frameworks. The interoperability theme will be addressed at several levels, including organisation and ecosystem, design of workflow services, data curation, performance measurement and collaboration.


CLARIN: Towards FAIR and Responsible Data Science Using Language Resources
Franciska de Jong | Bente Maegaard | Koenraad De Smedt | Darja Fišer | Dieter Van Uytvanck
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


COMMIT at SemEval-2017 Task 5: Ontology-based Method for Sentiment Analysis of Financial Headlines
Kim Schouten | Flavius Frasincar | Franciska de Jong
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes our submission to Task 5 of SemEval 2017, Fine-Grained Sentiment Analysis on Financial Microblogs and News, where we limit ourselves to performing sentiment analysis on news headlines only (track 2). The approach presented in this paper uses a Support Vector Machine to do the required regression, and besides unigrams and a sentiment tool, we use various ontology-based features. To this end we created a domain ontology that models various concepts from the financial domain. This allows us to model the sentiment of actions depending on which entity they are affecting (e.g., ‘decreasing debt’ is positive, but ‘decreasing profit’ is negative). The presented approach yielded a cosine distance of 0.6810 on the official test data, resulting in the 12th position.


Survey: Computational Sociolinguistics: A Survey
Dong Nguyen | A. Seza Doğruöz | Carolyn P. Rosé | Franciska de Jong
Computational Linguistics, Volume 42, Issue 3 - September 2016


COMMIT-P1WP3: A Co-occurrence Based Approach to Aspect-Level Sentiment Analysis
Kim Schouten | Flavius Frasincar | Franciska de Jong
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

Applying prosodic speech features in mental health care: An exploratory study in a life-review intervention for depression
Sanne M.A. Lamers | Khiet P. Truong | Bas Steunenberg | Franciska de Jong | Gerben J. Westerhof
Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment
Dong Nguyen | Dolf Trieschnigg | A. Seza Doğruöz | Rilana Gravel | Mariët Theune | Theo Meder | Franciska de Jong
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Croatian Memories
Arjan van Hessen | Franciska de Jong | Stef Scagliola | Tanja Petrovic
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this contribution we describe a collection of approximately 400 video interviews recorded in the context of the project Croatian Memories (CroMe) with the objective of documenting personal war-related experiences. The value of this type of sources is threefold: they contain information that is missing in written sources, they can contribute to the process of reconciliation, and they provide a basis for reuse of data in disciplines with an interest in narrative data. The CroMe collection is not primarily designed as a linguistic corpus, but is the result of an archival effort to collect so-called oral history data. For researchers in the fields of natural language processing and speech analy¬sis this type of life-stories may function as an object trouvé containing real-life language data that can prove to be useful for the purpose of modelling specific aspects of human expression and communication.


Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus
Martin Reynaert | Nelleke Oostdijk | Orphée De Clercq | Henk van den Heuvel | Franciska de Jong
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In The Low Countries, a major reference corpus for written Dutch is being built. We discuss the interplay between data acquisition and data processing during the creation of the SoNaR Corpus. Based on developments in traditional corpus compiling and new web harvesting approaches, SoNaR is designed to contain 500 million words, balanced over 36 text types including both traditional and new media texts. Beside its balanced design, every text sample included in SoNaR will have its IPR issues settled to the largest extent possible. This data collection task presents many challenges because every decision taken on the level of text acquisition has ramifications for the level of processing and the general usability of the corpus. As far as the traditional text types are concerned, each text brings its own processing requirements and issues. For new media texts - SMS, chat - the problem is even more complex, issues such as anonimity, recognizability and citation right, all present problems that have to be tackled. The solutions actually lead to the creation of two corpora: a gigaword SoNaR, IPR-cleared for research purposes, and the smaller - of commissioned size - more privacy compliant SoNaR, IPR-cleared for commercial purposes as well.


pdf bib
Invited Talk: NLP and the Humanities: The Revival of an Old Liaison
Franciska de Jong
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)


Evaluation of Spoken Document Retrieval for Historic Speech Collections
Willemijn Heeren | Franciska de Jong | Laurens van der Werff | Marijn Huijbregts | Roeland Ordelman
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The re-use of spoken word audio collections maintained by audiovisual archives is severely hindered by their generally limited access. The CHoral project, which is part of the CATCH program funded by the Dutch Research Council, aims to provide users of speech archives with online, instead of on-location, access to relevant fragments, instead of full documents. To meet this goal, a spoken document retrieval framework is being developed. In this paper the evaluation efforts undertaken so far to assess and improve various aspects of the framework are presented. These efforts include (i) evaluation of the automatically generated textual representations of the spoken word documents that enable word-based search, (ii) the development of measures to estimate the quality of the textual representations for use in information retrieval, and (iii) studies to establish the potential user groups of the to-be-developed technology, and the first versions of the user interface supporting online access to spoken word collections.


Event-Coreference across Multiple, Multi-lingual Sources in the Mumis Project
Horacio Saggion | Jan Kuper | Hamish Cunningham | Thierry Declerck | Peter Wittenburg | Marco Puts | Eduard Hoenkamp | Franciska de Jong | Yorick Wilks