Jussi Karlgren


2020

pdf bib
100,000 Podcasts: A Spoken English Document Corpus
Ann Clifton | Sravana Reddy | Yongze Yu | Aasish Pappu | Rezvaneh Rezapour | Hamed Bonab | Maria Eskevich | Gareth Jones | Jussi Karlgren | Ben Carterette | Rosie Jones
Proceedings of the 28th International Conference on Computational Linguistics

Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research.

2019

pdf bib
Team Harry Friberg at SemEval-2019 Task 4: Identifying Hyperpartisan News through Editorially Defined Metatopics
Nazanin Afsarmanesh | Jussi Karlgren | Peter Sumbler | Nina Viereckel
Proceedings of the 13th International Workshop on Semantic Evaluation

This report describes the starting point for a simple rule based hypothesis testing excercise on identifying hyperpartisan news items carried out by the Harry Friberg team from Gavagai. We used manually crafted metatopics, topics which often appear in hyperpartisan texts as rant conduits, together with tonality analysis to identify general characteristics of hyperpartisan news items. While the precision of the resulting effort is less than stellar— our contribution ranked 37th of the 42 successfully submitted experiments with overly high recall (95%) and low precision (54%)—we believe we have a model which allows us to continue exploring the underlying features of what the subgenre of hyperpartisan news items is characterised by.

2016

pdf bib
The Gavagai Living Lexicon
Magnus Sahlgren | Amaru Cuba Gyllensten | Fredrik Espinoza | Ola Hamfors | Jussi Karlgren | Fredrik Olsson | Per Persson | Akshay Viswanathan | Anders Holst
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents the Gavagai Living Lexicon, which is an online distributional semantic model currently available in 20 different languages. We describe the underlying distributional semantic model, and how we have solved some of the challenges in applying such a model to large amounts of streaming data. We also describe the architecture of our implementation, and discuss how we deal with continuous quality assurance of the lexicon.

2015

pdf bib
Inferring the location of authors from words in their texts
Max Berggren | Jussi Karlgren | Robert Östling | Mikael Parkvall
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

2013

pdf bib
New Measures to Investigate Term Typology by Distributional Data
Jussi Karlgren
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2011

pdf bib
Experiments to investigate the utility of nearest neighbour metrics based on linguistically informed features for detecting textual plagiarism
Per Almquist | Jussi Karlgren
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

2010

pdf bib
Cross-Lingual Comparison between Distributionally Determined Word Similarity Networks
Olof Görnerup | Jussi Karlgren
Proceedings of TextGraphs-5 - 2010 Workshop on Graph-based Methods for Natural Language Processing

pdf bib
Uncertainty Detection as Approximate Max-Margin Sequence Labelling
Oscar Täckström | Sumithra Velupillai | Martin Hassel | Gunnar Eriksson | Hercules Dalianis | Jussi Karlgren
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

2008

pdf bib
Experiments to Investigate the Connection between Case Distribution and Topical Relevance of Search Terms in an Information Retrieval Setting
Jussi Karlgren | Hercules Dalianis | Bart Jongejan
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We have performed a set of experiments made to investigate the utility of morphological analysis to improve retrieval of documents written in languages with relatively large morphological variation in a practical commercial setting, using the SiteSeeker search system developed and marketed by Euroling Ab. The objective of the experiments was to evaluate different lemmatisers and stemmers to determine which would be the most practical for the task at hand: highly interactive, relatively high precision web searches in commercial customer-oriented document collections. This paper gives an overview of some of the results for Finnish and German, and describes specifically one experiment designed to investigate the case distribution of nouns in a highly inflectional language (Finnish) and the topicality of the nouns in target texts. We find that topical nouns taken from queries are distributed differently over relevant and non-relevant documents depending on their grammatical case.

2007

pdf bib
SICS: Valence annotation based on seeds in word space
Magnus Sahlgren | Jussi Karlgren | Gunnar Eriksson
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib
Creating bilingual lexica using reference wordlists for alignment of monolingual semantic vector spaces
Jon Holmlund | Magnus Sahlgren | Jussi Karlgren
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)

pdf bib
Compound terms and their constituent elements in information retrieval
Jussi Karlgren
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)

pdf bib
Multilingual interactive experiments with Flickr
Paul D. Clough | Julio Gonzales | Jussi Karlgren
Proceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources

1998

pdf bib
Assembling a Balanced Corpus from the Internet
Johan Dewe | Jussi Karlgren | Ivan Bretan
Proceedings of the 11th Nordic Conference of Computational Linguistics (NODALIDA 1998)

1994

pdf bib
Clustering Sentences – Making Sense of Synonymous Sentences
Jussi Karlgren | Björn Gambäck | Christer Samuelsson
Proceedings of the 9th Nordic Conference of Computational Linguistics (NODALIDA 1993)

pdf bib
Dilemma - An Instant Lexicographer
Hans Karlgren | Jussi Karlgren | Magnus Nordstrom | Paul Pettersson | Bengt Wahrolen
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

pdf bib
Recognizing Text Genres With Simple Metrics Using Discriminant Analysis
Jussi Karlgren | Douglass Cutting
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics

1993

pdf bib
A Speech to Speech Translation System Built From Standard Components
Manny Rayner | Hiyan Alshawi | Ivan Bretan | David Carter | Vassilios Digalakis | Bjorn Gamback | Jaan Kaja | Jussi Karlgren | Bertil Lyberg | Steve Pulman | Patti Price | Christer Samuelsson
Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993