We present a multi-lingual dictionary of dirty words. We have collected about 3,200 dirty words in several languages and built a database of these. The language with the most words in the database is English, though there are several hundred dirty words in for instance Japanese too. Words are classified into their general meaning, such as what part of the human anatomy they refer to. Words can also be assigned a nuance label to indicate if it is a cute word used when speaking to children, a very rude word, a clinical word etc. The database is available online and will hopefully be enlarged over time. It has already been used in research on for instance automatic joke generation and emotion detection.
We implement several different methods for generating jokes in English. The common theme is to intentionally produce poor utterances by breaking Grices maxims of conversation. The generated jokes are evaluated and compared to human made jokes. They are in general quite weak jokes, though there are a few high scoring jokes and many jokes that score higher than the most boring human joke.
In this paper we present a novel method for automatic text summarization through text extraction, using computational semantics. The new idea is to view all the extracted text as a whole and compute a score for the total impact of the summary, instead of ranking for instance individual sentences. A greedy search strategy is used to search through the space of possible summaries and select the summary with the highest score of those found. The aim has been to construct a summarizer that can be quickly assembled, with the use of only a very few basic language tools, for languages that lack large amounts of structured or annotated data or advanced tools for linguistic processing. The proposed method is largely language independent, though we only evaluate it on English in this paper, using ROUGE-scores on texts from among others the DUC 2004 task 2. On this task our method performs better than several of the systems evaluated there, but worse than the best systems.
Finding the Correct Interpretation of Swedish Compounds, a Statistical Approach
Jonas Sjöbergh | Viggo Kann
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)