J. A. Meaney
2026
SemEval-2026 Task 1: MWAHAHA, Models Write Automatic Humor And Humans Annotate
Santiago Castro | Luis Chiruzzo | Santiago Góngora | Naihao Deng | Salar Rahili | Ignacio Sastre | Aiala Rosá | Victoria Amoroso | Guillermo Rey | Guillermo Moncecchi | J. A. Meaney | Juan José Prada | Rada Mihalcea
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Santiago Castro | Luis Chiruzzo | Santiago Góngora | Naihao Deng | Salar Rahili | Ignacio Sastre | Aiala Rosá | Victoria Amoroso | Guillermo Rey | Guillermo Moncecchi | J. A. Meaney | Juan José Prada | Rada Mihalcea
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
We present SemEval-2026 Task 1: MWAHAHA (Models Write Automatic Humor And Humans Annotate), the first shared task on general-purpose humor generation. Systems must produce short jokes in English, Spanish, and Chinese under lexical or topical constraints (Subtask A) and generate humorous captions for GIFs (Subtask B). To discourage memorization and ensure fairness, all jokes must meet specific criteria, such as using infrequent word pairs or relating to recent news headlines. Evaluation is conducted through pairwise human preference judgments in a Chatbot Arena-style setting, yielding Elo-based rankings. The task attracted 309 registered users, with 37 teams submitting systems to the evaluation phase. Participating systems employ a wide range of NLP techniques, including generate-then-rank pipelines, reinforcement learning, parameter-efficient fine-tuning, retrieval-augmented generation, humor-theory-grounded prompting, and persona-based strategies. Our Gemini 2.5 Flash baseline, using simple prompts, tied for first place in all subtasks, and the majority of elaborate multi-stage pipelines only marginally surpassed it with overlapping confidence intervals. More work is necessary to outperform the simple usage of state-of-the-art large language models. We release all evaluation data, prompts, and leaderboard results to support future research in computational humor generation.
2024
Testing and Adapting the Representational Abilities of Large Language Models on Folktales in Low-Resource Languages
J. A. Meaney | Beatrice Alex | William Lamb
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
J. A. Meaney | Beatrice Alex | William Lamb
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Folktales are a rich resource of knowledge about the society and culture of a civilisation. Digital folklore research aims to use automated techniques to better understand these folktales, and it relies on abstract representations of the textual data. Although a number of large language models (LLMs) claim to be able to represent low-resource langauges such as Irish and Gaelic, we present two classification tasks to explore how useful these representations are, and three adaptations to improve the performance of these models. We find that adapting the models to work with longer sequences, and continuing pre-training on the domain of folktales improves classification performance, although these findings are tempered by the impressive performance of a baseline SVM with non-contextual features.
2021
SemEval 2021 Task 7: HaHackathon, Detecting and Rating Humor and Offense
J. A. Meaney | Steven Wilson | Luis Chiruzzo | Adam Lopez | Walid Magdy
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
J. A. Meaney | Steven Wilson | Luis Chiruzzo | Adam Lopez | Walid Magdy
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
SemEval 2021 Task 7, HaHackathon, was the first shared task to combine the previously separate domains of humor detection and offense detection. We collected 10,000 texts from Twitter and the Kaggle Short Jokes dataset, and had each annotated for humor and offense by 20 annotators aged 18-70. Our subtasks were binary humor detection, prediction of humor and offense ratings, and a novel controversy task: to predict if the variance in the humor ratings was higher than a specific threshold. The subtasks attracted 36-58 submissions, with most of the participants choosing to use pre-trained language models. Many of the highest performing teams also implemented additional optimization techniques, including task-adaptive training and adversarial training. The results suggest that the participating systems are well suited to humor detection, but that humor controversy is a more challenging task. We discuss which models excel in this task, which auxiliary techniques boost their performance, and analyze the errors which were not captured by the best systems.
2020
Smash at SemEval-2020 Task 7: Optimizing the Hyperparameters of ERNIE 2.0 for Humor Ranking and Rating
J. A. Meaney | Steven Wilson | Walid Magdy
Proceedings of the Fourteenth Workshop on Semantic Evaluation
J. A. Meaney | Steven Wilson | Walid Magdy
Proceedings of the Fourteenth Workshop on Semantic Evaluation
The use of pre-trained language models such as BERT and ULMFiT has become increasingly popular in shared tasks, due to their powerful language modelling capabilities. Our entry to SemEval uses ERNIE 2.0, a language model which is pre-trained on a large number of tasks to enrich the semantic and syntactic information learned. ERNIE’s knowledge masking pre-training task is a unique method for learning about named entities, and we hypothesise that it may be of use in a dataset which is built on news headlines and which contains many named entities. We optimize the hyperparameters in a regression and classification model and find that the hyperparameters we selected helped to make bigger gains in the classification model than the regression model.
Crossing the Line: Where do Demographic Variables Fit into Humor Detection?
J. A. Meaney
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
J. A. Meaney
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Recent humor classification shared tasks have struggled with two issues: either the data comprises a highly constrained genre of humor which does not broadly represent humor, or the data is so indiscriminate that the inter-annotator agreement on its humor content is drastically low. These tasks typically average over all annotators’ judgments, in spite of the fact that humor is a highly subjective phenomenon. We argue that demographic factors influence whether a text is perceived as humorous or not. We propose the addition of demographic information about the humor annotators in order to bin ratings more sensibly. We also suggest the addition of an ‘offensive’ label to distinguish between different generations, in terms of humor. This would allow for more nuanced shared tasks and could lead to better performance on downstream tasks, such as content moderation.