Marek Grześ

Also published as: Marek Grzes


2025

pdf bib
Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique
Piotr Sawicki | Marek Grzes | Dan Brown | Fabricio Goes
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

This study adapts the Consensual Assessment Technique (CAT) for Large Language Models (LLMs), introducing a novel methodology for poetry evaluation. Using a 90-poem dataset with a ground truth based on publication venue, we demonstrate that this approach allows LLMs to significantly surpass the performance of non-expert human judges. Our method, which leverages forced-choice ranking within small, randomized batches, enabled Claude-3-Opus to achieve a Spearman’s Rank Correlation of 0.87 with the ground truth, dramatically outperforming the best human non-expert evaluation (SRC = 0.38). The LLM assessments also exhibited high inter-rater reliability, underscoring the methodology’s robustness. These findings establish that LLMs, when guided by a comparative framework, can be effective and reliable tools for assessing poetry, paving the way for their broader application in other creative domains.

2020

pdf bib
SESAM at SemEval-2020 Task 8: Investigating the Relationship between Image and Text in Sentiment Analysis of Memes
Lisa Bonheme | Marek Grzes
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper presents our submission to task 8 (memotion analysis) of the SemEval 2020 competition. We explain the algorithms that were used to learn our models along with the process of tuning the algorithms and selecting the best model. Since meme analysis is a challenging task with two distinct modalities, we studied the impact of different multimodal representation strategies. The results of several approaches to dealing with multimodal data are therefore discussed in the paper. We found that alignment-based strategies did not perform well on memes. Our quantitative results also showed that images and text were uncorrelated. Fusion-based strategies did not show significant improvements and using one modality only (text or image) tends to lead to better results when applied with the predictive models that we used in our research.

2019

pdf bib
Relating RNN Layers with the Spectral WFA Ranks in Sequence Modelling
Farhana Ferdousi Liza | Marek Grzes
Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges

We analyse Recurrent Neural Networks (RNNs) to understand the significance of multiple LSTM layers. We argue that the Weighted Finite-state Automata (WFA) trained using a spectral learning algorithm are helpful to analyse RNNs. Our results suggest that multiple LSTM layers in RNNs help learning distributed hidden states, but have a smaller impact on the ability to learn long-term dependencies. The analysis is based on the empirical results, however relevant theory (whenever possible) was discussed to justify and support our conclusions.

2016

pdf bib
An Improved Crowdsourcing Based Evaluation Technique for Word Embedding Methods
Farhana Ferdousi Liza | Marek Grześ
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP