2025
pdf
bib
abs
PAKTON: A Multi-Agent Framework for Question Answering in Long Legal Agreements
Raptopoulos Petros
|
Giorgos Filandrianos
|
Maria Lymperaiou
|
Giorgos Stamou
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Contract review is a complex and time-intensive task that typically demands specialized legal expertise, rendering it largely inaccessible to non-experts. Moreover, legal interpretation is rarely straightforward—ambiguity is pervasive, and judgments often hinge on subjective assessments. Compounding these challenges, contracts are usually confidential, restricting their use with proprietary models and necessitating reliance on open-source alternatives. To address these challenges, we introduce PAKTON: a fully open-source, end-to-end, multi-agent framework with plug-and-play capabilities. PAKTON is designed to handle the complexities of contract analysis through collaborative agent workflows and a novel retrieval-augmented generation (RAG) component, enabling automated legal document review that is more accessible, adaptable, and privacy-preserving. Experiments demonstrate that PAKTON outperforms both general-purpose and pretrained models in predictive accuracy, retrieval performance, explainability, completeness, and grounded justifications as evaluated through a human study and validated with automated metrics.
pdf
bib
abs
Bias Beware: The Impact of Cognitive Biases on LLM-Driven Product Recommendations
Giorgos Filandrianos
|
Angeliki Dimitriou
|
Maria Lymperaiou
|
Konstantinos Thomas
|
Giorgos Stamou
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
The advent of Large Language Models (LLMs) has revolutionized product recommenders, yet their susceptibility to adversarial manipulation poses critical challenges, particularly in real-world commercial applications. Our approach is the first one to tap into human psychological principles, seamlessly modifying product descriptions, making such manipulations hard to detect. In this work, we investigate cognitive biases as black-box adversarial strategies, drawing parallels between their effects on LLMs and human purchasing behavior. Through extensive evaluation across models of varying scale, we find that certain biases, such as social proof, consistently boost product recommendation rate and ranking, while others, like scarcity and exclusivity, surprisingly reduce visibility. Our results demonstrate that cognitive biases are deeply embedded in state-of-the-art LLMs, leading to highly unpredictable behavior in product recommendations and posing significant challenges for effective mitigation.
pdf
bib
abs
Assumed Identities: Quantifying Gender Bias in Machine Translation of Gender-Ambiguous Occupational Terms
Orfeas Menis Mastromichalakis
|
Giorgos Filandrianos
|
Maria Symeonaki
|
Giorgos Stamou
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Machine Translation (MT) systems frequently encounter gender-ambiguous occupational terms, where they must assign gender without explicit contextual cues. While individual translations in such cases may not be inherently biased, systematic patterns—such as consistently translating certain professions with specific genders—can emerge, reflecting and perpetuating societal stereotypes. This ambiguity challenges traditional instance-level single-answer evaluation approaches, as no single gold standard translation exists. To address this, we introduce GRAPE, a probability-based metric designed to evaluate gender bias by analyzing aggregated model responses. Alongside this, we present GAMBIT, a benchmarking dataset in English with gender-ambiguous occupational terms. Using GRAPE, we evaluate several MT systems and examine whether their gendered translations in Greek and French align with or diverge from societal stereotypes, real-world occupational gender distributions, and normative standards.
pdf
bib
abs
Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models
Elena Stringli
|
Maria Lymperaiou
|
Giorgos Filandrianos
|
Athanasios Voulodimos
|
Giorgos Stamou
Findings of the Association for Computational Linguistics: ACL 2025
Inverse tasks can uncover potential reasoning gaps as Large Language Models (LLMs) scale up. In this work, we explore the redefinition task, in which we assign alternative values to well-known physical constants and units of measure, prompting LLMs to respond accordingly. Our findings show that not only does model performance degrade with scale, but its false confidence also rises. Moreover, while factors such as prompting strategies or response formatting are influential, they do not preclude LLMs from anchoring to memorized values.
2024
pdf
bib
abs
Optimal and efficient text counterfactuals using Graph Neural Networks
Dimitris Lymperopoulos
|
Maria Lymperaiou
|
Giorgos Filandrianos
|
Giorgos Stamou
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
As NLP models become increasingly integral to decision-making processes, the need for explainability and interpretability has become paramount. In this work, we propose a framework that achieves the aforementioned by generating semantically edited inputs, known as counterfactual interventions, which change the model prediction, thus providing a form of counterfactual explanations for the model. We frame the search for optimal counterfactual interventions as a graph assignment problem and employ a GNN to solve it, thus achieving high efficiency. We test our framework on two NLP tasks - binary sentiment classification and topic classification - and show that the generated edits are contrastive, fluent and minimal, while the whole process remains significantly faster than other state-of-the-art counterfactual editors.
pdf
bib
abs
Enhancing adversarial robustness in Natural Language Inference using explanations
Alexandros Koulakos
|
Maria Lymperaiou
|
Giorgos Filandrianos
|
Giorgos Stamou
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
The surge of state-of-the-art transformer-based models has undoubtedly pushed the limits of NLP model performance, excelling in a variety of tasks. We cast the spotlight on the underexplored task of Natural Language Inference (NLI), since models trained on popular well-suited datasets are susceptible to adversarial attacks, allowing subtle input interventions to mislead the model. In this work, we validate the usage of natural language explanation as a model-agnostic defence strategy through extensive experimentation: only by fine-tuning a classifier on the explanation rather than premise-hypothesis inputs, robustness under various adversarial attacks is achieved in comparison to explanation-free baselines. Moreover, since there is no standard strategy for testing the semantic validity of the generated explanations, we research the correlation of widely used language generation metrics with human perception, in order for them to serve as a proxy towards robust NLI models. Our approach is resource-efficient and reproducible without significant computational limitations.
pdf
bib
abs
BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training
Nikitas Theodoropoulos
|
Giorgos Filandrianos
|
Vassilis Lyberatos
|
Maria Lymperaiou
|
Giorgos Stamou
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
We describe our contribution to the Strict and Strict-Small tracks of the 2nd iteration of the BabyLM Challenge. The shared task is centered around efficient pre-training given data constraints motivated by human development. In response, we study the effect of synthetic story data in language pre-training using *TinyStories*: a recently introduced dataset of short stories. Initially, we train GPT-Neo models on subsets of *TinyStories*, while varying the amount of available data. We find that, even with access to less than 100M words, the models are able to generate high-quality, original completions to a given story, and acquire substantial linguistic knowledge. To measure the effect of synthetic story data, we train *LTG-BERT* encoder models on a combined dataset of: a subset of *TinyStories*, story completions generated by GPT-Neo, and a subset of the *BabyLM* dataset. Our experimentation reveals that synthetic data can occasionally offer modest gains, but overall have a negative influence on linguistic understanding. Our work offers an initial study on synthesizing story data in low resource settings and underscores their potential for augmentation in data-constrained language modeling. We publicly release our models and implementation on our GitHub.
pdf
bib
abs
Puzzle Solving using Reasoning of Large Language Models: A Survey
Panagiotis Giadikiaroglou
|
Maria Lymperaiou
|
Giorgos Filandrianos
|
Giorgos Stamou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Exploring the capabilities of Large Language Models (LLMs) in puzzle solving unveils critical insights into their potential and challenges in AI, marking a significant step towards understanding their applicability in complex reasoning tasks. This survey leverages a unique taxonomy—dividing puzzles into rule-based and rule-less categories—to critically assess LLMs through various methodologies, including prompting techniques, neuro-symbolic approaches, and fine-tuning. Through a critical review of relevant datasets and benchmarks, we assess LLMs’ performance, identifying significant challenges in complex puzzle scenarios. Our findings highlight the disparity between LLM capabilities and human-like reasoning, particularly in those requiring advanced logical inference. The survey underscores the necessity for novel strategies and richer datasets to advance LLMs’ puzzle-solving proficiency and contribute to AI’s logical reasoning and creative problem-solving advancements.
pdf
bib
abs
”I Never Said That”: A dataset, taxonomy and baselines on response clarity classification
Konstantinos Thomas
|
Giorgos Filandrianos
|
Maria Lymperaiou
|
Chrysoula Zerva
|
Giorgos Stamou
Findings of the Association for Computational Linguistics: EMNLP 2024
Equivocation and ambiguity in public speech are well-studied discourse phenomena, especially in political science and analysis of political interviews. Inspired by the well-grounded theory on equivocation, we aim to resolve the closely related problem of response clarity in questions extracted from political interviews, leveraging the capabilities of Large Language Models (LLMs) and human expertise. To this end, we introduce a novel taxonomy that frames the task of detecting and classifying response clarity and a corresponding clarity classification dataset which consists of question-answer (QA) pairs drawn from political interviews and annotated accordingly. Our proposed two-level taxonomy addresses the clarity of a response in terms of the information provided for a given question (high-level) and also provides a fine-grained taxonomy of evasion techniques that relate to unclear, ambiguous responses (lower-level). We combine ChatGPT and human annotators to collect, validate and annotate discrete QA pairs from political interviews, to be used for our newly introduced response clarity task. We provide a detailed analysis and conduct several experiments with different model architectures, sizes and adaptation methods to gain insights and establish new baselines over the proposed dataset and task.