Bartosz Żuk


2025

This study addresses the fundamental task of discourse unit detection – the critical initial step in discourse parsing. We analyze how various discourse frameworks conceptualize and structure discourse units, with a focus on their underlying taxonomies and theoretical assumptions. While approaches to discourse segmentation vary considerably, the extent to which these conceptual divergences influence practical implementations remains insufficiently studied. To address this gap, we investigate similarities and differences in segmentation across several English datasets, segmented and annotated according to distinct discourse frameworks, using a simple, rule-based heuristics. We evaluate the effectiveness of rules with respect to gold-standard segmentation, while also checking variability and cross-framework generalizability. Additionally, we conduct a manual comparison of a sample of rule-based segmentation outputs against benchmark segmentation, identifying points of convergence and divergence.Our findings indicate that discourse frameworks align strongly at the level of segmentation: particular clauses consistently serve as the primary boundaries of discourse units. Discrepancies arise mainly in the treatment of other structures, such as adpositional phrases, appositions, interjections, and parenthesised text segments, which are inconsistently marked as separate discourse units across formalisms.
Alignment is the critical process of minimizing harmful outputs by teaching large language models (LLMs) to prefer safe, helpful and appropriate responses. While the majority of alignment research and datasets remain overwhelmingly English-centric, ensuring safety across diverse linguistic and cultural contexts requires localized resources. In this paper, we introduce the first Polish preference dataset PLLuM-Align, created entirely through human annotation to reflect Polish language and cultural nuances. The dataset includes response rating, ranking, and multi-turn dialog data. Designed to reflect the linguistic subtleties and cultural norms of Polish, this resource lays the groundwork for more aligned Polish LLMs and contributes to the broader goal of multilingual alignment in underrepresented languages.