Clarifying the research framing of NLP artefacts (e.g., models, datasets, etc.) is crucial to aligning research with practical applications when researchers claim that their findings have real-world impact. Recent studies manually analyzed NLP research across domains, showing that few papers explicitly identify key stakeholders, intended uses, or appropriate contexts. In this work, we propose to automate this analysis, developing a three-component system that infers research framings by first extracting key elements (means, ends, stakeholders), then linking them through interpretable rules and contextual reasoning.We evaluate our approach on two domains: automated fact-checking using an existing dataset, and hate speech detection for which we annotate a new dataset—achieving consistent improvements over strong LLM baselines.Finally, we apply our system to recent automated fact-checking papers and uncover three notable trends: a rise in underspecified research goals, increased emphasis on scientific exploration over application, and a shift toward supporting human fact-checkers rather than pursuing full automation.
Large language models are increasingly expected to adapt to individual users, reflecting differences in preferences, values, and communication styles. To evaluate whether models can serve diverse populations, we introduce MTPA, a benchmark that leverages large-scale survey data (WVS, EVS, GSS) to construct real, hyper-granular personas spanning demographics, beliefs, and values. Unlike prior benchmarks that rely on synthetic profiles or narrow trait prediction, MTPA conditions models on real personas and systematically tests their behavior across core alignment tasks. We show that persona conditioning exposes pluralistic misalignment: while aggregate metrics suggest models are truthful and safe, subgroup-specific evaluations reveal hidden pockets of degraded factuality, fairness disparities, and inconsistent value alignment. Alongside the benchmark, we release a dataset, toolkit, and baseline evaluations. MTPA is designed with extensibility and sustainability in mind: as the underlying survey datasets are regularly updated, MTPA supports regular integration of new populations and user traits.
Scientific writing is a challenging task, particularly for novice researchers who often rely on feedback from experienced peers. Recent work has primarily focused on improving surface form and style rather than manuscript content. In this paper, we propose a novel task: automated focused feedback generation for scientific writing assistance. We present SWIF2T: a Scientific WrIting Focused Feedback Tool. It is designed to generate specific, actionable and coherent comments, which identify weaknesses in a scientific paper and/or propose revisions to it. Our approach consists of four components - planner, investigator, reviewer and controller - leveraging multiple Large Language Models (LLMs) to implement them. We compile a dataset of 300 peer reviews citing weaknesses in scientific papers and conduct human evaluation. The results demonstrate the superiority in specificity, reading comprehension, and overall helpfulness of SWIF2T’s feedback compared to other approaches. In our analysis, we also identified cases where automatically generated reviews were judged better than human ones, suggesting opportunities for integration of AI-generated feedback in scientific writing.
Prior research has shown that typical fact-checking models for stand-alone claims struggle with claims made in conversation. As a solution, fine-tuning these models on dialogue data has been proposed. However, creating separate models for each use case is impractical, and we show that fine-tuning models for dialogue results in poor performance on typical fact-checking. To overcome this challenge, we present techniques that allow us to use the same models for both dialogue and typical fact-checking. These mainly focus on retrieval adaptation and transforming conversational inputs so that they can be accurately processed by models trained on stand-alone claims. We demonstrate that a typical fact-checking model incorporating these techniques is competitive with state-of-the-art models for dialogue, while maintaining its performance on stand-alone claims.