Filippo Menczer

2026

Large Language Models Require Curated Context for Reliable Political Fact-Checking—Even with Reasoning and Web Search
Matthew R. DeVerna | Kai-Cheng Yang | Harry Yaojun Yan | Filippo Menczer
Findings of the Association for Computational Linguistics: ACL 2026

Large language models (LLMs) have raised hopes for automated end-to-end fact-checking, but prior studies report mixed results. As mainstream chatbots increasingly ship with reasoning capabilities and web search tools—and millions of users already rely on them for verification—rigorous evaluation is urgent. We evaluate 15 recent LLMs from OpenAI, Google, Meta, and DeepSeek on more than 6,000 claims fact-checked by PolitiFact, comparing standard models with reasoning- and web-search variants. Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains, despite fact-checks being available on the web. In contrast, a curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants. These findings suggest that giving models access to curated high-quality context is a promising path for automated fact-checking.

pdf bib abs

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
Zoher Kachwala | Bao Tran Truong | Rasika Muralidharan | Haewoon Kwak | Jisun An | Filippo Menczer
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Social media are shifting towards pluralism — community-governed platforms where groups define their own norms. What violates rules in one community may be perfectly acceptable in another. Can AI models help moderate such pluralistic communities? We formalize the task as a multiple-choice problem, mirroring how human moderators operate in the real world: given a comment and its surrounding context, identify which specific rule, if any, is violated. We introduce PluRule, a multimodal, multilingual benchmark for detecting 13,371 rule violations across 1,989 Reddit communities spanning 2,885 rules in 9 languages. Using this benchmark, we show that state-of-the-art vision-language models struggle significantly: even GPT-5.2 with high reasoning performs only slightly better than a trivial baseline. We also find that bigger models and increased context provide marginal gains, and universal rules like civility and self-promotion are easier to detect. Our results show that moderation of pluralistic communities on social media is a fundamental challenge for language models. Our code and benchmark are publicly available.

2024

pdf bib abs

REMATCH: Robust and Efficient Matching of Local Knowledge Graphs to Improve Structural and Semantic Similarity
Zoher Kachwala | Jisun An | Haewoon Kwak | Filippo Menczer
Findings of the Association for Computational Linguistics: NAACL 2024

Knowledge graphs play a pivotal role in various applications, such as question-answering and fact-checking. Abstract Meaning Representation (AMR) represents text as knowledge graphs. Evaluating the quality of these graphs involves matching them structurally to each other and semantically to the source text. Existing AMR metrics are inefficient and struggle to capture semantic similarity. We also lack a systematic evaluation benchmark for assessing structural similarity between AMR graphs. To overcome these limitations, we introduce a novel AMR similarity metric, rematch, alongside a new evaluation for structural similarity called RARE. Among state-of-the-art metrics, rematch ranks second in structural similarity; and first in semantic similarity by 1–5 percentage points on the STS-B and SICK-R benchmarks. Rematch is also five times faster than the next most efficient metric.

Co-authors

Bao Tran Truong 1

Harry Yaojun Yan 1

Kai-Cheng Yang 1

Venues

Findings2
ACL1

Fix author