HoWToBench: Holistic Evaluation for LLM’s Capability in Human-level Writing using Tree of Writing

Andrew Zhuoer Feng; Cunxiang Wang; Yu Luo; Lin Fan; Irene Zhou; Zikang Wang; Xiaotao Gu; Jie Tang; Hongning Wang; Minlie Huang

HoWToBench: Holistic Evaluation for LLM’s Capability in Human-level Writing using Tree of Writing

Andrew Zhuoer Feng, Cunxiang Wang, Yu Luo, Lin Fan, Irene Zhou, Zikang Wang, Xiaotao Gu, Jie Tang, Hongning Wang, Minlie Huang

Abstract

Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM’s performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing **12** genres and **1302** instructions across three task categories: contextual **completion**, outline-**guided** writing, and **open**-ended generation. ToW successfully mitigates the biases, achieving a **0.93** Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.

Anthology ID:: 2026.acl-long.317
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6986–7034
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.317/
DOI:
Bibkey:
Cite (ACL):: Andrew Zhuoer Feng, Cunxiang Wang, Yu Luo, Lin Fan, Irene Zhou, Zikang Wang, Xiaotao Gu, Jie Tang, Hongning Wang, and Minlie Huang. 2026. HoWToBench: Holistic Evaluation for LLM’s Capability in Human-level Writing using Tree of Writing. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6986–7034, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: HoWToBench: Holistic Evaluation for LLM’s Capability in Human-level Writing using Tree of Writing (Feng et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.317.pdf
Checklist:: 2026.acl-long.317.checklist.pdf

PDF Cite Search Checklist Fix data