bytF: How Good Are Byte Level N-Gram F-Scores for Automatic Machine Translation Evaluation?

Raj Dabre, Kaing Hour, Haiyue Song


Abstract
Recently, chrF and chrF++ have become the preferred metric over BLEU for automatic n-gram evaluation of machine translation. Since they focus on character-level n-grams, it appears to have better correlations with human judgments for translating into morphologically rich languages compared to word-level metrics. However, for non-Latin languages with sub-character-level structures, we can go one step further namely bytes. To this end, we propose bytF to capture sub-character-level information, where we consider byte-level n-grams. Furthermore, we augment it to bytF+ and bytF++ where we consider character and word n-gram backoffs. On machine translation metric meta-evaluation datasets from English into 5 Indian languages, Chinese and Japanese, we show that bytF and its variants are comparable (give minimum difference) or significantly better (give maximum difference) correlated than chrF and chrF++ with human judgments at the segment level. We often observe that backing off to characters and words for bytF and to words for chrF does not have the highest correlation with humans. Furthermore, we also observe that using default n-gram values often leads to scores having poorer correlations with humans, indicating the need for well studied and tuned n-gram metrics for efficacy.
Anthology ID:
2025.mtsummit-1.29
Volume:
Proceedings of Machine Translation Summit XX: Volume 1
Month:
June
Year:
2025
Address:
Geneva, Switzerland
Editors:
Pierrette Bouillon, Johanna Gerlach, Sabrina Girletti, Lise Volkart, Raphael Rubino, Rico Sennrich, Ana C. Farinha, Marco Gaido, Joke Daems, Dorothy Kenny, Helena Moniz, Sara Szoc
Venue:
MTSummit
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
378–387
Language:
URL:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.mtsummit-1.29/
DOI:
Bibkey:
Cite (ACL):
Raj Dabre, Kaing Hour, and Haiyue Song. 2025. bytF: How Good Are Byte Level N-Gram F-Scores for Automatic Machine Translation Evaluation?. In Proceedings of Machine Translation Summit XX: Volume 1, pages 378–387, Geneva, Switzerland. European Association for Machine Translation.
Cite (Informal):
bytF: How Good Are Byte Level N-Gram F-Scores for Automatic Machine Translation Evaluation? (Dabre et al., MTSummit 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.mtsummit-1.29.pdf