Abstract
We develop two new metrics that build on top of the COMET architecture. The main contribution is collecting a ten-times larger corpus of human judgements than COMET and investigating how to filter out problematic human judgements. We propose filtering human judgements where human reference is statistically worse than machine translation. Furthermore, we average scores of all equal segments evaluated multiple times.The results comparing automatic metrics on source-based DA and MQM-style human judgement show state-of-the-art performance on a system-level pair-wise system ranking.We release both of our metrics for public use.- Anthology ID:
- 2022.wmt-1.47
- Volume:
- Proceedings of the Seventh Conference on Machine Translation (WMT)
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates (Hybrid)
- Venue:
- WMT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 541–548
- Language:
- URL:
- https://aclanthology.org/2022.wmt-1.47
- DOI:
- Cite (ACL):
- Tom Kocmi, Hitokazu Matsushita, and Christian Federmann. 2022. MS-COMET: More and Better Human Judgements Improve Metric Performance. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 541–548, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Cite (Informal):
- MS-COMET: More and Better Human Judgements Improve Metric Performance (Kocmi et al., WMT 2022)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2022.wmt-1.47.pdf