2025
pdf
bib
abs
Exploring In-Image Machine Translation with Real-World Background
Yanzhi Tian
|
Zeming Liu
|
Zhengyang Liu
|
Yuhang Guo
Findings of the Association for Computational Linguistics: ACL 2025
In-Image Machine Translation (IIMT) aims to translate texts within images from one language to another. Previous research on IIMT was primarily conducted on simplified scenarios such as images of one-line text with black font in white backgrounds, which is far from reality and impractical for applications in the real world. To make IIMT research practically valuable, it is essential to consider a complex scenario where the text backgrounds are derived from real-world images. To facilitate research of complex scenarios IIMT, we design an IIMT dataset that includes subtitle text with a real-world background. However, previous IIMT models perform inadequately in complex scenarios. To address the issue, we propose the DebackX model, which separates the background and text-image from the source image, performs translation on the text-image directly, and fuses the translated text-image with the background to generate the target image. Experimental results show that our model achieves improvements in both translation quality and visual effect.
2023
pdf
bib
abs
BIT-ACT: An Ancient Chinese Translation System Using Data Augmentation
Li Zeng
|
Yanzhi Tian
|
Yingyu Shan
|
Yuhang Guo
Proceedings of ALT2023: Ancient Language Translation Workshop
This paper describes a translation model for ancient Chinese to modern Chinese and English for the Evahan 2023 competition, a subtask of the Ancient Language Translation 2023 challenge. During the training of our model, we applied various data augmentation techniques and used SiKu-RoBERTa as part of our model architecture. The results indicate that back translation improves the model’s performance, but double back translation introduces noise and harms the model’s performance. Fine-tuning on the original dataset can be helpful in solving the issue.
pdf
bib
abs
In-Image Neural Machine Translation with Segmented Pixel Sequence-to-Sequence Model
Yanzhi Tian
|
Xiang Li
|
Zeming Liu
|
Yuhang Guo
|
Bin Wang
Findings of the Association for Computational Linguistics: EMNLP 2023
In-Image Machine Translation (IIMT) aims to convert images containing texts from one language to another. Traditional approaches for this task are cascade methods, which utilize optical character recognition (OCR) followed by neural machine translation (NMT) and text rendering. However, the cascade methods suffer from compounding errors of OCR and NMT, leading to a decrease in translation quality. In this paper, we propose an end-to-end model instead of the OCR, NMT and text rendering pipeline. Our neural architecture adopts encoder-decoder paradigm with segmented pixel sequences as inputs and outputs. Through end-to-end training, our model yields improvements across various dimensions, (i) it achieves higher translation quality by avoiding error propagation, (ii) it demonstrates robustness for out domain data, and (iii) it displays insensitivity to incomplete words. To validate the effectiveness of our method and support for future research, we construct our dataset containing 4M pairs of De-En images and train our end-to-end model. The experimental results show that our approach outperforms both cascade method and current end-to-end model.
pdf
bib
abs
The Xiaomi AI Lab’s Speech Translation Systems for IWSLT 2023 Offline Task, Simultaneous Task and Speech-to-Speech Task
Wuwei Huang
|
Mengge Liu
|
Xiang Li
|
Yanzhi Tian
|
Fengyu Yang
|
Wen Zhang
|
Jian Luan
|
Bin Wang
|
Yuhang Guo
|
Jinsong Su
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
This system description paper introduces the systems submitted by Xiaomi AI Lab to the three tracks of the IWSLT 2023 Evaluation Campaign, namely the offline speech translation (Offline-ST) track, the offline speech-to-speech translation (Offline-S2ST) track, and the simultaneous speech translation (Simul-ST) track. All our submissions for these three tracks only involve the English-Chinese language direction. Our English-Chinese speech translation systems are constructed using large-scale pre-trained models as the foundation. Specifically, we fine-tune these models’ corresponding components for various downstream speech translation tasks. Moreover, we implement several popular techniques, such as data filtering, data augmentation, speech segmentation, and model ensemble, to improve the system’s overall performance. Extensive experiments show that our systems achieve a significant improvement over the strong baseline systems in terms of the automatic evaluation metric.
2022
pdf
bib
abs
BIT-Xiaomi’s System for AutoSimTrans 2022
Mengge Liu
|
Xiang Li
|
Bao Chen
|
Yanzhi Tian
|
Tianwei Lan
|
Silin Li
|
Yuhang Guo
|
Jian Luan
|
Bin Wang
Proceedings of the Third Workshop on Automatic Simultaneous Translation
This system paper describes the BIT-Xiaomi simultaneous translation system for Autosimtrans 2022 simultaneous translation challenge. We participated in three tracks: the Zh-En text-to-text track, the Zh-En audio-to-text track and the En-Es test-to-text track. In our system, wait-k is employed to train prefix-to-prefix translation models. We integrate streaming chunking to detect boundaries as the source streaming read in. We further improve our system with data selection, data-augmentation and R-drop training methods. Results show that our wait-k implementation outperforms organizer’s baseline by 8 BLEU score at most, and our proposed streaming chunking method further improves about 2 BLEU in low latency regime.
pdf
bib
abs
Ancient Chinese Word Segmentation and Part-of-Speech Tagging Using Data Augmentation
Yanzhi Tian
|
Yuhang Guo
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
We attended the EvaHan2022 ancient Chinese word segmentation and Part-of-Speech (POS) tagging evaluation. We regard the Chinese word segmentation and POS tagging as sequence tagging tasks. Our system is based on a BERT-BiLSTM-CRF model which is trained on the data provided by the EvaHan2022 evaluation. Besides, we also employ data augmentation techniques to enhance the performance of our model. On the Test A and Test B of the evaluation, the F1 scores of our system achieve 94.73% and 90.93% for the word segmentation, 89.19% and 83.48% for the POS tagging.