2025
pdf
bib
abs
Efficient On-Device Text Simplification for Firefox with Synthetic Data Fine-Tuning
Pablo Romero
|
Zihao Li
|
Matthew Shardlow
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)
This work presents a system for on-device text simplification that enables users to process sensitive text without relying on cloud-based services. Through the use of quantization techniques and a novel approach to controllable text simplification we reduce model size by up to 75 percent with minimal performance degradation. Our models demonstrate efficient state-of-the-art results using a synthetic dataset of 2909 examples outperforming prior work trained on 300K examples. This efficiency stems from (1) a single control token strategy that precisely targets specific reading levels (2) a contrastive training approach that enriches model understanding through exposure to multiple simplification levels and (3) individual models that dedicate full parameter capacity to specific reading level transformations. Our best models achieve up to 82.18 BLEU at the Advanced level and 46.12 SARI at the Elementary level on standard benchmarks with performance preserved even after aggressive quantization. This work is implemented as a collaboration with the Mozilla AI team to process text entirely locally ensuring sensitive information never leaves the users device. We have a demonstration video https//youtu.be/TzmaxnARMzg and a web demo available at https//pablorom2004.github.io/Simplification-Web-Demo
2024
pdf
bib
Efficient Sparse Attention needs Adaptive Token Release
Chaoran Zhang
|
Lixin Zou
|
Dan Luo
|
Xiangyang Luo
|
Zihao Li
|
Min Tang
|
Chenliang Li
Findings of the Association for Computational Linguistics: ACL 2024
2023
pdf
bib
abs
Comparing Generic and Expert Models for Genre-Specific Text Simplification
Zihao Li
|
Matthew Shardlow
|
Fernando Alva-Manchego
Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability
We investigate how text genre influences the performance of models for controlled text simplification. Regarding datasets from Wikipedia and PubMed as two different genres, we compare the performance of genre-specific models trained by transfer learning and prompt-only GPT-like large language models. Our experiments showed that: (1) the performance loss of genre-specific models on general tasks can be limited to 2%, (2) transfer learning can improve performance on genre-specific datasets up to 10% in SARI score from the base model without transfer learning, (3) simplifications generated by the smaller but more customized models show similar performance in simplicity and a better meaning reservation capability to the larger generic models in both automatic and human evaluations.
2022
pdf
bib
abs
An Investigation into the Effect of Control Tokens on Text Simplification
Zihao Li
|
Matthew Shardlow
|
Saeed Hassan
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)
Recent work on text simplification has focused on the use of control tokens to further the state of the art. However, it is not easy to further improve without an in-depth comprehension of the mechanisms underlying control tokens. One unexplored factor is the tokenisation strategy, which we also explore. In this paper, we (1) reimplemented ACCESS, (2) explored the effects of varying control tokens, (3) tested the influences of different tokenisation strategies, and (4) demonstrated how separate control tokens affect performance. We show variations of performance in the four control tokens separately. We also uncover how the design of control tokens could influence the performance and propose some suggestions for designing control tokens, which also reaches into other controllable text generation tasks.