Mohamed Nabih
2025
The Warmup Dilemma: How Learning Rate Strategies Impact Speech-to-Text Model Convergence
Marco Gaido
|
Sara Papi
|
Luisa Bentivogli
|
Alessio Brutti
|
Mauro Cettolo
|
Roberto Gretter
|
Marco Matassoni
|
Mohamed Nabih
|
Matteo Negri
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
Training large-scale models presents challenges not only in terms of resource requirements but also in terms of their convergence. For this reason, the learning rate (LR) is often decreased when the size of a model is increased. Such a simple solution is not enough in the case of speech-to-text (S2T) trainings, where evolved and more complex variants of the Transformer architecture – e.g., Conformer or Branchformer – are used in light of their better performance. As a workaround, OWSM designed a double linear warmup of the LR, increasing it to a very small value in the first phase before updating it to a higher value in the second phase. While this solution worked well in practice, it was not compared with alternative solutions, nor was the impact on the final performance of different LR warmup schedules studied. This paper fills this gap, revealing that i) large-scale S2T trainings demand a sub-exponential LR warmup, and ii) a higher LR in the warmup phase accelerates initial convergence, but it does not boost final performance.
2024
MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
Marco Gaido
|
Sara Papi
|
Luisa Bentivogli
|
Alessio Brutti
|
Mauro Cettolo
|
Roberto Gretter
|
Marco Matassoni
|
Mohamed Nabih
|
Matteo Negri
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.
Search
Fix author
Co-authors
- Luisa Bentivogli 2
- Alessio Brutti 2
- Mauro Cettolo 2
- Marco Gaido 2
- Roberto Gretter 2
- show all...