Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Tanvirul Alam, Akib Khan, Firoj Alam


Abstract
Punctuation restoration is a common post-processing problem for Automatic Speech Recognition (ASR) systems. It is important to improve the readability of the transcribed text for the human reader and facilitate NLP tasks. Current state-of-art address this problem using different deep learning models. Recently, transformer models have proven their success in downstream NLP tasks, and these models have been explored very little for the punctuation restoration problem. In this work, we explore different transformer based models and propose an augmentation strategy for this task, focusing on high-resource (English) and low-resource (Bangla) languages. For English, we obtain comparable state-of-the-art results, while for Bangla, it is the first reported work, which can serve as a strong baseline for future work. We have made our developed Bangla dataset publicly available for the research community.
Anthology ID:
2020.wnut-1.18
Volume:
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)
Month:
November
Year:
2020
Address:
Online
Venues:
EMNLP | WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
132–142
Language:
URL:
https://aclanthology.org/2020.wnut-1.18
DOI:
10.18653/v1/2020.wnut-1.18
Bibkey:
Cite (ACL):
Tanvirul Alam, Akib Khan, and Firoj Alam. 2020. Punctuation Restoration using Transformer Models for High-and Low-Resource Languages. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 132–142, Online. Association for Computational Linguistics.
Cite (Informal):
Punctuation Restoration using Transformer Models for High-and Low-Resource Languages (Alam et al., WNUT 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/2020.wnut-1.18.pdf
Code
 xashru/punctuation-restoration