Data Augmentation for Low-Resource Italian NLP: Enhancing Semantic Processing with DRS

Muhammad Saad Amin; Luca Anselma; Alessandro Mazzei

Data Augmentation for Low-Resource Italian NLP: Enhancing Semantic Processing with DRS

Muhammad Saad Amin, Luca Anselma, Alessandro Mazzei

Abstract

Discourse Representation Structure (DRS), a formal meaning representation, has shown promising results in semantic parsing and natural language generation tasks for high-resource languages like English. This paper investigates enhancing the application of DRS to low-resource Italian Natural Language Processing (NLP), in both semantic parsing (Text-to-DRS) and natural language generation (DRS-to-Text). To address the scarcity of annotated corpora for Italian DRS, we propose a novel data augmentation technique that involves the use of external linguistic resources including: (i) WordNet for common nouns, adjectives, adverbs, and verbs; (ii) LLM-generated named entities for proper nouns; and (iii) rule-based algorithms fortense augmentation. This approach not only increases the quantity of training data but also introduces linguistic diversity, which is crucial for improving model performance and robustness. Using this augmented dataset, we developed neural semantic parser and generator models that demonstrated enhanced generalization ability compared to models trained on non-augmented data. We evaluated the effect of semantic data augmentation using two state-of-the-art transformer-based neural sequence-to-sequence models, i.e., byT5 and IT5. Our implementation shows promising results for Italian semanticprocessing. Data augmentation significantly increased the performance of semantic parsing from 76.10 to 90.56 (+14.46%) F1-SMATCH score and generation with 37.79 to 57.48 (+19.69%) BLEU, 30.83 to 40.95 (+10.12%) METEOR, 81.66 to 90.97 (+9.31%) COMET, 54.84 to 70.88 (+16.04%) chrF, and 88.86 to 92.97 (+4.11%) BERT scores. These results demonstrate the effectiveness of our novel augmentation approach in enhancing semantic processing capabilities for low-resource languages like Italian.

Anthology ID:: 2024.clicit-1.5
Volume:: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
Month:: December
Year:: 2024
Address:: Pisa, Italy
Editors:: Felice Dell'Orletta, Alessandro Lenci, Simonetta Montemagni, Rachele Sprugnoli
Venue:: CLiC-it
SIG:
Publisher:: CEUR Workshop Proceedings
Note:
Pages:: 29–38
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.clicit-1.5/
DOI:
Bibkey:
Cite (ACL):: Muhammad Saad Amin, Luca Anselma, and Alessandro Mazzei. 2024. Data Augmentation for Low-Resource Italian NLP: Enhancing Semantic Processing with DRS. In Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), pages 29–38, Pisa, Italy. CEUR Workshop Proceedings.
Cite (Informal):: Data Augmentation for Low-Resource Italian NLP: Enhancing Semantic Processing with DRS (Amin et al., CLiC-it 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.clicit-1.5.pdf

PDF Cite Search Fix data