2023
pdf
bib
abs
Development of Urdu-English Religious Domain Parallel Corpus
Sadaf Abdul Rauf
|
Noor e Hira
Proceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation
Despite the abundance of monolingual corpora accessible online, there remains a scarcity of domain specific parallel corpora. This scarcity poses a challenge in the development of robust translation systems tailored for such specialized domains. Addressing this gap, we have developed a parallel religious domain corpus for Urdu-English. This corpus consists of 18,426 parallel sentences from Sunan Daud, carefully curated to capture the unique linguistic and contextual aspects of religious texts. The developed corpus is then used to train Urdu-English religious domain Neural Machine Translation (NMT) systems, the best system scored 27.9 BLEU points
pdf
abs
Biomedical Parallel Sentence Retrieval Using Large Language Models
Sheema Firdous
|
Sadaf Abdul Rauf
Proceedings of the Eighth Conference on Machine Translation
We have explored the effect of in domain knowledge during parallel sentence filtering from in domain corpora. Models built with sentences mined from in domain corpora without domain knowledge performed poorly, whereas model performance improved by more than 2.3 BLEU points on average with further domain centric filtering. We have used Large Language Models for selecting similar and domain aligned sentences. Our experiments show the importance of inclusion of domain knowledge in sentence selection methodologies even if the initial comparable corpora are in domain.
2022
pdf
abs
Exploring Transfer Learning for Urdu Speech Synthesis
Sahar Jamal
|
Sadaf Abdul Rauf
|
Quratulain Majid
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference
Neural methods in Text to Speech synthesis (TTS) have demonstrated momentous advancement in terms of the naturalness and intelligibility of the synthesized speech. In this paper we present neural speech synthesis system for Urdu language, a low resource language. The main challenge faced for this study was the non-availability of any publicly available Urdu speech synthesis corpora. Urdu speech corpus was created using audio books and synthetic speech generation. To leverage the low resource scenario we adopted transfer learning for our experiments where knowledge extracted is further used to train the model using a relatively smaller Urdu training data set. The results from this model show satisfactory results, though a good margin for improvement exists and we are working to improve it further.
2021
pdf
abs
Automatic Sentence Simplification in Low Resource Settings for Urdu
Yusra Anees
|
Sadaf Abdul Rauf
Proceedings of the 1st Workshop on NLP for Positive Impact
To build automated simplification systems, corpora of complex sentences and their simplified versions is the first step to understand sentence complexity and enable the development of automatic text simplification systems. We present a lexical and syntactically simplified Urdu simplification corpus with a detailed analysis of the various simplification operations and human evaluation of corpus quality. We further analyze our corpora using text readability measures and present a comparison of the original, lexical simplified and syntactically simplified corpora. In addition, we compare our corpus with other existing simplification corpora by building simplification systems and evaluating these systems using BLEU and SARI scores. Our system achieves the highest BLEU score and comparable SARI score in comparison to other systems. We release our simplification corpora for the benefit of the research community.
pdf
abs
LISN @ WMT 2021
Jitao Xu
|
Minh Quang Pham
|
Sadaf Abdul Rauf
|
François Yvon
Proceedings of the Sixth Conference on Machine Translation
This paper describes LISN’s submissions to two shared tasks at WMT’21. For the biomedical translation task, we have developed resource-heavy systems for the English-French language pair, using both out-of-domain and in-domain corpora. The target genre for this task (scientific abstracts) corresponds to texts that often have a standardized structure. Our systems attempt to take this structure into account using a hierarchical system of sentence-level tags. Translation systems were also prepared for the News task for the French-German language pair. The challenge was to perform unsupervised adaptation to the target domain (financial news). For this, we explored the potential of retrieval-based strategies, where sentences that are similar to test instances are used to prime the decoder.
pdf
abs
FJWU Participation for the WMT21 Biomedical Translation Task
Sumbal Naz
|
Sadaf Abdul Rauf
|
Sami Ul Haq
Proceedings of the Sixth Conference on Machine Translation
In this paper we present the FJWU’s system submitted to the biomedical shared task at WMT21. We prepared state-of-the-art multilingual neural machine translation systems for three languages (i.e. German, Spanish and French) with English as target language. Our NMT systems based on Transformer architecture, were trained on combination of in-domain and out-domain parallel corpora developed using Information Retrieval (IR) and domain adaptation techniques.
2020
pdf
abs
Document Level NMT of Low-Resource Languages with Backtranslation
Sami Ul Haq
|
Sadaf Abdul Rauf
|
Arsalan Shaukat
|
Abdullah Saeed
Proceedings of the Fifth Conference on Machine Translation
This paper describes our system submission to WMT20 shared task on similar language translation. We examined the use of documentlevel neural machine translation (NMT) systems for low-resource, similar language pair Marathi−Hindi. Our system is an extension of state-of-the-art Transformer architecture with hierarchical attention networks to incorporate contextual information. Since, NMT requires large amount of parallel data which is not available for this task, our approach is focused on utilizing monolingual data with back translation to train our models. Our experiments reveal that document-level NMT can be a reasonable alternative to sentence-level NMT for improving translation quality of low resourced languages even when used with synthetic data.
pdf
abs
LIMSI @ WMT 2020
Sadaf Abdul Rauf
|
José Carlos Rosales Núñez
|
Minh Quang Pham
|
François Yvon
Proceedings of the Fifth Conference on Machine Translation
This paper describes LIMSI’s submissions to the translation shared tasks at WMT’20. This year we have focused our efforts on the biomedical translation task, developing a resource-heavy system for the translation of medical abstracts from English into French, using back-translated texts, terminological resources as well as multiple pre-processing pipelines, including pre-trained representations. Systems were also prepared for the robustness task for translating from English into German; for this large-scale task we developed multi-domain, noise-robust, translation systems aim to handle the two test conditions: zero-shot and few-shot domain adaptation.
pdf
abs
FJWU participation for the WMT20 Biomedical Translation Task
Sumbal Naz
|
Sadaf Abdul Rauf
|
Noor-e- Hira
|
Sami Ul Haq
Proceedings of the Fifth Conference on Machine Translation
This paper reports system descriptions for FJWU-NRPU team for participation in the WMT20 Biomedical shared translation task. We focused our submission on exploring the effects of adding in-domain corpora extracted from various out-of-domain sources. Systems were built for French to English using in-domain corpora through fine tuning and selective data training. We further explored BERT based models specifically with focus on effect of domain adaptive subword units.
abs
Developing a Monolingual Sentence Simplification Corpus for Urdu
Yusra Anees
|
Sadaf Abdul Rauf
|
Nauman Iqbal
|
Abdul Basit Siddiqi
Proceedings of the The Fourth Widening Natural Language Processing Workshop
Complex sentences are a hurdle in the learning process of language learners. Sentence simplification aims to convert a complex sentence into its simpler form such that it is easily comprehensible. To build such automated simplification systems, corpora of complex sentences and their simplified versions is the first step to understand sentence complexity and enable the development of automatic text simplification systems. No such corpus has yet been developed for Urdu and we fill this gap by developing one such corpus to help start readability and automatic sentence simplification research. We present a lexical and syntactically simplified Urdu simplification corpus and a detailed analysis of the various simplification operations. We further analyze our corpora using text readability measures and present a comparison of the original, lexical simplified, and syntactically simplified corpora.
pdf
abs
Improving Document-Level Neural Machine Translation with Domain Adaptation
Sami Ul Haq
|
Sadaf Abdul Rauf
|
Arslan Shoukat
|
Noor-e- Hira
Proceedings of the Fourth Workshop on Neural Generation and Translation
Recent studies have shown that translation quality of NMT systems can be improved by providing document-level contextual information. In general sentence-based NMT models are extended to capture contextual information from large-scale document-level corpora which are difficult to acquire. Domain adaptation on the other hand promises adapting components of already developed systems by exploiting limited in-domain data. This paper presents FJWU’s system submission at WNGT, we specifically participated in Document level MT task for German-English translation. Our system is based on context-aware Transformer model developed on top of original NMT architecture by integrating contextual information using attention networks. Our experimental results show providing previous sentences as context significantly improves the BLEU score as compared to a strong NMT baseline. We also studied the impact of domain adaptation on document level translationand were able to improve results by adaptingthe systems according to the testing domain.
pdf
abs
Simplification automatique de texte dans un contexte de faibles ressources (Automatic Text Simplification : Approaching the Problem in Low Resource Settings for French)
Sadaf Abdul Rauf
|
Anne-Laure Ligozat
|
Francois Yvon
|
Gabriel Illouz
|
Thierry Hamon
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles
La simplification de textes a émergé comme un sous-domaine actif du traitement automatique des langues, du fait des problèmes pratiques et théoriques qu’elle permet d’aborder, ainsi que de ses nombreuses applications pratiques. Des corpus de simplification sont nécessaires pour entrainer des systèmes de simplification automatique ; ces ressources sont toutefois rares et n’existent que pour un petit nombre de langues. Nous montrons ici que dans un contexte où les ressources pour la simplification sont rares, il reste néanmoins possible de construire des systèmes de simplification, en ayant recours à des corpus synthétiques, par exemple obtenus par traduction automatique, et nous évaluons diverses manières de les constituer.
pdf
abs
On the Exploration of English to Urdu Machine Translation
Sadaf Abdul Rauf
|
Syeda Abida
|
Noor-e- Hira
|
Syeda Zahra
|
Dania Parvez
|
Javeria Bashir
|
Qurat-ul-ain Majid
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Machine Translation is the inevitable technology to reduce communication barriers in today’s world. It has made substantial progress in recent years and is being widely used in commercial as well as non-profit sectors. Such is only the case for European and other high resource languages. For English-Urdu language pair, the technology is in its infancy stage due to scarcity of resources. Present research is an important milestone in English-Urdu machine translation, as we present results for four major domains including Biomedical, Religious, Technological and General using Statistical and Neural Machine Translation. We performed series of experiments in attempts to optimize the performance of each system and also to study the impact of data sources on the systems. Finally, we established a comparison of the data sources and the effect of language model size on statistical machine translation performance.
2019
pdf
abs
Exploring Transfer Learning and Domain Data Selection for the Biomedical Translation
Noor-e- Hira
|
Sadaf Abdul Rauf
|
Kiran Kiani
|
Ammara Zafar
|
Raheel Nawaz
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
Transfer Learning and Selective data training are two of the many approaches being extensively investigated to improve the quality of Neural Machine Translation systems. This paper presents a series of experiments by applying transfer learning and selective data training for participation in the Bio-medical shared task of WMT19. We have used Information Retrieval to selectively choose related sentences from out-of-domain data and used them as additional training data using transfer learning. We also report the effect of tokenization on translation model performance.
2011
pdf
Investigations on Translation Model Adaptation Using Monolingual Data
Patrik Lambert
|
Holger Schwenk
|
Christophe Servan
|
Sadaf Abdul-Rauf
Proceedings of the Sixth Workshop on Statistical Machine Translation
pdf
LIUM’s SMT Machine Translation Systems for WMT 2011
Holger Schwenk
|
Patrik Lambert
|
Loïc Barrault
|
Christophe Servan
|
Sadaf Abdul-Rauf
|
Haithem Afli
|
Kashif Shah
Proceedings of the Sixth Workshop on Statistical Machine Translation
2010
pdf
LIUM SMT Machine Translation System for WMT 2010
Patrik Lambert
|
Sadaf Abdul-Rauf
|
Holger Schwenk
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
2009
pdf
On the Use of Comparable Corpora to Improve SMT performance
Sadaf Abdul-Rauf
|
Holger Schwenk
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)
pdf
SMT and SPE Machine Translation Systems for WMT‘09
Holger Schwenk
|
Sadaf Abdul-Rauf
|
Loïc Barrault
|
Jean Senellart
Proceedings of the Fourth Workshop on Statistical Machine Translation
pdf
Exploiting Comparable Corpora with TER and TERp
Sadaf Abdul-Rauf
|
Holger Schwenk
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)
2008
pdf
abs
The LIUM Arabic/English statistical machine translation system for IWSLT 2008.
Holger Schwenk
|
Yannick Estève
|
Sadaf Abdul Rauf
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign
This paper describes the system developed by the LIUM laboratory for the 2008 IWSLT evaluation. We only participated in the Arabic/English BTEC task. We developed a statistical phrase-based system using the Moses toolkit and SYSTRAN’s rule-based translation system to perform a morphological decomposition of the Arabic words. A continuous space language model was deployed to improve the modeling of the target language. Both approaches achieved significant improvements in the BLEU score. The system achieves a score of 49.4 on the test set of the 2008 IWSLT evaluation.