Vincent Nguyen


2021

pdf
Combining Shallow and Deep Representations for Text-Pair Classification
Vincent Nguyen | Sarvnaz Karimi | Zhenchang Xing
Proceedings of the The 19th Annual Workshop of the Australasian Language Technology Association

Text-pair classification is the task of determining the class relationship between two sentences. It is embedded in several tasks such as paraphrase identification and duplicate question detection. Contemporary methods use fine-tuned transformer encoder semantic representations of the classification token in the text-pair sequence from the transformer’s final layer for class prediction. However, research has shown that earlier parts of the network learn shallow features, such as syntax and structure, which existing methods do not directly exploit. We propose a novel convolution-based decoder for transformer-based architecture that maximizes the use of encoder hidden features for text-pair classification. Our model exploits hidden representations within transformer-based architecture. It outperforms a transformer encoder baseline on average by 50% (relative F1-score) on six datasets from the medical, software engineering, and open-domains. Our work shows that transformer-based models can improve text-pair classification by modifying the fine-tuning step to exploit shallow features while improving model generalization, with only a slight reduction in efficiency.

pdf
Cross-Domain Language Modeling: An Empirical Investigation
Vincent Nguyen | Sarvnaz Karimi | Maciej Rybinski | Zhenchang Xing
Proceedings of the The 19th Annual Workshop of the Australasian Language Technology Association

Transformer encoder models exhibit strong performance in single-domain applications. However, in a cross-domain situation, using a sub-word vocabulary model results in sub-word overlap. This is an issue when there is an overlap between sub-words that share no semantic similarity between domains. We hypothesize that alleviating this overlap allows for a more effective modeling of multi-domain tasks; we consider the biomedical and general domains in this paper. We present a study on reducing sub-word overlap by scaling the vocabulary size in a Transformer encoder model while pretraining with multiple domains. We observe a significant increase in downstream performance in the general-biomedical cross-domain from a reduction in sub-word overlap.

2020

pdf
The OpenNMT Neural Machine Translation Toolkit: 2020 Edition
Guillaume Klein | François Hernandez | Vincent Nguyen | Jean Senellart
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf
The Ubiqus English-Inuktitut System for WMT20
François Hernandez | Vincent Nguyen
Proceedings of the Fifth Conference on Machine Translation

This paper describes Ubiqus’ submission to the WMT20 English-Inuktitut shared news translation task. Our main system, and only submission, is based on a multilingual approach, jointly training a Transformer model on several agglutinative languages. The English-Inuktitut translation task is challenging at every step, from data selection, preparation and tokenization to quality evaluation down the line. Difficulties emerge both because of the peculiarities of the Inuktitut language as well as the low-resource context.

pdf
Pandemic Literature Search: Finding Information on COVID-19
Vincent Nguyen | Maciek Rybinski | Sarvnaz Karimi | Zhenchang Xing
Proceedings of the The 18th Annual Workshop of the Australasian Language Technology Association

Finding information related to a pandemic of a novel disease raises new challenges for information seeking and retrieval, as the new information becomes available gradually. We investigate how to better rank information for pandemic information retrieval. We experiment with different ranking algorithms and propose a novel end-to-end method for neural retrieval, and demonstrate its effectiveness on the TREC COVID search. This work could lead to a search system that aids scientists, clinicians, policymakers and others in finding reliable answers from the scientific literature.

pdf
Align then Summarize: Automatic Alignment Methods for Summarization Corpus Creation
Paul Tardy | David Janiszek | Yannick Estève | Vincent Nguyen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Summarizing texts is not a straightforward task. Before even considering text summarization, one should determine what kind of summary is expected. How much should the information be compressed? Is it relevant to reformulate or should the summary stick to the original phrasing? State-of-the-art on automatic text summarization mostly revolves around news articles. We suggest that considering a wider variety of tasks would lead to an improvement in the field, in terms of generalization and robustness. We explore meeting summarization: generating reports from automatic transcriptions. Our work consists in segmenting and aligning transcriptions with respect to reports, to get a suitable dataset for neural summarization. Using a bootstrapping approach, we provide pre-alignments that are corrected by human annotators, making a validation set against which we evaluate automatic models. This consistently reduces annotators’ efforts by providing iteratively better pre-alignment and maximizes the corpus size by using annotations from our automatic alignment models. Evaluation is conducted on publicmeetings, a novel corpus of aligned public meetings. We report automatic alignment and summarization performances on this corpus and show that automatic alignment is relevant for data annotation since it leads to large improvement of almost +4 on all ROUGE scores on the summarization task.

2019

pdf
ANU-CSIRO at MEDIQA 2019: Question Answering Using Deep Contextual Knowledge
Vincent Nguyen | Sarvnaz Karimi | Zhenchang Xing
Proceedings of the 18th BioNLP Workshop and Shared Task

We report on our system for textual inference and question entailment in the medical domain for the ACL BioNLP 2019 Shared Task, MEDIQA. Textual inference is the task of finding the semantic relationships between pairs of text. Question entailment involves identifying pairs of questions which have similar semantic content. To improve upon medical natural language inference and question entailment approaches to further medical question answering, we propose a system that incorporates open-domain and biomedical domain approaches to improve semantic understanding and ambiguity resolution. Our models achieve 80% accuracy on medical natural language inference (6.5% absolute improvement over the original baseline), 48.9% accuracy on recognising medical question entailment, 0.248 Spearman’s rho for question answering ranking and 68.6% accuracy for question answering classification.

pdf
Question Answering in the Biomedical Domain
Vincent Nguyen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Question answering techniques have mainly been investigated in open domains. However, there are particular challenges in extending these open-domain techniques to extend into the biomedical domain. Question answering focusing on patients is less studied. We find that there are some challenges in patient question answering such as limited annotated data, lexical gap and quality of answer spans. We aim to address some of these gaps by extending and developing upon the literature to design a question answering system that can decide on the most appropriate answers for patients attempting to self-diagnose while including the ability to abstain from answering when confidence is low.

pdf
Investigating the Effect of Lexical Segmentation in Transformer-based Models on Medical Datasets
Vincent Nguyen | Sarvnaz Karimi | Zhenchang Xing
Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association

2018

pdf
OpenNMT: Neural Machine Translation Toolkit
Guillaume Klein | Yoon Kim | Yuntian Deng | Vincent Nguyen | Jean Senellart | Alexander Rush
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)