Suzanna Sia


2022

pdf
Prefix Embeddings for In-context Machine Translation
Suzanna Sia | Kevin Duh
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Very large language models have been shown to translate with few-shot in-context examples. However, they have not achieved state-of-art results for translating out of English. In this work, we investigate an extremely lightweight fixed-parameter method for conditioning a large language model to better translate into the target language. Our method introduces additional embeddings, known as prefix embeddings which do not interfere with the existing weights of the model. Using unsupervised and weakly semi-supervised methods that train only 0.0001% of the model parameters, the simple method improves ~0.2-1.3 BLEU points across 3 domains and 3 languages. We analyze the resulting embeddings’ training dynamics, and where they lie in the embedding space, and show that our trained embeddings can be used for both in-context translation, and diverse generation of the target sentence.

2021

pdf
Adaptive Mixed Component LDA for Low Resource Topic Modeling
Suzanna Sia | Kevin Duh
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Probabilistic topic models in low data resource scenarios are faced with less reliable estimates due to sparsity of discrete word co-occurrence counts, and do not have the luxury of retraining word or topic embeddings using neural methods. In this challenging resource constrained setting, we explore mixture models which interpolate between the discrete and continuous topic-word distributions that utilise pre-trained embeddings to improve topic coherence. We introduce an automatic trade-off between the discrete and continuous representations via an adaptive mixture coefficient, which places greater weight on the discrete representation when the corpus statistics are more reliable. The adaptive mixture coefficient takes into account global corpus statistics, and the uncertainty in each topic’s continuous distributions. Our approach outperforms the fully discrete, fully continuous, and static mixture model on topic coherence in low resource settings. We additionally demonstrate the generalisability of our method by extending it to handle multilingual document collections.

2020

pdf
CLIReval: Evaluating Machine Translation as a Cross-Lingual Information Retrieval Task
Shuo Sun | Suzanna Sia | Kevin Duh
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We present CLIReval, an easy-to-use toolkit for evaluating machine translation (MT) with the proxy task of cross-lingual information retrieval (CLIR). Contrary to what the project name might suggest, CLIReval does not actually require any annotated CLIR dataset. Instead, it automatically transforms translations and references used in MT evaluations into a synthetic CLIR dataset; it then sets up a standard search engine (Elasticsearch) and computes various information retrieval metrics (e.g., mean average precision) by treating the translations as documents to be retrieved. The idea is to gauge the quality of MT by its impact on the document translation approach to CLIR. As a case study, we run CLIReval on the “metrics shared task” of WMT2019; while this extrinsic metric is not intended to replace popular intrinsic metrics such as BLEU, results suggest CLIReval is competitive in many language pairs in terms of correlation to human judgments of quality. CLIReval is publicly available at https://github.com/ssun32/CLIReval.

pdf
Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too!
Suzanna Sia | Ayush Dalmia | Sabrina J. Mielke
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Topic models are a useful analysis tool to uncover the underlying themes within document collections. The dominant approach is to use probabilistic topic models that posit a generative story, but in this paper we propose an alternative way to obtain topics: clustering pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words. We provide benchmarks for the combination of different word embeddings and clustering algorithms, and analyse their performance under dimensionality reduction with PCA. The best performing combination for our approach performs as well as classical topic models, but with lower runtime and computational complexity.