%
% File emnlp2020.tex
%
%% Based on the style files for ACL 2020, which were
%% Based on the style files for ACL 2018, NAACL 2018/19, which were
%% Based on the style files for ACL-2015, with some improvements
%%  taken from the NAACL-2016 style
%% Based on the style files for ACL-2014, which were, in turn,
%% based on ACL-2013, ACL-2012, ACL-2011, ACL-2010, ACL-IJCNLP-2009,
%% EACL-2009, IJCNLP-2008...
%% Based on the style files for EACL 2006 by 
%%e.agirre@ehu.es or Sergi.Balari@uab.es
%% and that of ACL 08 by Joakim Nivre and Noah Smith

\documentclass[11pt,a4paper]{article}
\usepackage[hyperref]{emnlp2020}
\usepackage{times}
\usepackage{latexsym}
\renewcommand{\UrlFont}{\ttfamily\small}

% This is not strictly necessary, and may be commented out,
% but it will improve the layout of the manuscript,
% and will typically save some space.
\usepackage[ruled,vlined]{algorithm2e}

%\usepackage{algorithmic}
%\usepackage{hyperref}       % hyperlinks
\usepackage{amsmath}
\usepackage{cleveref}       
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{mathtools}
\usepackage{amssymb}
\usepackage{bm}
\usepackage{bbm}
\usepackage{multirow}
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}

\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{enumitem}
%\usepackage[usenames,dvipsnames]{xcolor}

\newcommand{\todo}{{\color{red}TODO}}
\newcommand{\x}{\bm{\mathrm{x}}}
\newcommand{\y}{\bm{\mathrm{y}}}
\newcommand{\z}{\bm{\mathrm{z}}}
\newcommand{\MU}{\bm{\mathrm{\mu}}}
\newcommand{\SIGMA}{\bm{\mathrm{\sigma}}}
\newcommand{\dec}{p_\theta (\y|\z,\x)}
\newcommand{\pri}{p_\theta (\z|\x)}
\newcommand{\post}{q_\phi (\z|\y,\x)}
\newcommand{\elbobig}{ \E_{\z \sim q_\phi} \Big[ \log \dec \Big] - \text{KL}\Big[ \post \, \Big|\Big| \, \pri \Big]}
\newcommand{\elbosmall}{ \E_{\z \sim q_\phi} \big[ \log \dec \big] - \text{KL}\big[ \post \, \big|\big| \, \pri \big]}
\DeclareMathOperator*{\E}{\mathbb{E}}
\newcommand{\wmtende}{WMT'14 En$\rightarrow$De}
\newcommand{\wmtdeen}{WMT'14 De$\rightarrow$En}
\newcommand{\wmtendeboth}{WMT'14 En$\leftrightarrow$De}
\newcommand{\wmtenro}{WMT'16 En$\rightarrow$Ro}
\newcommand{\wmtroen}{WMT'16 Ro$\rightarrow$En}
\newcommand{\wmtenroboth}{WMT'16 En$\leftrightarrow$Ro}
\newcommand{\iwsltdeen}{IWSLT'16 De$\rightarrow$En}
\newcommand{\argmax}{\text{argmax}}% # COLOR
\newcommand{\mygeq}{{}}% # COLOR
\newcommand{\modelts}{{Tr-S}}
\newcommand{\modeltb}{{Tr-B}}
\newcommand{\modeltl}{{Tr-L}}
\newcommand{\modelgb}{{Ga-B}}
\newcommand{\modelgl}{{Ga-L}}
\newcommand{\modelfs}{{Fl-S}}
\newcommand{\modelfb}{{Fl-B}}
\newcommand{\modelfl}{{Fl-L}}

\crefformat{section}{\S#2#1#3} % see manual of cleveref, section 8.2.1
\crefformat{subsection}{\S#2#1#3}
\crefformat{subsubsection}{\S#2#1#3}

\definecolor{pinegreen}{rgb}{0.0, 0.47, 0.44}
\definecolor{olive}{rgb}{0.5, 0.5, 0.0}
\definecolor{ao}{rgb}{0.0, 0.5, 0.0}
\definecolor{darkpastelgreen}{rgb}{0.01, 0.75, 0.24}
\definecolor{forestgreen}{rgb}{0.13, 0.55, 0.13}
\definecolor{htmlgreen}{rgb}{0.0, 0.5, 0.0}

\aclfinalcopy % Uncomment this line for the final submission
\def\aclpaperid{23} %  Enter the acl Paper ID here

%\setlength\titlebox{5cm}
% You can expand the titlebox if you need extra space
% to show all the authors. Please do not make the titlebox
% smaller than 5cm (the original size); we will check this
% in the camera-ready version and ask you to change it back.

\newcommand\BibTeX{B\textsc{ib}\TeX}

\title{On the Discrepancy between Density Estimation and Sequence Generation}

\author{Jason Lee \\
        New York University \\
      {\small \texttt{jason@cs.nyu.edu} } \\\And

      Dustin Tran \\
      Google AI \\
      {\small \texttt{trandustin@google.com} } \\\And
      
      Orhan Firat \\
      Google AI \\
      {\small \texttt{orhanf@google.com} } \\\And
  
  Kyunghyun Cho \\
        New York University \\
      {\small \texttt{kyunghyun.cho@nyu.edu} } \\
}

\date{}

\begin{document}
\maketitle

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{abstract}
Many sequence-to-sequence generation tasks, including machine translation and text-to-speech, can be posed as estimating the density of the output $y$ given the input $x$: $p(y|x).$ Given this interpretation, it is natural to evaluate sequence-to-sequence models using conditional log-likelihood on a test set. 
However, the goal of sequence-to-sequence generation (or structured prediction) is to find the best output $\hat{y}$ given an input $x$, and each task has its own downstream metric $R$ that scores a model output by comparing against a set of references $y^*$: $R(\hat{y}, y^* | x).$
While we hope that a model that excels in density estimation also performs well on the downstream metric, the exact correlation has not been studied for sequence generation tasks.
In this paper, by comparing several density estimators on five machine translation tasks, we find that the correlation between rankings of models based on log-likelihood and BLEU varies significantly depending on the range of the model families being compared.
First, log-likelihood is highly correlated with BLEU when we consider models within the same family (e.g. autoregressive models, or latent variable models with the same parameterization of the prior).
%Among autoregressive models, there is a perfect correlation on all datasets. Among latent variable models with the same prior distribution, we find a high correlation on four out of five datasets.
However, we observe no correlation between rankings of models across different families:
(1) among non-autoregressive latent variable models, a flexible prior distribution is better at density estimation but gives worse generation quality than a simple prior, and (2) autoregressive models offer the best translation performance overall, while latent variable models with a normalizing flow prior give the highest held-out log-likelihood across all datasets.
%Therefore, we recommend using a simple prior for the latent variable non-autoregressive model when fast generation speed is desired.
\end{abstract}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction}
\label{intro}

Sequence-to-sequence generation tasks can be cast as conditional density estimation $p(y|x)$ where $x$ and $y$ are input and output sequences. In this framework, density estimators are trained to maximize the conditional log-likelihood, and also evaluated using log-likelihood on a test set. 
However, many sequence generation tasks require finding the best output $\hat{y}$ given an input $x$ at test time, and the output is evaluated against a set of references $y^*$ on a task-specific metric: $R(\hat{y},y^*|x).$
%\sidenote{dt: claim seems a bit too general; our results are specific to BLEU?}
For example, machine translation systems are evaluated using BLEU scores~\citep{papieni02bleu}, image captioning systems use METEOR~\citep{banerjee05meteor} and text-to-speech systems use MOS (mean opinion scores). 
%\sidenote{dt: i agree with yuancao on pointing to earlier works}
As density estimators are optimized on log-likelihood, we want models with higher held-out log-likelihoods to give better generation quality, but the correlation has not been well studied for sequence generation tasks. 
In this work, we investigate the correlation between rankings of density estimators based on (1) test log-likelihood and (2) the downstream metric for machine translation.
%\footnote{We open source our code at \url{https://github.com/tensorflow/tensor2tensor}}.

On five language pairs from three machine translation datasets ({\wmtendeboth}, {\wmtenroboth}, {\iwsltdeen}), we compare the held-out log-likelihood and BLEU scores of several density estimators: (1) autoregressive models~\citep{vaswani17attention}, (2) latent variable models with a non-autoregressive decoder and a simple (diagonal Gaussian) prior~\citep{shu19latent}, and (3) latent variable models with a non-autoregressive decoder and a flexible (normalizing flow) prior~\citep{ma19flowseq}.

We present two key observations. First, among models within the same family, we find that log-likelihood is strongly correlated with BLEU. The correlation is almost perfect for autoregressive models and high for latent variable models with the same prior. Between models of different families, however, log-likelihood and BLEU are not correlated. Latent variable models with a flow prior are in fact the best density estimators (even better than autoregressive models), but they give the worst generation quality. Gaussian prior models offer comparable or better BLEU scores, while autoregressive models give the best BLEU scores overall.
From these findings, we conclude that the correlation between log-likelihood and BLEU scores varies significantly depending on the range of model families considered.
%Our findings are several-fold. 
%First, there is a perfect correlation between log-likelihood and BLEU among autoregressive models across all language pairs and datasets. 
%Second, among latent variable models with the same prior distribution, the correlation is strong but not perfect.
%Third, across different prior distributions of latent variable models, there is no correlation:
%Third, among latent variable models with a non-autoregressive decoder, 
%flow prior models are better density estimators but the Gaussian prior models give comparable or better generation quality.
%, while having fewer parameters and faster generation speed. 
%Finally, autoregressive models give the best BLEU scores, but the latent variable models with the flow prior give the highest held-out log-likelihoods on all datasets. 
%Finally, knowledge distillation drastically hurts density estimation but consistently improves generation quality of non-autoregressive models. 

Second, we find that knowledge distillation drastically hurts density estimation performance across different models and datasets, but consistently improves translation quality of non-autoregressive models. 
%\sidenote{dt: s/generation performance/translation quality?}
For autoregressive models, distillation slightly hurts translation quality. Among latent-variable models, iterative inference with a delta posterior~\citep{shu19latent} significantly improves the translation quality of latent variable models with a Gaussian prior, whereas the improvement is relatively small for the flow prior.
Overall, for fast generation, we recommend a latent variable non-autoregressive model using a simple prior (rather than a flexible one), knowledge distillation, and iterative inference. This is 5--7x faster than the autoregressive model at the expense of 2 BLEU scores on average, and it improves upon latent variable models with a flexible prior across generation speed, BLEU, and parameter count.

%\sidenote{dt: i like the structure and detail in this sec}
\section{Background}
\label{sec:background}
Sequence-to-sequence generation is a supervised learning problem of generating an output sequence given an input sequence. For many such tasks, conditional density estimators have been very successful~\citep{sutskever14sequence,bahdanau15neural,vinyals15show,vinyals15neural}. 

To learn the distribution of an output sequence, it is crucial to give enough capacity to the model to be able to capture the dependencies among the output variables. We explore two ways to achieve this: (1) directly modeling the dependencies with an autoregressive factorization of the variables, and (2) letting latent variables capture the dependencies, so the distribution of the output sequence can be factorized given the latent variables and therefore more quickly be generated.
We discuss both classes of density estimators in depth below. We denote the training set as a set of tuples $\{(\x_n, \y_n)\}_{n=1}^N$ and each input and output example as sequences of random variables $\x=\{x_1, \dots, x_{T'}\}$ and $\y=\{y_1, \dots, y_{T}\}$ (where we drop the subscript $n$ for notational simplicity). 
We use $\theta$ to denote the model parameters.

\subsection{Autoregressive Models}
\paragraph{Learning}
Autoregressive models factorize the joint distribution of the sequence of output variables $\y=\{y_1,\dots,y_T\}$ as a product of conditional distributions:
$$\log p_{\text{AR}}(\y|\x) = \sum_{t=1}^{T} \log p_\theta(y_t|y_{<t},\x).$$
They are trained to maximize the log-likelihood of the training data: $L_\text{AR}(\theta) = \frac{1}{N} \sum_{n=1}^{N} \log p_{\text{AR}}(\y_n|\x_n).$

\paragraph{Parameterization}
Recurrent neural networks and their gated variants are natural parameterizations of autoregressive models~\citep{elman90finding,hochreiter97long,chung14empirical}. 
By ensuring that no future information $y_{\geq t}$ is used in predicting the current timestep $y_t$, non-recurrent architectures can also parameterize autoregressive models, such as convolutions~\citep{oord16wavenet,gehring17convolutional} and Transformers~\citep{vaswani17attention}, which are feedforward networks with self-attention.

\paragraph{Inference} Finding the most likely output sequence given an input sequence under an autoregressive model amounts to solving a search problem:
%\begin{align*}
%\argmax_{\y} \log p_\theta (\y|\x) = \argmax_{y_{1:T}} \sum_{t=1}^{T}{\log p_\theta (y_t|y_{<t}, \x)}.
$\argmax_{y_{1:T}} \sum_{t=1}^{T}{\log p_\theta (y_t|y_{<t}, \x)}.$
%\end{align*}
As the size of the search space grows exponentially with the length of the output sequence $T$, solving this exactly is intractable. Therefore, approximate search algorithms are often used such as greedy search or beam search.

\subsection{Latent Variable Models}
\paragraph{Learning}
Latent variable models posit a joint distribution of observed variables ($\y$) and unobserved variables ($\z$). They are trained to maximize the marginal log-likelihood of the training data:
\begin{equation}
%\label{eq:lvm}
\log p_{\text{LVM}} (\y|\x) = \log \int_{\z} \dec \: \pri d\z.
\end{equation}
As the marginalization over $\z$ makes computing the marginal log-likelihood and posterior inference intractable, variational inference proposes to use a parameterized family of distributions $q_\phi(\z|\y,\x)$ to approximate the true posterior $p(\z|\y,\x).$ Then, we have the evidence lowerbound (ELBO)~\citep{wainwright08graphical,kingma14auto}:
\begin{align}
\label{eq:elbo}
%\begin{split}
\log & \: p_{\text{LVM}} (\y|\x) \geq \text{ELBO}(\y,\x;\theta,\phi) \\
%- \text{KL}\Big[q_{\phi}(\z|\y,\x) \Big|\Big| p_{\theta}(\z|\y,\x)\Big] \\
    %&= \elbobig , \notag
    &= \E_{\z \sim q_\phi} \big[ \log p_\theta(\y,\z|\x) - \log q_\phi(\z|\y,\x) \big], \notag
%\end{split}
\end{align}
where $\dec$ is the decoder, $\post$ is the variational posterior and $\pri$ is the prior. Both the model and variational parameters $\theta, \phi$ are estimated to maximize ELBO over the training set: $L_{\text{LVM}}(\theta,\phi) = \frac{1}{N}\sum_{n=1}^{N} \text{ELBO}(\y_n,\x_n;\theta,\phi).$

%\sidenote{dt: describe architectures here? o.w. point to 4.3}
\paragraph{Parameterization}

As latent variables can capture the dependencies between the output variables, the decoding distribution can be factorized: $\dec = \prod_{t=1}^{T} p_\theta(y_t|\z,\x)$. 
%The approximate posterior distribution is often parameterized with a mean-field factorization, which can be parameterized by any neural network that outputs mean and variance for each output position:
The approximate posterior distribution is also often factorized, which can be parameterized by any neural network that outputs mean and standard deviation for each output position:
$q_{\phi}(z_{1:T}|\y,\x) = \prod_{t=1}^{T} \mathcal{N}\Big(z_{t} \Big|\mu_{\phi,t}(\y,\x), \sigma_{\phi,t}(\y, \x)\Big).$
We discuss prior distributions in \cref{sec:lvmprior}.

\paragraph{Inference}
Generating the most likely output given an input with a latent variable model requires optimizing ELBO with respect to the output: $\argmax_{\y}{\text{ELBO}(\y,\x;\theta,\phi)}.$
%$$\argmax_{\y} {\E_{\z \sim q_\phi} \Big[ \log p_{\theta} (\y|\z,\x) \Big] - \text{KL}\Big[ \post \Big|\Big| \pri \Big] }.$$
As computing the expectation in Eq.~\ref{eq:elbo} is intractable, we instead optimize a proxy lowerbound using a delta posterior~\citep{shu19latent}: 
\begin{equation*}
  \delta(\z|\MU) =
    \begin{cases}
      1, & \text{if} \:\:\z = \MU\\
      0, & \text{otherwise}
    \end{cases}       
\end{equation*}
Then, the ELBO reduces to:
\vskip -0.3in
\begin{align}
& \E_{\z\sim \delta(\z|\MU)}{\Big[ \dec + \pri \Big]} + \overbrace{\mathcal{H}(\delta)}^{=0}, \notag \\
&= \log p_{\theta} (\y|\MU, \x) + \log p_{\theta} (\MU|\x).
\label{eq:proxy}
\end{align}

We maximize Eq.~\ref{eq:proxy} with iterative refinement: the EM algorithm alternates between (1) matching the proxy to the original lowerbound by setting ${\MU} = \E_{q_\phi}[\z]$, and (2) maximizing the proxy lowerbound with respect to $\y$ by: $\hat{\y} = \text{argmax}_{\y} (\log p_\theta(\y|{\MU},\x)).$ 
The delta posterior is initialized using the prior (e.g. $\MU=\E_{\z \sim \pri}[\z]$ in case of a Gaussian prior) so that the inference algorithm is fully deterministic, a desirable property for sequence generation tasks. We study the effect of iterative refinement on BLEU score in detail.

\subsection{Prior for Latent Variable Models}
\label{sec:lvmprior}
%As found by \citet{hoffman16elbo,rosca18distribution,tomczak18vae}, the choice of the prior distribution is important in latent variable models. We explore two choices: a diagonal Gaussian distribution and a normalizing flow.
%Recently, \citet{hoffman16elbo} decomposed the the variational evidence lower bound objective (ELBO) that contains the term $\text{KL}[q(z)||p(z)]$ (marginal KL), and argued that a prior distribution that is too simple (e.g. a standard gaussian) may be unable to match the aggregate posterior. Indeed, \citet{rosca18distribution} showed that the marginal KL term is non-zero for flexible posterior distribution (such as RealNVP~\citep{dinh17density}).
Several work have discovered that the prior distribution plays a critical role in balancing the variational posterior and the decoder, and a standard normal distribution may be too rigid for the aggregate posterior to match~\citep{hoffman16elbo,rosca18distribution}. Indeed, follow-up work found that more flexible prior distributions outperform simple priors on several density estimation tasks~\citep{tomczak18vae,bauer19resampled}.
Therefore, we explore two choices for the prior distribution: a factorized Gaussian and a normalizing flow.
%Therefore, we explore two choices for the prior distribution: a diagonal Gaussian distribution and a normalizing flow.

\paragraph{Diagonal Gaussian}
A simple model of the conditional prior is a factorized Gaussian distribution:
$$\log p_{\theta}(z_{1:T}|\x) = \sum_{t=1}^{T} {\log \mathcal{N}\Big(z_{t} \Big|\mu_{\theta,t}(\x), \sigma_{\theta,t}(\x)\Big),}$$
where each latent variable $z_t$ is modeled as a diagonal Gaussian with mean and standard deviation computed from a learned function.

\paragraph{Normalizing Flow}
Normalizing flows~\citep{tabak13family,rezende15variational,papa19normalizing} offer a general method to construct complex probability distributions over continuous random variables. It consists of (1) a base distribution $p_{b}(\epsilon)$ (often chosen as a standard Gaussian distribution) and an invertible transformation $f$ and its inverse $f^{-1}$, such that $f(\z)=\epsilon,\:\:f^{-1}(\epsilon)=\z.$
As our prior is conditioned on $\x$, so are the transformations: $f(\z; \x)=\epsilon,\:\:f^{-1}(\epsilon; \x)=\z.$
Then, by change-of-variables, we can evaluate the exact density of the latent variable ${\z}$ under the flow prior:
$$\log p_\theta(\z|\x) = \log p_b\Big(f(\z; \x)\Big) + \log \bigg|\text{det} \frac{\partial f(\z; \x)}{\partial \z}\bigg|.$$
Affine coupling flows~\citep{dinh17density} enable efficient generation and computation of the Jacobian determinant by constructing each transformation such that only a subset of the random variables undergoes affine transformation, using parameters computed from the remaining variables:
\begin{align}
\label{eq:glow}
    &\z_\text{id}, \z_\text{tr} = \text{split}(\z) \nonumber \\
    &\bm{\mathrm{s}}, \bm{\mathrm{b}} = g_\text{param}(\z_\text{id}) \\
    %&\epsilon_\text{tr} = \bm{\mathrm{s}} \cdot \z_\text{tr} + \bm{\mathrm{b}} \nonumber \\
    %&f(\z) = \text{concat}(\z_\text{id}; \epsilon_\text{tr}) \nonumber
    %&\epsilon_\text{tr} = \bm{\mathrm{s}} \cdot \z_\text{tr} + \bm{\mathrm{b}} \nonumber \\
    &f(\z) = \text{concat}(\z_\text{id}; \:\: \bm{\mathrm{s}} \cdot \z_\text{tr} + \bm{\mathrm{b}}), \nonumber
\end{align}
where $g_\text{param}$ can be arbitrarily complex as it needs not be invertible. 
%but $g_\text{transform}$ is an invertible function given $\theta$. Affine transformations $\epsilon_\text{tr} = \bm{\mathrm{s}}(\theta) \cdot \z_\text{tr} + \bm{\mathrm{b}}(\theta)$ have been successfully used as $g_\text{transform}$ in generative modeling of images and speech~\citep{kingma18glow,prenger19waveglow}.
As invertibility is closed under function composition and the Jacobian determinant is multiplicative, increasingly flexible coupling flows can be constructed by stacking multiple flow layers and reordering such that all the variables are transformed. 
%Deep affine coupling flows have been successfully used in generative models of images and speech~\citep{kingma18glow,prenger19waveglow}.

\iffalse
\subsection{Flow-based Density Estimators}
As normalizing flows apply continuous transformations to continuous distributions, they are not directly applicable to discrete data such as text. 
Recently proposed discrete normalizing flows (without the determinant Jacobian term) give promising performance on character-level language modeling and image compression~\citep{tran19discrete,hoogeboom19integer}. However, bias from straight-through gradient estimators hinders scalability in terms of flow depth and the number of classes. As this reduces their chance of success in large-scale sequence generation tasks such as machine translation, we do not include discrete flow models in our experiments and leave it as future work.
\fi

\subsection{Knowledge Distillation}
While most density estimators for sequence generation tasks are trained to maximize the log-likelihood of the training data, recent work have shown that it is possible to improve the performance of non-autoregressive models significantly by training them on the predictions of a pre-trained autoregressive model~\citep{gu18non,oord18parallel}. 
While \citet{zhou19understanding} recently found that distillation reduces complexity of the training data, its effect on density estimation performance has not been studied.

\section{Problem Definition}
\label{sec:probdef}
On a sequence generation task, a conditional density estimator $F \in \mathcal{H}$ (where $\mathcal{H}$ is a hypothesis set of density estimators in \cref{sec:background}) is trained to maximize the log-likelihood (or its approximation) of the training set $\{(x_n, y_n)\}_{n=1}^N$:
$$L(F) = \frac{1}{N} \sum_{n=1}^{N} \log p_{F}(y_n|x_n).$$

Once training converges, the model $F$ is evaluated on the test set $\{(x_m, y_m)\}_{m=1}^M$ using a downstream metric $R$:
%$$R(F) = R((y_1, \dots, y_M), (\hat{y}_1, \dots, \hat{y}_M), (x_1, \dots, x_M)),$$
\begin{equation*}
%R(F) = R\big(\{y_m\}_{m=1}^M,\,\{\hat{y}_m\}_{m=1}^M,\,\{x_m\}_{m=1}^M\big),
R(F) = R\big(\{(x_m, y_m, \hat{y}_m)\}_{m=1}^M\big),
\end{equation*}
where $\hat{y}_m = \argmax_{y} \log p_{F} (y|x_m).$

To perform model selection, we can rank a set of density estimators $\{F_1, \dots, F_K\}$ based on either the held-out log-likelihood or the downstream metric. We measure the correlation between the rankings given by the log-likelihood $L(F)$ and the downstream metric $R(F)$.

\section{Experimental Setup}
On machine translation, we train several autoregressive models and latent variable models and analyze the correlation between their rankings based on log-likelihood and BLEU.

\subsection{Datasets and Preprocessing}
We use five language pairs from three translation datasets: {\iwsltdeen}\footnote{\url{https://wit3.fbk.eu/}} (containing 197K training, 2K development and 2K test sentence pairs), {\wmtenroboth}\footnote{\url{www.statmt.org/wmt16/translation-task.html}} (612K, 2K, 2K pairs) and {\wmtendeboth}\footnote{\url{www.statmt.org/wmt14/translation-task.html}} (4.5M, 3K, 3K pairs). For {\wmtendeboth} and {\wmtenroboth}, both directions are used.

We use the preprocessing scripts with default hyperparameters from the \texttt{tensor2tensor} framework.\footnote{\url{https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/bin/t2t-datagen}} Namely, we use wordpiece tokenization~\citep{schuster12japanese} with 32K wordpieces on all datasets. For {\wmtenroboth}, we follow \citet{sennrich16edinburgh} and normalize Romanian and remove diacritics before applying wordpiece tokenization. For training, we discard sentence pairs if either the source or the target length exceeds 64 tokens. 
As splitting along the time dimension~\citep{ma19flowseq} in the coupling flow layer requires that the length of the output sequence is a multiple of 2 at each level, \texttt{<EOS>} tokens are appended to the target sentence until its length is a multiple of 4.

%\subsection{Model Details}
\subsection{Autoregressive Models}
We use three Transformer~\citep{vaswani17attention} models of different sizes: Transformer-big (\modeltl), Transformer-base (\modeltb) and Transformer-small (\modelts). The first two models have the same hyperparameters as in \citet{vaswani17attention}. Transformer-small has 2 attention heads, 5 encoder and decoder layers, $d_{\text{model}}=256$ and $d_\text{filter}=1024$.

\subsection{Latent Variable Models}
The latent variable models in our experiments are composed of the source sentence encoder, length predictor, prior, decoder and posterior. The source sentence encoder is implemented with a standard Transformer encoder. 
Given the hidden states of the source sentence, the length predictor (a 2-layer MLP) predicts the length difference between the source and target sentences as a categorical distribution in $[-30, 30].$
We implement the decoder $\dec$ with a standard Transformer decoder that outputs the logits of all target tokens in parallel. The approximate posterior $\post$ is implemented as a Transformer decoder with a final Linear layer with weight normalization~\citep{salimans16weight} to output the mean and standard deviation (having dimensionality $d_\text{latent}$). Both the decoder and the approximate posterior attend to the source hidden states.

\paragraph{Diagonal Gaussian Prior}
The diagonal Gaussian prior is implemented with a Transformer decoder which receives a sequence of positional encodings of length $T$ as input, and outputs the mean and standard deviation of each target token (of dimensionality $d_\text{latent}$). We train two models of different sizes: Gauss-base (\modelgb) and Gauss-large (\modelgl). Gauss-base has 4 attention heads, 3 posterior layers, 3 decoder layers and 6 encoder layers, whereas Gauss-large has 8 attention heads, 4 posterior layers, 6 decoder layers, 6 encoder layers.
$(d_\text{model}, d_\text{latent}, d_\text{filter})$ is (512, 512, 2048) for WMT experiments and (256, 256, 1024) for IWSLT experiments.

\paragraph{Normalizing Flow Prior}
The flow prior is implemented with Glow~\citep{kingma18glow}. We use a single Transformer decoder layer with a final Linear layer with weight normalization to parameterize $g_\text{param}$ in Eq.~\ref{eq:glow}. This produces the shift and scale parameters for the affine transformation.
Our flow prior has the multi-scale architecture with three levels~\citep{dinh17density}: at the end of each level, half of the latent variables are modeled with a standard Gaussian distribution.
We use three split patterns and multi-headed 1x1 convolution from \citet{ma19flowseq}. We experiment with the following hyperparameter settings: Flow-small (\modelfs) with 12/12/8 flow layers in each level and Flow-base (\modelfb) with 12/24/16 flow layers in each level.
The first level corresponds to the latent distribution and the last level corresponds to the base distribution. $(d_\text{model}, d_\text{latent}, d_\text{filter})$ is (320, 320, 640) for all experiments. For the Transformer decoder in $g_\text{param}$, we use 4 attention heads for Flow-small and 8 attention heads for Flow-base.

%Fixing the rest of the model, we replace the diagonal Gaussian prior with a Glow~\citep{kingma18glow} prior. We closely follow the implementation of \citet{ma19flowseq}. Some notable implementation details include: (1) we use a single Transformer decoder layer to produce the location and scale parameters for each affine transformation, (2) we use the multi-head version of 1x1 convolution, (3) we bound the scale parameter to $[0, 1]$ by adding 2.0 and taking its sigmoid, (4) we employ weight normalization~\citep{salimans16weight} to the last \texttt{Linear} layer that produces affine transformation parameters and $\mu, \:\sigma$ of every Gaussian distribution, (5) our one step of flow is similar to \citet{kingma18glow} and consists of actnorm, 1x1 convolution and and affine coupling layer and (6) our \texttt{Flow-base} model uses a multi-scale architecture of 3 levels and 52 flow layers (12/24/16 layers in each level). We refer readers to \citet{ma19flowseq} for more implementation details. 

\subsection{Training and Optimization}
We use the Adam optimizer~\citep{kingma15adam} with the learning rate schedule used by \citet{vaswani17attention}. The norm of the gradients is clipped at 1.0. 
We perform early stopping and choose the learning rate warmup steps and dropout rate based on the BLEU score on the development set.
To train non-autoregressive models, the loss from the length predictor is minimized jointly with negative ELBO loss.

\paragraph{Knowledge Distillation} Following previous work~\citep{kim16sequence,gu18non,lee18deterministic}, we construct a distilled dataset by decoding the training set using Transformer-base with beam width 4. For {\iwsltdeen}, we use Transformer-small.

\paragraph{Latent Variable Models} To ease optimization of latent variable models~\citep{bowman16generating,higgins17beta}, we set the weight of the KL term to 0 for the first 5,000 SGD steps and linearly increase it to 1 over the next 20,000 steps. Similarly with \citet{mansimov19molecular}, we find it helpful to add a small regularization term to the training objective that matches the approximate posterior with a standard Gaussian distribution: $\alpha \cdot \text{KL}\big[ q_\phi (\z|\y,\x) \: || \:\mathcal{N}(0, \bm{\mathrm{I}}) \big]$, as the original KL term $\text{KL}\big[ \post \, \big|\big| \, \pri \big]$ does not have a local point minimum but a valley of minima.
We find $\alpha=10^{-4}$ to work best.

\paragraph{Flow Prior Models} We perform data-dependent initialization of actnorm parameters for the flow prior~\citep{kingma18glow} at the 5,000-th step, which is at the beginning of KL scheduling.

\subsection{Evaluation Metrics}

\paragraph{Log-likelihood} is the main metric for measuring density estimation (data modeling) performance. We compute exact log-likelihood for autoregressive models. 
For latent variable models, we estimate the marginal log-likelihood by importance sampling with 1K samples from the approximate posterior and using the ground truth target length.
%For latent variable models, we compute its lowerbound using importance sampling with 1K samples from the approximate posterior.

\paragraph{BLEU} measures the similarity (in terms of n-gram overlap) between a generated output and a set of references, regardless of the model. It is a standard metric for generation quality of machine translation systems. 
%We also compute \textbf{Pairwise BLEU} to measure the diversity among a set of outputs generated from a model~\citep{shen19mixture}.

\paragraph{Generation Speed} In addition to the quality-driven metrics, we measure the generation speed of each model in the number of sentences generated per second on a single V100 GPU.

\section{Results}
\label{sec:results}

\begin{table}[!t]
\small
\centering
\begin{sc}
\begin{tabular}{llrrrr} \toprule
 & & \multicolumn{2}{c}{BLEU ($\uparrow$)} & \multicolumn{2}{c}{LL ($\uparrow$)} \\ 
 & & \multicolumn{1}{c}{Raw} & \multicolumn{1}{c}{Dist.} & \multicolumn{1}{c}{Raw} & \multicolumn{1}{c}{Dist.} \\ 
 
 % WMT'14 En->De
\midrule
\multirow{11}{*}{\rotatebox{90}{$\:\:\:$\wmtende}} 
& \modelts & 24.54 & 24.94 & -1.77 & -2.36 \\
& \modeltb & 28.18 & 27.86 & -1.44 & -2.19 \\
& \modeltl & \underline{29.39} & {28.29} & -1.35 & -2.23 \\
\cmidrule{2-6}
& \modelgb & 15.74 & 24.54 &  \mygeq -1.51 & \mygeq -2.44 \\
& \modelgl & 17.33 & \textbf{25.53} &  \mygeq -1.47 & \mygeq -2.24 \\
& \modelfs & 18.17 & 21.98 &  \mygeq -1.41 & \mygeq -2.13 \\
& \modelfb & 18.57 & 21.82 &  \mygeq \textbf{-1.23} & \mygeq -2.05 \\
\cmidrule{2-6}
& \modelfb$^{(*)}$ & 18.55  & 21.45 & & \\
& \modelfl$^{(*)}$ & 20.85  & 23.72 & & \\ 

 % WMT'14 De->en
\midrule
\multirow{11}{*}{\rotatebox{90}{$\:\:\:$\wmtdeen}} 
& \modelts & 29.15 & 28.40 & -1.66 & -2.24 \\
& \modeltb & 32.21 & {32.24} & -1.42 & -2.12 \\
& \modeltl & \underline{33.16} & {32.24} & -1.35 & -2.05 \\
\cmidrule{2-6}
& \modelgb & 21.64 & 29.29 &  \mygeq -1.41 & \mygeq -2.17 \\
& \modelgl & 23.03 & \textbf{30.30} &  \mygeq -1.31 & \mygeq -2.04  \\
& \modelfs & 23.17 & 27.14 &  \mygeq -1.28 & \mygeq -1.73  \\
& \modelfb & 23.12 & 26.72 &  \mygeq \textbf{-1.20} & \mygeq -1.71  \\
\cmidrule{2-6}
& \modelfb$^{(*)}$ & 23.36  & 26.16 & &   \\
& \modelfl$^{(*)}$ & 25.40  & 28.39 & &   \\ 

 % WMT'16 En->Ro
\midrule
\multirow{10}{*}{\rotatebox{90}{$\:\:\:$\wmtenro}} 
& \modelts & 30.12 & 29.57 & -1.72 & -1.95 \\
& \modeltb & \underline{33.46} & {33.28} & -1.63 & -2.52  \\
\cmidrule{2-6}
& \modelgb & 28.03 & 29.71 &  \mygeq -2.38 & \mygeq -3.48  \\
& \modelgl & 28.16 & \textbf{30.91} &  \mygeq -2.44 & \mygeq -3.54  \\
& \modelfs & 26.85 & 28.63 &  \mygeq -1.53 & \mygeq -2.42  \\
& \modelfb & 27.49 & 29.09 &  \mygeq \textbf{-1.50} & \mygeq -2.31  \\
\cmidrule{2-6}
& \modelfb$^{(*)}$ & 29.26 & 29.34 & &   \\
& \modelfl$^{(*)}$ & 29.86 & 29.73 & &   \\ 

 % WMT'16 Ro->en
\midrule
\multirow{10}{*}{\rotatebox{90}{$\:\:\:$\wmtroen}} 
& \modelts & 29.33 & 28.87 & -1.84 & -1.93 \\
& \modeltb & \underline{32.19} & {31.15} & -1.79 & -2.28  \\
\cmidrule{2-6}
& \modelgb & 26.48 & 27.81 &  \mygeq -2.41 & \mygeq -2.92  \\
& \modelgl & 27.35 & \textbf{28.02} &  \mygeq -2.32 & \mygeq -3.01  \\
& \modelfs & 26.03 & 26.12 &  \mygeq -1.65 & \mygeq -2.05  \\
& \modelfb & 27.14 & 27.33 &  \mygeq \textbf{-1.64} & \mygeq -2.01  \\
\cmidrule{2-6}
& \modelfb$^{(*)}$ & 30.16 & 30.44 & &   \\
& \modelfl$^{(*)}$ & 30.69 & 30.72 & &   \\ 

% IWSLT'16 En->De
\midrule
\multirow{9}{*}{\rotatebox{90}{$\quad\quad\quad$IWSLT}} 
& \modelts & 31.54 & \underline{31.72} & -1.84 & -2.56  \\
\cmidrule{2-6}
& \modelgb & 24.36 & 26.80 &  \mygeq -1.98 & \mygeq -2.70  \\
& \modelfs & 23.64 & 26.69 &  \mygeq -1.66 & \mygeq -2.28  \\
& \modelfb & 24.89 & \textbf{27.00} &  \mygeq \textbf{-1.57} & \mygeq -2.46  \\
\cmidrule{2-6}
& \modelfb$^{(*)}$ & 24.75 & 27.75 & &   \\

\bottomrule
\end{tabular}
\caption{Test BLEU score and log-likelihood of each model. Raw: models trained on raw data. Dist.: models trained on distilled data. \modelts: Transformer-small. \modeltb: Transformer-base. \modeltl: Transformer-big. \modelgb: Gauss-base. \modelgl: Gauss-large. \modelfs: Flow-small. \modelfb: Flow-base. \modelfl: Flow-large.
We use beam search with width 4 for inference with autoregressive models, and one step of iterative inference~\citep{shu19latent} for latent variable models.
%We estimate the marginal log-likelihood for latent variable models by importance sampling with 1,000 samples from $\post$.
% \note{dt: rm details (in 4.4)?}
On most datasets, our Flow-base model gives comparable results to those from \citet{ma19flowseq}, which are denoted with ($*$).
We boldface the best log-likelihood overall and the best BLEU score among the latent variable models. We underscore best BLEU score among the autoregressive models.
}
\label{tab:mainresult}
\end{sc}
\vskip -0.20in
\end{table}

\begin{table}[!t]
\small
\centering
\begin{sc}
\begin{tabular}{lrrr} \toprule
 & \modeltb & \modelgb & \modelfb \\ \midrule
Raw & 0.926 & 0.831 & 0.678 \\
Dist. & -0.758 & -0.897 & -0.873 \\
\bottomrule
\end{tabular}
\caption{Pearson's correlation between log-likelihood and BLEU across the training checkpoints of Transformer-base, Gauss-base and Flow-base on {\wmtende}.}
\label{tab:corr}
\end{sc}
\end{table}

\iffalse
\begin{table}[!t]
\small
\centering
\vskip -0.1in
\caption{BLEU scores and log-likelihoods on out-of-distribution test sets. Models trained on {\wmtdeen} are evaluated on {\iwsltdeen}, and vice versa.}
\vskip 0.15in
\begin{sc}
\begin{tabular}{llrrrr} \toprule
 & & \multicolumn{2}{c}{BLEU ($\uparrow$)} & \multicolumn{2}{c}{LL ($\uparrow$)} \\ 
 & & \multicolumn{1}{c}{Raw} & \multicolumn{1}{c}{Dist.} & \multicolumn{1}{c}{Raw} & \multicolumn{1}{c}{Dist.} \\ 
 
\midrule
\multirow{8}{*}{\rotatebox{90}{\parbox{1.2cm}{\centering WMT'14 \\ $\rightarrow$IWSLT}}} 
    & \modelts & 29.15 & 28.40 & -1.65 & -2.25 \\
    & \modeltb & 32.29 & 31.75 & -1.42 & -2.12 \\
    & \modeltl & \underline{33.16} & 32.24 & -1.35 & -2.06 \\
\cmidrule{2-6}
& \modelgb & 24.26 & 28.77 & \mygeq -1.37 & \mygeq -2.10 \\
& \modelgl & 25.46 & \textbf{29.60} & \mygeq -1.28 & \mygeq -2.01 \\
& \modelfs & 24.35 & 26.79 & \mygeq -1.26 & \mygeq -1.76 \\
& \modelfb & 24.25 & 27.12 & \mygeq \textbf{-1.19} & \mygeq -1.73 \\

\midrule
\multirow{6}{*}{\rotatebox{90}{\parbox{0.8cm}{\centering IWSLT$\rightarrow$ \\ WMT'14}}} 
    & \modelts & 18.50 & \underline{18.94} & -2.79 & -3.41 \\
\cmidrule{2-6}
& \modelgb & 12.12 & 13.78 & \mygeq -3.10 & \mygeq -3.83 \\
& \modelfs & 11.78 & \textbf{14.35} & \mygeq -2.81 & \mygeq -3.22 \\
& \modelfb & 12.56 & 14.30 & \mygeq \textbf{-2.62} & \mygeq -3.43 \\

\bottomrule
\end{tabular}
\end{sc}
\label{tab:oodresult}
\end{table}
\fi

\subsection{Correlation between rankings of models}
\label{sec:mainresult}

Table~\ref{tab:mainresult} presents the comparison of three model families (Transformer, Gauss, Flow) on five language pairs in terms of generation quality (BLEU) and log-likelihood (LL). We present two sets of results: one from models trained on raw data (Raw), and another from models trained on distilled data (Dist.) (which we mostly discuss in \cref{sec:dist}). 
We use the original test set in computing the log-likelihood and BLEU scores of the distilled models, so the results are comparable with the undistilled models.
We make two main observations:

\begin{enumerate}[leftmargin=*]
    \vskip -0.1in
    \setlength\itemsep{0.40em}
    \setlength\parskip{0.0em}
    \item Log-likelihood is highly correlated with BLEU when considering models within the same family.
    \begin{enumerate}[
      align=left,
      leftmargin=1.0em,
      itemindent=0pt,
      labelsep=0pt,
      labelwidth=2em
    ]
    \setlength\itemsep{0.40em}
    \setlength\parskip{0.0em}
        \item Among autoregressive models (\modelts, {\modeltb} and {\modeltl}), there is a perfect correlation between log-likelihood and BLEU. On all five language pairs (undistilled), the rankings of autoregressive models based on log-likelihood and BLEU are identical.
        \item Among non-autoregressive latent variable models with the same prior distribution, there is a strong but not perfect correlation. 
        Between Gauss-large and Gauss-base, the model with higher held-out log-likelihood also gives higher BLEU on four out of five datasets.
        Similarly, Flow-base gives higher log-likelihood and BLEU score than Flow-small on all datasets except {\wmtdeen}.
    \end{enumerate}
    
    \item Log-likelihood is not correlated with BLEU when comparing models from different families.
    \begin{enumerate}[
      align=left,
      leftmargin=1.0em,
      itemindent=0pt,
      labelsep=0pt,
      labelwidth=2em
    ]
    \setlength\itemsep{0.40em}
    \setlength\parskip{0.0em}
        \item Between latent variable models with different prior distributions, we observe no correlation between log-likelihood and BLEU.
        %Between the Gaussian prior and flow prior models, there is no correlation between log-likelihood and BLEU. 
        On four out of five language pairs (undistilled), Flow-base gives much higher log-likelihood but similar or worse BLEU score than Gauss-base. 
        %Only on {\wmtende}, Flow-base is 1 BLEU score higher.
        %worse BLEU than Gauss-large. On {\iwsltdeen}, they give similar BLEU scores.
        With distillation, Gauss-large considerably outperforms Flow-base in BLEU on all datasets, while Flow-base gives better log-likelihood.
        %Increasing the capacity of the flow prior model does not lead to better generation quality than the Gaussian prior model, as Gauss-large (95M parameters) gives higher BLEU score than Flow-large (258M parameters). 
        \item Overall, autoregressive models offer the best translation quality but not the best modeling performance. In fact, Flow-base model with a non-autoregressive decoder gives the highest held-out log-likelihood on all datasets. 
    \end{enumerate}
    %To check if the flow prior model gives higher log probability than the autoregressive model \emph{for every sentence}, we compute the log probability of every test sentence under Transformer-base and Flow-base, and run a Wilcoxon signed rank test~\citep{wilcoxon45individual}. With the threshold of $p < 1\mathrm{e}{-3}$, the flow prior model is found to give significantly higher log probability than the autoregressive model.
    \vskip -0.1in
\end{enumerate}

\paragraph{Correlation between log-likelihood and BLEU across checkpoints} Table~\ref{tab:corr} presents the correlation between log-likelihood and BLEU across the training checkpoints of several models. 
The findings are similar to Table~\ref{tab:mainresult}: for Transformer-base, there is almost perfect correlation ($0.926$) across the checkpoints. 
For Gauss-base and Flow-base, we observe strong but not perfect correlation ($0.831$ and $0.678$).
Overall, these findings suggest that there is a high correlation between log-likelihood and BLEU when comparing models within the same family. We discuss the correlation for models trained with distillation below in \cref{sec:dist}.

\iffalse
\paragraph{Out-of-distribution experiments} We run additional experiments to validate our findings on data outside the training distribution. Using {\wmtdeen} (which is collected from news commentary and parliament proceedings) and {\iwsltdeen} (a collection of transcriptions of TED talks), we evaluate models that are trained on one dataset on the other's test set. The results, presented in Table~\ref{tab:oodresult}, are consistent with the in-distribution data.
Within the same model family (autoregressive or latent variable with the same prior distribution), the correlation between log-likelihood and BLEU is high.
Across different families of models, however, we again find no correlation. While the flow prior models are the best density estimators overall (even better than the autoregressive models), their translation quality is the poorest.
%(1) perfect correlation among autoregressive models, (2) flow prior models give higher log-likelihood but worse BLEU scores than the Gaussian prior, (3) among latent variable models of the same prior distribution, the correlation between log-likelihood and BLEU is strong but not perfect, and (4) the autoregressive models give the best BLEU scores overall, but the flow prior models give the higher log-likelihood.
These findings show that the correlation between log-likelihood and BLEU varies significantly depending on the range of model families being compared, on both in-domain and out-of-domain data.
%These findings show that the correlation between rankings of models with respect to log-likelihood and BLEU varies significantly across different density estimators and decoders.
\fi

\subsection{Knowledge Distillation}
\label{sec:dist}
In Table~\ref{tab:corr}, we observe a strong negative correlation between log-likelihood and BLEU across the training checkpoints of several density estimators trained with distillation.
Indeed, distillation severely hurts density estimation performance on all datasets
%(see Tables~\ref{tab:mainresult} and \ref{tab:oodresult}).
(see Table~\ref{tab:mainresult}).
In terms of generation quality, it consistently improves non-autoregressive models, yet the amount of improvement varies across models and datasets. 
On {\wmtende} and {\wmtdeen}, distillation gives a significant 7--9 BLEU increase for diagonal Gaussian prior models, but the improvement is relatively smaller on other datasets. Flow prior models benefit less from distillation, only 3--4 BLEU scores on {\wmtendeboth} and less on other datasets.
%, however it significantly improves generation quality for non-autoregressive models. 
For autoregressive models, distillation results in a slight decrease in generation performance.

\iffalse
\begin{figure}[!t]
  \centering
  \begin{minipage}[b]{0.23\textwidth}
    \includegraphics[width=\textwidth]{fig/bleu.png}
  \end{minipage}
  %\hfill
  \begin{minipage}[b]{0.23\textwidth}
    \includegraphics[width=\textwidth]{fig/ll.png}
  \end{minipage}
\caption{Test BLEU and ELBO curves of Flow-Base on (1) raw and (2) distilled {\wmtenro}. For the distilled model, test ELBO keeps decreasing but the BLEU score improves.}
\vskip -0.1in
\end{figure}
\fi

\iffalse
\begin{figure}[!t]
\vskip -0.05in
\begin{center}
\centerline{\includegraphics[width=0.90\columnwidth]{fig/logmarginal.png}}
\caption{Histogram of differences of log marginal likelihood of all test sentences between the prior model and the autoregressive model.}
\label{fig:logmarginal}
\end{center}
\vskip -0.40in
\end{figure}
\fi

\subsection{Iterative inference on Gaussian vs. flow prior}
We analyze the effect of iterative inference on the Gaussian and the flow prior models. Table~\ref{tab:refinement} shows that iterative refinement improves BLEU and ELBO for both Gaussian prior and flow prior models, but the gain is relatively smaller for the flow prior model.

\begin{table}[!h]
\small
\centering
\begin{sc}
\begin{tabular}{llrrrr} \toprule
& & \multicolumn{4}{c}{Number of refinement steps} \\
& & \multicolumn{1}{c}{0} & \multicolumn{1}{c}{1} & \multicolumn{1}{c}{2} & \multicolumn{1}{c}{4} \\ \midrule
\multirow{2}{*}{{\footnotesize BLEU}}
& Ga-B & 22.88 & 24.36 & 24.60 & 24.69 \\
& Fl-B & 24.57 & 24.89 & 24.81 & 24.92 \\ \midrule

\multirow{2}{*}{{\footnotesize ELBO}}
& Ga-B & -1.11 & -0.93 & -0.90 & -0.89 \\
& Fl-B & -1.22 & -1.17 & -1.16 & -1.15 \\ \bottomrule
\end{tabular}
\end{sc}
\caption{Iterative inference with a delta posterior improves BLEU and ELBO for Gauss-base and Flow-base on {\iwsltdeen} (without distillation).}
\label{tab:refinement}
\end{table}

\iffalse
\paragraph{Iterative refinement improves generation quality at the cost of diversity} On the subset of {\wmtende} test set that contains 10 German references for each English sentence~\citep{ott18analyzing} we decode 10 distinct candidates from each model and compute the overall quality (BLEU) and diversity (pairwise BLEU) (see Figure~\ref{fig:diversity}). Ground-truth references, having both high quality and diversity, are in the top left. Beam candidates from an autoregressive model are of high quality but not diverse (top right). 
For the latent variable models, we compute the quality and diversity of the output after performing $k$ steps of iterative inference (where $k\in\{0,1,2,4,8\}$ and 0 indicates no refinement).
%\sidenote{dt: also add observation that distillation  hurts diversity}
The first observation is that distillation significantly improves the quality but reduces diversity (towards top right) for non-autoregressive latent variable models. 
Iterative refinement has a similar effect for Gaussian prior models, improving quality at the cost of diversity (towards top right). For flow prior models, however, it leads to little improvement in quality and a drop in diversity (towards right).

\begin{figure}[!t]
\begin{center}
\centerline{\includegraphics[width=0.80\columnwidth]{fig/diversity.png}}
%\vskip 0.1in
%\includegraphics[width=0.85\columnwidth]{fig/diversity.png}
\caption{Quality vs. diversity analysis. we plot overall BLEU (higher means better quality overall) and pairwise-BLEU (lower means more diverse). 
For Gaussian prior and flow prior models, we perform $k$ steps of iterative inference ($k \in \{0,1,2,4,8\}$).
}
\label{fig:diversity}
\end{center}
\vskip -0.20in
\end{figure}
\fi

\begin{figure}[!t]
  \centering
  \begin{minipage}[b]{0.35\textwidth}
    \includegraphics[width=\textwidth]{fig/gauss.png}
  \end{minipage}
  %\hfill
  \begin{minipage}[b]{0.35\textwidth}
    \includegraphics[width=\textwidth]{fig/flow.png}
  \end{minipage}
%\vskip 0.1in
\caption{Visualization of the latent space with 1K samples from the prior ({\color{htmlgreen}green plus sign}), the approximate posterior ({\color{blue}blue circle}) and the delta posterior ({\color{red}red cross}) of Gauss-base (top) and Flow-small (bottom) on a {\iwsltdeen} test example.}
\label{fig:vis}
\vskip -0.1in
\end{figure}

\begin{table*}[!t]
\small
\centering
\begin{sc}
\begin{tabular}{lrrrrrrrrrrrr} \toprule

& \multicolumn{5}{c}{BLEU} & & \multicolumn{5}{c}{Speed} & Size  \\
\cmidrule{2-6} \cmidrule{8-12}
$k=$ & \multicolumn{1}{c}{0} & \multicolumn{1}{c}{1} & \multicolumn{1}{c}{2} & \multicolumn{1}{c}{4} & \multicolumn{1}{c}{8} & & \multicolumn{1}{c}{0} & \multicolumn{1}{c}{1} & \multicolumn{1}{c}{2} & \multicolumn{1}{c}{4} & \multicolumn{1}{c}{8} & \\ \midrule

\modelts & 24.54 &  & & & & & 2.69 & & & & & 17M \\
\modeltb & 28.18 &  & & & & & 2.58 & & & & & 60M \\
\modeltl & 29.39 &  & & & & & 1.93 & & & & & 208M \\ \midrule
\modelgb & 23.15 & 24.54 & 24.87 & 24.94 & 24.92 & & 28.77 & 20.52 & 16.51 & 12.00 & 8.11 & 75M \\
\modelgl & 24.31 & 25.53 & 25.69 & 25.68 & 25.68 & & 19.83 & 14.72 & 10.25 & 7.88 & 4.91 & 95M \\ 
\modelfb & 21.57 & 21.82 & 21.79 & 21.81 & 21.80 & & 5.82 & 5.60 & 4.84 & 3.60 & 3.37 & 75M \\
\modelfl$^{(*)}$ & 23.72 & & & & & & & & & & & 258M \\
\bottomrule

\end{tabular}
\end{sc}
\caption{BLEU score, generation speed and size of various models on {\wmtende} test set. We measure generation speed in sentence/s on a single V100 GPU with batch size 1. 
We perform inference of autoregressive models using beam search with width 4.
For latent variable models, we train perform $k$ steps of iterative inference~\citep{shu19latent} (where $k\in\{0,1,2,4,8\}$) and report results from models trained with distillation.
%\note{dt: does "distillation" here mean "trained with distillation"?}
$(*)$ results are from \citet{ma19flowseq}.}
\label{tab:speed}
\end{table*}

\paragraph{Visualization of latent space}
In Figure~\ref{fig:vis}, we visualize the latent space of the approximate prior, the prior and the delta posterior of the latent variable models using t-SNE~\citep{maaten14accelerating}. 
It is clear from the figures that the delta posterior of Gauss-base has high overlap with the approximate posterior, while the overlap is relatively low for Flow-small.
%while the delta-posterior of Flow-small has relatively low overlap with the approximate posterior.
We conjecture that while the loss surface of ELBO contains many local optima that we can reach via iterative refinement, not all of them share the support of the approximate posterior density (hence correspond to data). This is particularly pronounced for the flow prior model.
%As iterative inference improves ELBO for both models, we conclude that the loss surface of ELBO has many local optima (red points in Figure~\ref{fig:vis}), but not all of them correspond to data (blue points). For the flow prior model, ELBO contains a lot of local minima that do not correspond well to data. 
%Might be better optimization algorithm for the flow prior, but leave it as out of scope.

\subsection{Generation speed and model size}
We compare performance, generation speed and size of various models in Table~\ref{tab:speed}. 
While autoregressive models offer the best translation quality, inference is inherently sequential and slow.
Decoding from non-autoregressive latent variable models is much more efficient, and requires constant time with respect to sequence length given parallel computation. Compared to Transformer-base, Gauss-large with 1 step of iterative inference improves generation speed by 6x, at the cost of 2.6 BLEU. On {\wmtdeen}, the performance degradation is 1.9 BLEU.
%Compared to Transformer-base, Gauss-large with 1 refinement step improves generation speed by 6x at the cost of 2.65 BLEU. 
Flow prior models perform much worse than the Gaussian prior models despite having more parameters and slower generation speed.
%Flow prior models give worse performance than the Gaussian prior models, while being much slower to decode (due to their depth).

\section{Related Work}
\label{related}

%For generative models of images, \citet{theis16note,grover18flowgan} found that log-likelihood is uninformative about the visual quality of the samples.
%In addition, recent work~\citep{nalisnick19do,fetaya19conditional} showed that density estimators can assign higher likelihood to out-of-distribution data than their training data. These findings agree with our observations in machine translation.

For sequence generation, the gap between log-likelihood and downstream metric has long been recognized. To address this discrepancy between density estimation and approximate inference (generation), there has largely been two lines of prior work: (1) structured perceptron training for conditional random fields~\citep{lafferty01conditional,collins02discriminative,liang06end} and (2) empirical risk minimization with approximate inference~\citep{valtchev97mmie,povey02minimum,och03minimum,fu07automatic,stoyanov11empirical,hopkins11tuning,shen16minimum}. More recent work proposed to train neural sequence models directly on task-specific losses using reinforcement learning~\citep{ranzato16sequence,bahdanau17actor,jaques17sequence} or adversarial training~\citep{goyal16professor}.

%There is a long line of research on discriminative training methods that directly minimize the task-specific error, from early 2000s to recent approaches using neural networks. 
%To list a few, discriminative training methods have been proposed for machine translation~\citep{och03minimum,hopkins11tuning,shen16minimum,bahdanau17actor}, automatic speech recognition~\citep{valtchev97mmie,povey02minimum,fu07automatic,sabour19optimal} and traditional structured prediction tasks~\citep{collins02discriminative}. 

Despite such a plethora of work in bridging the gap between log-likelihood and the downstream task, the exact correlation between the two has not been established well. 
Our work investigates the correlation for neural sequence models (autoregressive models and latent variable models) in machine translation.
%This work conducts a study of neural sequence models in machine translation, and finds that the correlation between log-likelihood and BLEU varies depending on the range of model families being considered.
%Our work quantifies the correlation between log-likelihood and BLEU for several neural sequence models on machine translation.
Among autoregressive models for open-domain dialogue, a concurrent work~\citep{adiwardana20towards} found a strong correlation between perplexity and a human evaluation metric that awards sensibleness and specificity. 
This work confirms a part of our finding that log-likelihood is highly correlated with the downstream metric when we consider models within the same family.

%For sequence generation, several previous work acknowledged the discrepancy between the training objective (log-likelihood) and the task-specific evaluation objective. 
%They proposed methods to either incorporate or directly train on downstream losses from various sequence generation tasks including machine translation~\citep{shen16minimum,ranzato16sequence,bahdanau17actor}, music synthesis~\citep{goyal16professor,jaques17sequence} and speech recognition~\citep{sabour19optimal}.
%To our best knowledge, however, the correlation between log-likelihood and the downstream metric was not well studied for any sequence generation task.

Our work is inspired by recent work on latent variable models for non-autoregressive neural machine translation~\citep{gu18non,lee18deterministic,kaiser18fast}. 
Specifically, we compare continuous latent variable models with a diagonal Gaussian prior~\citep{shu19latent} and a normalizing flow prior~\citep{ma19flowseq}. We find that while having an expressive prior is beneficial for density estimation, a simple prior delivers better generation quality while being smaller and faster.
%Specifically, our work compares the density estimation performance and translation quality between continuous latent variable models with a simple diagonal Gaussian prior~\citep{shu19latent} and a flexible normalizing flow prior~\citep{ma19flowseq}.

%\paragraph{Iterative inference}

%Recently, \citet{hoffman16elbo} decomposed the the variational evidence lower bound objective (ELBO) that contains the term $\text{KL}[q(z)||p(z)]$ (marginal KL), and argued that a prior distribution that is too simple (e.g. a standard gaussian) may be unable to match the aggregate posterior. Indeed, \citet{rosca18distribution} showed that the marginal KL term is non-zero for flexible posterior distribution (such as RealNVP~\citep{dinh17density}).

%Recently, \citet{ma19flowseq} proposed to use a flow prior in latent variable machine translation. Also, \citet{shu19latent} showed that a simple diagonal gaussian prior, combined with iterative refinement, can yield competitive results. This work is inspired by these two recent results.
%Meanwhile, recent work in latent variable Machine Translation have also applied expressive prior distributions (e.g. mixture of Gaussians~\citep{shen19mixture} and normalizing flow~\citep{ma19flowseq,przystupa19investigating}) to capture a complex distribution of target sentence given the source sentence. Our work is thus concerned with the following question: ``Do we need complex priors for Machine Translation?''.
%Several recent work have discovered that the prior distribution plays a critical role in balancing the variational posterior and the decoder, and a standard normal distribution may be too rigid for the aggregate posterior to match~\citep{hoffman16elbo,rosca18distribution,xu19necessity}. Indeed, follow-up work found that more flexible prior distribution outperform simple priors on several density estimation tasks and argued for a choice of prior that reflect the high level factors of variation in data~\citep{bauer19resampled,tomczak18vae,vikram19loracs}.

%Recent work discovered that an overly simplistic prior distribution such as standard normal can be difficult for the aggregate posterior to match, thereby hindering learning~\citep{hoffman16elbo,rosca18distribution}. Indeed, flexible prior distributions were found to outperform simple priors on several density estimation tasks~\citep{bauer19resampled,tomczak18vae,vikram19loracs}. Therefore, we compare a flexible prior (normalizing flow) with a simple one (diagonal Gaussian), while fixing the posterior distribution as diagonal Gausian, on density estimation and sequence generation.

%As normalizing flows construct continuous distributions of continuous random variables, they are not directly applicable to discrete sequences such as text. While \citet{tran19discrete,hoogeboom19integer} showed that discrete normalizing flows can give competitive performance on character-level language modeling and image compression, bias from straight-through gradient estimators hinders scalability in terms of flow depth and the number of classes. As this reduces their chance of success in large-scale sequence modeling tasks such as machine translation, we do not consider discrete flow models in our analysis and leave it as future work. Instead, we follow \citet{ziegler19latent,ma19flowseq} and incorporate continuous flows as a prior distribution in a VAE.

\section{Conclusion}
\label{sec:conclusion}
%\sidenote{dt: is this copy-pasted from intro? i think we acn rm the openers and just focus on highlights/contributions}
%Many sequence-to-sequence generation tasks can be naturally framed as conditional density estimation, where models are trained to maximize the conditional log-likelihood $p(y|x)$ on the training data, and evaluated on the log-likelihood on the held out test set. 
%However, sequence generation tasks require finding the best output $\hat{y}$ given an input $x$ at test time, which is evaluated on a task-specific metric against a set of ground truth references $y^*$: $R(\hat{y},y^*|x).$ This calls into question how well the log-likelihood is correlated with the downstream metric.
In this work, we investigate the correlation between log-likelihood and the downstream evaluation metric for machine translation.
We train several autoregressive models and latent variable models on five language pairs from three machine translation datasets ({\wmtendeboth}, {\wmtenroboth} and {\iwsltdeen}), and find that the correlation between log-likelihood and BLEU changes drastically depending on the range of model families being compared:
Among the models within the same family, log-likelihood is highly correlated with BLEU.
Between models of different families, however, we observe no correlation:
%Between latent variable models with different priors, however, we observe no correlation: 
the flow prior model gives higher held-out log-likelihood but similar or worse BLEU score than the Gaussian prior model.
Furthermore, autoregressive models give the highest BLEU scores overall but the latent variable model with a flow prior gives the highest test log-likelihoods on all datasets.

In the future, we will investigate the factors behind this discrepancy. One possibility is the inherent difficulty of inference for latent variable models, which might be resolved by designing better inference algorithms. We will also explore if the discrepancy is mainly caused by the difference in the decoding distribution (autoregressive vs. factorized) or the training objective (maximum likelihood vs. ELBO).

%For future work, we plan to investigate the factors behind this discrepancy between the log-likelihood and downstream metric, which we conjecture is due to the inherent difficulty of inference in latent variable models, which could be resolved by designing better inference algorithms.
%For future work, we plan to investigate the correlation on other sequence generation tasks.

% In the unusual situation where you want a paper to appear in the
% references without citing it in the main text, use \nocite

% Acknowledgements should only appear in the accepted version.

\section*{Acknowledgements}
We thank our colleagues at the Google Translate and Brain teams, particularly Durk Kingma, Yu Zhang, Yuan Cao and Julia Kreutzer for their feedback on the draft. JL thanks Chunting Zhou, Manoj Kumar and William Chan for helpful discussions. 

KC is supported by Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI), Samsung Research (Improving Deep Learning using Latent Structure) and NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science. KC thanks CIFAR, eBay, Naver and NVIDIA for their support.


\bibliography{anthology,emnlp2020}
\bibliographystyle{acl_natbib}

\end{document}