%
% File acl2021-camera-ready.tex
%
%% Based on the style files for EMNLP 2020, which were
%% Based on the style files for ACL 2020, which were
%% Based on the style files for ACL 2018, NAACL 2018/19, which were
%% Based on the style files for ACL-2015, with some improvements
%%  taken from the NAACL-2016 style
%% Based on the style files for ACL-2014, which were, in turn,
%% based on ACL-2013, ACL-2012, ACL-2011, ACL-2010, ACL-IJCNLP-2009,
%% EACL-2009, IJCNLP-2008...
%% Based on the style files for EACL 2006 by 
%%e.agirre@ehu.es or Sergi.Balari@uab.es
%% and that of ACL 08 by Joakim Nivre and Noah Smith

\documentclass[11pt,a4paper]{article}
\usepackage[hyperref]{acl2021}
\usepackage{fontspec}
\usepackage{latexsym}
\renewcommand{\UrlFont}{\ttfamily\small}
\usepackage[utf8]{inputenc}
\usepackage{enumitem}

\setmainfont{Charis SIL}

% This is not strictly necessary, and may be commented out,
% but it will improve the layout of the manuscript,
% and will typically save some space.
\usepackage{microtype}

\aclfinalcopy % Uncomment this line for the final submission
\def\aclpaperid{***} %  Enter the acl Paper ID here

%\setlength\titlebox{5cm}
% You can expand the titlebox if you need extra space
% to show all the authors. Please do not make the titlebox
% smaller than 5cm (the original size); we will check this
% in the camera-ready version and ask you to change it back.

\newcommand\BibTeX{B\textsc{ib}\TeX}

\title{Avengers, Ensemble! Benefits of ensembling in grapheme-to-phoneme prediction}

\author{Vasundhara Gautam, Wang Yau Li, Zafarullah Mahmood, Fred Mailhot\thanks{Corresponding author. Contributing authors are listed alphabetically.},\\
        \textbf{Shreekantha Nadig$^\dagger$, Riqiang Wang, Nathan Zhang} \\ 
        \\
  Dialpad Canada $^\dagger$Dialpad India \\
  \texttt{vasundhara,wangyau.li,zafar,fred.mailhot}  \\
   \texttt{shree,riqiang.wang,nzhang@dialpad.com} }

\date{}

\begin{document}
\maketitle
\begin{abstract}
We describe three baseline beating systems for the high-resource 
English-only sub-task of the SIGMORPHON 2021 Shared Task 1: a small ensemble that Dialpad's\footnote{https://www.dialpad.com/} speech recognition team uses internally, a well-known off-the-shelf model,
and a larger ensemble model comprising these and others. We additionally discuss
the challenges related to the provided data, along with the processing
steps we took.
\end{abstract}


\section{Introduction}

The transduction of sequences of \textit{graphemes} to \textit{phones} or \textit{phonemes},\footnote{We use these terms interchangeably here to refer to graphical representations of minimal speech sounds, remaining agnostic as to their theoretical or ontological status.} that is from characters used in orthographic representations to characters used to represent minimal units of speech, is a core component of many tasks in speech science \& technology. This \textit{grapheme-to-phoneme} conversion (or \textit{g2p}) may be used, e.g., to automate or scale the creation of digital lexicons or pronunciation dictionaries, which are crucial to FST-based approaches to automatic speech recognition (ASR) and synthesis \citep{mohri2002wfst}. 

The SIGMORPHON 2021 Workshop included a Shared Task on g2p conversion, comprising 3 sub-tasks.\footnote{https://github.com/sigmorphon/2021-task1} The low- and medium-resource tasks were multilingual, while the high-resource task was English-only. This paper provides an overview of the three baseline-beating systems submitted by the Dialpad team for the high-resource sub-task, along with discussion of the challenges posed by the data that was provided.


\section{Sub-task 1: high-resource, English-only}

The organizers provided 41,680 lines of data in total; 33,344 for training, and 4,168 each for development and test. The data consists of word/pronunciation pairs (\textit{word-pron pairs}, henceforth), where words are sequences of graphemes and pronunciations are sequences of characters from the International Phonetic Alphabet \citep{IPA:99}. The data was derived from the English portion of the WikiPron database \citep{lee2020wikipron}, a massively multilingual resource of word-pron pairs extracted from Wiktionary\footnote{https://en.wiktionary.org/} and subject to some manual QA and post-processing.\footnote{See https://github.com/sigmorphon/2021-task1 for fuller details on data formatting and processing.}

The baseline model provided was the 2nd place finisher from the 2020 g2p shared task \citep{gorman-etal-2020-sigmorphon}. It is an ensembled neural transition model that operates over edit actions and is trained via imitation learning \citep{makarov-clematide-2020-cluzh}.

Evaluation scripts were provided to compute \textit{word error rate} (WER), the percentage of words for which the output sequence does not match the gold label.

Notwithstanding the baseline's strong prior performance and the amount of data available, the task proved to be challenging; the baseline system achieved development and test set WERs of \textbf{45.13} and \textbf{41.94}, respectively. We discuss possible reasons for this below.

\subsection{Data-related challenges}\label{data-related-challenges}

Wiktionary is an open, collaborative, public effort to create a free dictionary in multiple languages. Anyone can create an account and add or amend words, pronunciations, etymological information, etc. As with most user-generated content, this is a noisy method of data creation and annotation.

Even setting aside the theory-laden question of when or whether a given word should be counted as English,\footnote{E.g., the training data included the arguably French word-pronunciation pair: \textit{embonpoint} /ɑ̃ b ɔ̃ p w ɛ̃/} the open nature of Wiktionary means that speakers of different variants or dialects of English may submit varying or conflicting pronunciations for sets of words. For example, some transcriptions indicate that the users who input them had the \textit{cot/caught} merger while others do not; in the training data ``cot'' is transcribed /k ɑ t/ and ``caught'' is transcribed /k ɔ t/, indicating a split, but ``aughts'' is transcribed as /ɑ t s/, indicating merger. There is also variation in the narrowness of transcription. For example, some transcriptions include aspiration on stressed-syllable-initial stops while others do not c.f. ``kill'' /kʰ ɪ l/ and ``killer'' /k ɪ l ɚ/.

Typically the set of English phonemes is taken to be somewhere between 38-45 depending on variant/dialect \citep{mcmahon2002engphon}. In exploring the training data, we found a total of 124 symbols in the training set transcriptions, many of which only appeared in a small set (1--5) of transcriptions. To reduce the effect of this long tail of infrequent symbols, we normalized the training set.

The main source of symbols in the long tail was the variation in the broadness of transcription---vowels were sometimes but not always transcribed with nasalization before a nasal consonant, aspiration on word-initial voiceless stops was inconsistently indicated, phonetic length was occasionally indicated, etc. There were also some cases of erroneous transcription that we uncovered by looking at the lowest frequency phones and the word-pronunciation pairs where they appeared. For instance, the IPA /j/ was transcribed as /y/ twice, the voiced alveolar approximant /ɹ/ was mistranscribed as the trill /r/ over 200 times, and we found a handful of issues where a phone was transcribed with a Unicode symbol not used in the IPA at all.

Most of these were cases where the rare variant was at least two orders of magnitude less frequent than the common variant of the symbol. There was, however, one class of sounds where the variation was less dramatically skewed; the consonants /m/, /n/, and /l/ appeared in unstressed syllables following schwa (/əm/, /ən/, /əl/) roughly one order of magnitude more frequently than their syllabic counterparts (/m̩/, /n̩/, /l̩/),  and we opted not to normalize these. If we had normalized the syllabic variants, it would have resulted in more consistent g2p output but it would likely also have penalized our performance on the uncleaned test set.\footnote{Although the possibility also exists that one or more of our models would have found and exploited contextual cues that weren't obvious to us by inspection.} In the end, our training data contained 47 phones (plus end-of-sequence and UNK symbols for some models).

\section{Models}

We trained and evaluated several models for this task, both publicly available, in-house, and custom developed, along with various ensembling permutations. In the end, we submitted three sets of baseline beating results. The organizers assigned sequential identifiers to multiple submissions (e.g. \textit{Dialpad-N}); we include these in the discussion of our entries below, for ease of subsequent reference.

\subsection{The Dialpad model (Dialpad-2)}

Dialpad uses a g2p system internally for scalable generation of novel lexicon additions. We were motivated to enter this shared task as a means of assessing potential areas of improvement for our system; in order to do so we needed to assess our own performance as a baseline.

This model is a simple majority-vote ensemble of 3 existing publicly available g2p systems: \textit{Phonetisaurus} \citep{novak2012phonetisaurus}, a WFST-based model, \textit{Sequitur} \citep{bisani2008sequitur}, a joint-sequence model trained via EM, and a neural sequence-to-sequence model developed at CMU as part of the CMUSphinx\footnote{https://cmusphinx.github.io} toolkit (see subsection \ref{cmu}). As Dialpad uses a proprietary lexicon and phoneset internally, we retrained all three models on the cleaned version of the shared task training data, retaining default hyperparameters and architectures.

In the end, this ensemble achieved a test set WER of \textbf{41.72}, narrowly beating the baseline (results are discussed in more depth in Section \ref{results}).

\subsection{A strong standalone model: CMUSphinx \texttt{g2p-seq2seq} (Dialpad-3)}\label{cmu}

CMUSphinx is a set of open systems and tools for speech science developed at Carnegie Mellon University, including a g2p system.\footnote{https://github.com/cmusphinx/g2p-seq2seq} It is a neural sequence-to-sequence model \citep{sutskever2014seq2seq} that is Transformer-based \citep{vaswani2017attn}, written in Tensorflow \citep{tensorflow2015whitepaper}. A pre-trained 3-layer model is available for download, but it is trained on a dictionary that uses ARPABET, a substantially different phoneset from the IPA used in this challenge. For this reason we re-trained a model from scratch on the cleaned version of the training data.

This model achieved a test set WER of \textbf{41.58}, again narrowly beating the baseline. Interestingly, this outperformed the Dialpad model which incorporates it, suggesting that Phonetisaurus and Sequitur add more noise than signal to predicted outputs, to say nothing of increased computational resources and training time. More generally, this points to the CMUSphinx seq2seq model as a simple and strong baseline against which future g2p research should be assessed. 

\subsection{A large ensemble (Dialpad-1)}

In the interest of seeing what results could be achieved via further naive ensembling, our final submission was a large ensemble, comprising two variations on the baseline model, the Dialpad-2 ensemble discussed above, and two additional seq2seq models, one using LSTMs and the other Transformer-based. The latter additionally incorporated a sub-word extraction method designed to bias a model's input-output mapping toward ``good'' grapheme-phoneme correspondences.

The method of ensembling for this model is word level majority-vote ensembling. We select the most common prediction when there is a majority prediction (i.e. one prediction has more votes than all of the others). If there is a tie, we pick the prediction that was generated by the best standalone model with respect to each model's performance on the development set.

This collection of models achieved a test set WER of \textbf{37.43}, a 10.75\% relative reduction in WER over the baseline model. As shown in Table ~\ref{table:wer-results}, although a majority of the component models did not outperform the baseline, there was sufficient agreement across different examples that a simple majority voting scheme was able to leverage the models' varying strengths effectively. We discuss the components and their individual performance below and in Section ~\ref{results}.

\subsubsection{Baseline variations}

The ``foundation'' of our ensemble was the default baseline model \citep{makarov-clematide-2018-imitation}, which we trained using the raw data and default settings in order to reflect the baseline performance published by the organization. We included this in order to individually assess the effect of additional models on overall performance.

In addition to this default base, we added a larger version of the same model, for which we increased the number of encoder and decoder layers from 1 to 3, and increased the hidden dimensions 200 to 400.

\subsubsection{biLSTM+attention seq2seq}

We conducted experiments with a RNN seq2seq model, comprising a biLSTM encoder, LSTM decoder, and dot-product attention.\footnote{We used the DyNet toolkit \citep{neubig2017dynet} for these experiments.} We conducted several rounds of hyperparameter optimization over layer sizes, optimizer, and learning rate. Although none of these models outperformed the baseline, a small network (16-d embeddings, 128-d LSTM layers) proved to be efficiently trainable (2 CPU-hours) and improved the ensemble results, so it was included.

\subsubsection{PAS2P: Pronunciation-assisted sub-words to phonemes}

Sub-word segmentation is widely used in ASR and neural machine translation tasks, as it reduces the cardinality of the search space over word-based models, and mitigates the issue of OOVs. Use of sub-words for g2p tasks has been explore, e.g. \citet{reddy2010mdl} develop an MDL-based approach to extracting sub-word units for the task of g2p. Recently, a pronunciation-assisted sub-word model (PASM) \citep{xu2019pasm} was shown to improve the performance of ASR models. We experimented with pronunciation-assisted sub-words to phonemes (PAS2P), leveraging the training data and a reparameterization of the IBM Model 2 aligner \citep{brown1993ibm} dubbed \textit{fast\_align} \citep{dyer2013fastalign}.\footnote{https://github.com/clab/fast\_align}

The alignment model is used to find an alignment of sequences of graphemeres to their corresponding phonemes. We follow a similar process as \citet{xu2019pasm} to find the consistent grapheme-phoneme pairs and refinement of the pairs for the PASM model. We also collect grapheme sequence statistics and marginalize it by summing up the counts of each type of grapheme sequence over all possible types of phoneme sequences. These counts are the weights of each sub-word sequence.

Given a word and the weights for each sub-word, the segmentation process is a search problem over all possible sub-word segmentation of that word. We solve this search problem by building weighted FSTs\footnote{We use Pynini \citep{gorman2016pynini} for this.} of a given word and the sub-word vocabulary, and finding the best path through this lattice. For example, the word ``thoughtfulness" would be segmented by PASM as ``th\_ough\_t\_f\_u\_l\_n\_e\_ss", and this would be used as the input in the PAS2P model rather than the full sequence of individual graphemes.

Finally, the PAS2P transducer is a Transformer-based sequence-to-sequence model trained using the ESPnet end-to-end speech processing toolkit \citep{Watanabe2018Espnet}, with pronunciation-assisted sub-words as inputs and phones as outputs. The model has 6 layers of encoder and decoder with 2048 units, and 4 attention heads with 256 units. We use dropout with a probability of 0.1 and label smoothing with a weight of 0.1 to regularize the model. This model achieved WERs of \textbf{44.84} and \textbf{43.40} on the development and test sets, respectively.

\section{Results}\label{results}

Our main results are shown in Table~\ref{table:wer-results}, where we show both dev and test set WER for each individual model in addition to the submitted ensembles. In particular, we can see that many of the ensemble components do not beat the baseline WER, but nonetheless serve to improve the ensembled models.

\begin{table}[htp]
\centering
\begin{tabular}{lrr}
\hline \textbf{Model} & \textbf{dev} & \textbf{test} \\ \hline
\textbf{Dialpad-3} & \textbf{43.30} & \textbf{41.58} \\
PAS2P & 44.84 & 43.40 \\
Baseline (large) & 44.99 & 41.65 \\
Baseline (organizer) &	45.13 & 41.94 \\
Phonetisaurus &	45.44 & 43.88 \\
Baseline (raw data) & 45.92 & 41.70 \\
Sequitur & 46.69 & 43.86 \\
biLSTM seq2seq & 47.89 & 44.05 \\
\hline
\textbf{Dialpad-2} & \textbf{43.83} & \textbf{41.72} \\
\textbf{Dialpad-1} & \textbf{40.12} & \textbf{37.43} \\
\hline
\end{tabular}
\caption{\label{table:wer-results} Results for components of ensembles, and submitted models/ensembles (bolded).}
\end{table}


\section{Additional experiments}\label{ensembling-experiments}

We experimented with different ensembles and found that incorporating models with different architectures generally improves overall performance. In the standalone results, only the top three models beat the baseline WER, but adding additional models with higher WER than the baseline continues to reduce overall WER. Table ~\ref{table:top-model-ensembles} shows the effect of this progressive ensembling, from our top-3 models to our top-7 (i.e. the ensemble for the \textbf{Dialpad-1} model). 

\begin{table}[htp]
\centering
\begin{tabular}{lrr}
\hline
\textbf{Model} & \textbf{dev} & \textbf{test} \\
\hline
Ensemble-top3 & 41.10 & 39.71 \\
Ensemble-top4 & 40.74 & 38.89 \\
Ensemble-top5 & 40.50 & 38.12 \\
Ensemble-top6 & 40.31 & 37.69 \\
Ensemble-top7 (Dialpad-1) & 40.12 & 37.43 \\
\hline
\end{tabular}
\caption{\label{table:top-model-ensembles} Progressive ensembling results, with top-performing components}
\end{table}

\subsection{Edit distance-based voting}

In addition to varying our ensemble sizes and components, we investigated a different ensemble voting scheme, in which ties are broken using edit distance when there is no 1-best majority option. That is, in the event of a tie, instead of selecting the prediction made by the best standalone model (our usual tie-breaking method), we select the prediction that minimizes edit distance to all other predictions that have the same number of votes. The idea of this method is to maximize sub-word level agreement. Although this method did not show clear improvements on the development set, we found after submission that it narrowly but consistently outperformed the top-N ensembles on the test set (see Table \ref{table:edit-dist-ensembles}).

\begin{table}[htp]
\centering
\begin{tabular}{lrr}
\hline
\textbf{Model} & \textbf{dev} & \textbf{test} \\
\hline
ED-Dialpad-3 & 43.76 & 41.70 \\
ED-top3 & 41.24 & 39.40 \\
ED-top4 & 40.62 & 38.48 \\
ED-top5 & 40.50 & 37.69 \\
ED-top6 & 40.28 & 37.50 \\
ED-top7 & 40.21 & 37.31 \\
\hline
\end{tabular}
\caption{\label{table:edit-dist-ensembles} Results for ensembling with edit-distance tie-breaking}
\end{table}

\section{Error analysis}

We conducted some basic analyses of the \textbf{Dialpad-1} submission's patterns of errors, to better understand its performance and identify potential areas of improvement.\footnote{We are grateful to an anonymous reviewer for suggesting that this would strengthen the paper.}

\subsection{Oracle WER}

We began by calculating the \textit{oracle WER}, i.e. the theoretical best WER that the ensemble could have achieved if it had selected the correct/gold prediction every time it was present in the pool of component model predictions for a given input. The Dialpad-1 system's oracle WERs on the dev and test sets were \textbf{25.12} and \textbf{23.27}, respectively (c.f. 40.12 and 37.43 actual).

These represent massive performance improvements (approx. 15\% absolute, or 37\% relative, WER reduction), and suggest refinement of our output selection/voting method (perhaps via some kind of confidence weighting) could lead to much-improved results.

\subsection{Data-related errors}

We also investigated outputs for which none of our component models predicted the correct pronunciation, in hopes of finding some patterns of interest.

Many of the training data-related issues raised in section~\ref{data-related-challenges} appeared in the dev and test labels as well. In some cases this led to high cross-component agreement, even on incorrect predictions. Our hope that subtle contextual cues might reveal patterns in the distribution of syllabic versus schwa-following liquids and nasals was not borne out, e.g. our ensemble was led astray on words like  ``warble'', which had a labelled pronunciation of /w ɔ ɹ b l̩/, while all 7 of our models predicted /w ɔ ɹ b ə l/, a functionally non-distinct pronunciation. In addition, the previously mentioned issue of /ɹ/ being mistranscribed as /r/ affected our performance, e.g. with the word ``unilateral'', whose labelled pronunciation was /j u n ɪ l æ t ə r ə l/, instead of /j u n ɪ l æ t ə ɹ ə l/, which was again the pronunciation predicted by all 7 models. Finally, narrowness of transcription was also an issue that affected our performance on the dev and test sets, e.g., for words like ``cloudy'' /k ɫ a ʊ d i/ and ``cry'' /k ɹ a ɪ̯/, for which we predicted /k l a ʊ d i/ and /k ɹ a ɪ/, respectively. In the end, it seems that noisiness in the data was a major source of errors for our submissions.\footnote{We nonetheless acknowledge the magnitude and challenge of the task of cleaning/normalizing a large quantity of user-generated data, and thank the organizers for the work that they did in this area.}

Aside from issues arising due label noise, our systems also made some genuine errors that are typical of g2p models, mostly related to data distribution or sparsity. For example, our component models overwhelmingly predicted that ``irreparate'' (/ɪ ɹ ɛ p ə ɹ ə t/) should rhyme instead with ``rate'' (this ``-ate-''  /e ɪ t/ correspondence was overwhelmingly present in the training data), that ``backache'' (/b æ k e ɪ k/) must contain the affricate /t͡ʃ/, that ``acres'' (e ɪ k ɚ z/) rhymes with ``degrees'', and that ``beret'' has a /t/ sound in it. In each of these cases, there was either not enough samples in the training set to reliably learn the relevant grapheme-phoneme correspondence, or else a conflicting (but correct) correspondence was over-represented in the training data.

\section{Conclusion}

We presented and discussed three g2p systems submitted for the SIGMORPHON2021 English-only shared sub-task. In addition to finding a strong off-the-shelf contender, we show that naive ensembling remains a strong strategy in supervised learning, of which g2p is a sub-domain, and that simple majority-voting schemes in classification can often leverage the respective strengths of sub-optimal component models, especially when diverse architectures are combined. Additionally, we provided more evidence for the usefulness of linguistically-informed sub-word modeling as an input transformation on speech-related tasks.

We also discussed additional experiments whose results were not submitted, indicating the benefit of exploring top-N model vs ensemble trade-offs, and demonstrating the potential benefit of an edit-distance based tie-breaking method for ensemble voting.

Future work includes further search for the optimal trade-off between ensemble size and performance, as well as additional exploration of the edit-distance voting scheme, and more sophisticated ensembling/voting methods, e.g. majority voting at the phone level on aligned outputs.

\section*{Acknowledgments}

We are grateful to Dialpad Inc. for providing the resources, both temporal and computational, to work on this project.

\bibliographystyle{acl_natbib}
\bibliography{acl2021}


\end{document}