Cyril Allauzen


2024

pdf
A* shortest string decoding for non-idempotent semirings
Kyle Gorman | Cyril Allauzen
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Abstract: The single shortest path algorithm is undefined for weighted finite-state automata over non-idempotent semirings because such semirings do not guarantee the existence of a shortest path. However, in non-idempotent semirings admitting an order satisfying a monotonicity condition (such as the plus-times or log semirings), the shortest string is well-defined. We describe an algorithm which finds the shortest string for a weighted non-deterministic automaton over such semirings using the backwards shortest distance of an equivalent deterministic automaton (DFA) as a heuristic for A* search performed over a companion idempotent semiring, which is proven to return the shortest string. There may be exponentially more states in the DFA, but the proposed algorithm needs to visit only a small fraction of them if determinization is performed “on the fly”.

2019

pdf
Federated Learning of N-Gram Language Models
Mingqing Chen | Ananda Theertha Suresh | Rajiv Mathews | Adeline Wong | Cyril Allauzen | Françoise Beaufays | Michael Riley
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

We propose algorithms to train production-quality n-gram language models using federated learning. Federated learning is a distributed computation platform that can be used to train global models for portable devices such as smart phones. Federated learning is especially relevant for applications handling privacy-sensitive data, such as virtual keyboards, because training is performed without the users’ data ever leaving their devices. While the principles of federated learning are fairly generic, its methodology assumes that the underlying models are neural networks. However, virtual keyboards are typically powered by n-gram language models for latency reasons. We propose to train a recurrent neural network language model using the decentralized FederatedAveraging algorithm and to approximate this federated model server-side with an n-gram model that can be deployed to devices for fast inference. Our technical contributions include ways of handling large vocabularies, algorithms to correct capitalization errors in user data, and efficient finite state transducer algorithms to convert word language models to word-piece language models and vice versa. The n-gram language models trained with federated learning are compared to n-grams trained with traditional server-based algorithms using A/B tests on tens of millions of users of a virtual keyboard. Results are presented for two languages, American English and Brazilian Portuguese. This work demonstrates that high-quality n-gram language models can be trained directly on client mobile devices without sensitive training data ever leaving the devices.

pdf
On the Compression of Lexicon Transducers
Marco Cognetta | Cyril Allauzen | Michael Riley
Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing

In finite-state language processing pipelines, a lexicon is often a key component. It needs to be comprehensive to ensure accuracy, reducing out-of-vocabulary misses. However, in memory-constrained environments (e.g., mobile phones), the size of the component automata must be kept small. Indeed, a delicate balance between comprehensiveness, speed, and memory must be struck to conform to device requirements while providing a good user experience.In this paper, we describe a compression scheme for lexicons when represented as finite-state transducers. We efficiently encode the graph of the transducer while storing transition labels separately. The graph encoding scheme is based on the LOUDS (Level Order Unary Degree Sequence) tree representation, which has constant time tree traversal for queries while being information-theoretically optimal in space. We find that our encoding is near the theoretical lower bound for such graphs and substantially outperforms more traditional representations in space while remaining competitive in latency benchmarks.

2017

pdf bib
Transliterated Mobile Keyboard Input via Weighted Finite-State Transducers
Lars Hellsten | Brian Roark | Prasoon Goyal | Cyril Allauzen | Françoise Beaufays | Tom Ouyang | Michael Riley | David Rybach
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017)

2016

pdf
Distributed representation and estimation of WFST-based n-gram models
Cyril Allauzen | Michael Riley | Brian Roark
Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata

2014

pdf
Pushdown Automata in Statistical Machine Translation
Cyril Allauzen | Bill Byrne | Adrià de Gispert | Gonzalo Iglesias | Michael Riley
Computational Linguistics, Volume 40, Issue 3 - September 2014

2013

pdf
Smoothed marginal distribution constraints for language modeling
Brian Roark | Cyril Allauzen | Michael Riley
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf
The OpenGrm open-source finite-state grammar software libraries
Brian Roark | Richard Sproat | Cyril Allauzen | Michael Riley | Jeffrey Sorensen | Terry Tai
Proceedings of the ACL 2012 System Demonstrations

2011

pdf
Hierarchical Phrase-based Translation Representations
Gonzalo Iglesias | Cyril Allauzen | William Byrne | Adrià de Gispert | Michael Riley
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

2010

pdf
Expected Sequence Similarity Maximization
Cyril Allauzen | Shankar Kumar | Wolfgang Macherey | Mehryar Mohri | Michael Riley
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2009

pdf
OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language
Michael Riley | Cyril Allauzen | Martin Jansche
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts

2004

pdf
Statistical Modeling for Unit Selection in Speech Synthesis
Mehryar Mohri | Cyril Allauzen | Michael Riley
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

pdf
General Indexation of Weighted Automata - Application to Spoken Utterance Retrieval
Cyril Allauzen | Mehryar Mohri | Murat Saraclar
Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004

2003

pdf
Generalized Algorithms for Constructing Statistical Language Models
Cyril Allauzen | Mehryar Mohri | Brian Roark
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics