Samy Bengio


2025

pdf bib
TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
Jeffrey Li | Mohammadreza Armandpour | Seyed Iman Mirzadeh | Sachin Mehta | Vaishaal Shankar | Raviteja Vemulapalli | Samy Bengio | Oncel Tuzel | Mehrdad Farajtabar | Hadi Pouransari | Fartash Faghri
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) – orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.

2018

pdf bib
Tensor2Tensor for Neural Machine Translation
Ashish Vaswani | Samy Bengio | Eugene Brevdo | Francois Chollet | Aidan Gomez | Stephan Gouws | Llion Jones | Łukasz Kaiser | Nal Kalchbrenner | Niki Parmar | Ryan Sepassi | Noam Shazeer | Jakob Uszkoreit
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

2016

pdf bib
Generating Sentences from a Continuous Space
Samuel R. Bowman | Luke Vilnis | Oriol Vinyals | Andrew Dai | Rafal Jozefowicz | Samy Bengio
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

2006

pdf bib
Investigating Lexical Substitution Scoring for Subtitle Generation
Oren Glickman | Ido Dagan | Walter Daelemans | Mikaela Keller | Samy Bengio
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)