Jakob Foerster


2024

pdf
HelloFresh: LLM Evalutions on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Tim Franzmeyer | Aleksandar Shtedritski | Samuel Albanie | Philip Torr | Joao F. Henriques | Jakob Foerster
Findings of the Association for Computational Linguistics ACL 2024

Benchmarks have been essential for driving progress in machine learning. A better understanding of LLM capabilities on real world tasks is vital for safe development.Designing adequate LLM benchmarks is challenging: Data from real-world tasks is hard to collect, public availability of static evaluation data results in test data contamination and benchmark overfitting, and periodically generating new evaluation data is tedious and may result in temporally inconsistent results. We introduce HelloFresh, based on continuous streams of real-world data generated by intrinsically motivated human labelers. It covers recent events from X (formerly Twitter) community notes and edits of Wikipedia pages, mitigating the risk of test data contamination and benchmark overfitting.Any X user can propose an X note to add additional context to a misleading post (formerly tweet); if the community classifies it as helpful, it is shown with the post. Similarly, Wikipedia relies on community-based consensus, allowing users to edit articles or revert edits made by other users.Verifying whether an X note is helpful or whether a Wikipedia edit should be accepted are hard tasks that require grounding by querying the web.We backtest state-of-the-art LLMs supplemented with simple web search access and find that HelloFresh yields a temporally consistent ranking.To enable continuous evaluation on Hellofresh, we host a public leaderboard and periodically updated evaluation data at https://tinyurl.com/hello-fresh-LLM.

2020

pdf
Compositionality and Capacity in Emergent Languages
Abhinav Gupta | Cinjon Resnick | Jakob Foerster | Andrew Dai | Kyunghyun Cho
Proceedings of the 5th Workshop on Representation Learning for NLP

Recent works have discussed the extent to which emergent languages can exhibit properties of natural languages particularly learning compositionality. In this paper, we investigate the learning biases that affect the efficacy and compositionality in multi-agent communication in addition to the communicative bandwidth. Our foremost contribution is to explore how the capacity of a neural network impacts its ability to learn a compositional language. We additionally introduce a set of evaluation metrics with which we analyze the learned languages. Our hypothesis is that there should be a specific range of model capacity and channel bandwidth that induces compositional structure in the resulting language and consequently encourages systematic generalization. While we empirically see evidence for the bottom of this range, we curiously do not find evidence for the top part of the range and believe that this is an open question for the community.

2019

pdf
Seeded self-play for language learning
Abhinav Gupta | Ryan Lowe | Jakob Foerster | Douwe Kiela | Joelle Pineau
Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

How can we teach artificial agents to use human language flexibly to solve problems in real-world environments? We have an example of this in nature: human babies eventually learn to use human language to solve problems, and they are taught with an adult human-in-the-loop. Unfortunately, current machine learning methods (e.g. from deep reinforcement learning) are too data inefficient to learn language in this way. An outstanding goal is finding an algorithm with a suitable ‘language learning prior’ that allows it to learn human language, while minimizing the number of on-policy human interactions. In this paper, we propose to learn such a prior in simulation using an approach we call, Learning to Learn to Communicate (L2C). Specifically, in L2C we train a meta-learning agent in simulation to interact with populations of pre-trained agents, each with their own distinct communication protocol. Once the meta-learning agent is able to quickly adapt to each population of agents, it can be deployed in new populations, including populations speaking human language. Our key insight is that such populations can be obtained via self-play, after pre-training agents with imitation learning on a small amount of off-policy human language data. We call this latter technique Seeded Self-Play (S2P). Our preliminary experiments show that agents trained with L2C and S2P need fewer on-policy samples to learn a compositional language in a Lewis signaling game.