Daniel Truhn
2025
LLM Agents Making Agent Tools
Georg Wölflein
|
Dyke Ferber
|
Daniel Truhn
|
Ognjen Arandjelovic
|
Jakob Nikolas Kather
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains demanding large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, an agentic framework that autonomously transforms papers with code into LLM-compatible tools. Given a GitHub URL and short task description, ToolMaker autonomously installs dependencies and generates code to perform the task, using a closed-loop self-correction mechanism for debugging. To evaluate our approach, we introduce a benchmark comprising 15 complex computational tasks spanning various domains with over 100 unit tests to assess correctness and robustness. Our method correctly implements 80% of the tasks, substantially outperforming current state-of-the-art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent-based scientific workflows.
2023
On the Impact of Cross-Domain Data on German Language Models
Amin Dada
|
Aokun Chen
|
Cheng Peng
|
Kaleb Smith
|
Ahmad Idrissi-Yaghir
|
Constantin Seibold
|
Jianning Li
|
Lars Heiliger
|
Christoph Friedrich
|
Daniel Truhn
|
Jan Egger
|
Jiang Bian
|
Jens Kleesiek
|
Yonghui Wu
Findings of the Association for Computational Linguistics: EMNLP 2023
Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45% over the previous state-of-the-art.