Dimitris Roussis
2025
Krikri: Advancing Open Large Language Models for Greek
Dimitris Roussis
|
Leon Voukoutis
|
Georgios Paraskevopoulos
|
Sokratis Sofianopoulos
|
Prokopis Prokopidis
|
Vassilis Papavassileiou
|
Athanasios Katsamanis
|
Stelios Piperidis
|
Vassilis Katsouros
Findings of the Association for Computational Linguistics: EMNLP 2025
We introduce Llama-Krikri-8B, a cutting-edge Large Language Model tailored for the Greek language, built on Meta’s Llama 3.1-8B. Llama-Krikri-8B has been extensively trained on high-quality Greek data to ensure superior adaptation to linguistic nuances. With 8 billion parameters, it offers advanced capabilities while maintaining efficient computational performance. Llama-Krikri-8B supports both Modern Greek and English, and is also equipped to handle polytonic text and Ancient Greek. The chat version of Llama-Krikri-8B features a multi-stage post-training pipeline, utilizing both human and synthetic instruction and preference data, by applying techniques such as MAGPIE. In addition, for evaluation, we propose three novel public benchmarks for Greek. Our evaluation on existing as well as the proposed benchmarks shows notable improvements over comparable Greek and multilingual LLMs in both natural language understanding and generation as well as code generation.
2024
Enhancing Scientific Discourse: Machine Translation for the Scientific Domain
Dimitris Roussis
|
Sokratis Sofianopoulos
|
Stelios Piperidis
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora from the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the research domains of: Energy Research, Neuroscience, Cancer and Transportation. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details regarding the corpus creation process, the fine-tuning strategies employed, and we conclude with the evaluation results.