Cristian-George Craciun
2025
GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering
Cristian-George Craciun
|
Răzvan-Alexandru Smădu
|
Dumitru-Clementin Cercel
|
Mihaela-Claudia Cercel
Findings of the Association for Computational Linguistics: ACL 2025
Pre-trained language models have shown remarkable performance in recent years, setting a new paradigm for natural language processing (NLP) research. The legal domain has received some attention from the NLP community, in part due to its textual nature. Question answering (QA) systems represent some of the tasks in this domain. This work explores the legal multiple-choice QA (MCQA) for Romanian. The contribution of this work is multi-fold. We introduce JuRO, the first openly available Romanian legal MCQA dataset, comprising 10,836 questions from three examinations. Along with this dataset, we introduce CROL, an organized corpus of laws comprising a total of 93 distinct documents with their modifications over 763 time spans, which we used for information retrieval techniques in this work. Additionally, we construct Law-RoG, the first graph of legal knowledge for the Romanian language, derived from the aforementioned corpus. Lastly, we propose a novel approach for MCQA, namely Graph Retrieval Augmented by Facts (GRAF), which achieves competitive results with generally accepted state-of-the-art methods and even exceeds them in most settings.
German4All – A Dataset and Model for Readability-Controlled Paraphrasing in German
Miriam Anschütz
|
Thanh Mai Pham
|
Eslam Nasrallah
|
Maximilian Müller
|
Cristian-George Craciun
|
Georg Groh
Proceedings of the 18th International Natural Language Generation Conference
The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We open-source both the dataset and the model to encourage further research on multi-level paraphrasing.
2024
RoQLlama: A Lightweight Romanian Adapted Language Model
George-Andrei Dima
|
Andrei-Marius Avram
|
Cristian-George Craciun
|
Dumitru-Clementin Cercel
Findings of the Association for Computational Linguistics: EMNLP 2024
The remarkable achievements obtained by open-source large language models (LLMs) in recent years have predominantly been concentrated on tasks involving the English language. In this paper, we aim to advance the performance of Llama2 models on Romanian tasks. We tackle the problem of reduced computing resources by using QLoRA for training. We release RoQLlama-7b, a quantized LLM, which shows equal or improved results compared to its full-sized counterpart when tested on seven Romanian downstream tasks in the zero-shot setup. Also, it consistently achieves higher average scores across all few-shot prompts. Additionally, we introduce a novel Romanian dataset, namely RoMedQA, which contains single-choice medical questions in Romanian.