With the advent of large language models, machine translation (MT) has become a widely used, but little understood, tool for accessing historical and multilingual texts. While models like GPT, Claude, and Deepseek increasingly enable translation of low-resource and ancient languages, critical questions remain about their evaluation, optimal model selection, and the value of domain-specific training and retrieval-augmented generation setups.While AI models like GPT, Claude, and Deepseek are improving translation capabilities for low-resource and ancient languages, researchers still face important questions about how to evaluate their performance, which models work best, and whether specialized training approaches provide meaningful improvements in translation quality.This study introduces a comprehensive evaluation dataset for Buddhist Chinese to English translation, comprising 2,662 bilingual data points from 32 texts that have been selected to represent the full breadth of the Chinese Buddhist canon.We evaluate various computational metrics of translation quality (BLEU, chrF, BLEURT, GEMBA) against expert annotations from five domain specialists who rated 182 machine-generated translations. Our analysis reveals that LLM-based GEMBA scoring shows the strongest correlation with human judgment, significantly outperforming traditional metrics. We then benchmark commercial models (GPT-4 Turbo, Claude 3.5, Gemini), open-source models (Gemma 2, Deepseek-r1), and a domain-specialized model (Gemma 2 Mitra) using GEMBA. Our results demonstrate that domain-specific training enables open-weights models to achieve competitive performance with commercial systems, while also showing that retrieval-augmented generation (RAG) significantly improves translation quality for the best performing commercial models.
Accurate representation of procedures in restricted scenarios, such as non-standardized scientific experiments, requires precise depiction of constraints. Unfortunately, Domain-specific Language (DSL), as an effective tool to express constraints structurally, often requires case-by-case hand-crafting, necessitating customized, labor-intensive efforts. To overcome this challenge, we introduce the AutoDSL framework to automate DSL-based constraint design across various domains. Utilizing domain specified experimental protocol corpora, AutoDSL optimizes syntactic constraints and abstracts semantic constraints. Quantitative and qualitative analyses of the DSLs designed by AutoDSL across five distinct domains highlight its potential as an auxiliary module for language models, aiming to improve procedural planning and execution.
In order to construct or extend entity-centric and event-centric knowledge graphs (KG and EKG), the information extraction (IE) annotation toolkit is essential. However, existing IE toolkits have several non-trivial problems, such as not supporting multi-tasks, and not supporting automatic updates. In this work, we present CollabKG, a learnable human-machine-cooperative IE toolkit for KG and EKG construction. Specifically, for the multi-task issue, CollabKG unifies different IE subtasks, including named entity recognition (NER), entity-relation triple extraction (RE), and event extraction (EE), and supports both KG and EKG. Then, combining advanced prompting-based IE technology, the human-machine-cooperation mechanism with Large Language Models (LLMs) as the assistant machine is presented which can provide a lower cost as well as a higher performance. Lastly, owing to the two-way interaction between the human and machine, CollabKG with learning ability allows self-renewal. Besides, CollabKG has several appealing features (e.g., customization, training-free, and label propagation) that make the system powerful and high-productivity. We holistically compare our toolkit with other existing tools on these features. Human evaluation quantitatively illustrates that CollabKG significantly improves annotation quality, efficiency, and stability simultaneously.