Large Language Models (LLMs) are now being considered as judges of high efficiency to evaluate the quality of answers generated by candidate models. However, their judgments may be influenced by complex scenarios and inherent biases, raising concerns about their reliability. This study aims to bridge this gap by introducing four unexplored factors and examining the performance of LLMs as judges, namely answer quantity, inducing statements, judging strategy, and judging style. Additionally, we introduce a new dimension of question difficulty to provide a more comprehensive understanding of LLMs’ judgments across varying question intricacies. We employ ChatGPT, GPT-4, Gemini, and Claude-2 as judges and conduct experiments on Vicuna Benchmark and MT-bench. Our study reveals that LLMs’ judging abilities are susceptible to the influence of these four factors, and analyzing from the newly proposed dimension of question difficulty is highly necessary. We also provide valuable insights into optimizing LLMs’ performance as judges, enhancing their reliability and adaptability across diverse evaluation scenarios.
Recent research has documented that results reported in frequently-cited authorship attribution papers are difficult to reproduce. Inaccessible code and data are often proposed as factors which block successful reproductions. Even when original materials are available, problems remain which prevent researchers from comparing the effectiveness of different methods. To solve the remaining problems—the lack of fixed test sets and the use of inappropriately homogeneous corpora—our paper contributes materials for five closed-set authorship identification experiments. The five experiments feature texts from 106 distinct authors. Experiments involve a range of contemporary non-fiction American English prose. These experiments provide the foundation for comparable and reproducible authorship attribution research involving contemporary writing.
The success of authorship attribution relies on the presence of linguistic features specific to individual authors. There is, however, limited research assessing to what extent authorial style remains constant when individuals switch from one writing modality to another. We measure the effect of writing mode on writing style in the context of authorship attribution research using a corpus of documents composed online (in a web browser) and documents composed offline using a traditional word processor. The results confirm the existence of a “mode effect” on authorial style. Online writing differs systematically from offline writing in terms of sentence length, word use, readability, and certain part-of-speech ratios. These findings have implications for research design and feature engineering in authorship attribution studies.