Bryan E. Tuck


2026

Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-family evaluation remains limited. We evaluate 39 configurations spanning three model families (Qwen3, Claude Haiku 4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Cross-family differences produce substantially larger performance gaps (2.0–2.2×, F1 = 0.761 vs. 0.343) than parameter scaling within families (83% gain from 4B to 32B scaling), and a partial-correlation analysis rules out tokenizer design as a confound for within-family scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade, showing inconsistent compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (ρ = 0.28–0.42) across all families, yet identify systematic failures on common words with unusual orthography ("data", "loll", "acai": 83–91% human success, 94–98% model miss rate). These failures point to over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns.

2025

The rapid development of large language models (LLMs) has significantly improved the generation of fluent and convincing text, raising concerns about their potential misuse on social media platforms. We present a comprehensive methodology for creating nine Twitter datasets to examine the generative capabilities of four prominent LLMs: Llama 3, Mistral, Qwen2, and GPT4o. These datasets encompass four censored and five uncensored model configurations, including 7B and 8B parameter base-instruction models of the three open-source LLMs. Additionally, we perform a data quality analysis to assess the characteristics of textual outputs from human, “censored,” and “uncensored models,” employing semantic meaning, lexical richness, structural patterns, content characteristics, and detector performance metrics to identify differences and similarities. Our evaluation demonstrates that “uncensored” models significantly undermine the effectiveness of automated detection methods. This study addresses a critical gap by exploring smaller open-source models and the ramifications of “uncensoring,” providing valuable insights into how domain adaptation and content moderation strategies influence both the detectability and structural characteristics of machine-generated text.