How Yu


2026

We present our systems for SemEval-2026 Task 13, built on the Droid resource suite and benchmark setting. For Subtask A (binary classification of human-written vs. machine-generated code), lexical baselines such as TF–IDF and character n-grams transferred poorly from the LeetCode training distribution to the production-code evaluation split. After correcting pipeline errors that obscured true performance and selecting stable AST features under domain shift, our final system uses 5 uncorrelated features and achieves 0.57 macro F1 on the public test set.For Subtask C (4-way authorship classification of human, AI, hybrid, and adversarial) lexical baselines performed poorly under a significant vocabulary shift. Deep semantic models proved more promising, and a per-class weighted ensemble which included these models achieved 0.57 macro F1 on the public test set