Gian Seifert


2026

The rapid evolution of the labor market requires skill ontologies to be continuously updated, but manually identifying emerging skills in job advertisements is highly labor-intensive. This paper presents a scalable, multi-stage pipeline for automated novelty detection in skill extraction. The system combines Large Language Models (LLMs) for candidate generation, a re-matching and threshold-based filtering module ("Turbo"), that compares candidates against the existing ontology, and a two-step aggregation process that merges string-based and embedding-based clustering. Experiments on Swiss job advertisement datasets using GPT-4o, Gemini-2.0-flash, and DeepSeek-V3 show that the pipeline effectively reduces noise and manual curation effort: Turbo filtering lowered false positives by 82%, and aggregation reduced the number of items requiring review by 97%. Among the tested models, Gemini-2.0-flash achieved the highest precision, reaching a novelty detection ratio of up to 73% in the qualitative evaluation. These findings demonstrate the pipeline’s potential as an efficient tool for maintaining dynamic skill ontologies.