TY - GEN
T1 - LLM-Supervised Multilingual Skill Extraction and Classification from Job Ads
AU - Wang, Jakob Mørup
AU - Sun, Zhiru
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2026
Y1 - 2026
N2 - This paper presents a pipeline for extracting, classifying, and representing skills requested in job advertisements to enable demand-side labor market analysis. Our contributions include: (1) addressing the annotation bottleneck by leveraging scalable, taxonomy-aligned LLM supervision for training a lightweight sentence encoder, (2) expanding skill extraction to include implicit skill requirements as well as the explicit mentions typically targeted in prior work, and (3) representing skills as distributions to robustly support downstream tasks despite the fluid, overlapping nature of skill definitions. Concretely, we compile 3M+ postings from 10k+ sources and sample 500k+ sentences to fine-tune paraphrase-multilingual-mpnet-base-v2 for identifying skill requests and mapping them to the 13,896-skill ESCO taxonomy, supervised by GPT-4o mini. The outcome is normalized per-ad skill distributions, aggregated from sentence-level distributions weighted by request probability.
AB - This paper presents a pipeline for extracting, classifying, and representing skills requested in job advertisements to enable demand-side labor market analysis. Our contributions include: (1) addressing the annotation bottleneck by leveraging scalable, taxonomy-aligned LLM supervision for training a lightweight sentence encoder, (2) expanding skill extraction to include implicit skill requirements as well as the explicit mentions typically targeted in prior work, and (3) representing skills as distributions to robustly support downstream tasks despite the fluid, overlapping nature of skill definitions. Concretely, we compile 3M+ postings from 10k+ sources and sample 500k+ sentences to fine-tune paraphrase-multilingual-mpnet-base-v2 for identifying skill requests and mapping them to the 13,896-skill ESCO taxonomy, supervised by GPT-4o mini. The outcome is normalized per-ad skill distributions, aggregated from sentence-level distributions weighted by request probability.
KW - Labor market analysis
KW - LLM supervision
KW - Skill classification
KW - Skill extraction
KW - Taxonomy alignment
U2 - 10.1007/978-3-031-97144-0_9
DO - 10.1007/978-3-031-97144-0_9
M3 - Article in proceedings
AN - SCOPUS:105011032093
SN - 9783031971433
T3 - Lecture Notes in Computer Science
SP - 94
EP - 104
BT - Natural Language Processing and Information Systems
A2 - Ichise, Ryutaro
PB - Springer
T2 - 30th International Conference on Natural Language and Information Systems, NLDB 2025
Y2 - 4 July 2025 through 6 July 2025
ER -