37,000 Skills: Extracting Signal from Job Description Noise

The Problem with Skills in Job Postings

A job description for a "Barista" at a coffee chain mentions Java seventeen times. A posting for a "Project Manager" lists Microsoft Office, Agile, Scrum, stakeholder management, budget forecasting, and "strong communication skills" in a single paragraph of prose. A three-sentence listing for a warehouse associate says only "Must be able to lift 50 lbs. Apply today. EOE."

These three postings represent the full spectrum of the skills extraction problem. The first is a trap: "Java" is a programming language to a skills parser, but irrelevant noise in the context of a coffee shop. The second is rich with signal, but buried in unstructured text. The third barely contains enough text to extract anything at all.

Across 907.5 million job records, 84.6% have at least one extracted skill, 73.9% have identified soft skills, and 9.1% include recognized certifications. But those topline numbers obscure the real story, which is about how extraction quality varies with the quality of the input. You can see exactly which of our 82 fields we extract in the data schema.

The Two-Stage Pipeline: Match First, Then Filter

Most skills extraction systems use one of two approaches: keyword matching against a dictionary, or NLP-based extraction from context. Each has a fundamental weakness. Dictionary matching is fast and comprehensive but produces false positives. NLP extraction understands context but is slow, expensive at scale, and tends to miss unusually phrased skills.

Our pipeline uses both, in sequence.

Stage one: high-speed dictionary matching. We maintain a taxonomy of over 37,000 technical skills, 3,000+ certifications, and 400+ soft skills. Our matching algorithm scans each description against this entire taxonomy in a single pass. This is computationally efficient even at 907 million records because it runs in linear time relative to text length, regardless of dictionary size.

Stage two: NLP relevance filtering. The candidate set from stage one passes through a title-skill relevance model that evaluates whether each matched skill is actually relevant to the job. This is where "Java" gets filtered out of the Barista posting. The model considers the job title, surrounding context, and overall semantic content to separate genuine requirements from incidental text matches.

This two-stage approach gives us the recall of dictionary matching combined with the precision of NLP. For a closer look at how raw postings become structured records, you can compare raw vs enriched output side by side.

Description Length Drives Everything

The relationship between description length and skills extraction coverage is the defining characteristic of skills data quality. Our production data makes the pattern clear:

Description Length	Records	Avg Skills	Coverage
Under 50 words	19.1M	0.7	46.3%
50 to 100 words	28.7M	1.2	60.4%
100 to 200 words	90.6M	2.2	79.7%
200 to 500 words	379.2M	4.5	94.3%
500+ words	314.0M	10.2	99.3%

At under 50 words, fewer than half of postings yield any extractable skills. At 500+ words, coverage reaches 99.3% with an average of 10.2 skills per posting.

This is not a pipeline limitation. It is a physical constraint: you cannot extract skills that are not mentioned. A three-line posting that says "Hiring nurses. Apply now." contains exactly one extractable signal, and no amount of NLP sophistication will find more.

The practical implication is that skills coverage varies by source. ATS feeds from Greenhouse and Lever produce longer descriptions (often 300 to 800 words) with correspondingly higher coverage. Aggregator sources like Google Jobs frequently carry truncated descriptions under 200 words. We discuss source quality differences in more detail in our ATS and job board comparison. Any analysis of skills trends must account for the description length distribution of the underlying source mix.

What the Top Skills Reveal

The most frequently extracted skills across 907.5 million records paint a picture that may surprise analysts accustomed to tech-centric discussions.

Nursing dominates by a wide margin. With 101 million mentions, Nursing is the single most frequently identified skill. Travel Nursing adds another 48.8 million. Healthcare is, by volume, the largest skills category in the job market.

Universal skills function as baseline expectations, not differentiators. Microsoft Office appears in 44.5 million postings. Project Management shows up in 25.5 million. These are table stakes across white-collar roles.

The top 15 skills span at least eight industries. Construction (36.1M), Finance (31.3M), Accounting (29.3M), Mental Health (26.9M), Housekeeping (24.4M), Merchandising (24.1M), and Acute Care (23.4M) represent healthcare, hospitality, retail, construction, and financial services.

The job market is not a technology monoculture, and skills data should reflect that breadth. For context on how these skills map to occupational classifications, see our glossary of job market terms.

Certifications Tell a Different Story

While skills describe what a worker can do, certifications describe what a worker is authorized to do. The distinction matters for workforce planning.

Certification	Postings
Security Clearance	8.2M
Licensed Practical Nurse	7.2M
Nurse Practitioner	5.4M
CNA	5.4M
CDL	2.8M
Forklift Certification	1.4M

Security Clearance leads at 8.2 million postings, reflecting the scale of the US defense contractor workforce. The healthcare certifications (LPN, NP, CNA) reinforce that industry's dominance in the structured credentials space. CDL at 2.8 million speaks to logistics and transportation. Forklift Certification at 1.4 million reflects warehousing and distribution.

The certification landscape is notably more concentrated than the skills landscape. While our taxonomy includes over 3,000 certifications, the top six account for a disproportionate share of mentions. Certain industries (healthcare, defense, transportation) have hard certification requirements, while most knowledge-worker roles rely on skills and experience rather than formal credentials. This concentration pattern is something to keep in mind when building occupation-level datasets or comparing providers.

Why Taxonomy Scale Matters

A taxonomy with 500 skills can tell you that a posting requires "programming." A taxonomy with 37,000 skills can distinguish between Python, Go, Rust, TypeScript, and COBOL. It can separate "machine learning" from "deep learning" from "reinforcement learning." It can identify "Kubernetes" as distinct from "Docker" as distinct from "cloud infrastructure."

But taxonomy size alone is not sufficient. A larger dictionary without relevance filtering would simply produce more false positives. The combination of comprehensive matching and contextual filtering is what allows us to maintain both breadth (capturing niche skills like "BCBA" or "PALS" certification) and precision (not tagging every coffee shop job with a Java requirement).

The NLP filtering stage is what makes the taxonomy usable at scale. Without it, a 37,000-term dictionary would generate so many false positives that downstream consumers would need to build their own filtering layer, which defeats the purpose of pre-enriched data. Our methodology page covers how we validate extraction accuracy across skill categories and description types.

How Skills Enrichment Connects to the Broader Pipeline

Skills extraction does not happen in isolation. It depends on clean job descriptions from the ingestion pipeline and feeds into the broader enrichment layer alongside salary normalization, SOC classification, seniority detection, and remote work tagging. A model serving outage or description truncation issue upstream can cause skills coverage to drop without any change to the extraction logic itself.

This is why monitoring extraction rates by description length, by source, and over time matters more than a single aggregate coverage number. The 84.6% overall coverage rate is a blend of 99.3% on long descriptions and 46.3% on short ones. The useful question is always: what does coverage look like for the specific segment you care about?

Evaluating Skills Data

When comparing job market data providers, the questions that matter are not "Do you have skills data?" but rather:

What is the taxonomy size, and how is it maintained?
How does the system handle false positives from dictionary matching?
What is the coverage rate by description length, and how does the source mix affect that rate?
Are certifications, soft skills, and technical skills distinguished or lumped together?

Across 907.5 million records, 84.6% carry extracted skills. For postings with descriptions over 200 words (693 million records, or 76% of the dataset), coverage exceeds 94%. We extract across three distinct categories using a taxonomy of 40,000+ terms, filtered through NLP relevance models that prevent dictionary-match noise from contaminating the output.

If you want to see skills extraction results on your target occupations or geographies, request a sample.

Want to see the data for yourself?

Get a free sample of 5,000 enriched job records, delivered within 24 hours.

Request a Free Sample Back to Blog