Anatomy of 400 Million Job Postings: A Dataset Guide

The Dataset at a Glance

907,539,631 records. Each one represents a unique job posting observation after multi-signal semantic deduplication across sources. The raw observation count, measured in the billions, is substantially higher before dedup. How that dedup works is a story in itself, covered in Deduplication Is Harder Than You Think.

22 sources. These range from the two largest global job boards (Indeed and LinkedIn) to employer ATS platforms (Workday, Greenhouse, Lever, iCIMS, SmartRecruiter) to regional aggregators and specialized feeds.

82 fields per record. Coverage varies by field, source, and data vintage, but the schema is consistent. Every record passes through the same enrichment pipeline. Full field definitions are in the data schema.

Four years of continuous coverage: 2022 through 2025, with volume growing each year.

Source Composition

Not all job data is created equal. The top sources by volume:

Source	Records	Share
Indeed	225.9M	24.9%
PJF (aggregator)	215.7M	23.8%
LinkedIn	176.4M	19.4%
SimplyHired	105.1M	11.6%
Jora	74.7M	8.2%
CareerBuilder	58.1M	6.4%
MyWorkdayJobs	11.8M	1.3%
Google Jobs	6.5M	0.7%
SmartRecruiter	2.2M	0.2%
iCIMS	2.0M	0.2%
Greenhouse	278K	<0.1%

The dataset is not dependent on any single source. Indeed is the largest contributor at 24.9%, but the top three sources each contribute meaningful volume. If any single source degrades or restricts access, the dataset does not collapse.

The mix includes both aggregator platforms (Indeed, LinkedIn, SimplyHired) and direct employer ATS feeds (Greenhouse, Lever, Workday/MyWorkdayJobs, iCIMS). ATS feeds are smaller in volume but higher in data quality, representing first-party employer data rather than re-aggregated scrapes. For a detailed comparison of what that quality difference looks like in practice, see ATS vs. Job Boards: A Data Quality Comparison.

Year-Over-Year Growth

Year	Records	YoY Growth
2022	179.9M	baseline
2023	198.4M	+10.3%
2024	214.6M	+8.2%
2025	272.9M	partial year, tracking above 2024

The 2025 partial-year number (272.9M through late March) already exceeds full-year 2024. This reflects both expanding source coverage and increasing labor market activity.

Field Coverage: The 85+ Field Anatomy

Identity and Core Fields

Field	Coverage	Notes
Normalized title	99.97%	Raw titles canonicalized via ML model
Company name	96.6%	Gaps concentrated in two ATS sources
Country	92.5%	Parsed from raw location strings
State	95.1%	Highest coverage among location fields
City	92.5%	Metro areas and multi-location strings are the main gap
Zipcode	91.6%	Often inferred when not stated

The 99.97% title normalization rate means only about 270,000 records out of 907.5 million lack a normalized title. This is the single most impactful enrichment for downstream analysis. Without it, "Sr. Software Eng II," "Senior Software Engineer," and "Snr SW Developer" are three different things. With it, they are one. The methodology page covers how normalization works, and Inside Our NLP Enrichment System walks through the full stack of models behind classification and extraction.

Classification Fields

Field	Coverage	What It Does
SOC code	79.6%	6-digit occupational classification (BLS standard)
Seniority	71.2%	Entry / Mid / Senior / Lead / Executive
Employment type	70.2%	Full-time / Part-time / Contract / Temporary
Remote/work mode	79.6%	Remote / Hybrid / On-site

Only 18.9% of records arrive with seniority information from the source; the NLP pipeline brings that to 71.2%, a 3.8x multiplier. The pipeline assigns a classification only when confidence is sufficient. It does not guess.

Remote/work mode coverage at 79.6% represents a structural shift that is worth understanding historically. Before 2020, virtually no postings carried a structured remote field because the concept barely existed. The 2020-2022 period was a messy transition where "remote" appeared in descriptions but not as structured data. By 2023+, work mode is a standard field. The full timeline is in Remote Work by the Numbers: 2020-2025.

Extraction Fields

Field	Coverage	Scale
Skills (technical)	84.6%	37,000+ skills in taxonomy
Soft skills	73.9%	400+ categories
Benefits	51.4%	Health, dental, 401k, PTO
Certifications	9.1%	3,000+ certifications tracked

The 84.6% skills coverage means roughly 768 million records have structured, taxonomy-mapped skill arrays. This is high-speed dictionary matching against a 37,000+ taxonomy followed by contextual relevance filtering, not keyword search. "Java" in a barista posting gets filtered out. "Java" in a backend engineer posting gets kept. Certification coverage at 9.1% reflects reality: most postings do not require specific certifications. Among roles that do (nursing, IT security, finance, trades), extraction rates are substantially higher. The full extraction pipeline is described in 37,000 Skills: Extracting Signal from Noise.

Salary

Stated salary coverage sits at 22.5% (203.9 million records) across the full 2022-2025 range. This number is rising rapidly due to US salary transparency laws: Colorado (2021), NYC (2022), California and Washington (2023), with continued expansion through 2025. The year-by-year impact of these laws on data availability is tracked in The Salary Transparency Data Shift.

For records without stated salary, a prediction model trained on 50M+ observations with MAPE under 15% provides estimates where sufficient context exists (valid state, zipcode, and SOC code required). The model returns no prediction rather than a bad one when prerequisites are missing. Details on how the model works are in Salary Prediction from 50 Million Observations.

What the Data Reveals

Healthcare Dominates Posting Volume

The top normalized titles are overwhelmingly healthcare:

Rank	Title	Records
1	Registered Nurse / Travel Nurse	7.5M
2	Registered Nurse	6.5M
3	Telemetry Travel RN	6.1M
4	Travel Physical Therapist	4.3M
5	L&D Travel RN	4.1M
6	Assistant Manager	3.9M
7	Customer Service Representative	3.8M

Six of the top ten titles are nursing or allied health roles. This reflects the structural healthcare labor shortage since 2020: aging population, pandemic-accelerated burnout, and geographic maldistribution of providers. Travel nursing dominates because facilities compete for a mobile workforce, generating high posting volumes as contracts turn over every 8-13 weeks.

Source Quality Is Not Uniform

Source	Avg Description Words	Seniority Coverage	Skills Coverage
Greenhouse	739	83.6%	89.5%
MyWorkdayJobs	698	68.0%	85.5%
Indeed	497	90.0%	90.6%
LinkedIn	463	81.6%	89.8%
PJF	417	73.6%	81.6%
CareerBuilder	575	4.2%	97.5%

Greenhouse postings average 739 words, nearly twice the length of PJF aggregator listings at 417 words. Longer descriptions give NLP models more context, translating directly to higher extraction quality. CareerBuilder presents an interesting case: only 4.2% seniority coverage but 97.5% skills coverage, suggesting descriptions rich in technical requirements but lacking the title patterns seniority classifiers rely on.

Source provenance is preserved on every record, so source-aware analysis is straightforward. The data schema documents how source metadata is structured.

Coverage by Data Vintage

Coverage rates are not static. They shift with the age of the data, and understanding why prevents false conclusions.

Field	Pre-2022	2022-2023	2024+
Salary (stated)	mostly null	improving (transparency laws)	30-40% in covered states
Work mode	mostly null	40-70%	80%+
Seniority	lower coverage	improving	70%+
SOC code	lower coverage	improving	80%+
Skills	depends on desc length	improving	85%+

A null work_mode field in a 2019 record is not a data quality issue. A null work_mode in a 2024 record warrants investigation. The glossary defines terms like "data vintage" and "structural break" that are relevant here.

The Quality Signal

Job market data quality is not a single number. It is a matrix: field by source by vintage by geography. A dataset that provides 95.1% state coverage also has corners where seniority is 4.2% (CareerBuilder) or salary coverage is minimal (PJF). A record from 2022 has different expected coverage than one from 2025.

The value of a deeply enriched dataset is that coverage is measured, source provenance is preserved, confidence scores are available, and the enrichment pipeline applies consistently across all 907.5 million records. That consistency is what makes the data usable for rigorous analysis. You can explore available datasets and compare what is included at datasets or see pricing at pricing.

907.5 million records. 22 sources. 82 fields. Measured coverage on every one.

To see how this looks on actual records, request a sample.

Want to see the data for yourself?

Get a free sample of 5,000 enriched job records, delivered within 24 hours.

Request a Free Sample Back to Blog