The Dataset at a Glance
907,539,631 records. Each one represents a unique job posting observation after multi-signal semantic deduplication across sources. The raw observation count, measured in the billions, is substantially higher before dedup. How that dedup works is a story in itself, covered in Deduplication Is Harder Than You Think.
22 sources. These range from the two largest global job boards (Indeed and LinkedIn) to employer ATS platforms (Workday, Greenhouse, Lever, iCIMS, SmartRecruiter) to regional aggregators and specialized feeds.
82 fields per record. Coverage varies by field, source, and data vintage, but the schema is consistent. Every record passes through the same enrichment pipeline. Full field definitions are in the data schema.
Four years of continuous coverage: 2022 through 2025, with volume growing each year.
Source Composition
Not all job data is created equal. The top sources by volume:
| Source | Records | Share |
|---|---|---|
| Indeed | 225.9M | 24.9% |
| PJF (aggregator) | 215.7M | 23.8% |
| 176.4M | 19.4% | |
| SimplyHired | 105.1M | 11.6% |
| Jora | 74.7M | 8.2% |
| CareerBuilder | 58.1M | 6.4% |
| MyWorkdayJobs | 11.8M | 1.3% |
| Google Jobs | 6.5M | 0.7% |
| SmartRecruiter | 2.2M | 0.2% |
| iCIMS | 2.0M | 0.2% |
| Greenhouse | 278K | <0.1% |
The dataset is not dependent on any single source. Indeed is the largest contributor at 24.9%, but the top three sources each contribute meaningful volume. If any single source degrades or restricts access, the dataset does not collapse.
The mix includes both aggregator platforms (Indeed, LinkedIn, SimplyHired) and direct employer ATS feeds (Greenhouse, Lever, Workday/MyWorkdayJobs, iCIMS). ATS feeds are smaller in volume but higher in data quality, representing first-party employer data rather than re-aggregated scrapes. For a detailed comparison of what that quality difference looks like in practice, see ATS vs. Job Boards: A Data Quality Comparison.
Year-Over-Year Growth
| Year | Records | YoY Growth |
|---|---|---|
| 2022 | 179.9M | baseline |
| 2023 | 198.4M | +10.3% |
| 2024 | 214.6M | +8.2% |
| 2025 | 272.9M | partial year, tracking above 2024 |
The 2025 partial-year number (272.9M through late March) already exceeds full-year 2024. This reflects both expanding source coverage and increasing labor market activity.
Field Coverage: The 85+ Field Anatomy
Identity and Core Fields
| Field | Coverage | Notes |
|---|---|---|
| Normalized title | 99.97% | Raw titles canonicalized via ML model |
| Company name | 96.6% | Gaps concentrated in two ATS sources |
| Country | 92.5% | Parsed from raw location strings |
| State | 95.1% | Highest coverage among location fields |
| City | 92.5% | Metro areas and multi-location strings are the main gap |
| Zipcode | 91.6% | Often inferred when not stated |
The 99.97% title normalization rate means only about 270,000 records out of 907.5 million lack a normalized title. This is the single most impactful enrichment for downstream analysis. Without it, "Sr. Software Eng II," "Senior Software Engineer," and "Snr SW Developer" are three different things. With it, they are one. The methodology page covers how normalization works, and Inside the Model Garden walks through the full stack of models behind classification and extraction.
Classification Fields
| Field | Coverage | What It Does |
|---|---|---|
| SOC code | 79.6% | 6-digit occupational classification (BLS standard) |
| Seniority | 71.2% | Entry / Mid / Senior / Lead / Executive |
| Employment type | 70.2% | Full-time / Part-time / Contract / Temporary |
| Remote/work mode | 79.6% | Remote / Hybrid / On-site |
Only 18.9% of records arrive with seniority information from the source; the NLP pipeline brings that to 71.2%, a 3.8x multiplier. The pipeline assigns a classification only when confidence is sufficient. It does not guess.
Remote/work mode coverage at 79.6% represents a structural shift that is worth understanding historically. Before 2020, virtually no postings carried a structured remote field because the concept barely existed. The 2020-2022 period was a messy transition where "remote" appeared in descriptions but not as structured data. By 2023+, work mode is a standard field. The full timeline is in Remote Work by the Numbers: 2020-2025.
Extraction Fields
| Field | Coverage | Scale |
|---|---|---|
| Skills (technical) | 84.6% | 37,000+ skills in taxonomy |
| Soft skills | 73.9% | 400+ categories |
| Benefits | 51.4% | Health, dental, 401k, PTO |
| Certifications | 9.1% | 3,000+ certifications tracked |
The 84.6% skills coverage means roughly 768 million records have structured, taxonomy-mapped skill arrays. This is dictionary matching (Aho-Corasick against a 37,000+ taxonomy) followed by contextual relevance filtering, not keyword search. "Java" in a barista posting gets filtered out. "Java" in a backend engineer posting gets kept. Certification coverage at 9.1% reflects reality: most postings do not require specific certifications. Among roles that do (nursing, IT security, finance, trades), extraction rates are substantially higher. The full extraction pipeline is described in 37,000 Skills: Extracting Signal from Noise.
Salary
Stated salary coverage sits at 22.5% (203.9 million records) across the full 2022-2025 range. This number is rising rapidly due to US salary transparency laws: Colorado (2021), NYC (2022), California and Washington (2023), with continued expansion through 2025. The year-by-year impact of these laws on data availability is tracked in The Salary Transparency Data Shift.
For records without stated salary, a prediction model trained on 50M+ observations with MAPE under 15% provides estimates where sufficient context exists (valid state, zipcode, and SOC code required). The model returns no prediction rather than a bad one when prerequisites are missing. Details on how the model works are in Salary Prediction from 50 Million Observations.
What the Data Reveals
Healthcare Dominates Posting Volume
The top normalized titles are overwhelmingly healthcare:
| Rank | Title | Records |
|---|---|---|
| 1 | Registered Nurse / Travel Nurse | 7.5M |
| 2 | Registered Nurse | 6.5M |
| 3 | Telemetry Travel RN | 6.1M |
| 4 | Travel Physical Therapist | 4.3M |
| 5 | L&D Travel RN | 4.1M |
| 6 | Assistant Manager | 3.9M |
| 7 | Customer Service Representative | 3.8M |
Six of the top ten titles are nursing or allied health roles. This reflects the structural healthcare labor shortage since 2020: aging population, pandemic-accelerated burnout, and geographic maldistribution of providers. Travel nursing dominates because facilities compete for a mobile workforce, generating high posting volumes as contracts turn over every 8-13 weeks.
Source Quality Is Not Uniform
| Source | Avg Description Words | Seniority Coverage | Skills Coverage |
|---|---|---|---|
| Greenhouse | 739 | 83.6% | 89.5% |
| MyWorkdayJobs | 698 | 68.0% | 85.5% |
| Indeed | 497 | 90.0% | 90.6% |
| 463 | 81.6% | 89.8% | |
| PJF | 417 | 73.6% | 81.6% |
| CareerBuilder | 575 | 4.2% | 97.5% |
Greenhouse postings average 739 words, nearly twice the length of PJF aggregator listings at 417 words. Longer descriptions give NLP models more context, translating directly to higher extraction quality. CareerBuilder presents an interesting case: only 4.2% seniority coverage but 97.5% skills coverage, suggesting descriptions rich in technical requirements but lacking the title patterns seniority classifiers rely on.
Source provenance is preserved on every record, so source-aware analysis is straightforward. The data schema documents how source metadata is structured.
Coverage by Data Vintage
Coverage rates are not static. They shift with the age of the data, and understanding why prevents false conclusions.
| Field | Pre-2022 | 2022-2023 | 2024+ |
|---|---|---|---|
| Salary (stated) | mostly null | improving (transparency laws) | 30-40% in covered states |
| Work mode | mostly null | 40-70% | 80%+ |
| Seniority | lower coverage | improving | 70%+ |
| SOC code | lower coverage | improving | 80%+ |
| Skills | depends on desc length | improving | 85%+ |
A null work_mode field in a 2019 record is not a data quality issue. A null work_mode in a 2024 record warrants investigation. The glossary defines terms like "data vintage" and "structural break" that are relevant here.
The Quality Signal
Job market data quality is not a single number. It is a matrix: field by source by vintage by geography. A dataset that provides 95.1% state coverage also has corners where seniority is 4.2% (CareerBuilder) or salary coverage is minimal (PJF). A record from 2022 has different expected coverage than one from 2025.
The value of a deeply enriched dataset is that coverage is measured, source provenance is preserved, confidence scores are available, and the enrichment pipeline applies consistently across all 907.5 million records. That consistency is what makes the data usable for rigorous analysis. You can explore available datasets and compare what is included at datasets or see pricing at pricing.
907.5 million records. 22 sources. 82 fields. Measured coverage on every one.
To see how this looks on actual records, request a sample.
Want to see the data for yourself?
Get a free sample of 5,000 enriched job records, delivered within 24 hours.