Not All Job Data Sources Are Created Equal
If you are building an HR tech product, training a labor market model, or analyzing workforce trends, you need job market data. But most vendors will not tell you this: the quality gap between the best and worst sources is enormous, and aggregate metrics hide the problem.
We ingest data from over 200,000 employer career portals and every major job board. Our pipeline processes 4.5 billion raw observations into 907.5 million deduplicated records across 82 fields. Along the way, we have learned exactly how data quality varies across source types. That knowledge shapes everything from our deduplication logic to our NLP enrichment priorities.
The Source Landscape: Three Tiers
Our pipeline ingests from three broad categories. Here is how volume breaks down in stage050, the final delivery table:
Tier 1: Major Job Boards. Indeed contributes approximately 225.9 million unique jobs, making it the single largest source. LinkedIn follows with 176.4 million. A major professional job feed (PJF) contributes 215.7 million.
Tier 2: Direct ATS Feeds. Career portals hosted on Workday, Greenhouse, Lever, and SmartRecruiter. Workday alone accounts for over 11.8 million records. Across 200,000+ employer portals, ATS feeds collectively represent a substantial and growing share.
Tier 3: Aggregators. Google Jobs, Jobs2Careers, Jora, and similar services re-aggregate data from other sources. The volume hierarchy is clear: major boards dominate by raw record count. But volume is not quality. In fact, there is often an inverse relationship. The sources with the highest raw counts also tend to have the highest duplication rates and the lowest per-record enrichment quality.
Company Name Coverage: Where Extraction Gaps Hide
Company name is a foundational field. Without it, you cannot do company-level analytics, match to firmographic databases, or build employer-specific trend data.
The quality gap across sources is stark. Indeed, LinkedIn, and most major boards deliver company names on the vast majority of records. Greenhouse, Lever, and established ATS platforms reliably include employer identity.
Then there are the gaps:
| Source | Company Name Null Rate | Records Affected | Root Cause |
|---|---|---|---|
| gem | 99.2% | 1.26M | Company name in URL path (e.g., jobs.gem.com/twitch/) but not extracted |
| myworkdayjobs | 95.9% | 256.5M observations | Company name in URL subdomain but not extracted |
| Overall delivery | 3.4% null | 30.8M records | Mix of extraction gaps and genuinely missing data |
These are ETL extraction gaps, not inherent data quality issues. URL-based company extraction will recover a significant portion of those 30.8 million records. The lesson for data buyers: always ask whether company names are extracted from all available signals, not just the structured field in the raw feed.
Description Length Drives Enrichment Quality
Job description length is the single best predictor of downstream enrichment quality. Longer descriptions give our NLP models more signal for SOC classification, skills extraction, and seniority detection.
| Description Length | Skills Extraction Rate |
|---|---|
| Under 50 words | 47.5% |
| 50 to 100 words | ~65% |
| 100 to 200 words | 81.6% |
| 200 to 500 words | ~95% |
| 500+ words | 99.5% |
Our overall average across 599.3 million processed descriptions is 486 words, comfortably above the 200-word threshold where NLP enrichment becomes reliable. But this average masks significant variation by source.
ATS feeds, particularly Greenhouse and Lever postings, tend to include detailed descriptions with bulleted requirements and qualifications. These structured descriptions are optimal for NLP extraction. Job board descriptions are more variable. Indeed reformats employer descriptions, sometimes truncating them. Aggregators often serve shortened versions.
The practical implication: a dataset of 100 million Greenhouse postings with 500-word average descriptions will yield better enrichment than 200 million aggregator postings averaging 150 words. For details on how our Model Garden handles varying input quality, see the linked post.
Deduplication Rates Tell the Real Story
Our pipeline uses semantic deduplication combining vector similarity, MinHash company matching, title similarity modeling, and geo-clustering with graph-based transitive resolution. You can read more about the approach in our methodology. The dedup rates by source tell you a lot about what you are actually buying:
| Source | Raw Observations | Unique Jobs | Dedup Rate | Avg Scrapes per Job |
|---|---|---|---|---|
| Indeed | 2.06B | 225.9M | 89.1% | 9.1 |
| 987.8M | 176.4M | 82.1% | 5.6 | |
| Direct ATS | Varies | Varies | <2% | ~1.0 |
| Jobs2Careers | - | - | 0% | N/A (volatile URL identifiers) |
Our overall pipeline dedup ratio is 79.7%. If your vendor counts "job postings" without disclosing dedup methodology, you have no idea how many unique jobs you are actually getting. A dataset of 2 billion Indeed observations sounds impressive until you realize it represents 226 million unique jobs.
The Jobs2Careers anomaly is worth noting: a 0% dedup rate means every observation generates a unique key, likely because URL-based identifiers change on every scrape. This kind of source-specific behavior is invisible in aggregate metrics. Understanding it requires source-level quality tracking.
NLP Enrichment Is Not Uniform Across Sources
Across 493.9 million NLP-processed records, our coverage numbers are strong: 100% SOC code coverage, 100% seniority on model-garden-v1.0.0, and effectively 100% remote work status. But not all records have been processed.
Our NLP pipeline has covered 493.9 million of 717.9 million extracted records. Approximately 413 million records are still awaiting full enrichment. The gap is not uniform:
- SimplyHired has a known processing stall affecting 12.2 million records from June through December 2025
- Lever shows a 31.6% gap on records from the past 180 days
These are pipeline throughput issues, not model quality issues. The distinction matters. A model that achieves 0% null on processed records but has only reached 69% of the corpus is a scheduling problem, not a quality problem.
The question to ask your vendor: what percentage of records have SOC codes, and how does that vary by source, vintage, and region? Aggregate coverage numbers are misleading when processing is unevenly distributed. For a detailed breakdown, see our datasets page.
Salary Coverage: A Structural Transformation
Salary data availability has shifted dramatically since 2021, driven by US state transparency laws:
| Year | Salary Coverage (stated) | Key Events |
|---|---|---|
| 2022 | 31.4% | Colorado law in effect, NYC law takes effect Nov 2022 |
| 2023 | 20.4% | CA, WA, NY State laws take effect |
| 2024 | 20.3% | Hawaii law takes effect |
| 2025 | 23.0% | Illinois, Minnesota laws take effect |
The counterintuitive decline from 2022 to 2023 reflects a source composition shift, not a data quality failure. One of our largest feeds had near-100% salary coverage in 2022 but experienced a format change. When we isolate Indeed specifically, salary coverage improved from 51.1% in 2022 to 88.8% in 2023.
This is a critical lesson: aggregate metrics can obscure source-level trends. A single source's format change can swing overall coverage by 10+ percentage points. For sources without stated salary, our prediction model fills the gap with under 15% MAPE when state, zip code, and SOC code are available.
The ATS Advantage
Across every quality metric, direct ATS feeds consistently outperform aggregated sources:
- Employer-verified content. No reformatting, no truncation, no re-aggregation errors.
- Near-zero internal duplication. Compare to Indeed's 89% or Google Jobs' pathological duplication.
- Structured metadata. ATS platforms enforce fields like location, department, and salary range.
- Freshness guarantees. ATS feeds update in near-real-time. Board scrapers may lag by days or weeks.
This is why coverage of 200,000+ employer ATS portals is a strategic asset, not just a volume play. These feeds provide ground truth that calibrates NLP models and validates deduplication logic.
What Data Buyers Should Ask
If you are evaluating job market data providers, these five questions will separate the serious vendors from the ones selling volume:
- Source composition. 500 million records from Google Jobs is not comparable to 500 million from ATS feeds.
- Source-level metrics. Null rates, description lengths, and dedup rates should be available per source, not just in aggregate.
- Dedup methodology. Hash-based dedup catches exact duplicates. Semantic dedup catches near-duplicates. The difference can be 20 to 30 percentage points. See our glossary for definitions.
- Freshness per source. A provider might have fresh Indeed data but stale ATS feeds.
- Enrichment depth. A fully enriched record with normalized title, SOC code, seniority, skills, and predicted salary is worth more than ten raw records with just title and location.
To see how this looks in practice, explore source-level quality metrics with a sample dataset, or compare raw and enriched records side by side in our comparison tool.
Want to see the data for yourself?
Get a free sample of 5,000 enriched job records, delivered within 24 hours.