ATS vs Job Boards: Data Quality Across 200K Sources

Not All Job Data Sources Are Created Equal

If you are building an HR tech product, training a labor market model, or analyzing workforce trends, you need job market data. But most vendors will not tell you this: the quality gap between the best and worst sources is enormous, and aggregate metrics hide the problem.

We ingest data from over 200,000 employer career portals and every major job board. Our pipeline processes 4.5 billion raw observations into 907.5 million deduplicated records across 82 fields. Along the way, we have learned exactly how data quality varies across source types. That knowledge shapes everything from our deduplication logic to our NLP enrichment priorities.

The Source Landscape: Three Tiers

Our pipeline ingests from three broad categories. Here is how volume breaks down in the final delivery table:

Tier 1: Major Job Boards. Indeed contributes approximately 225.9 million unique jobs, making it the single largest source. LinkedIn follows with 176.4 million. A major professional job feed (PJF) contributes 215.7 million.

Tier 2: Direct ATS Feeds. Career portals hosted on Workday, Greenhouse, Lever, and SmartRecruiter. Workday alone accounts for over 11.8 million records. Across 200,000+ employer portals, ATS feeds collectively represent a substantial and growing share.

Tier 3: Aggregators. Google Jobs, Jobs2Careers, Jora, and similar services re-aggregate data from other sources. The volume hierarchy is clear: major boards dominate by raw record count. But volume is not quality. In fact, there is often an inverse relationship. The sources with the highest raw counts also tend to have the highest duplication rates and the lowest per-record enrichment quality.

Company Name Coverage: Where Extraction Gaps Hide

Company name is a foundational field. Without it, you cannot do company-level analytics, match to firmographic databases, or build employer-specific trend data.

The quality gap across sources is stark. Indeed, LinkedIn, and most major boards deliver company names on the vast majority of records. Greenhouse, Lever, and established ATS platforms reliably include employer identity.

Then there are the gaps:

Source	Company Name Null Rate	Records Affected	Root Cause
gem	99.2%	1.26M	Company name in URL path (e.g., `jobs.gem.com/twitch/`) but not extracted
myworkdayjobs	95.9%	256.5M observations	Company name in URL subdomain but not extracted
Overall delivery	3.4% null	30.8M records	Mix of extraction gaps and genuinely missing data

These are ETL extraction gaps, not inherent data quality issues. URL-based company extraction will recover a significant portion of those 30.8 million records. The lesson for data buyers: always ask whether company names are extracted from all available signals, not just the structured field in the raw feed.

Description Length Drives Enrichment Quality

Job description length is the single best predictor of downstream enrichment quality. Longer descriptions give our NLP models more signal for SOC classification, skills extraction, and seniority detection.

Description Length	Skills Extraction Rate
Under 50 words	47.5%
50 to 100 words	~65%
100 to 200 words	81.6%
200 to 500 words	~95%
500+ words	99.5%

Our overall average across 599.3 million processed descriptions is 486 words, comfortably above the 200-word threshold where NLP enrichment becomes reliable. But this average masks significant variation by source.

ATS feeds, particularly Greenhouse and Lever postings, tend to include detailed descriptions with bulleted requirements and qualifications. These structured descriptions are optimal for NLP extraction. Job board descriptions are more variable. Indeed reformats employer descriptions, sometimes truncating them. Aggregators often serve shortened versions.

The practical implication: a dataset of 100 million Greenhouse postings with 500-word average descriptions will yield better enrichment than 200 million aggregator postings averaging 150 words. For details on how our NLP enrichment system handles varying input quality, see the linked post.

Deduplication Rates Tell the Real Story

Our pipeline uses semantic deduplication combining vector similarity, MinHash company matching, title similarity modeling, and geo-clustering with graph-based transitive resolution. You can read more about the approach in our methodology. The dedup rates by source tell you a lot about what you are actually buying:

Source	Raw Observations	Unique Jobs	Dedup Rate	Avg Scrapes per Job
Indeed	2.06B	225.9M	89.1%	9.1
LinkedIn	987.8M	176.4M	82.1%	5.6
Direct ATS	Varies	Varies	<2%	~1.0
Jobs2Careers	-	-	0%	N/A (volatile URL identifiers)

Our overall pipeline dedup ratio is 79.7%. If your vendor counts "job postings" without disclosing dedup methodology, you have no idea how many unique jobs you are actually getting. A dataset of 2 billion Indeed observations sounds impressive until you realize it represents 226 million unique jobs.

The Jobs2Careers anomaly is worth noting: a 0% dedup rate means every observation generates a unique key, likely because URL-based identifiers change on every scrape. This kind of source-specific behavior is invisible in aggregate metrics. Understanding it requires source-level quality tracking.

NLP Enrichment Is Not Uniform Across Sources

Across 493.9 million NLP-processed records, our coverage numbers are strong: 100% SOC code coverage, 100% seniority on our current enrichment system, and effectively 100% remote work status. But not all records have been processed.

Our NLP pipeline has covered 493.9 million of 717.9 million extracted records. Approximately 413 million records are still awaiting full enrichment. The gap is not uniform:

SimplyHired has a known processing stall affecting 12.2 million records from June through December 2025
Lever shows a 31.6% gap on records from the past 180 days

These are pipeline throughput issues, not model quality issues. The distinction matters. A model that achieves 0% null on processed records but has only reached 69% of the corpus is a scheduling problem, not a quality problem.

The question to ask your vendor: what percentage of records have SOC codes, and how does that vary by source, vintage, and region? Aggregate coverage numbers are misleading when processing is unevenly distributed. For a detailed breakdown, see our datasets page.

Salary Coverage: A Structural Transformation

Salary data availability has shifted dramatically since 2021, driven by US state transparency laws:

Year	Salary Coverage (stated)	Key Events
2022	31.4%	Colorado law in effect, NYC law takes effect Nov 2022
2023	20.4%	CA, WA, NY State laws take effect
2024	20.3%	Hawaii law takes effect
2025	23.0%	Illinois, Minnesota laws take effect

The counterintuitive decline from 2022 to 2023 reflects a source composition shift, not a data quality failure. One of our largest feeds had near-100% salary coverage in 2022 but experienced a format change. When we isolate Indeed specifically, salary coverage improved from 51.1% in 2022 to 88.8% in 2023.

This is a critical lesson: aggregate metrics can obscure source-level trends. A single source's format change can swing overall coverage by 10+ percentage points. For sources without stated salary, our prediction model fills the gap with under 15% MAPE when state, zip code, and SOC code are available.

The ATS Advantage

Across every quality metric, direct ATS feeds consistently outperform aggregated sources:

Employer-verified content. No reformatting, no truncation, no re-aggregation errors.
Near-zero internal duplication. Compare to Indeed's 89% or Google Jobs' pathological duplication.
Structured metadata. ATS platforms enforce fields like location, department, and salary range.
Freshness guarantees. ATS feeds update in near-real-time. Board scrapers may lag by days or weeks.

This is why coverage of 200,000+ employer ATS portals is a strategic asset, not just a volume play. These feeds provide ground truth that calibrates NLP models and validates deduplication logic.

What Data Buyers Should Ask

If you are evaluating job market data providers, these five questions will separate the serious vendors from the ones selling volume:

Source composition. 500 million records from Google Jobs is not comparable to 500 million from ATS feeds.
Source-level metrics. Null rates, description lengths, and dedup rates should be available per source, not just in aggregate.
Dedup methodology. Hash-based dedup catches exact duplicates. Semantic dedup catches near-duplicates. The difference can be 20 to 30 percentage points. See our glossary for definitions.
Freshness per source. A provider might have fresh Indeed data but stale ATS feeds.
Enrichment depth. A fully enriched record with normalized title, SOC code, seniority, skills, and predicted salary is worth more than ten raw records with just title and location.

To see how this looks in practice, explore source-level quality metrics with a sample dataset, or compare raw and enriched records side by side in our comparison tool.

Want to see the data for yourself?

Get a free sample of 5,000 enriched job records, delivered within 24 hours.

Request a Free Sample Back to Blog