Skip to content
Canaria

Research-Grade Data Quality

Every record in Canaria's dataset passes through a multi-stage enrichment and validation pipeline. This page documents the coverage benchmarks, accuracy metrics, and known limitations that quant funds, enterprise teams, and academic researchers need to evaluate before using the data.

At a Glance

900M+

Unique postings after deduplication

56%

Average dedup rate across all sources

Daily to hourly

Update frequency

82

Enriched fields per record

Enrichment Field Accuracy

Benchmarks are measured on production data. Coverage rates are reported for 2023+ data unless otherwise noted. Pre-2020 coverage is lower by design.

FieldCoverageAccuracyNotes
SOC code (6-digit)>90% (2023+)85-92%Uses title plus description context. Accuracy is for descriptions over 200 words.
SOC code (2-digit)>90% (2023+)>95%Major group level. Aligned with BLS 2018 SOC taxonomy.
Predicted salary>85% non-null (2023+)MAPE <15%Trained on 50M+ salary observations. Requires valid state, zip code, and SOC. Industry average MAPE is 15-25%.
Skills extraction>80% coverageF1 75-87%37,000+ skills taxonomy. Two-step extraction: dictionary matching followed by NLP relevance filtering.
Seniority level~75% non-null in delivery100% completeModel always returns a classification (Entry, Mid, Senior, Lead, Manager, Director, Executive). Delivery null rates reflect pre-2020 vintage data.
Work mode (remote/hybrid/on-site)>85% (2022+)92-97%NULL is expected and correct for pre-2020 data. Structured work mode fields did not exist before 2021.
Location (city/state)>95% parse successCity: 85-93%, State: 92-97%Geocoded to lat/lng. Parse failures are tracked and reported separately.
Title normalization>90% mapped>90%Canonicalizes abbreviations, Roman numerals, and level indicators. Confidence scores included for downstream filtering.

Deduplication Methodology

We ingest 1B+ raw postings and apply a three-stage deduplication pipeline: vector similarity matching on job descriptions, near-duplicate detection for company name variants, and graph-based transitive closure to merge job families across sources and reposting cycles. The result is 900M+ semantically unique postings.

Deduplication is configurable per delivery: clients can adjust similarity thresholds (0.90 to 0.95), posting windows (1 to 6 months), and geographic radius (10 to 50 miles). No duplicate job ID appears in any delivery file.

SourceDedup Rate
Overall (all sources)40-60%
ATS portals (single-source)<2%
LinkedIn<10%
Indeed<15%
Aggregators (e.g., Google Jobs)60-70%
Duplicate job ID in delivery0% (enforced)

High dedup rates on aggregator sources (60-70%) reflect legitimate cross-posting behavior. ATS portal feeds show near-zero internal duplication (<2%) because they are primary sources.

Update Frequency

Data is refreshed daily to hourly depending on the source. ATS portal postings are captured within 24 hours of going live on the employer career page. Major job board sources refresh on a daily cadence. Snapshots used in the interactive explorer are refreshed weekly. Full-history flat file deliveries are available on request.

Data Vintage and Coverage Notes

Coverage improves significantly for 2022+ data. Many enrichment fields did not exist before 2021-2022:

  • Work mode (remote / hybrid / on-site)

    NULL values in pre-2020 data are expected and correct. Fewer than 1% of postings mentioned remote work before 2020. Structured work mode fields did not emerge until 2021. Coverage reaches 85%+ for 2022+ data.

  • Stated salary

    Salary transparency laws (Colorado 2021, NYC 2022, California and Washington 2023) drove a step-change improvement in stated salary coverage. Pre-2022 null rates of 70-95% are expected, not a data gap. Predicted salary is available for all records with valid state, zip code, and SOC code.

  • SOC code and skills

    Enrichment model accuracy is highest for 2023+ data where job descriptions are longer and more structured. Pre-2020 SOC coverage is 70-85% and skills coverage 50-70% due to shorter descriptions and format differences in older postings.

Researcher note: Null rates for pre-2020 data are expected and correct. Comparing null rates across vintages without accounting for structural breaks (remote work adoption, salary transparency laws) will produce misleading conclusions.

Evaluate quality firsthand

Request a sample dataset to run your own validation before committing. We provide sample files with full enrichment fields so you can test against your own benchmarks.