Research-Grade Data Quality
Every record in Canaria's dataset passes through a multi-stage enrichment and validation pipeline. This page documents the coverage benchmarks, accuracy metrics, and known limitations that quant funds, enterprise teams, and academic researchers need to evaluate before using the data.
At a Glance
900M+
Unique postings after deduplication
56%
Average dedup rate across all sources
Daily to hourly
Update frequency
82
Enriched fields per record
Enrichment Field Accuracy
Benchmarks are measured on production data. Coverage rates are reported for 2023+ data unless otherwise noted. Pre-2020 coverage is lower by design.
| Field | Coverage | Accuracy | Notes |
|---|---|---|---|
| SOC code (6-digit) | >90% (2023+) | 85-92% | Uses title plus description context. Accuracy is for descriptions over 200 words. |
| SOC code (2-digit) | >90% (2023+) | >95% | Major group level. Aligned with BLS 2018 SOC taxonomy. |
| Predicted salary | >85% non-null (2023+) | MAPE <15% | Trained on 50M+ salary observations. Requires valid state, zip code, and SOC. Industry average MAPE is 15-25%. |
| Skills extraction | >80% coverage | F1 75-87% | 37,000+ skills taxonomy. Two-step extraction: dictionary matching followed by NLP relevance filtering. |
| Seniority level | ~75% non-null in delivery | 100% complete | Model always returns a classification (Entry, Mid, Senior, Lead, Manager, Director, Executive). Delivery null rates reflect pre-2020 vintage data. |
| Work mode (remote/hybrid/on-site) | >85% (2022+) | 92-97% | NULL is expected and correct for pre-2020 data. Structured work mode fields did not exist before 2021. |
| Location (city/state) | >95% parse success | City: 85-93%, State: 92-97% | Geocoded to lat/lng. Parse failures are tracked and reported separately. |
| Title normalization | >90% mapped | >90% | Canonicalizes abbreviations, Roman numerals, and level indicators. Confidence scores included for downstream filtering. |
Deduplication Methodology
We ingest 1B+ raw postings and apply a three-stage deduplication pipeline: vector similarity matching on job descriptions, near-duplicate detection for company name variants, and graph-based transitive closure to merge job families across sources and reposting cycles. The result is 900M+ semantically unique postings.
Deduplication is configurable per delivery: clients can adjust similarity thresholds (0.90 to 0.95), posting windows (1 to 6 months), and geographic radius (10 to 50 miles). No duplicate job ID appears in any delivery file.
| Source | Dedup Rate |
|---|---|
| Overall (all sources) | 40-60% |
| ATS portals (single-source) | <2% |
| <10% | |
| Indeed | <15% |
| Aggregators (e.g., Google Jobs) | 60-70% |
| Duplicate job ID in delivery | 0% (enforced) |
High dedup rates on aggregator sources (60-70%) reflect legitimate cross-posting behavior. ATS portal feeds show near-zero internal duplication (<2%) because they are primary sources.
Update Frequency
Data is refreshed daily to hourly depending on the source. ATS portal postings are captured within 24 hours of going live on the employer career page. Major job board sources refresh on a daily cadence. Snapshots used in the interactive explorer are refreshed weekly. Full-history flat file deliveries are available on request.
Data Vintage and Coverage Notes
Coverage improves significantly for 2022+ data. Many enrichment fields did not exist before 2021-2022:
Work mode (remote / hybrid / on-site)
NULL values in pre-2020 data are expected and correct. Fewer than 1% of postings mentioned remote work before 2020. Structured work mode fields did not emerge until 2021. Coverage reaches 85%+ for 2022+ data.
Stated salary
Salary transparency laws (Colorado 2021, NYC 2022, California and Washington 2023) drove a step-change improvement in stated salary coverage. Pre-2022 null rates of 70-95% are expected, not a data gap. Predicted salary is available for all records with valid state, zip code, and SOC code.
SOC code and skills
Enrichment model accuracy is highest for 2023+ data where job descriptions are longer and more structured. Pre-2020 SOC coverage is 70-85% and skills coverage 50-70% due to shorter descriptions and format differences in older postings.
Researcher note: Null rates for pre-2020 data are expected and correct. Comparing null rates across vintages without accounting for structural breaks (remote work adoption, salary transparency laws) will produce misleading conclusions.
Evaluate quality firsthand
Request a sample dataset to run your own validation before committing. We provide sample files with full enrichment fields so you can test against your own benchmarks.