The flagship Canaria dataset: 907M deduplicated job postings aggregated from 15+ sources and enriched with 82 structured fields through the Model Garden pipeline. Sources include Indeed (226M), LinkedIn (176M), PJF (216M), SimplyHired (105M), 200,000+ ATS employer career portals, CareerBuilder, and more. Semantic deduplication removes 40-60% of cross-source duplicates using vector similarity, MinHash/Jaccard, and graph-based transitive matching. Every record includes SOC classification, seniority (100% complete), salary prediction (MAPE <15%), work mode detection, and skills extraction from a 37,000+-skill taxonomy.
All records are fully enriched through the Model Garden NLP pipeline. Raw and enriched fields are delivered together for full transparency.
jobTitlecompanyNamelocationdescriptiondatePostednormTitlesocsocTitlesenioritysalaryAvgAnnualnlpSkillsnlpSoftSkillsremoteemploymentsrcBaseLabor Market Data for Investment Professionals
Hiring velocity as an economic leading indicator with SOC-level granularity
Job Market Data for Competitive Intelligence
Competitor hiring patterns, skills trends, and geographic expansion signals
Job Market Data for HR Tech Platforms
Add salary benchmarking and skills intelligence to your platform without building ML
Job Market Training Data for AI & ML Teams
Pre-enriched, deduplicated job market training data. Skip 6 months of pipeline building.
Job Market Data for Workforce Planning
Lightcast intelligence at a fraction of the cost, with full data transparency
Job Market Data for Academic Research
Longitudinal dataset for labor economics: wage dynamics, skill demand, remote work adoption
Job Market Data for Recruiting & Staffing
Fresh daily job market data for content, outreach, and market positioning.