Data Collection and Enrichment Methodology
How we collect, standardize, enrich, and deduplicate job market data at scale.
Platform Overview
| URLs Scraped | 8B+ |
| URLs Ingested (Unique) | 1B+ |
| Unique Job Postings (After Semantic Dedup) | 900M+ |
| Fields per Record | 82 |
| Historical Coverage | 2022-present |
| Update Frequency | Daily (incremental) |
| Primary Sources | Indeed, LinkedIn Jobs, 200,000+ employer ATS portals (Greenhouse, Lever, Workday, iCIMS) |
| Geographic Scope | United States (primary), expanding internationally |
| Delivery Formats | CSV, Parquet via S3, GCS, Google Drive, Dropbox, SFTP |
Data Collection Philosophy
Canaria treats job postings as observable signals of employer labor demand, not as direct measures of hiring outcomes. The platform focuses on making what is observable reliable, standardized, and analytically usable, while explicitly preserving uncertainty and avoiding inference beyond the data's scope. Raw job postings are ingested through a distributed scraping infrastructure and processed through a multi-stage pipeline combining deterministic parsing, machine learning-based enrichment, and job-level entity resolution. All raw, parsed, and enriched fields are retained for transparency and auditability. The system prioritizes methodological rigor, reproducibility, and clarity over opaque or black-box data products.
Pipeline Architecture
Every job posting passes through an 8-stage pipeline: ingestion, deduplication, NLP classification, location parsing, salary normalization, skills extraction, title normalization, and company enrichment.
Raw Scraped Data
8B+ URLs, 1B+ ingested
Dedup & Aggregation
907M unique records, 40-60% dedup rate
Location Parsing
City / state / country / zip / coordinates
Salary Normalization
Annual USD min / avg / max
Description Extraction
Skills, benefits, clearance, contact
ML Classification
SOC, seniority, employment, remote
Title Normalization
Standardized titles + confidence scores
Company Enrichment
28.5M companies, industry, size, HQ
Final Merged Delivery
907M records, 82 fields, 7 LEFT JOINs + priority logic
Machine Learning Enrichment (Model Garden)
The Model Garden is a microservice-based NLP pipeline that transforms raw job text into structured, analytics-ready fields. Each model operates independently, allowing separate scaling and iteration.
Title Normalization
Canonicalizes noisy titles (abbreviations, typos) to standardized forms. Outputs confidence scores so researchers can filter or weight results by precision.
Accuracy: >90% mapped
Example: SW Eng II → Software Engineer
SOC Classification
Assigns 6-digit Standard Occupational Classification codes using title + description context. Enables occupational analysis aligned with the BLS 2018 SOC taxonomy.
Accuracy: 2-digit >95%, 6-digit 85-92%
Example: 15-1252 — Software Developers
Seniority Detection
Infers career level (Entry, Mid, Senior, Lead, Manager, Director, Executive) from title patterns and responsibility descriptions. Always returns a classification.
Accuracy: 100% complete, always classifies
Example: Senior
Employment Type
Classifies postings as Full-time, Part-time, Contract, or Temporary based on description text patterns and explicit mentions.
Accuracy: ~90% coverage
Example: Full-time
Remote Work Status
Determines if a position is Remote, Hybrid, On-site, or Flexible through keyword detection and location analysis.
Accuracy: >85% (2023+)
Example: Hybrid
Salary Prediction
Regression model trained on 50M+ Glassdoor/Indeed observations. Requires valid State, ZipCode, and SOC code.
Accuracy: MAPE <15%
Example: $95,000 – $125,000
Skills Extraction (NER)
Extracts 37,000+ skills, 3,000+ certifications, and 400+ soft skills using Aho-Corasick dictionary matching plus NLP relevance filtering to remove spurious matches.
Accuracy: F1: 85-92% (structured), 65-78% (narrative)
Example: ["Python", "AWS", "Docker"]
NAICS Prediction
Industry classification for company records using company name and job text signals. Enables industry-level labor market analysis.
Accuracy: In progress
Example: 511210 — Software Publishers
Named Entity Recognition
A fast, dictionary-based keyword processor runs in-pipeline to extract structured entities from job descriptions. Extracted skills are filtered through a title-skill relevance model to remove spurious matches. Multi-valued attributes are represented as arrays; when no signal is present, fields return empty arrays rather than null.
Technical Skills
Programming languages, tools, frameworks
Soft Skills
Leadership, communication, problem-solving
Certifications
Professional licenses and credentials (PMP, AWS, CPA)
Qualifications
Education requirements, years of experience
Benefits
Health insurance, 401(k), PTO mentions
Contact Signals
Emails and phone numbers extracted from posting text
Work Requirements
Visa sponsorship indicators, citizenship or residency requirements, security clearance
Work Conditions
Travel requirements, shift work and shift type mentions, language requirements
Role Characteristics
Urgent hiring cues, manager or lead indicators, number of openings, team size, start date
Extraction Accuracy (F1 Scores)
| Context | F1 Score |
|---|---|
| Bulleted / structured sections | 85-92% |
| Narrative / prose text | 65-78% |
Skills Coverage by Description Length
| Description Length | Skills Coverage |
|---|---|
| <50 words | 47.5% |
| 50–199 words | ~80% |
| 200–499 words | >85% |
| 500+ words | 99.5% |
Semantic Deduplication
A central challenge in online job postings data is severe duplication, driven by reposting behavior, aggregator syndication, and minor text changes over time. Canaria addresses this through a configurable, multi-signal deduplication framework that resolves repeated observations of the same hiring intent into a single canonical job entity. The average deduplication rate across all sources is 40-60% (multi-source).
Vector Similarity
Captures semantic similarity between descriptions even when text is slightly altered.
MinHash / Jaccard
Handles variations in company names like "Macy's," "Macys Inc," or "Macy's LLC."
Title Similarity
Normalizes title variants such as "Junior Mechanical Engineer" and "Mechanical Eng I."
Geo-location Clustering
Groups postings within an adjustable radius (e.g. 10–50 miles).
Configurable Posting Window
Defines whether two postings within a given time range (e.g. 1–6 months) represent the same job.
Graph-based Transitive Matching
Captures transitive similarity (A ≈ B, B ≈ C → unify A, B, C) and supports custom deduplication policies per client.
Deduplication Benchmarks
| Source / Metric | Dedup Rate |
|---|---|
| Overall (multi-source) | 40-60% |
| ATS (internal) | <2% |
| Google Jobs / Aggregators | 60-70% |
| Indeed (internal) | ~15% |
| LinkedIn (internal) | <10% |
| Duplicate jobId in delivery | 0% (enforced) |
Data Quality by Source
Not all job posting sources are created equal. Data completeness, duplication rates, and field availability vary significantly by source type. ATS direct feeds from employer career portals provide the highest quality data, while aggregator sources like Google Jobs carry substantially higher duplication and null rates.
| Metric | ATS Direct | Indeed | Glassdoor | Google Jobs | |
|---|---|---|---|---|---|
| Internal dedup rate | <2% | <10% | <15% | <15% | <30% |
| Description null | <3% | <5% | <5% | <10% | <15% |
| Location null | <2% | <3% | <5% | <5% | <10% |
| Date posted null | <5% | <5% | <5% | <10% | <15% |
| Salary stated (2023+) | <10% | 15-30% | 40-60% | 20-40% | varies |
| Seniority available | <5% | 60-80% | <10% | <10% | <10% |
| Data quality | Highest | High | Good | Good | Lowest |
Coverage by Data Vintage
Null rates vary by time period due to schema evolution, source availability changes, and the rollout of salary transparency laws. Researchers should account for data vintage when interpreting coverage.
| Field | Pre-2020 | 2020-2022 | 2023+ |
|---|---|---|---|
| Salary (stated) | 5-15% | 15-30% | 40-60% |
| Salary (predicted) | 50-70% | 70-85% | 85-95% |
| Work mode (remote/hybrid/onsite) | <5% | 40-70% | 85%+ |
| Seniority | 60-80% | 75-85% | 85-95% |
| SOC code | 70-85% | 80-90% | 85-95% |
| Skills | 50-70% | 70-85% | 80-93% |
| Location (parsed) | 85-95% | 92-97% | 95%+ |
| Company name | 97%+ | 98%+ | 99%+ |
Salary Transparency Law Timeline
US salary disclosure laws are the primary driver of improving salary coverage in job posting data. Each new law causes a measurable increase in the share of postings that include compensation information.
| Year | Law | Impact |
|---|---|---|
| 2021 | Colorado | First US state; salary field begins populating |
| 2022 | NYC (Nov) | Salary null rates drop 15-25pp in NYC |
| 2023 | CA, WA, NY State | ~25% of US workforce covered. Major salary coverage jump. |
| 2024 | Hawaii | Continued improvement |
| 2025 | Illinois, Minnesota | Further expansion. NJ, MA, VT pending. |
| 2025+ | EU Pay Transparency Directive | European salary data improving |
Company Enrichment (CDP)
Canaria maintains a Company Data Platform (CDP) of 28.5M companies, enriching each job posting with industry, employee count, headquarters location, revenue, and founding year. Company matching uses fuzzy name resolution with strict quality controls.
| Metric | Value |
|---|---|
| Companies in CDP | 28.5M |
| Match rate (job postings to canonical company) | >90% |
| False merge rate | <1% |
| False split rate | <5% |
| Orphan companies | <10% |
Data Quality Monitoring
Canaria maintains comprehensive monitoring and quality assurance systems across all ingestion and processing stages. A pipeline component tracks field-level completeness for every scraped batch. Non-empty field counts are aggregated by source and time period, enabling detection of source degradation or scraping failures. Completeness metrics inform prioritization of enrichment efforts. Source URLs are monitored for consecutive failure streaks and flagged for manual review when potentially defunct. Monitoring dashboards visualize trends and trigger alerts for anomalies including queue depth spikes, error rate thresholds, and crawler failures.
Research Applications
The Canaria dataset enables empirical analysis across multiple labor economics domains.
Wage Dynamics & Compensation
- Geographic wage differentials (metro vs rural, coastal vs inland)
- Salary transparency effects (pre/post state disclosure laws)
- Compensation trends by occupation, industry, and company size
Labor Demand & Skill Requirements
- Skill demand evolution over time (e.g., AI/ML skill growth)
- Education-occupation mismatch (degree requirements vs SOC norms)
- Occupational mobility pathways via skill overlap analysis
Remote Work & Geographic Flexibility
- Remote work adoption rates by occupation and industry
- Remote work wage premia and discounts
- Geographic concentration trends and migration patterns
Firm-level Hiring Behavior
- Posting persistence and time-to-fill estimation
- Hiring velocity as a leading economic indicator
- Expansion signals via new-role detection and headcount proxies
Want the full technical details?
Download the complete methodology document or request a sample dataset to evaluate the data firsthand.