Data Collection and Enrichment Methodology
How we collect, standardize, enrich, and deduplicate job market data at scale.
Platform Overview
| URLs Scraped | 8B+ |
| URLs Ingested (Unique) | 1B+ |
| Unique Job Postings (After Semantic Dedup) | 900M+ |
| Fields per Record | 82 |
| Historical Coverage | 2022-present |
| Update Frequency | Daily |
| Primary Sources | Indeed, LinkedIn Jobs, 200,000+ employer ATS portals (Greenhouse, Lever, Workday, iCIMS) |
| Geographic Scope | United States (primary), expanding internationally |
| Delivery Formats | CSV, Parquet via S3, GCS, Google Drive, Dropbox, SFTP |
Data Collection Philosophy
Canaria treats job postings as observable signals of employer labor demand, not as direct measures of hiring outcomes. A posted job reflects stated intent to hire; it does not confirm a hire occurred, a position was filled, or that a single headcount was added. Researchers should apply this distinction when using posting counts as economic indicators.
The platform focuses on making what is observable reliable, standardized, and analytically usable, while explicitly preserving uncertainty and avoiding inference beyond the data's scope. Raw job postings are ingested through a distributed scraping infrastructure spanning Indeed, LinkedIn Jobs, and 200,000+ employer career portals (Greenhouse, Lever, Workday, iCIMS, and others). Coverage is United States primary, with data beginning in 2022 and updated daily to hourly.
Each posting passes through a multi-stage enrichment pipeline combining deterministic parsing, machine learning classification, and job-level entity resolution. All raw, parsed, and enriched fields are retained for transparency and auditability. When a field is absent or unparseable, the record carries a NULL rather than a synthetic fill-in, and coverage rates are tracked by source and data vintage. The field priority rule throughout the pipeline is: parsed structured data takes precedence over scraped text, which takes precedence over calculated or model-predicted values.
Pipeline Architecture
Every job posting passes through an 8-stage pipeline: ingestion, deduplication, NLP classification, location parsing, salary normalization, skills extraction, title normalization, and company enrichment.
Raw Scraped Data
8B+ URLs, 1B+ ingested
Dedup & Aggregation
907M unique records, 40-60% dedup rate
Location Parsing
City / state / country / zip / coordinates
Salary Normalization
Annual USD min / avg / max
Description Extraction
Skills, benefits, clearance, contact
ML Classification
SOC, seniority, employment, remote
Title Normalization
Standardized titles + confidence scores
Company Enrichment
28.5M companies, industry, size, HQ
Final Merged Delivery
907M records, 82 fields, all enrichment stages joined with field priority logic
Machine Learning Enrichment
Our enrichment pipeline is a microservice-based NLP system that transforms raw job text into structured, analytics-ready fields. Each model operates independently, allowing separate scaling and iteration.
Title Normalization
Canonicalizes noisy titles (abbreviations, typos) to standardized forms. Outputs confidence scores so researchers can filter or weight results by precision.
Accuracy: >90% mapped
Example: SW Eng II → Software Engineer
SOC Classification
Assigns 6-digit Standard Occupational Classification codes using title + description context. Enables occupational analysis aligned with the BLS 2018 SOC taxonomy.
Accuracy: 2-digit >95%, 6-digit 85-92%
Example: 15-1252 (Software Developers)
Seniority Detection
Infers career level (Entry, Mid, Senior, Lead, Manager, Director, Executive) from title patterns and responsibility descriptions. Always returns a classification.
Accuracy: 100% complete, always classifies
Example: Senior
Employment Type
Classifies postings as Full-time, Part-time, Contract, or Temporary based on description text patterns and explicit mentions.
Accuracy: ~90% coverage
Example: Full-time
Remote Work Status
Determines if a position is Remote, Hybrid, On-site, or Flexible through keyword detection and location analysis.
Accuracy: >85% (2023+)
Example: Hybrid
Salary Prediction
Regression model trained on 50M+ Glassdoor/Indeed observations. Requires valid State, ZipCode, and SOC code.
Accuracy: MAPE <15%
Example: $95,000 – $125,000
Skills Extraction (NER)
Extracts 37,000+ skills, 3,000+ certifications, and 400+ soft skills using dictionary matching plus NLP relevance filtering to remove spurious matches.
Accuracy: F1: 85-92% (structured), 65-78% (narrative)
Example: ["Python", "AWS", "Docker"]
NAICS Prediction
Industry classification for company records using company name and job text signals. Enables industry-level labor market analysis.
Accuracy: In progress
Example: 511210 (Software Publishers)
SOC Classification: Accuracy and Coverage
Canaria assigns 6-digit Standard Occupational Classification (SOC) codes using both the job title and the full description text. Title-only classifiers commonly misclassify roles with generic titles such as "Manager" or "Analyst." Using description context brings 6-digit accuracy to 85-92% on postings with sufficient description length (more than 200 words). The taxonomy is aligned with the BLS 2018 SOC standard, enabling direct comparison with government labor statistics.
2-digit major group accuracy exceeds 95%, which is at or above the industry benchmark range of 93-97%. Coverage improves with data vintage as both source quality and model iterations improve.
Accuracy and Coverage Thresholds
| Metric | Production target | Acceptable | Investigate |
|---|---|---|---|
| 2-digit accuracy (major group) | >95% | >90% | <90% |
| 6-digit accuracy (descriptions >200 words) | 85-92% | >80% | <80% |
| Coverage (2023+) | >90% | >80% | <80% |
| Coverage (2020-2022) | 80-90% | >70% | <70% |
| Coverage (pre-2020) | >75% | >65% | <65% |
Known Classification Challenges
| Failure mode | Example |
|---|---|
| Ambiguous titles | "Manager", "Associate", "Specialist" without description context |
| New-economy roles | "Prompt Engineer" and "Growth Hacker" have no clean mapping in BLS taxonomy |
| Multi-function roles | "Software Engineer / Data Scientist" requires a primary SOC assignment |
Salary Prediction: Model and Coverage
Only about 2.5% of job postings across all sources include stated compensation. To address this gap, Canaria runs a regression model trained on 50M+ Glassdoor and Indeed salary observations to predict minimum, average, and maximum annual USD compensation for every record where State, Zip code, and SOC code are available. Predictions are annualized to USD; the original stated period (hourly, monthly, annual) is preserved in a separate field.
The dataset includes three complementary salary signals: posted ranges (where disclosed), model-predicted ranges (on all records meeting prerequisites), and employee-reported data (from Glassdoor, 10.5M+ individual reports). Posted ranges reflect what employers say they will pay; predicted ranges reflect market rates for the SOC, state, and zip combination; employee-reported data reflects what workers in those roles actually earned, including years-of-experience and bonus breakdowns not available from postings.
Prediction Accuracy
| Metric | Value | Context |
|---|---|---|
| MAPE (mean absolute percentage error) | <15% | Industry average is 15-25% |
| Within ±20% of actual | 65-80% | Industry benchmark range |
| Within ±30% of actual | 80-90% | Industry benchmark range |
| Coverage (2023+) | 85-95% | Requires valid State, Zip, SOC |
| Coverage (2020-2022) | 70-85% | |
| Coverage (pre-2020) | 50-70% | |
| Training observations | 50M+ | Glassdoor + Indeed salary disclosures |
Accuracy Degradation Factors
Prediction accuracy degrades in the following conditions. Researchers should apply appropriate filters or widen confidence intervals when working with these subsets:
- Very high compensation (above $300K/year)
- Hourly and gig-economy work
- International postings (model trained primarily on US data)
- Older data vintages (salary norms shift over time)
Salary invariants enforced on all records:salary_min ≤ salary_avg ≤ salary_max; all populated values are positive; all values are normalized to annual USD.
Work Mode Detection: Remote, Hybrid, On-site
Work mode classification uses keyword detection combined with location analysis to determine whether a role is Remote, Hybrid, On-site, or Flexible. Researchers must account for a structural break in this field at 2020: before the pandemic, fewer than 1% of postings mentioned remote work, so NULL values in pre-2020 data are correct and expected. The field does not exist as a concept for that era.
As of 2023+, work mode is a standard field across major sources with greater than 85% coverage. A NULL in 2023+ data indicates a parsing gap rather than an implicitly on-site role. Remote postings peaked around 2022 at approximately 15% of all postings and have declined to approximately 10% as of 2024-2025. The term "hybrid" remains poorly standardized: it can mean one day per week or four days per week depending on the employer.
Coverage by Era
| Period | Coverage | Notes |
|---|---|---|
| Pre-2020 | NULL (correct) | Fewer than 1% of postings mentioned remote work. NULL is expected, not a gap. |
| 2020 | ~15-20% coverage | Remote mentions spiked but appeared only in description text, with no structured field yet. |
| 2021-2022 | 40-70% coverage | Structured remote/hybrid/on-site fields emerged as employers standardized. |
| 2023+ | 85%+ coverage | Work mode field is standard across major sources. NULL in this range indicates a parsing gap. |
| 2024-2025 | Note: hybrid poorly defined | "Hybrid" can mean 1 day/week or 4 days/week depending on employer. Remote postings declining from 2022 peak (~15%) back toward ~10%. |
Accuracy Metrics
| Metric | Value | Notes |
|---|---|---|
| Accuracy (2023+ postings) | 85%+ | State of the art: 92-97% for 2023+ data |
| Coverage (2023+) | 85%+ | |
| Coverage (2020-2022) | 40-70% | |
| Coverage (pre-2020) | <5% | NULL is correct, not a data gap |
Named Entity Recognition
A fast, dictionary-based keyword processor runs in-pipeline to extract structured entities from job descriptions. Extracted skills are filtered through a title-skill relevance model to remove spurious matches. Multi-valued attributes are represented as arrays; when no signal is present, fields return empty arrays rather than null.
Technical Skills
Programming languages, tools, frameworks
Soft Skills
Leadership, communication, problem-solving
Certifications
Professional licenses and credentials (PMP, AWS, CPA)
Qualifications
Education requirements, years of experience
Benefits
Health insurance, 401(k), PTO mentions
Contact Signals
Emails and phone numbers extracted from posting text
Work Requirements
Visa sponsorship indicators, citizenship or residency requirements, security clearance
Work Conditions
Travel requirements, shift work and shift type mentions, language requirements
Role Characteristics
Urgent hiring cues, manager or lead indicators, number of openings, team size, start date
Extraction Accuracy (F1 Scores)
| Context | F1 Score |
|---|---|
| Bulleted / structured sections | 85-92% |
| Narrative / prose text | 65-78% |
Skills Coverage by Description Length
| Description Length | Skills Coverage |
|---|---|
| <50 words | 47.5% |
| 50–199 words | ~80% |
| 200–499 words | >85% |
| 500+ words | 99.5% |
Skills Extraction QA Thresholds
| Metric | Production target | Acceptable | Investigate |
|---|---|---|---|
| Skills coverage (description >200 chars) | >85% | >75% | <75% |
| Average skills per posting (2023+) | 5-15 | 3-20 | <3 or >25 |
| Taxonomy match rate | >90% | >85% | <85% |
Skills Taxonomy: Context and Scale
Canaria's 37,000+ skills taxonomy is built from O*NET mappings, public sources, and empirical signals derived from live job postings. It expands continuously as new skills emerge in the data. The table below compares major standards for context.
| Standard | Scale | Source | Notes |
|---|---|---|---|
| O*NET | ~35 broad categories | US Dept. of Labor | Free, public domain. Too coarse for most applications. |
| ESCO | 13,890 skills | European Commission | Multilingual (27 EU languages). Free, open. |
| Canaria | 37,000+ skills, 3,000+ certs, 400+ soft skills | Proprietary | Built from O*NET plus public taxonomies plus empirical signals. Captures emerging skills from live postings. |
Semantic Deduplication
A central challenge in online job postings data is severe duplication, driven by reposting behavior, aggregator syndication, and minor text changes over time. Canaria addresses this through a configurable, multi-signal deduplication framework that resolves repeated observations of the same hiring intent into a single canonical job entity. The average deduplication rate across all sources is 40-60% (multi-source).
Vector Similarity
Captures semantic similarity between descriptions even when text is slightly altered.
MinHash / Jaccard
Handles variations in company names like "Macy's," "Macys Inc," or "Macy's LLC."
Title Similarity
Normalizes title variants such as "Junior Mechanical Engineer" and "Mechanical Eng I."
Geo-location Clustering
Groups postings within an adjustable radius (e.g. 10–50 miles).
Configurable Posting Window
Defines whether two postings within a given time range (e.g. 1–6 months) represent the same job.
Graph-based Transitive Matching
Captures transitive similarity (A ≈ B, B ≈ C → unify A, B, C) and supports custom deduplication policies per client.
Deduplication Benchmarks
| Source / Metric | Dedup Rate |
|---|---|
| Overall (multi-source) | 40-60% |
| ATS (internal) | <2% |
| Google Jobs / Aggregators | 60-70% |
| Indeed (internal) | ~15% |
| LinkedIn (internal) | <10% |
| Duplicate jobId in delivery | 0% (enforced) |
Location Parsing
Raw location strings from job postings are parsed into structured geographic components and geocoded to latitude/longitude. Country-level parsing exceeds 98% accuracy; city-level accuracy is 85-93% due to ambiguous city names, metro area strings, and multi-location postings. The is_remote_global flag identifies postings where location is not a physical place.
| Level | Accuracy | Notes |
|---|---|---|
| Country | >98% | Rarely ambiguous |
| State / Province | 92-97% | Fails on ambiguous city names and international postings |
| City | 85-93% | Metro area strings and multi-location postings reduce accuracy |
| Zip / Postal code | 70-85% | Often inferred rather than directly stated |
| Lat / Lng (geocoded) | 80-90% within 25 miles | Dependent on city-level accuracy |
Data Quality by Source
Not all job posting sources are created equal. Data completeness, duplication rates, and field availability vary significantly by source type. ATS direct feeds from employer career portals provide the highest quality data, while aggregator sources like Google Jobs carry substantially higher duplication and null rates.
| Metric | ATS Direct | Indeed | Glassdoor | Google Jobs | |
|---|---|---|---|---|---|
| Internal dedup rate | <2% | <10% | <15% | <15% | <30% |
| Description null | <3% | <5% | <5% | <10% | <15% |
| Location null | <2% | <3% | <5% | <5% | <10% |
| Date posted null | <5% | <5% | <5% | <10% | <15% |
| Salary stated (2023+) | <10% | 15-30% | 40-60% | 20-40% | varies |
| Seniority available | <5% | 60-80% | <10% | <10% | <10% |
| Data quality | Highest | High | Good | Good | Lowest |
Coverage by Data Vintage
Null rates vary by time period due to schema evolution, source availability changes, and the rollout of salary transparency laws. Researchers should account for data vintage when interpreting coverage.
| Field | Pre-2020 | 2020-2022 | 2023+ |
|---|---|---|---|
| Salary (stated) | 5-15% | 15-30% | 40-60% |
| Salary (predicted) | 50-70% | 70-85% | 85-95% |
| Work mode (remote/hybrid/onsite) | <5% | 40-70% | 85%+ |
| Seniority | 60-80% | 75-85% | 85-95% |
| SOC code | 70-85% | 80-90% | 85-95% |
| Skills | 50-70% | 70-85% | 80-93% |
| Location (parsed) | 85-95% | 92-97% | 95%+ |
| Company name | 97%+ | 98%+ | 99%+ |
Salary Transparency Law Timeline
US salary disclosure laws are the primary driver of improving salary coverage in job posting data. Each new law causes a measurable increase in the share of postings that include compensation information.
| Year | Law | Impact |
|---|---|---|
| 2021 | Colorado | First US state; salary field begins populating |
| 2022 | NYC (Nov) | Salary null rates drop 15-25pp in NYC |
| 2023 | CA, WA, NY State | ~25% of US workforce covered. Major salary coverage jump. |
| 2024 | Hawaii | Continued improvement |
| 2025 | Illinois, Minnesota | Further expansion. NJ, MA, VT pending. |
| 2025+ | EU Pay Transparency Directive | European salary data improving |
Company Enrichment
Canaria maintains a company database of 28.5M companies, enriching each job posting with industry, employee count, headquarters location, revenue, and founding year. Company entity resolution uses a hybrid approach combining fuzzy name matching, domain/URL matching, LinkedIn company ID joins, and ML-based resolution to achieve 90-95% match accuracy while keeping false merge rates below 1%.
Company Matching Methods
| Method | Accuracy | Notes |
|---|---|---|
| Exact string match | 50-60% | Baseline, insufficient alone |
| Fuzzy matching (Levenshtein, Jaro-Winkler) | 70-80% | |
| Token similarity (TF-IDF, Jaccard) | 75-85% | |
| Domain / URL matching | 85-90% | |
| LinkedIn company ID join | 90-95% | Highest precision signal |
| ML entity resolution | 85-92% | |
| Hybrid (fuzzy + URL + ML) | 90-95% | Current production approach |
Quality Metrics
| Metric | Value |
|---|---|
| Companies in database | 28.5M |
| Match rate (job postings to canonical company) | >90% |
| False merge rate (distinct companies incorrectly merged) | <1% |
| False split rate (same company as separate entities) | <5% |
| Orphan companies (no canonical match) | <10% |
Data Quality Monitoring
Canaria maintains comprehensive monitoring and quality assurance systems across all ingestion and processing stages. A pipeline component tracks field-level completeness for every scraped batch. Non-empty field counts are aggregated by source and time period, enabling detection of source degradation or scraping failures. Completeness metrics inform prioritization of enrichment efforts. Source URLs are monitored for consecutive failure streaks and flagged for manual review when potentially defunct. Monitoring dashboards visualize trends and trigger alerts for anomalies including queue depth spikes, error rate thresholds, and crawler failures.
Research Applications
The Canaria dataset enables empirical analysis across multiple labor economics domains.
Wage Dynamics & Compensation
- Geographic wage differentials (metro vs rural, coastal vs inland)
- Salary transparency effects (pre/post state disclosure laws)
- Compensation trends by occupation, industry, and company size
Labor Demand & Skill Requirements
- Skill demand evolution over time (e.g., AI/ML skill growth)
- Education-occupation mismatch (degree requirements vs SOC norms)
- Occupational mobility pathways via skill overlap analysis
Remote Work & Geographic Flexibility
- Remote work adoption rates by occupation and industry
- Remote work wage premia and discounts
- Geographic concentration trends and migration patterns
Firm-level Hiring Behavior
- Posting persistence and time-to-fill estimation
- Hiring velocity as a leading economic indicator
- Expansion signals via new-role detection and headcount proxies
Want the full technical details?
Download the complete methodology document or request a sample dataset to evaluate the data firsthand.