Canaria

Data Collection and Enrichment Methodology

How we collect, standardize, enrich, and deduplicate job market data at scale.

Platform Overview

URLs Scraped8B+
URLs Ingested (Unique)1B+
Unique Job Postings (After Semantic Dedup)900M+
Fields per Record82
Historical Coverage2022-present
Update FrequencyDaily (incremental)
Primary SourcesIndeed, LinkedIn Jobs, 200,000+ employer ATS portals (Greenhouse, Lever, Workday, iCIMS)
Geographic ScopeUnited States (primary), expanding internationally
Delivery FormatsCSV, Parquet via S3, GCS, Google Drive, Dropbox, SFTP

Data Collection Philosophy

Canaria treats job postings as observable signals of employer labor demand, not as direct measures of hiring outcomes. The platform focuses on making what is observable reliable, standardized, and analytically usable, while explicitly preserving uncertainty and avoiding inference beyond the data's scope. Raw job postings are ingested through a distributed scraping infrastructure and processed through a multi-stage pipeline combining deterministic parsing, machine learning-based enrichment, and job-level entity resolution. All raw, parsed, and enriched fields are retained for transparency and auditability. The system prioritizes methodological rigor, reproducibility, and clarity over opaque or black-box data products.

Pipeline Architecture

Every job posting passes through an 8-stage pipeline: ingestion, deduplication, NLP classification, location parsing, salary normalization, skills extraction, title normalization, and company enrichment.

stage000

Raw Scraped Data

8B+ URLs, 1B+ ingested

stage001

Dedup & Aggregation

907M unique records, 40-60% dedup rate

stage010

Location Parsing

City / state / country / zip / coordinates

stage011

Salary Normalization

Annual USD min / avg / max

stage012

Description Extraction

Skills, benefits, clearance, contact

stage002

ML Classification

SOC, seniority, employment, remote

stage020

Title Normalization

Standardized titles + confidence scores

stage100

Company Enrichment

28.5M companies, industry, size, HQ

stage050

Final Merged Delivery

907M records, 82 fields, 7 LEFT JOINs + priority logic

Machine Learning Enrichment (Model Garden)

The Model Garden is a microservice-based NLP pipeline that transforms raw job text into structured, analytics-ready fields. Each model operates independently, allowing separate scaling and iteration.

1

Title Normalization

Canonicalizes noisy titles (abbreviations, typos) to standardized forms. Outputs confidence scores so researchers can filter or weight results by precision.

Accuracy: >90% mapped

Example: SW Eng II → Software Engineer

2

SOC Classification

Assigns 6-digit Standard Occupational Classification codes using title + description context. Enables occupational analysis aligned with the BLS 2018 SOC taxonomy.

Accuracy: 2-digit >95%, 6-digit 85-92%

Example: 15-1252 — Software Developers

3

Seniority Detection

Infers career level (Entry, Mid, Senior, Lead, Manager, Director, Executive) from title patterns and responsibility descriptions. Always returns a classification.

Accuracy: 100% complete, always classifies

Example: Senior

4

Employment Type

Classifies postings as Full-time, Part-time, Contract, or Temporary based on description text patterns and explicit mentions.

Accuracy: ~90% coverage

Example: Full-time

5

Remote Work Status

Determines if a position is Remote, Hybrid, On-site, or Flexible through keyword detection and location analysis.

Accuracy: >85% (2023+)

Example: Hybrid

6

Salary Prediction

Regression model trained on 50M+ Glassdoor/Indeed observations. Requires valid State, ZipCode, and SOC code.

Accuracy: MAPE <15%

Example: $95,000 – $125,000

7

Skills Extraction (NER)

Extracts 37,000+ skills, 3,000+ certifications, and 400+ soft skills using Aho-Corasick dictionary matching plus NLP relevance filtering to remove spurious matches.

Accuracy: F1: 85-92% (structured), 65-78% (narrative)

Example: ["Python", "AWS", "Docker"]

8

NAICS Prediction

Industry classification for company records using company name and job text signals. Enables industry-level labor market analysis.

Accuracy: In progress

Example: 511210 — Software Publishers

Named Entity Recognition

A fast, dictionary-based keyword processor runs in-pipeline to extract structured entities from job descriptions. Extracted skills are filtered through a title-skill relevance model to remove spurious matches. Multi-valued attributes are represented as arrays; when no signal is present, fields return empty arrays rather than null.

Technical Skills

Programming languages, tools, frameworks

Soft Skills

Leadership, communication, problem-solving

Certifications

Professional licenses and credentials (PMP, AWS, CPA)

Qualifications

Education requirements, years of experience

Benefits

Health insurance, 401(k), PTO mentions

Contact Signals

Emails and phone numbers extracted from posting text

Work Requirements

Visa sponsorship indicators, citizenship or residency requirements, security clearance

Work Conditions

Travel requirements, shift work and shift type mentions, language requirements

Role Characteristics

Urgent hiring cues, manager or lead indicators, number of openings, team size, start date

Extraction Accuracy (F1 Scores)

ContextF1 Score
Bulleted / structured sections85-92%
Narrative / prose text65-78%

Skills Coverage by Description Length

Description LengthSkills Coverage
<50 words47.5%
50–199 words~80%
200–499 words>85%
500+ words99.5%

Semantic Deduplication

A central challenge in online job postings data is severe duplication, driven by reposting behavior, aggregator syndication, and minor text changes over time. Canaria addresses this through a configurable, multi-signal deduplication framework that resolves repeated observations of the same hiring intent into a single canonical job entity. The average deduplication rate across all sources is 40-60% (multi-source).

Vector Similarity

Captures semantic similarity between descriptions even when text is slightly altered.

MinHash / Jaccard

Handles variations in company names like "Macy's," "Macys Inc," or "Macy's LLC."

Title Similarity

Normalizes title variants such as "Junior Mechanical Engineer" and "Mechanical Eng I."

Geo-location Clustering

Groups postings within an adjustable radius (e.g. 10–50 miles).

Configurable Posting Window

Defines whether two postings within a given time range (e.g. 1–6 months) represent the same job.

Graph-based Transitive Matching

Captures transitive similarity (A ≈ B, B ≈ C → unify A, B, C) and supports custom deduplication policies per client.

Deduplication Benchmarks

Source / MetricDedup Rate
Overall (multi-source)40-60%
ATS (internal)<2%
Google Jobs / Aggregators60-70%
Indeed (internal)~15%
LinkedIn (internal)<10%
Duplicate jobId in delivery0% (enforced)

Data Quality by Source

Not all job posting sources are created equal. Data completeness, duplication rates, and field availability vary significantly by source type. ATS direct feeds from employer career portals provide the highest quality data, while aggregator sources like Google Jobs carry substantially higher duplication and null rates.

MetricATS DirectLinkedInIndeedGlassdoorGoogle Jobs
Internal dedup rate<2%<10%<15%<15%<30%
Description null<3%<5%<5%<10%<15%
Location null<2%<3%<5%<5%<10%
Date posted null<5%<5%<5%<10%<15%
Salary stated (2023+)<10%15-30%40-60%20-40%varies
Seniority available<5%60-80%<10%<10%<10%
Data qualityHighestHighGoodGoodLowest

Coverage by Data Vintage

Null rates vary by time period due to schema evolution, source availability changes, and the rollout of salary transparency laws. Researchers should account for data vintage when interpreting coverage.

FieldPre-20202020-20222023+
Salary (stated)5-15%15-30%40-60%
Salary (predicted)50-70%70-85%85-95%
Work mode (remote/hybrid/onsite)<5%40-70%85%+
Seniority60-80%75-85%85-95%
SOC code70-85%80-90%85-95%
Skills50-70%70-85%80-93%
Location (parsed)85-95%92-97%95%+
Company name97%+98%+99%+

Salary Transparency Law Timeline

US salary disclosure laws are the primary driver of improving salary coverage in job posting data. Each new law causes a measurable increase in the share of postings that include compensation information.

YearLawImpact
2021ColoradoFirst US state; salary field begins populating
2022NYC (Nov)Salary null rates drop 15-25pp in NYC
2023CA, WA, NY State~25% of US workforce covered. Major salary coverage jump.
2024HawaiiContinued improvement
2025Illinois, MinnesotaFurther expansion. NJ, MA, VT pending.
2025+EU Pay Transparency DirectiveEuropean salary data improving

Company Enrichment (CDP)

Canaria maintains a Company Data Platform (CDP) of 28.5M companies, enriching each job posting with industry, employee count, headquarters location, revenue, and founding year. Company matching uses fuzzy name resolution with strict quality controls.

MetricValue
Companies in CDP28.5M
Match rate (job postings to canonical company)>90%
False merge rate<1%
False split rate<5%
Orphan companies<10%

Data Quality Monitoring

Canaria maintains comprehensive monitoring and quality assurance systems across all ingestion and processing stages. A pipeline component tracks field-level completeness for every scraped batch. Non-empty field counts are aggregated by source and time period, enabling detection of source degradation or scraping failures. Completeness metrics inform prioritization of enrichment efforts. Source URLs are monitored for consecutive failure streaks and flagged for manual review when potentially defunct. Monitoring dashboards visualize trends and trigger alerts for anomalies including queue depth spikes, error rate thresholds, and crawler failures.

Research Applications

The Canaria dataset enables empirical analysis across multiple labor economics domains.

Wage Dynamics & Compensation

  • Geographic wage differentials (metro vs rural, coastal vs inland)
  • Salary transparency effects (pre/post state disclosure laws)
  • Compensation trends by occupation, industry, and company size

Labor Demand & Skill Requirements

  • Skill demand evolution over time (e.g., AI/ML skill growth)
  • Education-occupation mismatch (degree requirements vs SOC norms)
  • Occupational mobility pathways via skill overlap analysis

Remote Work & Geographic Flexibility

  • Remote work adoption rates by occupation and industry
  • Remote work wage premia and discounts
  • Geographic concentration trends and migration patterns

Firm-level Hiring Behavior

  • Posting persistence and time-to-fill estimation
  • Hiring velocity as a leading economic indicator
  • Expansion signals via new-role detection and headcount proxies

Want the full technical details?

Download the complete methodology document or request a sample dataset to evaluate the data firsthand.