Data Collection and Enrichment Methodology

How we collect, standardize, enrich, and deduplicate job market data at scale.

Platform Overview

URLs Scraped	8B+
URLs Ingested (Unique)	1B+
Unique Job Postings (After Semantic Dedup)	900M+
Fields per Record	82
Historical Coverage	2022-present
Update Frequency	Daily
Primary Sources	Indeed, LinkedIn Jobs, 200,000+ employer ATS portals (Greenhouse, Lever, Workday, iCIMS)
Geographic Scope	United States (primary), expanding internationally
Delivery Formats	CSV, Parquet via S3, GCS, Google Drive, Dropbox, SFTP

Data Collection Philosophy

Canaria treats job postings as observable signals of employer labor demand, not as direct measures of hiring outcomes. A posted job reflects stated intent to hire; it does not confirm a hire occurred, a position was filled, or that a single headcount was added. Researchers should apply this distinction when using posting counts as economic indicators.

The platform focuses on making what is observable reliable, standardized, and analytically usable, while explicitly preserving uncertainty and avoiding inference beyond the data's scope. Raw job postings are ingested through a distributed scraping infrastructure spanning Indeed, LinkedIn Jobs, and 200,000+ employer career portals (Greenhouse, Lever, Workday, iCIMS, and others). Coverage is United States primary, with data beginning in 2022 and updated daily to hourly.

Each posting passes through a multi-stage enrichment pipeline combining deterministic parsing, machine learning classification, and job-level entity resolution. All raw, parsed, and enriched fields are retained for transparency and auditability. When a field is absent or unparseable, the record carries a NULL rather than a synthetic fill-in, and coverage rates are tracked by source and data vintage. The field priority rule throughout the pipeline is: parsed structured data takes precedence over scraped text, which takes precedence over calculated or model-predicted values.

Pipeline Architecture

Every job posting passes through an 8-stage pipeline: ingestion, deduplication, NLP classification, location parsing, salary normalization, skills extraction, title normalization, and company enrichment.

Step 01

Raw Scraped Data

8B+ URLs, 1B+ ingested

Step 02

Dedup & Aggregation

907M unique records, 40-60% dedup rate

Step 03a

Location Parsing

City / state / country / zip / coordinates

Step 03b

Salary Normalization

Annual USD min / avg / max

Step 03c

Description Extraction

Skills, benefits, clearance, contact

Step 03d

ML Classification

SOC, seniority, employment, remote

Step 03e

Title Normalization

Standardized titles + confidence scores

Step 03f

Company Enrichment

28.5M companies, industry, size, HQ

Step 04: Final Output

Final Merged Delivery

907M records, 82 fields, all enrichment stages joined with field priority logic

Machine Learning Enrichment

Our enrichment pipeline is a microservice-based NLP system that transforms raw job text into structured, analytics-ready fields. Each model operates independently, allowing separate scaling and iteration.

Title Normalization

Canonicalizes noisy titles (abbreviations, typos) to standardized forms. Outputs confidence scores so researchers can filter or weight results by precision.

Accuracy: >90% mapped

Example: SW Eng II → Software Engineer

SOC Classification

Assigns 6-digit Standard Occupational Classification codes using title + description context. Enables occupational analysis aligned with the BLS 2018 SOC taxonomy.

Accuracy: 2-digit >95%, 6-digit 85-92%

Example: 15-1252 (Software Developers)

Seniority Detection

Infers career level (Entry, Mid, Senior, Lead, Manager, Director, Executive) from title patterns and responsibility descriptions. Always returns a classification.

Accuracy: 100% complete, always classifies

Example: Senior

Employment Type

Classifies postings as Full-time, Part-time, Contract, or Temporary based on description text patterns and explicit mentions.

Accuracy: ~90% coverage

Example: Full-time

Remote Work Status

Determines if a position is Remote, Hybrid, On-site, or Flexible through keyword detection and location analysis.

Accuracy: >85% (2023+)

Example: Hybrid

Salary Prediction

Regression model trained on 50M+ Glassdoor/Indeed observations. Requires valid State, ZipCode, and SOC code.

Accuracy: MAPE <15%

Example: $95,000 – $125,000

Skills Extraction (NER)

Extracts 37,000+ skills, 3,000+ certifications, and 400+ soft skills using dictionary matching plus NLP relevance filtering to remove spurious matches.

Accuracy: F1: 85-92% (structured), 65-78% (narrative)

Example: ["Python", "AWS", "Docker"]

NAICS Prediction

Industry classification for company records using company name and job text signals. Enables industry-level labor market analysis.

Accuracy: In progress

Example: 511210 (Software Publishers)

SOC Classification: Accuracy and Coverage

Canaria assigns 6-digit Standard Occupational Classification (SOC) codes using both the job title and the full description text. Title-only classifiers commonly misclassify roles with generic titles such as "Manager" or "Analyst." Using description context brings 6-digit accuracy to 85-92% on postings with sufficient description length (more than 200 words). The taxonomy is aligned with the BLS 2018 SOC standard, enabling direct comparison with government labor statistics.

2-digit major group accuracy exceeds 95%, which is at or above the industry benchmark range of 93-97%. Coverage improves with data vintage as both source quality and model iterations improve.

Accuracy and Coverage Thresholds

Metric	Production target	Acceptable	Investigate
2-digit accuracy (major group)	>95%	>90%	<90%
6-digit accuracy (descriptions >200 words)	85-92%	>80%	<80%
Coverage (2023+)	>90%	>80%	<80%
Coverage (2020-2022)	80-90%	>70%	<70%
Coverage (pre-2020)	>75%	>65%	<65%

Known Classification Challenges

Failure mode	Example
Ambiguous titles	"Manager", "Associate", "Specialist" without description context
New-economy roles	"Prompt Engineer" and "Growth Hacker" have no clean mapping in BLS taxonomy
Multi-function roles	"Software Engineer / Data Scientist" requires a primary SOC assignment

Salary Prediction: Model and Coverage

Only about 2.5% of job postings across all sources include stated compensation. To address this gap, Canaria runs a regression model trained on 50M+ Glassdoor and Indeed salary observations to predict minimum, average, and maximum annual USD compensation for every record where State, Zip code, and SOC code are available. Predictions are annualized to USD; the original stated period (hourly, monthly, annual) is preserved in a separate field.

The dataset includes three complementary salary signals: posted ranges (where disclosed), model-predicted ranges (on all records meeting prerequisites), and employee-reported data (from Glassdoor, 10.5M+ individual reports). Posted ranges reflect what employers say they will pay; predicted ranges reflect market rates for the SOC, state, and zip combination; employee-reported data reflects what workers in those roles actually earned, including years-of-experience and bonus breakdowns not available from postings.

Prediction Accuracy

Metric	Value	Context
MAPE (mean absolute percentage error)	<15%	Industry average is 15-25%
Within ±20% of actual	65-80%	Industry benchmark range
Within ±30% of actual	80-90%	Industry benchmark range
Coverage (2023+)	85-95%	Requires valid State, Zip, SOC
Coverage (2020-2022)	70-85%
Coverage (pre-2020)	50-70%
Training observations	50M+	Glassdoor + Indeed salary disclosures

Accuracy Degradation Factors

Prediction accuracy degrades in the following conditions. Researchers should apply appropriate filters or widen confidence intervals when working with these subsets:

Very high compensation (above $300K/year)
Hourly and gig-economy work
International postings (model trained primarily on US data)
Older data vintages (salary norms shift over time)

Salary invariants enforced on all records:salary_min ≤ salary_avg ≤ salary_max; all populated values are positive; all values are normalized to annual USD.

Work Mode Detection: Remote, Hybrid, On-site

Work mode classification uses keyword detection combined with location analysis to determine whether a role is Remote, Hybrid, On-site, or Flexible. Researchers must account for a structural break in this field at 2020: before the pandemic, fewer than 1% of postings mentioned remote work, so NULL values in pre-2020 data are correct and expected. The field does not exist as a concept for that era.

As of 2023+, work mode is a standard field across major sources with greater than 85% coverage. A NULL in 2023+ data indicates a parsing gap rather than an implicitly on-site role. Remote postings peaked around 2022 at approximately 15% of all postings and have declined to approximately 10% as of 2024-2025. The term "hybrid" remains poorly standardized: it can mean one day per week or four days per week depending on the employer.

Coverage by Era

Period	Coverage	Notes
Pre-2020	NULL (correct)	Fewer than 1% of postings mentioned remote work. NULL is expected, not a gap.
2020	~15-20% coverage	Remote mentions spiked but appeared only in description text, with no structured field yet.
2021-2022	40-70% coverage	Structured remote/hybrid/on-site fields emerged as employers standardized.
2023+	85%+ coverage	Work mode field is standard across major sources. NULL in this range indicates a parsing gap.
2024-2025	Note: hybrid poorly defined	"Hybrid" can mean 1 day/week or 4 days/week depending on employer. Remote postings declining from 2022 peak (~15%) back toward ~10%.

Accuracy Metrics

Metric	Value	Notes
Accuracy (2023+ postings)	85%+	State of the art: 92-97% for 2023+ data
Coverage (2023+)	85%+
Coverage (2020-2022)	40-70%
Coverage (pre-2020)	<5%	NULL is correct, not a data gap

Named Entity Recognition

A fast, dictionary-based keyword processor runs in-pipeline to extract structured entities from job descriptions. Extracted skills are filtered through a title-skill relevance model to remove spurious matches. Multi-valued attributes are represented as arrays; when no signal is present, fields return empty arrays rather than null.

Technical Skills

Programming languages, tools, frameworks

Soft Skills

Leadership, communication, problem-solving

Certifications

Professional licenses and credentials (PMP, AWS, CPA)

Qualifications

Education requirements, years of experience

Benefits

Health insurance, 401(k), PTO mentions

Contact Signals

Emails and phone numbers extracted from posting text

Work Requirements

Visa sponsorship indicators, citizenship or residency requirements, security clearance

Work Conditions

Travel requirements, shift work and shift type mentions, language requirements

Role Characteristics

Urgent hiring cues, manager or lead indicators, number of openings, team size, start date

Extraction Accuracy (F1 Scores)

Context	F1 Score
Bulleted / structured sections	85-92%
Narrative / prose text	65-78%

Skills Coverage by Description Length

Description Length	Skills Coverage
<50 words	47.5%
50–199 words	~80%
200–499 words	>85%
500+ words	99.5%

Skills Extraction QA Thresholds

Metric	Production target	Acceptable	Investigate
Skills coverage (description >200 chars)	>85%	>75%	<75%
Average skills per posting (2023+)	5-15	3-20	<3 or >25
Taxonomy match rate	>90%	>85%	<85%

Skills Taxonomy: Context and Scale

Canaria's 37,000+ skills taxonomy is built from O*NET mappings, public sources, and empirical signals derived from live job postings. It expands continuously as new skills emerge in the data. The table below compares major standards for context.

Standard	Scale	Source	Notes
O*NET	~35 broad categories	US Dept. of Labor	Free, public domain. Too coarse for most applications.
ESCO	13,890 skills	European Commission	Multilingual (27 EU languages). Free, open.
Canaria	37,000+ skills, 3,000+ certs, 400+ soft skills	Proprietary	Built from O*NET plus public taxonomies plus empirical signals. Captures emerging skills from live postings.

Semantic Deduplication

A central challenge in online job postings data is severe duplication, driven by reposting behavior, aggregator syndication, and minor text changes over time. Canaria addresses this through a configurable, multi-signal deduplication framework that resolves repeated observations of the same hiring intent into a single canonical job entity. The average deduplication rate across all sources is 40-60% (multi-source).

Vector Similarity

Captures semantic similarity between descriptions even when text is slightly altered.

MinHash / Jaccard

Handles variations in company names like "Macy's," "Macys Inc," or "Macy's LLC."

Title Similarity

Normalizes title variants such as "Junior Mechanical Engineer" and "Mechanical Eng I."

Geo-location Clustering

Groups postings within an adjustable radius (e.g. 10–50 miles).

Configurable Posting Window

Defines whether two postings within a given time range (e.g. 1–6 months) represent the same job.

Graph-based Transitive Matching

Captures transitive similarity (A ≈ B, B ≈ C → unify A, B, C) and supports custom deduplication policies per client.

Deduplication Benchmarks

Source / Metric	Dedup Rate
Overall (multi-source)	40-60%
ATS (internal)	<2%
Google Jobs / Aggregators	60-70%
Indeed (internal)	~15%
LinkedIn (internal)	<10%
Duplicate jobId in delivery	0% (enforced)

Location Parsing

Raw location strings from job postings are parsed into structured geographic components and geocoded to latitude/longitude. Country-level parsing exceeds 98% accuracy; city-level accuracy is 85-93% due to ambiguous city names, metro area strings, and multi-location postings. The is_remote_global flag identifies postings where location is not a physical place.

Level	Accuracy	Notes
Country	>98%	Rarely ambiguous
State / Province	92-97%	Fails on ambiguous city names and international postings
City	85-93%	Metro area strings and multi-location postings reduce accuracy
Zip / Postal code	70-85%	Often inferred rather than directly stated
Lat / Lng (geocoded)	80-90% within 25 miles	Dependent on city-level accuracy

Data Quality by Source

Not all job posting sources are created equal. Data completeness, duplication rates, and field availability vary significantly by source type. ATS direct feeds from employer career portals provide the highest quality data, while aggregator sources like Google Jobs carry substantially higher duplication and null rates.

Metric	ATS Direct	LinkedIn	Indeed	Glassdoor	Google Jobs
Internal dedup rate	<2%	<10%	<15%	<15%	<30%
Description null	<3%	<5%	<5%	<10%	<15%
Location null	<2%	<3%	<5%	<5%	<10%
Date posted null	<5%	<5%	<5%	<10%	<15%
Salary stated (2023+)	<10%	15-30%	40-60%	20-40%	varies
Seniority available	<5%	60-80%	<10%	<10%	<10%
Data quality	Highest	High	Good	Good	Lowest

Coverage by Data Vintage

Null rates vary by time period due to schema evolution, source availability changes, and the rollout of salary transparency laws. Researchers should account for data vintage when interpreting coverage.

Field	Pre-2020	2020-2022	2023+
Salary (stated)	5-15%	15-30%	40-60%
Salary (predicted)	50-70%	70-85%	85-95%
Work mode (remote/hybrid/onsite)	<5%	40-70%	85%+
Seniority	60-80%	75-85%	85-95%
SOC code	70-85%	80-90%	85-95%
Skills	50-70%	70-85%	80-93%
Location (parsed)	85-95%	92-97%	95%+
Company name	97%+	98%+	99%+

Salary Transparency Law Timeline

US salary disclosure laws are the primary driver of improving salary coverage in job posting data. Each new law causes a measurable increase in the share of postings that include compensation information.

Year	Law	Impact
2021	Colorado	First US state; salary field begins populating
2022	NYC (Nov)	Salary null rates drop 15-25pp in NYC
2023	CA, WA, NY State	~25% of US workforce covered. Major salary coverage jump.
2024	Hawaii	Continued improvement
2025	Illinois, Minnesota	Further expansion. NJ, MA, VT pending.
2025+	EU Pay Transparency Directive	European salary data improving

Company Enrichment

Canaria maintains a company database of 28.5M companies, enriching each job posting with industry, employee count, headquarters location, revenue, and founding year. Company entity resolution uses a hybrid approach combining fuzzy name matching, domain/URL matching, LinkedIn company ID joins, and ML-based resolution to achieve 90-95% match accuracy while keeping false merge rates below 1%.

Company Matching Methods

Method	Accuracy	Notes
Exact string match	50-60%	Baseline, insufficient alone
Fuzzy matching (Levenshtein, Jaro-Winkler)	70-80%
Token similarity (TF-IDF, Jaccard)	75-85%
Domain / URL matching	85-90%
LinkedIn company ID join	90-95%	Highest precision signal
ML entity resolution	85-92%
Hybrid (fuzzy + URL + ML)	90-95%	Current production approach

Quality Metrics

Metric	Value
Companies in database	28.5M
Match rate (job postings to canonical company)	>90%
False merge rate (distinct companies incorrectly merged)	<1%
False split rate (same company as separate entities)	<5%
Orphan companies (no canonical match)	<10%

Data Quality Monitoring

Canaria maintains comprehensive monitoring and quality assurance systems across all ingestion and processing stages. A pipeline component tracks field-level completeness for every scraped batch. Non-empty field counts are aggregated by source and time period, enabling detection of source degradation or scraping failures. Completeness metrics inform prioritization of enrichment efforts. Source URLs are monitored for consecutive failure streaks and flagged for manual review when potentially defunct. Monitoring dashboards visualize trends and trigger alerts for anomalies including queue depth spikes, error rate thresholds, and crawler failures.

Research Applications

The Canaria dataset enables empirical analysis across multiple labor economics domains.

Wage Dynamics & Compensation

Geographic wage differentials (metro vs rural, coastal vs inland)
Salary transparency effects (pre/post state disclosure laws)
Compensation trends by occupation, industry, and company size

Labor Demand & Skill Requirements

Skill demand evolution over time (e.g., AI/ML skill growth)
Education-occupation mismatch (degree requirements vs SOC norms)
Occupational mobility pathways via skill overlap analysis

Remote Work & Geographic Flexibility

Remote work adoption rates by occupation and industry
Remote work wage premia and discounts
Geographic concentration trends and migration patterns

Firm-level Hiring Behavior

Posting persistence and time-to-fill estimation
Hiring velocity as a leading economic indicator
Expansion signals via new-role detection and headcount proxies

Want the full technical details?

Download the complete methodology document or request a sample dataset to evaluate the data firsthand.

Download Full Methodology PDF Request a Sample