Job Market Data Glossary

Definitions of key terms and standards used in labor market data.

Alternative Data

Non-traditional datasets used by investment firms, hedge funds, and research organizations to gain insights beyond conventional financial data. Job postings data is a leading category of alternative data because hiring patterns serve as a real-time signal for company growth, industry shifts, and macroeconomic trends. Alternative data buyers typically include quant funds, market research firms, and corporate strategy teams.

ATS (Applicant Tracking System)

Applicant Tracking System: software used by employers to manage their recruiting workflow. Major platforms include Greenhouse, Lever, Workday, and iCIMS. Canaria collects job postings directly from 200,000+ employer ATS career portals, the highest-quality data source with <2% internal duplication, <3% description null rate, and employer-verified content.

Related:Job Postings (ATS)Data Schema

CBSA

Core-Based Statistical Area code, a geographic designation defined by the US Office of Management and Budget that identifies metro and micro areas based on commuting patterns. CBSA codes (e.g., 35620 for the New York metro area) enable standardized geographic analysis of labor markets. Canaria geocodes job postings and assigns CBSA codes to support metropolitan-area-level workforce analytics.

Related:Data Schema Methodology

CCPA

The California Consumer Privacy Act is a state-level privacy law granting California residents rights over their personal data, including the right to know what data is collected, the right to delete it, and the right to opt out of its sale. Canaria is fully CCPA compliant and does not collect or process personal identifiable information in its job postings datasets.

Related:Privacy Policy GDPR

Company Master Database

Canaria's company master database containing 28.5M companies. It enriches job postings with industry, employee count, headquarters location, revenue, and founding year, achieving a >90% match rate from job postings to canonical company records with <1% false merge rate.

Related:Map & Business Data Methodology

Data Vintage

The time period when data was originally collected. Data vintage is critical for interpreting null rates and coverage in job posting datasets because schema evolution, source availability changes, and regulatory shifts (such as salary transparency laws) cause field coverage to vary significantly across time periods. For example, salary coverage in Canaria's data jumps from 5-15% pre-2020 to 40-60% for 2023+ data due to state disclosure laws.

Related:Methodology Data Schema

Deduplication

The process of identifying and removing duplicate records from a dataset. In job market data, the same position is often posted across multiple sources (Indeed, LinkedIn, the employer's own career page), creating redundant records. Effective deduplication is essential for accurate labor market analytics. Canaria uses a multi-stage deduplication pipeline combining exact matching, MinHash/Jaccard similarity, vector embeddings, and graph-based transitive matching to reduce 1B+ ingested postings to 900M+ unique records.

Employment Type

A classification field that categorizes jobs by their work arrangement: full-time, part-time, contract, temporary, or internship. Accurate employment type classification is important for labor market analysis because it affects compensation benchmarking, workforce planning, and regulatory compliance. Canaria extracts and normalizes employment type from both structured fields and unstructured job descriptions using NLP.

Related:Data Schema Raw vs Enriched

Entity Resolution

The process of determining that different records refer to the same real-world entity. In job market data, entity resolution operates at multiple levels: job-level (the same posting on Indeed and LinkedIn), company-level ("AWS" = "Amazon Web Services" = "Amazon.com Services LLC"), and title-level ("SW Eng II" = "Software Engineer 2"). Canaria uses multi-signal entity resolution combining vector similarity, MinHash, geo-clustering, and graph-based transitive matching.

F1 Score

The harmonic mean of precision and recall, used to measure extraction accuracy. An F1 score balances the trade-off between finding all relevant items (recall) and avoiding false positives (precision). A score of 1.0 is perfect. Canaria's NER pipeline achieves F1 scores of 85-92% on bulleted/structured job description sections and 65-78% on narrative prose text, where entity boundaries are more ambiguous.

Related:Methodology NER

Job Entity Resolution

The process of resolving multiple observations of the same real-world job opening into a single canonical record. A single position may appear on Indeed, LinkedIn, and the employer's career page with slightly different titles, descriptions, or formatting. Entity resolution uses a combination of deterministic rules (same URL, same employer + title + location) and probabilistic matching (text similarity, embedding distance) to link these observations. This is a prerequisite for accurate job counts and trend analysis.

MAPE

Mean Absolute Percentage Error is a standard metric for measuring prediction accuracy, expressed as a percentage. It calculates the average of absolute percentage differences between predicted and actual values. Lower MAPE indicates higher accuracy. Canaria's salary prediction model achieves a MAPE of <15%, trained on 50M+ salary observations from Glassdoor, Indeed, and employer-disclosed compensation data.

Related:Methodology Salary Prediction

MinHash

A locality-sensitive hashing technique used to efficiently estimate the Jaccard similarity between two sets. In job market data, MinHash enables rapid approximate comparison of job descriptions at scale, identifying near-duplicate postings without performing expensive pairwise text comparisons. Canaria uses MinHash as part of its multi-stage deduplication pipeline alongside vector embeddings and graph-based transitive matching.

NLP Enrichment Pipeline

Canaria's suite of proprietary ML models for job data enrichment. The enrichment pipeline is a microservice-based NLP system where each model operates independently, allowing separate scaling and iteration. It includes title normalization (>90% mapped), SOC classification (85-92% at 6-digit), seniority detection (100% complete), salary prediction (MAPE <15%), remote work status (>85% for 2023+), skills extraction (37,000+ skills), and NAICS prediction (in progress).

Related:Methodology NLP NER

NAICS

The North American Industry Classification System is the standard framework used by the United States, Canada, and Mexico to classify business establishments by their primary economic activity. NAICS codes are hierarchical, ranging from 2-digit sector codes (e.g., 51 for Information) to 6-digit industry codes (e.g., 511210 for Software Publishers). Canaria's NAICS prediction model (currently in progress) will map employer records to NAICS codes to enable industry-level labor market analysis.

NER

Named Entity Recognition is an NLP technique that identifies and classifies named entities in unstructured text into predefined categories such as skills, certifications, benefits, and work requirements. Canaria's NER pipeline uses dictionary matching plus NLP relevance filtering and achieves F1 scores of 85-92% on bulleted/structured sections and 65-78% on narrative text. Skills coverage ranges from 47.5% for descriptions under 50 words to 99.5% for descriptions of 500+ words.

Related:Methodology NLP F1 Score

NLP

Natural Language Processing is a branch of artificial intelligence focused on enabling computers to understand, interpret, and generate human language. In job market data, NLP powers enrichment tasks such as skills extraction, salary prediction, seniority classification, SOC code assignment, and work mode detection. Canaria applies NLP models across its entire pipeline to transform raw job postings into research-grade structured data with 82 enriched fields.

Related:Methodology Raw vs Enriched NER

O*NET

The Occupational Information Network is the United States' primary source of occupational information, maintained by the Department of Labor. O*NET provides detailed descriptions of over 1,000 occupations including required skills, knowledge, abilities, and typical activities. O*NET-SOC codes extend the standard SOC taxonomy with finer-grained occupation classifications. Canaria's SOC classification leverages O*NET occupation definitions to improve mapping accuracy.

Related:SOC Code Skills Taxonomy

Salary Prediction

A machine learning model that estimates compensation for job postings that do not disclose salary information. Canaria's salary prediction model is trained on 50M+ salary observations from sources including Glassdoor, Indeed, and employer-disclosed pay data. The model requires valid US State, ZipCode, and SOC code as inputs, achieving a Mean Absolute Percentage Error (MAPE) of <15%. Coverage ranges from 50-70% for pre-2020 data to 85-95% for 2023+ data. Returns -1 when prerequisites are missing.

Related:Methodology MAPE Data Schema

Seniority Classification

The process of categorizing job postings by experience level: Entry, Mid, Senior, Lead, Manager, Director, or Executive. Accurate seniority classification is critical for compensation benchmarking and talent market analysis. Canaria's seniority model achieves 100% coverage, always returning a classification for every record, using both title patterns and description context.

Related:Data Schema Raw vs Enriched

Skills Taxonomy

A structured, hierarchical classification of professional competencies extracted from job postings. Canaria maintains a taxonomy of 37,000+ skills including 3,000+ professional certifications and 400+ soft skills. Skills are extracted using a two-step process: dictionary matching identifies candidate skills, then an NLP relevance model filters spurious matches. Average skills per posting (2023+): 5-15. Taxonomy match rate: >90%.

SOC Code

The Standard Occupational Classification is a 6-digit coding system used by US federal agencies to classify workers into occupational categories. SOC codes are hierarchical: the first two digits represent major groups (e.g., 15 for Computer and Mathematical), four digits represent minor groups, and six digits represent detailed occupations. Canaria assigns SOC codes using both the job title and the full description context, achieving >95% accuracy at 2-digit level and 85-92% at 6-digit level. Aligned with BLS 2018 SOC.

Related:Data Schema Methodology O*NET

Semantic Deduplication

A deduplication approach that uses meaning-based similarity rather than exact string matching to identify duplicate records. Two job postings may have different wording but describe the same position. Semantic deduplication uses vector embeddings to represent job descriptions as high-dimensional points, then identifies clusters of near-identical postings using cosine similarity thresholds (configurable 0.9-0.95). Canaria combines semantic deduplication with MinHash and graph-based transitive matching to achieve 40-60% dedup rate across multi-source data.

Title Normalization

The process of standardizing job titles to canonical forms to enable consistent analysis. Employers use widely varying titles for equivalent roles (e.g., "Software Engineer II", "SWE 2", "Software Developer - Mid Level" may all describe the same position). Canaria's title normalization model maps >90% of titles to standardized forms with confidence scores (0-1), enabling accurate trend analysis, compensation benchmarking, and cross-employer comparisons.

Work Mode

A classification field indicating whether a position is Remote, Hybrid, On-site, or Flexible. Work mode has become one of the most important job attributes since 2020, driving both candidate preferences and employer strategy. Canaria classifies work mode using NLP applied to job descriptions, achieving >85% accuracy for 2023+ data. Coverage varies significantly by vintage: <5% pre-2020, 40-70% for 2020-2022, and 85%+ for 2023+.

See these concepts in action

Explore how these fields and methodologies work with real job market data.

Request a Free Sample Explore Data Schema See Raw vs Enriched