Canaria

Job Market Data Glossary

Definitions of key terms and standards used in labor market data.

Alternative Data

Non-traditional datasets used by investment firms, hedge funds, and research organizations to gain insights beyond conventional financial data. Job postings data is a leading category of alternative data because hiring patterns serve as a real-time signal for company growth, industry shifts, and macroeconomic trends. Alternative data buyers typically include quant funds, market research firms, and corporate strategy teams.

ATS (Applicant Tracking System)

Applicant Tracking System: software used by employers to manage their recruiting workflow. Major platforms include Greenhouse, Lever, Workday, and iCIMS. Canaria collects job postings directly from 200,000+ employer ATS career portals, the highest-quality data source with <2% internal duplication, <3% description null rate, and employer-verified content.

CBSA

Core-Based Statistical Area code, a geographic designation defined by the US Office of Management and Budget that identifies metro and micro areas based on commuting patterns. CBSA codes (e.g., 35620 for the New York metro area) enable standardized geographic analysis of labor markets. Canaria geocodes job postings and assigns CBSA codes to support metropolitan-area-level workforce analytics.

CCPA

The California Consumer Privacy Act is a state-level privacy law granting California residents rights over their personal data, including the right to know what data is collected, the right to delete it, and the right to opt out of its sale. Canaria is fully CCPA compliant and does not collect or process personal identifiable information in its job postings datasets.

CDP (Customer Data Platform)

Customer Data Platform: Canaria's company master database containing 28.5M companies. The CDP enriches job postings with industry, employee count, headquarters location, revenue, and founding year. It achieves >90% match rate from job postings to canonical company records with <1% false merge rate.

Data Vintage

The time period when data was originally collected. Data vintage is critical for interpreting null rates and coverage in job posting datasets because schema evolution, source availability changes, and regulatory shifts (such as salary transparency laws) cause field coverage to vary significantly across time periods. For example, salary coverage in Canaria's data jumps from 5-15% pre-2020 to 40-60% for 2023+ data due to state disclosure laws.

Deduplication

The process of identifying and removing duplicate records from a dataset. In job market data, the same position is often posted across multiple sources (Indeed, LinkedIn, the employer's own career page), creating redundant records. Effective deduplication is essential for accurate labor market analytics. Canaria uses a multi-stage deduplication pipeline combining exact matching, MinHash/Jaccard similarity, vector embeddings, and graph-based transitive matching to reduce 1B+ ingested postings to 900M+ unique records.

Employment Type

A classification field that categorizes jobs by their work arrangement: full-time, part-time, contract, temporary, or internship. Accurate employment type classification is important for labor market analysis because it affects compensation benchmarking, workforce planning, and regulatory compliance. Canaria extracts and normalizes employment type from both structured fields and unstructured job descriptions using NLP.

Entity Resolution

The process of determining that different records refer to the same real-world entity. In job market data, entity resolution operates at multiple levels: job-level (the same posting on Indeed and LinkedIn), company-level ("AWS" = "Amazon Web Services" = "Amazon.com Services LLC"), and title-level ("SW Eng II" = "Software Engineer 2"). Canaria uses multi-signal entity resolution combining vector similarity, MinHash, geo-clustering, and graph-based transitive matching.

F1 Score

The harmonic mean of precision and recall, used to measure extraction accuracy. An F1 score balances the trade-off between finding all relevant items (recall) and avoiding false positives (precision). A score of 1.0 is perfect. Canaria's NER pipeline achieves F1 scores of 85-92% on bulleted/structured job description sections and 65-78% on narrative prose text, where entity boundaries are more ambiguous.

GDPR

The General Data Protection Regulation is the European Union's comprehensive privacy framework governing how organizations collect, process, and store personal data of EU residents. It mandates lawful basis for processing, data minimization, and grants individuals rights including access, rectification, and erasure. Canaria is fully GDPR compliant and processes only publicly available job posting data without personal identifiable information.

Job Entity Resolution

The process of resolving multiple observations of the same real-world job opening into a single canonical record. A single position may appear on Indeed, LinkedIn, and the employer's career page with slightly different titles, descriptions, or formatting. Entity resolution uses a combination of deterministic rules (same URL, same employer + title + location) and probabilistic matching (text similarity, embedding distance) to link these observations. This is a prerequisite for accurate job counts and trend analysis.

MAPE

Mean Absolute Percentage Error is a standard metric for measuring prediction accuracy, expressed as a percentage. It calculates the average of absolute percentage differences between predicted and actual values. Lower MAPE indicates higher accuracy. Canaria's salary prediction model achieves a MAPE of <15%, trained on 50M+ salary observations from Glassdoor, Indeed, and employer-disclosed compensation data.

MinHash

A locality-sensitive hashing technique used to efficiently estimate the Jaccard similarity between two sets. In job market data, MinHash enables rapid approximate comparison of job descriptions at scale, identifying near-duplicate postings without performing expensive pairwise text comparisons. Canaria uses MinHash as part of its multi-stage deduplication pipeline alongside vector embeddings and graph-based transitive matching.

Model Garden

Canaria's suite of proprietary ML models for job data enrichment. The Model Garden is a microservice-based NLP pipeline where each model operates independently, allowing separate scaling and iteration. It includes title normalization (>90% mapped), SOC classification (85-92% at 6-digit), seniority detection (100% complete), salary prediction (MAPE <15%), remote work status (>85% for 2023+), skills extraction (37,000+ skills), and NAICS prediction (in progress).

NAICS

The North American Industry Classification System is the standard framework used by the United States, Canada, and Mexico to classify business establishments by their primary economic activity. NAICS codes are hierarchical, ranging from 2-digit sector codes (e.g., 51 for Information) to 6-digit industry codes (e.g., 511210 for Software Publishers). Canaria's NAICS prediction model (currently in progress) will map employer records to NAICS codes to enable industry-level labor market analysis.

NER

Named Entity Recognition is an NLP technique that identifies and classifies named entities in unstructured text into predefined categories such as skills, certifications, benefits, and work requirements. Canaria's NER pipeline uses Aho-Corasick dictionary matching plus NLP relevance filtering and achieves F1 scores of 85-92% on bulleted/structured sections and 65-78% on narrative text. Skills coverage ranges from 47.5% for descriptions under 50 words to 99.5% for descriptions of 500+ words.

NLP

Natural Language Processing is a branch of artificial intelligence focused on enabling computers to understand, interpret, and generate human language. In job market data, NLP powers enrichment tasks such as skills extraction, salary prediction, seniority classification, SOC code assignment, and work mode detection. Canaria applies NLP models across its entire pipeline to transform raw job postings into research-grade structured data with 82 enriched fields.

O*NET

The Occupational Information Network is the United States' primary source of occupational information, maintained by the Department of Labor. O*NET provides detailed descriptions of over 1,000 occupations including required skills, knowledge, abilities, and typical activities. O*NET-SOC codes extend the standard SOC taxonomy with finer-grained occupation classifications. Canaria's SOC classification leverages O*NET occupation definitions to improve mapping accuracy.

Salary Prediction

A machine learning model that estimates compensation for job postings that do not disclose salary information. Canaria's salary prediction model is trained on 50M+ salary observations from sources including Glassdoor, Indeed, and employer-disclosed pay data. The model requires valid US State, ZipCode, and SOC code as inputs, achieving a Mean Absolute Percentage Error (MAPE) of <15%. Coverage ranges from 50-70% for pre-2020 data to 85-95% for 2023+ data. Returns -1 when prerequisites are missing.

Seniority Classification

The process of categorizing job postings by experience level: Entry, Mid, Senior, Lead, Manager, Director, or Executive. Accurate seniority classification is critical for compensation benchmarking and talent market analysis. Canaria's seniority model achieves 100% coverage, always returning a classification for every record, using both title patterns and description context.

Skills Taxonomy

A structured, hierarchical classification of professional competencies extracted from job postings. Canaria maintains a taxonomy of 37,000+ skills including 3,000+ professional certifications and 400+ soft skills. Skills are extracted using a two-step process: Aho-Corasick dictionary matching identifies candidate skills, then an NLP relevance model filters spurious matches. Average skills per posting (2023+): 5-15. Taxonomy match rate: >90%.

SOC Code

The Standard Occupational Classification is a 6-digit coding system used by US federal agencies to classify workers into occupational categories. SOC codes are hierarchical: the first two digits represent major groups (e.g., 15 for Computer and Mathematical), four digits represent minor groups, and six digits represent detailed occupations. Canaria assigns SOC codes using both the job title and the full description context, achieving >95% accuracy at 2-digit level and 85-92% at 6-digit level. Aligned with BLS 2018 SOC.

Semantic Deduplication

A deduplication approach that uses meaning-based similarity rather than exact string matching to identify duplicate records. Two job postings may have different wording but describe the same position. Semantic deduplication uses vector embeddings to represent job descriptions as high-dimensional points, then identifies clusters of near-identical postings using cosine similarity thresholds (configurable 0.9-0.95). Canaria combines semantic deduplication with MinHash and graph-based transitive matching to achieve 40-60% dedup rate across multi-source data.

Title Normalization

The process of standardizing job titles to canonical forms to enable consistent analysis. Employers use widely varying titles for equivalent roles (e.g., "Software Engineer II", "SWE 2", "Software Developer - Mid Level" may all describe the same position). Canaria's title normalization model maps >90% of titles to standardized forms with confidence scores (0-1), enabling accurate trend analysis, compensation benchmarking, and cross-employer comparisons.

Work Mode

A classification field indicating whether a position is Remote, Hybrid, On-site, or Flexible. Work mode has become one of the most important job attributes since 2020, driving both candidate preferences and employer strategy. Canaria classifies work mode using NLP applied to job descriptions, achieving >85% accuracy for 2023+ data. Coverage varies significantly by vintage: <5% pre-2020, 40-70% for 2020-2022, and 85%+ for 2023+.

See these concepts in action

Explore how these fields and methodologies work with real job market data.