# Canaria

> Canaria provides research-grade job market intelligence.
> 900M+ deduplicated job postings, 82 enriched fields per record.
> Products: Job Postings, Company Profiles, Salary Data, Skills Taxonomy, Google Maps.
> Coverage: United States, 2022-present, daily updates.
> Pricing: API from $149/mo, Flat files from $2,500/mo.

## Products
- [Job Postings](https://www.decanaria.com/datasets/job-postings): 900M+ postings, 82 fields
- [Company Profiles](https://www.decanaria.com/datasets/company-profiles): 28.5M companies
- [Salary Data](https://www.decanaria.com/datasets/salary-data): AI-predicted (MAPE <15%) + Glassdoor
- [Skills & Taxonomy](https://www.decanaria.com/datasets/skills-taxonomy): 37,000+ skills
- [Google Maps](https://www.decanaria.com/datasets/google-maps): 193M+ business records (52M detailed with reviews)

## Data Insights
- [Remote Work Trends](https://www.decanaria.com/data/remote-work-trends)
- [Top Hiring Companies](https://www.decanaria.com/data/top-hiring-companies)
- [Most In-Demand Skills](https://www.decanaria.com/data/most-in-demand-skills)
- [Software Engineer Salaries](https://www.decanaria.com/data/software-engineer-salaries)
- [AI & ML Job Market](https://www.decanaria.com/data/ai-ml-job-market)
- [Hiring Trends by Industry](https://www.decanaria.com/data/hiring-trends-by-industry)

## Resources
- [Methodology](https://www.decanaria.com/methodology)
- [Data Schema](https://www.decanaria.com/schema)
- [Interactive Explorer](https://www.decanaria.com/explore)
- [API Stats](https://www.decanaria.com/api/public/stats)

## Contact
- Website: https://www.decanaria.com
- Email: contact@decanaria.com
- Free sample: https://www.decanaria.com/sample

---

# Detailed Product Descriptions

## 1. Job Postings Dataset
The core dataset. 900M+ unique job postings after semantic deduplication from 1B+ ingested records. Every record is enriched with 82 fields covering classification, salary, skills, location, company, and work requirements.

- **Volume**: 900M+ unique postings (907M in production pipeline)
- **Sources**: Indeed (226M), LinkedIn Jobs (176M), 200K+ employer ATS career portals (Greenhouse, Lever, Workday, iCIMS, etc.)
- **Coverage**: United States primary, 2022-present
- **Updates**: Daily incremental, ~25M active postings/month
- **Dedup rate**: 40-60% of raw scrapes are duplicates removed via semantic dedup
- **Fields per record**: 82 enriched fields across 10 categories
- **Delivery**: CSV/Parquet via S3, GCS, Snowflake, SFTP, Google Drive, Dropbox

## 2. Company Profiles Dataset
Canonical company records with firmographics, hiring activity, and industry classification. Built from LinkedIn profiles, job posting metadata, and third-party business registries.

- **Volume**: 28.5M company profiles
- **Company match rate**: >90% of job postings linked to a canonical company
- **LinkedIn profiles**: 70M+ indexed
- **Freshness**: Daily updates
- **Key fields**: Company name, industry, size, HQ location, founding year, revenue range, type, office locations, hiring volume, job mix

## 3. Salary Data
Three salary signals per job: parsed posted salary, AI-predicted salary, and Glassdoor benchmarks.

- **AI-predicted salary MAPE**: <15% (trained on 50M+ Glassdoor/Indeed observations)
- **Predicted salary coverage**: 85-95% of postings (2023+)
- **Parsed posted salary coverage**: ~35% of postings
- **Glassdoor records**: ~11M salary records across 854K companies
- **Output**: Annual USD, min/avg/max ranges

## 4. Skills & Occupation Taxonomy
NLP-extracted skills, certifications, and soft skills from job descriptions. SOC code classification and normalized job titles.

- **Skills taxonomy**: 37,000+ unique skills
- **Certifications**: 3,000+ recognized certifications
- **Soft skills**: 400+ behavioral/interpersonal skills
- **SOC accuracy (2-digit)**: >95%
- **SOC accuracy (6-digit)**: 85-92%
- **Title normalization accuracy**: >90%
- **Average skills per posting**: 5-15
- **Taxonomy match rate**: >90%

## 5. Google Maps / Business Locations
US business records from Google Maps with reviews, ratings, categories, and contact info.

- **Detailed records**: 52M (includes reviews, user content, ratings)
- **Basic records**: 193M (core fields without review text)
- **Use cases**: Enrich job data with employer location intelligence, commercial real estate, local market analysis

---

# Methodology Summary

## NLP Enrichment System Architecture
Canaria uses an NLP enrichment system: a collection of specialized ML models, each tuned for a specific enrichment task. Rather than a single general-purpose model, each field is produced by the best-fit approach:

- **Title normalization**: Encoder model mapping raw titles to canonical forms (>90% accuracy)
- **SOC classification**: Uses both title AND description context, not title-only keyword matching (>95% at 2-digit, 85-92% at 6-digit)
- **Seniority classification**: Multi-signal model achieving 100% completeness (always returns a value)
- **Salary prediction**: Gradient-boosted model trained on 50M+ salary observations, MAPE <15%
- **Skills extraction**: NER + taxonomy matching, F1 85-92% on structured postings, 65-78% on narrative descriptions
- **Remote/hybrid classification**: Context-aware model (not keyword-only), 85%+ accuracy on 2023+ data
- **Benefits extraction**: NLP extraction from unstructured description text, ~65% coverage

## Semantic Deduplication
Job postings are heavily duplicated across boards. Canaria removes 40-60% of raw volume using a three-stage dedup process:

1. **Vector similarity**: Sentence embeddings to detect near-identical descriptions
2. **MinHash / Jaccard**: Shingling-based fuzzy matching for paraphrased postings
3. **Graph-based transitive matching**: If A=B and B=C, then A=C (catches indirect duplicates)

Result: 900M+ unique postings from 1B+ ingested, each with a stable `jobId` (SHA256 content hash) and `contentId` for cross-stage joins.

## Location Enrichment
Three-source location pipeline: NLP parser, geocoding service, and raw scrape. The "final" fields (`finalCity`, `finalState`, `finalZipcode`, `finalCountry`) select the best available source per record.

- State accuracy: 92-97%
- City accuracy: 85-93%

---

# Field List for Job Postings (82 fields)

## Raw Fields (19)
- jobId — Primary key, SHA256 hash (100%)
- jobUrl — Link to posting (100%)
- sourceWebsite — Job board identifier: indeed, linkedin, greenhouse, etc. (100%)
- jobTitle — Raw title as posted (99%)
- jobDescription — Full HTML/text description (97%)
- companyName — Company name as listed (96%)
- scrapedLocation — Location string as displayed (93%)
- scrapedSalary — Salary text verbatim (~35%)
- jobDate — Date first observed on source (98%)
- jobKey — Source-specific job ID (100%)
- sourceCountry — Country code for regional board (100%)
- jobFunction — Job function category (~60%)
- department — Organizational unit (~30%)
- companyProfileUrl — Company profile link (~70%)
- scrapedSeniority — Seniority from metadata (~25%)
- scrapedEmployment — Employment type from metadata (~45%)
- scrapedBenefits — Benefits from structured fields (~20%)
- scrapedResponsibilities — Responsibilities if structured (~15%)
- scrapedQualifications — Qualifications if structured (~15%)

## Location Fields (16)
- city, state, zip_code, county, cbsa_code — Parsed/geocoded location
- latitude, longitude — Coordinates
- finalCity, finalState, finalZipcode, finalCountry — Best-available
- parsedCity, parsedState, parsedCountry — NLP parser output
- calcCity, calcState — Geocoding output

## Classification Fields (8)
- nlpNormalizedTitle — Standardized title (~100%)
- nlpNormalizedTitleScore — Confidence 0-1 (~100%)
- nlpSocCode — SOC 6-digit code (85-90%)
- nlpSocTitle — Official BLS occupation title (85-90%)
- nlpSeniority — ML-classified level (85-95%)
- nlpEmployment — ML-classified type (85-90%)
- nlpRemote — Work mode: Remote/Hybrid/On-site (85%+)
- postingLanguage — ISO 639-1 language (95%+)

## Salary Fields (5)
- parsedAnnualSalaryMin — Lower bound, annual USD (~35%)
- parsedAnnualSalaryAvg — Midpoint (~35%)
- parsedAnnualSalaryMax — Upper bound (~35%)
- nlpSalary — AI-predicted annual salary (85-95%)
- nlpDescriptionLength — Character count (97%)

## Skills & Qualifications Fields (7)
- nlpSkills — Technical/hard skills array (80-93%)
- nlpSoftSkills — Interpersonal skills array (70-85%)
- nlpCertifications — Professional certifications (~30%)
- nlpDegreeLevels — Education degrees (~60%)
- nlpDegreeLevelMin — Minimum acceptable degree (~55%)
- nlpQualifications — Other requirements (~50%)
- nlpExperienceRequirements — Years/type of experience (~45%)

## Benefits Fields (2)
- nlpBenefits — Benefits extracted from description (~65%)
- scrapedBenefits — Benefits from structured fields (~20%)

## Work Requirements Fields (10)
- nlpOffersVisaSponsorship — Visa sponsorship boolean (~15%)
- nlpRequiresClearance — Security clearance boolean (~8%)
- nlpClearanceLevels — Specific clearance levels (~5%)
- nlpCitizenshipRequired — Citizenship requirements (~10%)
- nlpOffersEquity — Equity/stock compensation (~12%)
- nlpRequiresTravel — Travel required boolean (~20%)
- nlpTravelPercentages — Quantified travel (~10%)
- nlpIsShiftWork — Shift-based position (~8%)
- nlpShiftTypes — Shift types (~5%)
- nlpLanguagesRequired — Non-English languages (~8%)

## Role Classification Fields (5)
- nlpIsManagerialRole — People management boolean (~90%)
- nlpIsUrgentHiring — Time-sensitive hiring signal (~5%)
- nlpNumberOfOpenings — Positions available (~10%)
- nlpTeamSizes — Team size context (~8%)
- nlpExpectedStartDates — Start timeline (~5%)

## Company Fields (7)
- companyIndustry — Industry classification (~85%)
- companySize — Employee count range (~70%)
- companyHqLocation — HQ location (~65%)
- companyFoundedYear — Founding year (~55%)
- companyRevenue — Revenue range (~50%)
- companyType — Organization type (~60%)
- companyOfficeLocations — Office locations array (~40%)

## Dedup / Metadata Fields (4)
- contentId — Content hash for joins (92%)
- firstScrapedTime — First detection timestamp (100%)
- lastScrapedTime — Most recent observation (100%)
- firstModifyTime — Earliest enrichment timestamp (~95%)

---

# Sample Record

```json
{
  "jobId": "cj_9f8a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a",
  "jobUrl": "https://indeed.com/viewjob?jk=abc123def456",
  "sourceWebsite": "indeed",
  "jobTitle": "Sr. Software Engineer - Backend",
  "companyName": "Acme Corp",
  "scrapedLocation": "San Francisco, CA 94105",
  "scrapedSalary": "$180,000 - $240,000 a year",
  "jobDate": "2026-02-15",
  "jobKey": "jk_abc123def456",
  "sourceCountry": "us",
  "jobFunction": "Engineering",
  "scrapedSeniority": "Senior",
  "scrapedEmployment": "Full-time",

  "city": "San Francisco",
  "state": "CA",
  "zip_code": "94105",
  "county": "San Francisco County",
  "cbsa_code": "41860",
  "latitude": 37.7749,
  "longitude": -122.4194,
  "finalCity": "San Francisco",
  "finalState": "CA",
  "finalZipcode": "94105",
  "finalCountry": "US",

  "nlpNormalizedTitle": "Software Engineer",
  "nlpNormalizedTitleScore": 0.96,
  "nlpSocCode": "15-1252",
  "nlpSocTitle": "Software Developers",
  "nlpSeniority": "Senior",
  "nlpEmployment": "Full-time",
  "nlpRemote": "Hybrid",
  "postingLanguage": "en",

  "parsedAnnualSalaryMin": 180000.0,
  "parsedAnnualSalaryAvg": 210000.0,
  "parsedAnnualSalaryMax": 240000.0,
  "nlpSalary": 208500.0,
  "nlpDescriptionLength": 4250,

  "nlpSkills": ["Python", "Go", "AWS", "Kubernetes", "PostgreSQL", "Redis", "gRPC", "Terraform"],
  "nlpSoftSkills": ["Leadership", "Communication", "Mentoring"],
  "nlpCertifications": ["AWS Solutions Architect"],
  "nlpDegreeLevels": ["Bachelor's", "Master's"],
  "nlpDegreeLevelMin": "Bachelor's",
  "nlpQualifications": ["5+ years backend development", "Distributed systems experience"],
  "nlpExperienceRequirements": ["5-8 years software engineering"],

  "nlpBenefits": ["Health Insurance", "Dental", "Vision", "401k Match", "Stock Options", "Unlimited PTO", "Remote Flexibility"],

  "nlpOffersVisaSponsorship": true,
  "nlpRequiresClearance": false,
  "nlpOffersEquity": true,
  "nlpRequiresTravel": false,
  "nlpIsShiftWork": false,

  "nlpIsManagerialRole": false,
  "nlpIsUrgentHiring": false,
  "nlpNumberOfOpenings": ["2"],

  "companyIndustry": "Technology",
  "companySize": "1001-5000",
  "companyHqLocation": "San Francisco, CA",
  "companyFoundedYear": "2012",
  "companyRevenue": "$100M-$500M",
  "companyType": "Private",

  "contentId": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2",
  "firstScrapedTime": "2026-02-15T08:30:00Z",
  "lastScrapedTime": "2026-03-10T14:22:00Z",
  "firstModifyTime": "2026-02-16T03:00:00Z"
}
```

---

# Frequently Asked Questions

## Q: How does Canaria deduplicate job postings?
A: We use a three-stage semantic deduplication pipeline: (1) vector similarity using sentence embeddings, (2) MinHash/Jaccard shingling for fuzzy matching, and (3) graph-based transitive matching to catch indirect duplicates. This removes 40-60% of raw volume, leaving 900M+ truly unique postings from 1B+ ingested.

## Q: How accurate are the AI salary predictions?
A: Our salary prediction model achieves a Mean Absolute Percentage Error (MAPE) of less than 15%, trained on 50M+ salary observations from Glassdoor, Indeed, and other sources. Predicted salary coverage reaches 85-95% of postings from 2023 onward, compared to only ~35% of postings that include a posted salary.

## Q: What sources does Canaria collect from?
A: We collect from three source categories: Indeed (226M postings), LinkedIn Jobs (176M postings), and 200,000+ employer ATS career portals including Greenhouse, Lever, Workday, iCIMS, and others. Data coverage is US primary, from 2022 to present, with daily incremental updates.

## Q: How is SOC classification done?
A: Unlike providers that match SOC codes using job titles alone, Canaria uses both the job title AND the full job description for classification. This context-aware approach achieves >95% accuracy at the 2-digit SOC level and 85-92% at the 6-digit level.

## Q: What delivery formats and channels are available?
A: Data is available in CSV and Parquet formats. Delivery channels include S3, GCS, Snowflake, SFTP, Google Drive, and Dropbox. We also distribute through AWS Data Exchange, Databricks Marketplace, SAP Data Marketplace, and Google Analytics Hub.

## Q: Is seniority always populated?
A: Yes. Canaria's seniority classification achieves 100% completeness: every record returns a seniority value (Intern, Entry, Mid, Senior, Lead, Director, VP, C-Level). This is a multi-signal model that combines title parsing, description analysis, and salary context.

## Q: How does Canaria compare to Lightcast or Revelio Labs?
A: Canaria provides comparable NLP enrichment (SOC codes, salary predictions, skills extraction, seniority, deduplication) at a fraction of the price. Lightcast charges $200K+/year for enterprise access. Canaria offers API access from $149/month and flat file delivery from $2,500/month, making research-grade data accessible to startups, quant funds, and mid-market buyers.

## Q: How does Canaria compare to raw data providers like Coresignal or Bright Data?
A: Raw data providers sell unenriched scrapes: you get a job title, company, and location but no SOC codes, no salary predictions, no skills extraction, no deduplication. Building those enrichments in-house costs $500K-$1M in Year 1 plus $200K+/year ongoing. Canaria delivers all 82 enriched fields ready to use.

---

# Pricing Summary

## API Pricing (Credit-Based)
| Plan | Monthly Credits | Monthly Price | Annual Price | Per Credit |
|------|----------------|---------------|--------------|------------|
| Free | 200 | $0 | $0 | -- |
| PAYG | Top-up (min $25) | $0.08/credit | -- | $0.080 |
| Lite | 1,000 | $49/mo | $39/mo | $0.049 |
| Starter | 5,000 | $149/mo | $119/mo | $0.030 |
| Growth | 25,000 | $499/mo | $399/mo | $0.020 |
| Scale | 100,000 | $1,499/mo | $1,199/mo | $0.015 |
| Enterprise | Custom | Custom | Custom | $0.005-0.01 |

### Credit Costs by Operation
- Job Postings Search: 1 credit/record
- Job Postings Bulk Export: 0.8 credits/record
- Aggregate Query: 10 credits
- Full Enrichment (6 models): 3 credits
- Core Enrichment (3 models): 2 credits
- Company Profile Lookup: 1 credit
- Salary Prediction: 2 credits
- Glassdoor Record: 1 credit
- Business Location Lookup: 1 credit
- Count / Preview / Schema: FREE

## Flat File Pricing
| Product | Starting At | Example Tiers |
|---------|------------|---------------|
| Job Postings | $2,500/mo | 500K/mo $2,500 - 10M/mo $20,000 |
| Company Profiles | $3,000 (one-time) | 500K $3,000 - 10M $41,500 (annual with quarterly refreshes at 1.5x) |
| Salary Data | $2,000/mo | 500K/mo $2,000 - 10M/mo $16,000 |
| Skills & Taxonomy | $3,000/mo | Taxonomy Only $3K - Taxonomy + Trends $5K |
| Google Maps (Detailed) | $1,200/mo | 500K/mo $1,200 - 5M/mo $7,000 |
| Google Maps (Fast) | $750/mo | 500K/mo $750 - 10M/mo $6,500 |

Delivery: CSV/Parquet via S3, GCS, Snowflake, SFTP, Google Drive, Dropbox
Marketplaces: AWS Data Exchange, Databricks Marketplace, SAP Data Marketplace, Google Analytics Hub

---

# Company Information
- **Name**: Canaria Inc.
- **Website**: https://www.decanaria.com
- **Email**: contact@decanaria.com
- **Location**: New York, NY
- **Founded**: 2022
- **Team**: ML experts from Google, Meta, Amazon; Stanford, Caltech, Columbia alumni
- **Compliance**: GDPR compliant, CCPA compliant, AI4Good Foundation founding member
- **LinkedIn**: https://www.linkedin.com/company/decanaria/