Browse field-level schemas for all five Canaria data products. Select a product below to explore its columns, types, and coverage.
| Field Name | Type | Description | Example | Coverage |
|---|---|---|---|---|
| jobId | FixedString(64) | Primary key. SHA256 hash for deduplication. Unique per posting. | cj_9f8a2b3c4d... | 100% |
| jobUrl | String | Link to full job description page. May expire when job removed. | https://indeed.com/viewjob?jk=abc123 | 100% |
| sourceWebsite | String | Job board identifier (indeed, linkedin, glassdoor, greenhouse, lever, workday, etc.) | indeed | 100% |
| jobTitle | String | Raw job title exactly as posted by employer. | Sr. SW Eng II - Backend | 99% |
| jobDescription | String | Full job description HTML/text from the posting source. | <p>We are looking for a...</p> | 97% |
| companyName | String | Company name as listed in the posting, before normalization. | Macys Inc. | 96% |
| scrapedLocation | String | Location string as displayed in the original posting. | New York, NY 10001 | 93% |
| scrapedSalary | String | Salary text verbatim from the posting (when provided). | $90,000 - $120,000 a year | ~35% |
| jobDate | Date | Date the posting was first observed on the source. | 2025-11-15 | 98% |
| jobKey | String | Source-specific job ID from original board (Indeed ID, LinkedIn ID, etc.) | jk_abc123def456 | 100% |
| sourceCountry | String | Country code for regional job board version (us, ca, uk, etc.) | us | 100% |
| jobFunction | String | Job function/category assigned by job board (Engineering, Sales, etc.) | Engineering | ~60% |
| department | String | Organizational unit (Engineering, Sales, HR, etc.) | Engineering | ~30% |
| companyProfileUrl | String | Link to company profile on the job board. | https://indeed.com/cmp/Acme-Corp | ~70% |
| scrapedSeniority | String | Seniority from posting metadata (Entry Level, Senior, etc.) | Senior | ~25% |
| scrapedEmployment | String | Employment type from metadata (Full-time, Part-time, Contract). | Full-time | ~45% |
| scrapedBenefits | String | Benefits from structured posting fields. | Health, Dental, Vision, 401k | ~20% |
| scrapedResponsibilities | String | Job responsibilities if provided as structured field. | Design and implement backend services... | ~15% |
| scrapedQualifications | String | Required qualifications if provided as structured field. | Bachelor's degree in Computer Science... | ~15% |
| city | String | Parsed city name, standardized. | New York | 91% |
| state | String | Two-letter US state code. | NY | 93% |
| zip_code | String | 5-digit US ZIP code, parsed or geocoded. | 10001 | 78% |
| county | String | US county name derived from geocoding. | New York County | 76% |
| cbsa_code | String | Core-Based Statistical Area code (metro/micro area). | 35620 | 74% |
| latitude | Float64 | Latitude coordinate of the posting location. | 40.7128 | 75% |
| longitude | Float64 | Longitude coordinate of the posting location. | -74.0060 | 75% |
| finalCity | String | Best-available city (priority: parsed > scraped > calc). | New York | 91% |
| finalState | String | Best-available state. | NY | 93% |
| finalZipcode | String | Best-available zipcode. | 10001 | 78% |
| finalCountry | String | Best-available country. | US | 92.5% |
| parsedCity | String | City extracted by NLP parser. | New York | ~88% |
| parsedState | String | State extracted by NLP parser. | NY | ~90% |
| parsedCountry | String | Country extracted by NLP parser. | US | ~90% |
| calcCity | String | City from geocoding service. | New York | ~75% |
| calcState | String | State from geocoding. | NY | ~78% |
| nlpNormalizedTitle | String | Standardized job title ("SW Eng II" -> "Software Engineer"). | Software Engineer | ~100% |
| nlpNormalizedTitleScore | Float32 | Title normalization confidence (0.0-1.0). | 0.94 | ~100% |
| nlpSocCode | String | Standard Occupational Classification code (6-digit, BLS 2018). | 15-1252 | 85-90% |
| nlpSocTitle | String | Official BLS occupation title for the SOC code. | Software Developers | 85-90% |
| nlpSeniority | String | ML-classified seniority level. | Senior | 85-95% |
| nlpEmployment | String | ML-classified employment type. | Full-time | 85-90% |
| nlpRemote | String | ML-classified work mode. | Remote | 85%+ (2023+) |
| postingLanguage | String | Detected language of posting (ISO 639-1). | en | 95%+ |
| parsedAnnualSalaryMin | Float32 | Lower bound, normalized to annual USD. | 90000.0 | ~35% (stated) |
| parsedAnnualSalaryAvg | Float32 | Midpoint (min+max)/2 in annual USD. | 105000.0 | ~35% (stated) |
| parsedAnnualSalaryMax | Float32 | Upper bound, normalized to annual USD. | 120000.0 | ~35% (stated) |
| nlpSalary | Float32 | AI-predicted annual salary (MAPE <15%). | 108500.0 | 85-95% (2023+) |
| nlpDescriptionLength | UInt32 | Description character count (quality signal). | 4250 | 97% |
| nlpSkills | Array(String) | Technical/hard skills extracted from description. | ["Python", "SQL", "AWS", "Docker"] | 80-93% |
| nlpSoftSkills | Array(String) | Interpersonal and behavioral skills. | ["Communication", "Leadership", "Teamwork"] | 70-85% |
| nlpCertifications | Array(String) | Professional certifications required or preferred. | ["PMP", "AWS Solutions Architect", "CPA"] | ~30% |
| nlpDegreeLevels | Array(String) | Education degrees mentioned in posting. | ["Bachelor's", "Master's"] | ~60% |
| nlpDegreeLevelMin | String | Minimum acceptable degree for the position. | Bachelor's | ~55% |
| nlpQualifications | Array(String) | Other qualification requirements. | ["5+ years experience", "US citizen"] | ~50% |
| nlpExperienceRequirements | Array(String) | Years/type of experience extracted. | ["3-5 years software development"] | ~45% |
| nlpBenefits | Array(String) | Benefits extracted from description text. | ["401k", "Health Insurance", "PTO", "Dental"] | ~65% |
| scrapedBenefits | String | Benefits from structured posting fields (when source provides). | Health, Dental, Vision, 401k | ~20% |
| nlpOffersVisaSponsorship | Boolean | Does employer offer visa sponsorship? Critical for international candidates. | true | ~15% |
| nlpRequiresClearance | Boolean | Does job require security clearance? Defense/gov sector filtering. | true | ~8% |
| nlpClearanceLevels | Array(String) | Specific clearance levels required. | ["Secret", "Top Secret"] | ~5% |
| nlpCitizenshipRequired | Array(String) | Citizenship/authorization requirements. | ["US Citizen", "Green Card"] | ~10% |
| nlpOffersEquity | Boolean | Does compensation include equity/stock? Startup indicator. | true | ~12% |
| nlpRequiresTravel | Boolean | Does the job require travel? | true | ~20% |
| nlpTravelPercentages | Array(String) | Quantified travel requirements. | ["25%", "50%"] | ~10% |
| nlpIsShiftWork | Boolean | Is this a shift-based position? | false | ~8% |
| nlpShiftTypes | Array(String) | Specific shift types when applicable. | ["Night", "Weekend", "Rotating"] | ~5% |
| nlpLanguagesRequired | Array(String) | Non-English language requirements. | ["Spanish", "Mandarin"] | ~8% |
| nlpIsManagerialRole | Boolean | Is this a people management position? | true | ~90% |
| nlpIsUrgentHiring | Boolean | Time-sensitive hire signal (market demand indicator). | true | ~5% |
| nlpNumberOfOpenings | Array(String) | How many positions available (hiring volume signal). | ["3"] | ~10% |
| nlpTeamSizes | Array(String) | Size of team this role joins (organizational context). | ["15-20"] | ~8% |
| nlpExpectedStartDates | Array(String) | When role is expected to start (hiring timeline). | ["2025-01-15"] | ~5% |
| companyIndustry | String | Industry classification. Company database value preferred, posting value as fallback. | Technology | ~85% |
| companySize | String | Employee count range. | 1001-5000 | ~70% |
| companyHqLocation | String | Headquarters location. | San Francisco, CA | ~65% |
| companyFoundedYear | Date | Company founding date. | 2010 | ~55% |
| companyRevenue | String | Revenue range. | $100M-$500M | ~50% |
| companyType | String | Organization type. | Private | ~60% |
| companyOfficeLocations | Array(String) | Office locations array. | ["SF", "NYC", "Austin"] | ~40% |
| contentId | FixedString(64) | Content hash for cross-stage joins. SHA256(norm(title) + norm(company) + raw(desc)). | a1b2c3d4e5... | 92% |
| firstScrapedTime | DateTime64 | First time posting was detected by scraper (UTC). | 2025-10-01T08:30:00Z | 100% |
| lastScrapedTime | DateTime64 | Most recent observation as active (UTC). Duration = active period. | 2025-11-15T14:22:00Z | 100% |
| firstModifyTime | DateTime64 | Earliest enrichment timestamp across all processing stages. | 2025-10-02T03:00:00Z | ~95% |
| Field Name | Type | Description | Example | Coverage |
|---|---|---|---|---|
| companyProfileUrlId | String | Primary key. Unique identifier derived from profile URL. | linkedin_acme-corp | 100% |
| companyNameId | String | Normalized company name identifier for joins. | acme_corp | 100% |
| companyName | String | Display name of the company. | Acme Corporation | 100% |
| companyProfileUrl | String | Full URL to the company profile page. | https://linkedin.com/company/acme-corp | 100% |
| companyProfileName | String | Profile slug or vanity name from the source. | acme-corp | 93% |
| companyKey | String | Internal key used for cross-table joins. | ck_acme_corp_123 | 89% |
| companyHeadquarters | String | Headquarters city and state/country. | San Francisco, CA | 77% |
| country | String | Country code of headquarters. | US | 81% |
| companyIndustry | String | Primary industry classification. | Technology | 84% |
| companySize | String | Employee count range bucket. | 1001-5000 | 83% |
| companyFoundDate | DateTime64 | Date the company was founded. | 2005-06-01 | 33% |
| companyWebsite | String | Company website URL. | https://acme.com | 68% |
| companyShortDesc | String | Short tagline or summary description. | Enterprise cloud infrastructure | 84% |
| companyLogoUrl | String | URL to company logo image. | https://media.licdn.com/dms/image/... | 87% |
| companyLocations | Array(String) | Array of all office locations. | ["San Francisco, CA", "New York, NY", "Austin, TX"] | 100% |
| companyAffiliatedPages | String | Affiliated company pages or subsidiaries. | Acme Labs, Acme Cloud | 28% |
| companyType | String | Organization type (Public, Private, Nonprofit, etc.). | Privately Held | 49% |
| companyEmployeeCount | UInt32 | Exact employee count when available. | 3500 | 82% |
| companySpecialties | String | Comma-separated list of company specialties. | Cloud Computing, AI, DevOps | 26% |
| companyFollowerCount | UInt32 | Number of followers on the profile platform. | 85000 | 80% |
| companyEmployees | String | JSON blob of featured employee profiles. | [{"name": "Jane Smith", "title": "CEO"}] | 75% |
| companySimilarPages | String | Similar company profiles suggested by LinkedIn. | Globex Corp, Initech | 66% |
| companyUpdates | String | Recent company posts or updates from the profile. | [{"text": "We're hiring!", "date": "2025-11-01"}] | 39% |
| companyDesc | String | Full company description / about text. | Acme Corporation is a leading provider of... | 14% |
| companyCeo | String | Name of the company CEO or top executive. | Jane Smith | 1% |
| companyRevenue | String | Revenue range bucket. | $1B-$5B | 6% |
| companyIndustryUrl | String | URL slug for the industry category on source platform. | /cmp/_industry/technology | 52% |
| companyRating | Float64 | Average employer rating (1.0-5.0 scale). | 4.2 | 29% |
| companyReviewCount | UInt32 | Total number of employer reviews. | 1250 | 29% |
| dbInsertTimestamp | DateTime | Timestamp when the record was inserted into ClickHouse. | 2025-11-01T12:00:00Z | 100% |
| firstScrapedTimestamp | DateTime | First time this company profile was scraped. | 2024-06-15T08:00:00Z | 100% |
| lastScrapedTimestamp | DateTime | Most recent scrape of this company profile. | 2025-11-15T14:00:00Z | 100% |
| lastModifiedTimestamp | DateTime | Last time any field in this record was updated. | 2025-11-10T09:30:00Z | 100% |
| companyScrapeStatus | String | Current scraping status (active, archived, error). | active | >99% |
| src | String | Source platform identifier. | 100% | |
| raw | JSON | Raw JSON blob from the original scrape for debugging. | {"raw_html": "..."} | 100% |
| Field Name | Type | Description | Example | Coverage |
|---|---|---|---|---|
| id | String | Primary key. Unique place identifier. | 0x808fcb5... | 100% |
| cid | String | Google CID (customer ID) for the place. | 12345678901234567 | 100% |
| title | String | Business name as displayed on Google Maps. | Starbucks | 100% |
| unique_key | String | Deduplication key derived from name + location. | starbucks_sf_94105 | 100% |
| data_id | String | Google internal data identifier for the listing. | 0x808fcb5:0x1234abcd | 100% |
| address | String | Short-form address as shown on the listing. | 123 Market St | ~95% |
| complete_address | String | Full formatted address including city, state, zip. | 123 Market St, San Francisco, CA 94105 | ~90% |
| street | String | Street name and number. | 123 Market St | ~85% |
| city | String | City name. | San Francisco | ~95% |
| state | String | State or province abbreviation. | CA | ~95% |
| postal_code | String | ZIP or postal code. | 94105 | ~85% |
| country | String | Country code. | US | ~98% |
| latitude | Float64 | Latitude coordinate from Google Maps. | 37.7749 | ~99% |
| longitude | Float64 | Longitude coordinate from Google Maps. | -122.4194 | ~99% |
| plus_code | String | Google Plus Code for precise location. | 849VCWC8+R9 | ~80% |
| timezone | String | IANA timezone for the location. | America/Los_Angeles | ~75% |
| category | String | Primary Google Maps business category. | Coffee shop | ~95% |
| categories | Array(String) | All business categories assigned by Google. | ["Coffee shop", "Cafe", "Breakfast restaurant"] | ~90% |
| status | String | Operational status of the business. | Operational | ~85% |
| description | String | Business description from the listing. | Premium coffee and handcrafted beverages... | ~50% |
| about | String | About section with structured attributes (JSON). | {"Service options": ["Dine-in", "Takeout"]} | ~45% |
| price_range | String | Price level indicator. | $$ | ~60% |
| owner | String | Business owner or operator name. | Starbucks Corporation | ~40% |
| review_count | UInt32 | Total number of Google reviews. | 342 | ~90% |
| review_rating | Float32 | Average star rating (1.0-5.0). | 4.3 | ~90% |
| reviews_link | String | Direct link to the Google reviews page. | https://search.google.com/local/reviews?placeid=... | ~85% |
| reviews_per_rating | String | JSON breakdown of reviews by star count. | {"5": 180, "4": 90, "3": 40, "2": 20, "1": 12} | ~80% |
| popular_times | String | Hourly busyness data by day of week (JSON). | {"Monday": [{"hour": 8, "busyness": 30}, ...]} | ~35% |
| user_reviews | String | Sample of recent user reviews (JSON array). | [{"rating": 5, "text": "Great coffee!"}] | ~70% |
| user_reviews_extended | String | Extended review data with metadata (author, date, response). | [{"author": "John", "date": "2025-10", ...}] | ~50% |
| phone | String | Business phone number. | +1 (415) 555-0123 | ~75% |
| web_site | String | Business website URL. | https://www.starbucks.com/store-locator/... | ~65% |
| emails | Array(String) | Email addresses found on listing or website. | ["info@business.com"] | ~20% |
| thumbnail | String | URL to the listing thumbnail image. | https://lh5.googleusercontent.com/p/... | ~85% |
| images | String | JSON array of photo URLs from the listing. | ["https://lh5.googleusercontent.com/p/..."] | ~75% |
| url | String | Google Maps URL for this listing. | https://www.google.com/maps/place/... | ~98% |
| link | String | Short Google Maps link. | https://maps.google.com/?cid=12345... | ~98% |
| reservations | String | Reservation links or availability. | https://www.opentable.com/... | ~15% |
| order_online | String | Online ordering links. | https://order.starbucks.com/... | ~25% |
| menu | String | Menu link or structured menu data. | https://www.starbucks.com/menu | ~30% |
| open_hours | String | Operating hours by day of week (JSON). | {"Monday": "6:00 AM - 8:00 PM", ...} | ~70% |
| detailed_scraped_at | DateTime | Timestamp of the detailed scrape. | 2025-11-01T12:00:00Z | 100% |
| ch_insertion_time | DateTime64 | ClickHouse insertion timestamp. | 2025-11-01T12:05:00Z | 100% |
| Field Name | Type | Description | Example | Coverage |
|---|---|---|---|---|
| parsedAnnualSalaryMin | Float32 | Lower bound of stated salary, normalized to annual USD. | 90000.0 | ~35% (stated) |
| parsedAnnualSalaryAvg | Float32 | Midpoint (min+max)/2 in annual USD. | 105000.0 | ~35% (stated) |
| parsedAnnualSalaryMax | Float32 | Upper bound of stated salary, normalized to annual USD. | 120000.0 | ~35% (stated) |
| nlpSalary | Float32 | AI-predicted annual salary trained on 50M+ observations (MAPE <15%). | 108500.0 | 85-95% (2023+) |
| scrapedSalary | String | Raw salary text verbatim from the posting. | $90,000 - $120,000/yr | ~35% |
| company_code | String | Glassdoor internal company identifier. | E1234 | 100% |
| company_name | String | Company name as listed on Glassdoor. | 100% | |
| job_title | String | Job title for this salary submission. | Software Engineer | 100% |
| location_raw | String | Raw location string from the salary submission. | San Francisco, CA | ~95% |
| city | String | Parsed city name. | San Francisco | ~90% |
| state | String | Parsed state abbreviation. | CA | ~90% |
| metro | String | Metro area designation. | San Francisco-Oakland-Berkeley, CA | ~85% |
| country | String | Country code. | US | ~98% |
| submitted_date | String | Date the salary was submitted by the employee. | 2025-03-15 | ~95% |
| scrape_time | DateTime64 | Timestamp when this record was scraped. | 2025-11-01T08:00:00Z | 100% |
| years_of_exp | String | Years of experience reported by the submitter. | 5-7 | ~60% |
| pay_json | String | Full pay breakdown as JSON (base, bonus, stock, etc.). | {"base": 150000, "bonus": 20000, "stock": 50000} | ~90% |
| total_pay_raw | String | Total compensation as reported (text). | $220,000 | ~85% |
| base_additional_raw | String | Base + additional pay text from Glassdoor. | $150K base + $70K additional | ~80% |
| base_pay | Float64 | Parsed base pay in USD. | 150000.0 | ~90% |
| additional_pay | Float64 | Additional pay (bonus, stock, commission) in USD. | 70000.0 | ~75% |
| anonymity_min | Float64 | Glassdoor anonymity range lower bound. | 140000.0 | ~80% |
| anonymity_max | Float64 | Glassdoor anonymity range upper bound. | 160000.0 | ~80% |
| pay_period | String | Pay frequency (Annual, Monthly, Hourly). | Annual | ~95% |
| currency_code | String | ISO currency code. | USD | ~98% |
| salary_min_annual | Float64 | Minimum annual salary (normalized from pay_period). | 140000.0 | ~85% |
| salary_max_annual | Float64 | Maximum annual salary (normalized from pay_period). | 180000.0 | ~85% |
| salary_avg_annual | Float64 | Average annual salary (midpoint of min/max). | 160000.0 | ~85% |
| source | String | Data source identifier. | glassdoor | 100% |
| source_file_type | String | File format of the source data. | json | 100% |
| source_file | String | Original source file path for traceability. | glassdoor_salaries_2025_q4.json | 100% |
| salary_detailed_url | String | URL to the Glassdoor salary detail page. | https://glassdoor.com/Salary/Google-Software-Engineer-... | ~90% |
| submitted_count | UInt32 | Total salary submissions for this job title. | 1250 | ~95% |
| submitted_count_company | UInt32 | Salary submissions for this title at this company. | 85 | ~90% |
| salary_general_url | String | URL to Glassdoor general salary page for this title. | https://glassdoor.com/Salaries/software-engineer-salary-... | ~95% |
| Field Name | Type | Description | Example | Coverage |
|---|---|---|---|---|
| nlpSkills | Array(String) | Technical/hard skills extracted from job description. | ["Python", "SQL", "AWS", "Docker"] | 80-93% |
| nlpSoftSkills | Array(String) | Interpersonal and behavioral skills. | ["Communication", "Leadership", "Teamwork"] | 70-85% |
| nlpCertifications | Array(String) | Professional certifications required or preferred. | ["PMP", "AWS Solutions Architect", "CPA"] | ~30% |
| nlpQualifications | Array(String) | Other qualification requirements extracted from text. | ["5+ years experience", "US citizen"] | ~50% |
| nlpExperienceRequirements | Array(String) | Years and type of experience requirements. | ["3-5 years software development"] | ~45% |
| nlpDegreeLevels | Array(String) | Education degrees mentioned in posting. | ["Bachelor's", "Master's"] | ~60% |
| nlpDegreeLevelMin | String | Minimum acceptable degree for the position. | Bachelor's | ~55% |
| nlpBenefits | Array(String) | Benefits extracted from description text. | ["401k", "Health Insurance", "PTO"] | ~65% |
| nlpSocCode | String | Standard Occupational Classification code (6-digit, BLS 2018). | 15-1252 | 85-90% |
| nlpSocTitle | String | Official BLS occupation title for the SOC code. | Software Developers | 85-90% |
| nlpSeniority | String | ML-classified seniority level (100% complete). | Senior | 85-95% |
| nlpEmployment | String | ML-classified employment type. | Full-time | 85-90% |
| nlpRemote | String | ML-classified work mode (Remote, Hybrid, On-site). | Remote | 85%+ (2023+) |
| nlpNormalizedTitle | String | Standardized job title via NLP normalization. | Software Engineer | ~100% |
| nlpNormalizedTitleScore | Float32 | Title normalization confidence score (0.0-1.0). | 0.94 | ~100% |
| nlpIsManagerialRole | Boolean | Is this a people management position? | true | ~90% |
| nlpIsUrgentHiring | Boolean | Time-sensitive hire signal (market demand indicator). | true | ~5% |
| nlpNumberOfOpenings | Array(String) | How many positions available (hiring volume signal). | ["3"] | ~10% |
| nlpTeamSizes | Array(String) | Size of team this role joins. | ["15-20"] | ~8% |
| nlpExpectedStartDates | Array(String) | When role is expected to start. | ["2025-01-15"] | ~5% |
| nlpOffersVisaSponsorship | Boolean | Does employer offer visa sponsorship? | true | ~15% |
| nlpRequiresClearance | Boolean | Does job require security clearance? | true | ~8% |
| nlpClearanceLevels | Array(String) | Specific clearance levels required. | ["Secret", "Top Secret"] | ~5% |
| nlpCitizenshipRequired | Array(String) | Citizenship/authorization requirements. | ["US Citizen", "Green Card"] | ~10% |
| nlpOffersEquity | Boolean | Does compensation include equity/stock? | true | ~12% |
| nlpRequiresTravel | Boolean | Does the job require travel? | true | ~20% |
| nlpTravelPercentages | Array(String) | Quantified travel requirements. | ["25%", "50%"] | ~10% |
| nlpIsShiftWork | Boolean | Is this a shift-based position? | false | ~8% |
| nlpShiftTypes | Array(String) | Specific shift types when applicable. | ["Night", "Weekend", "Rotating"] | ~5% |
| nlpLanguagesRequired | Array(String) | Non-English language requirements. | ["Spanish", "Mandarin"] | ~8% |