Inside the Model Garden: Classifying 907M Jobs

Raw Postings Are a Commodity. Enrichment Is the Product.

Every day, tens of thousands of new job postings appear across Indeed, LinkedIn, Greenhouse, Lever, and hundreds of thousands of employer career pages. The titles alone are a mess. "SW Eng II," "Software Developer (Remote)," "Sr. Software Engineer - Backend," and "Programmer Analyst" might all describe the same role. Multiply that by 400 million unique postings, each needing standardized titles, occupation codes, seniority levels, and predicted salaries, and the problem becomes clear quickly.

We solve this with what we call the Model Garden: a suite of independent, purpose-built ML models that each handle one enrichment task across our entire corpus of 907 million deduplicated job records. This post walks through each model, what it does, and how it performs at scale.

Why a Garden, Not a Monolith

Each model is independently trained, deployed, and versioned. When we improve our SOC classifier, we do not risk regressing our seniority model. When we add a new capability (NAICS prediction is next), we plant a new model without touching existing ones.

This matters operationally. A monolithic enrichment pipeline means one bad deployment can degrade every downstream field. With independent models, a regression in title normalization does not corrupt skills extraction or salary prediction. Each model has its own accuracy metrics, its own rollback path, and its own retraining schedule.

Our latest version, model-garden-v1.0.0, handles 69.2 million records with zero null rates on seniority and employment fields. The legacy from-nlp-extract pipeline covers 424.7 million records from historical backfill. Together with our NER extraction layer, they enrich 82 fields per record.

Title Normalization

Job titles are the noisiest field in any job market dataset. Employers invent titles freely: "Ninja Developer," "Customer Happiness Hero," "VP of Vibes." Our title normalization model maps every raw title to a standardized canonical form.

From 121.6 million titles processed, the confidence distribution tells the story:

Confidence Band	Share of Titles	What It Means
High (0.9 - 1.0)	6.29%	Near-exact matches to known canonical forms
Medium (0.7 - 0.9)	78.05%	The bulk of mappings, where the model does real work
Low (0.5 - 0.7)	15.65%	Titles requiring more interpretation
Very low (below 0.5)	0.01%	15,718 titles, almost entirely non-English inputs

The result is 100% title normalization coverage across 907.5 million records. Every record in our final delivery table has a normalized title. One known gap: those 15,718 very-low-confidence titles are mostly non-English (e.g., Polish titles mapped to unrelated English roles at confidence 0.38 to 0.42). We are adding language detection as a pre-filter to suppress these false mappings.

If you are working with normalized titles downstream, the confidence score lets you set your own quality threshold. A labor economist might filter to 0.7+; a search engine might accept 0.5+. See our glossary for details on how normalization relates to SOC classification.

SOC Classification

The Standard Occupational Classification system is the lingua franca of labor economics. Every government statistic and compensation benchmark uses 6-digit SOC codes. Our classifier assigns codes using both the job title and description as context, not title alone.

Across 493.9 million NLP-processed records, we achieve a 0% null rate on SOC codes. The top codes span software developers (15-1252), registered nurses (29-1141), retail salespersons (41-2031), general managers (11-1021), and truck drivers (53-3032).

Industry benchmarks for 6-digit SOC accuracy range from 85% to 92% for descriptions longer than 200 words. Our model performs within this range. At the 2-digit major group level, accuracy exceeds 95%. Description length matters enormously: short postings under 100 words produce lower-confidence classifications, while detailed postings with bulleted requirements give the model rich signal. This is one reason why ATS feeds tend to produce better enrichment than aggregator sources, since their descriptions are longer and more structured.

Seniority and Employment Type

Our seniority model classifies every job into Entry, Mid, Senior, Lead, and Executive levels. What distinguishes it: it always returns a classification. The model-garden-v1.0.0 achieves 0% null on seniority across all 69.2 million records processed. The legacy pipeline shows 15.6% null on its 424.7 million records, a gap we are actively backfilling.

Employment type classification (full-time, part-time, contract, temporary, internship) follows the same pattern. The current model achieves 0% null. The legacy pipeline has a 16.9% null rate. The difference between 0% and 15.6% illustrates why model versioning matters: you can measure improvement precisely and track regressions at the source level.

Remote Work Detection

Remote work classification is our most historically complex enrichment. The field did not exist before 2020. Coverage by vintage:

Pre-2020: Null is expected and correct. Virtually all jobs were implicitly on-site.
2020 to 2022: Structured fields emerged. Coverage improved from near-zero to 40 to 70%.
2023 and beyond: Work mode is standard. Our current NLP pipeline shows only 4 null values out of 493.9 million processed records.

This vintage sensitivity applies across the pipeline. Fields that look broken in 2018 data are perfectly normal. Fields that are null in 2024 data represent real bugs. We document these patterns in detail in our methodology.

Salary Prediction

Stated salary coverage has improved since US transparency laws took effect, but even in 2023+ data, only 20 to 30% of postings include a stated salary. Our prediction model fills the gap. Trained on 50+ million salary observations from Glassdoor and Indeed, it predicts annual compensation given a job's state, zip code, and SOC code.

Observed MAPE is under 15%, compared to the 15 to 25% range reported in most published academic models. The model returns -1 when prerequisites are missing. We would rather return nothing than return noise.

The prerequisite requirement (valid state, zip code, and SOC code) means salary prediction coverage varies by how well upstream location parsing and SOC classification perform. This is another argument for the garden architecture: improvements in SOC classification directly improve salary prediction coverage. For a full breakdown of salary coverage by year and state, see our datasets page.

The NER Layer: Skills, Certifications, and Benefits

Alongside classification models, we run dictionary-based Named Entity Recognition using the Aho-Corasick algorithm, followed by an NLP relevance filter. From 599.3 million processed records:

93.3% have at least one extracted skill
Coverage scales with description length: 47.5% for postings under 50 words, 99.5% for 500+ words
Average word count across processed descriptions is 486 words
Our taxonomy covers 37,000+ technical skills, 3,000+ certifications, and 400+ soft skills

The relevance filter is critical. Without it, a Barista job mentioning "experience with Java" (the coffee) would incorrectly receive "Java" (the programming language) as an extracted skill. The filter uses a title-skill relevance model to suppress these spurious matches, which is especially important for high-frequency terms that appear in non-technical contexts.

For more on how skills extraction performs across different source types and description lengths, see our comparison tool.

How It All Comes Together

The Model Garden's output flows through our 10-stage pipeline and culminates in 907.5 million records with 82 fields each. Every record carries a normalized title, a 6-digit SOC code, seniority, employment type, remote work status, predicted salary (when prerequisites are available), and extracted skills.

We are currently developing NAICS (industry classification) prediction as the next model in the garden. The modular architecture makes this straightforward: each model is independent, stateless, and horizontally scalable.

To see how enriched records compare to raw postings, try our comparison tool. If you want to evaluate the data for a specific use case, request a sample filtered by source, region, or occupation.

Want to see the data for yourself?

Get a free sample of 5,000 enriched job records, delivered within 24 hours.

Request a Free Sample Back to Blog