Careers

Data Scientist Intern

Help keep our datasets trustworthy at web-scale. You’ll write SQL to profile historical data, build Python checks for anomalies/regressions, define data quality KPIs, and surface results via lightweight reports. Partner with Data Eng to validate ETL, document data contracts, and triage issues—using version-controlled QA assets and repeatable workflows. You’re a BS/MS student with solid SQL + Python, Git fluency, and a systematic, detail-oriented mindset.

About Canaria

We are a technology product startup that is transforming the job market and career personalization space. Our advanced data mining techniques, computing optimizations, and state-of-the-art Natural Language Processing (NLP) techniques, including transformer and LLM-based architectures, enable large-scale processing of job market data*. We help job seekers find jobs that are the best fit for their skills and experience, and identify skill and credential gaps for their dream jobs.

*Our database already surpasses the size of the entire Wikipedia corpus by over 100 times. We are targeting an ambitious scaling of our data by 10 to 100 times this year..


📍 Location: Remote

Responsibilities

  • Write and maintain SQL, MongoDB, and ClickHouse queries to analyze large datasets for anomalies, inconsistencies, duplicates, and nonstandard values.
  • Investigate and document data issues across ingestion and transformation layers; create JIRA tickets with clear evidence and reproduction steps.
  • Collaborate with NLP/ML and Data Engineering teams to help design parsers, pre/post-processors, or rules to fix identified issues.
  • Develop unit tests and regression checks to ensure data quality and consistency after updates or pipeline changes.
  • Build lightweight Python scripts or notebooks to automate data validation and summarize findings.
  • Track and report on key data quality metrics (null rates, drift, outliers, schema mismatches).
  • Maintain structured documentation of data issues, investigation workflows, and QA playbooks.
  • Support large-scale reprocessing QA (e.g. schema changes, backfills, or version migrations).

Qualifications

  • Pursuing a Bachelor’s or Master’s in Computer Science, Data Science, Engineering, Statistics, or a related field.
  • Strong proficiency with SQL (joins, window functions, CTEs, aggregations) and basic experience with MongoDB or ClickHouse.
  • Familiarity with Python (pandas, typing, modular scripting) for analysis and data quality checks.
  • Solid understanding of data validation concepts: completeness, deduplication, anomaly detection, and schema enforcement.
  • Comfortable working with Git and collaborative workflows (PRs, reviews).
  • Excellent written communication skills and a detail-oriented, investigative mindset.

Nice to Have

  • Experience with data warehouses (BigQuery, Snowflake, Redshift) or relational DBs (PostgreSQL/MySQL).
  • Exposure to Airflow, DBT, or ETL/ELT frameworks.
  • Familiarity with data QA/testing tools or statistical drift detection.
  • Basic understanding of NLP pipelines or large-scale text data processing.
  • Experience building open-source dashboards (e.g. Grafana, Plotly, or Streamlit) for pipeline monitoring, anomaly tracking, or lightweight reporting.

What you will learn 

  • How to perform data QA at web scale, investigating billions of rows efficiently.
  • How to collaborate across Data Engineering, ML, and Product teams to improve data quality for downstream AI systems.
  • How to translate raw findings into actionable tickets and fixes that improve real-world model performance.
  • How large-scale NLP and ML systems depend on structured, validated, and traceable data flows.

📩 Ready to make an impact?

Apply now by sending your resume to recruitment@decanaria.com.