Data engineering is where bad decisions compound silently. These questions check whether a candidate designs reliable, scalable pipelines and thinks about data quality.
Hiring a Data Engineering developer is easy. Telling a real one from a convincing résumé is the hard part — and it’s most of what we do. These are grouped by level, because the same question that stretches a junior is a warm-up for a senior.
Junior Data Engineering interview questions
0–2 years
Fundamentals.
What is the difference between ETL and ELT?
ETL transforms data before loading; ELT loads raw data then transforms in the warehouse, leveraging its compute — common with modern cloud warehouses.
Cannot explain why ELT became popular.
What is the difference between a data warehouse and a data lake?
A warehouse stores structured, query-optimised data; a lake stores raw data of any format cheaply for later processing.
Uses the terms interchangeably.
What is a data pipeline?
An automated flow moving and transforming data from sources to destinations, ideally reliable and idempotent.
Copies data manually with ad-hoc scripts.
What is the difference between batch and streaming processing?
Batch processes data in scheduled chunks; streaming processes events continuously with low latency.
Picks one without considering latency needs.
What is a primary key and why does it matter in data?
A unique identifier enabling deduplication, joins and incremental processing.
Loads data with no unique key and creates duplicates.
What is schema-on-read vs schema-on-write?
Schema-on-write enforces structure at load time; schema-on-read applies it when querying raw data.
Doesn’t know the difference or its implications.
What is data normalisation vs denormalisation in analytics?
Normalised models reduce redundancy; analytical models often denormalise (star schema) for query performance.
Applies OLTP normalisation to an analytics workload.
Why is idempotency important in pipelines?
So re-running a job (after failure) doesn’t duplicate or corrupt data.
Reruns a job and doubles the data.
Mid-level Data Engineering interview questions
2–5 years
Modelling and orchestration.
How do you design an incremental data pipeline?
Process only new/changed data using watermarks or change data capture, with idempotent, restartable steps.
Reprocesses the entire dataset every run.
What is partitioning and why does it matter?
Splitting large datasets (often by date) so queries scan less and processing parallelises.
Queries scan entire unpartitioned tables.
How do you handle late-arriving or out-of-order data?
Watermarks, reprocessing windows, and upserts so corrections update prior results correctly.
Assumes data always arrives in order and on time.
What is a slowly changing dimension?
A modelling pattern for tracking how dimension attributes change over time (e.g. keeping history vs overwriting).
Overwrites history and loses point-in-time accuracy.
How do orchestration tools help?
They schedule, sequence, retry and monitor pipeline tasks with dependency management (e.g. DAGs).
Chains cron jobs with no dependency handling.
How do you ensure data quality?
Validation and tests on schema, nulls, uniqueness and ranges, with alerting when checks fail.
Ships data with no quality checks.
What is the difference between a fact and a dimension table?
Facts hold measurable events (numbers to aggregate); dimensions hold descriptive context to slice by.
Mixes measures and descriptors arbitrarily.
How do you handle backfills?
Idempotent, partition-aware reprocessing of historical data without disrupting ongoing loads.
Manually re-runs and creates duplicates or gaps.
Senior Data Engineering interview questions
5+ years
Scale and reliability.
How do you design a pipeline for scale and cost?
Partitioning, columnar formats, pushing compute to the warehouse, incremental processing, and monitoring cost drivers.
Scans everything and ignores cost.
How do you make pipelines reliable and observable?
Idempotent steps, retries, data-quality checks, lineage, alerting and SLAs on freshness.
No monitoring; failures found by users.
How do you handle schema evolution in data pipelines?
Backward-compatible changes, schema registries/contracts, and tolerant consumers so upstream changes don’t break downstream.
An upstream column change silently breaks everything.
What are the tradeoffs of streaming vs batch at scale?
Streaming gives low latency at higher operational complexity; batch is simpler and cheaper where latency allows.
Builds streaming everywhere with no need.
How do you approach data governance and lineage?
Track where data comes from and how it’s transformed, with cataloguing, access controls and quality ownership.
No idea how a metric was derived.
How do you optimise warehouse query performance and cost?
Partitioning/clustering, materialised or pre-aggregated tables, avoiding scans, and monitoring expensive queries.
Runs full-table scans repeatedly.
How do you handle personally identifiable information in data?
Classification, minimisation, masking/encryption, access controls, and compliance with regulations.
Copies raw PII everywhere with no controls.
How do you choose the right storage format?
Columnar formats (Parquet) for analytics, with compression and partitioning; row formats for transactional access.
Stores analytics data as raw CSV/JSON and pays for it.
Build and score a full interview with our free interview scorecard tool, browse the full question hub, or see how we interview engineers.