Data Engineering Interview Questions (2026): By Level, With Model Answers

How to use this

Data engineering is where bad decisions compound silently. These questions check whether a candidate designs reliable, scalable pipelines and thinks about data quality.

Hiring a Data Engineering developer is easy. Telling a real one from a convincing résumé is the hard part — and it’s most of what we do. These are grouped by level, because the same question that stretches a junior is a warm-up for a senior.

Junior Data Engineering interview questions

0–2 years

Fundamentals.

What is the difference between ETL and ELT?

What a strong answer covers

ETL transforms data before loading; ELT loads raw data then transforms in the warehouse, leveraging its compute — common with modern cloud warehouses.

Red flag

Cannot explain why ELT became popular.

What is the difference between a data warehouse and a data lake?

What a strong answer covers

A warehouse stores structured, query-optimised data; a lake stores raw data of any format cheaply for later processing.

Red flag

Uses the terms interchangeably.

What is a data pipeline?

What a strong answer covers

An automated flow moving and transforming data from sources to destinations, ideally reliable and idempotent.

Red flag

Copies data manually with ad-hoc scripts.

What is the difference between batch and streaming processing?

What a strong answer covers

Batch processes data in scheduled chunks; streaming processes events continuously with low latency.

Red flag

Picks one without considering latency needs.

What is a primary key and why does it matter in data?

What a strong answer covers

A unique identifier enabling deduplication, joins and incremental processing.

Red flag

Loads data with no unique key and creates duplicates.

What is schema-on-read vs schema-on-write?

What a strong answer covers

Schema-on-write enforces structure at load time; schema-on-read applies it when querying raw data.

Red flag

Doesn’t know the difference or its implications.

What is data normalisation vs denormalisation in analytics?

What a strong answer covers

Normalised models reduce redundancy; analytical models often denormalise (star schema) for query performance.

Red flag

Applies OLTP normalisation to an analytics workload.

Why is idempotency important in pipelines?

What a strong answer covers

So re-running a job (after failure) doesn’t duplicate or corrupt data.

Red flag

Reruns a job and doubles the data.

Mid-level Data Engineering interview questions

2–5 years

Modelling and orchestration.

How do you design an incremental data pipeline?

What a strong answer covers

Process only new/changed data using watermarks or change data capture, with idempotent, restartable steps.

Red flag

Reprocesses the entire dataset every run.

What is partitioning and why does it matter?

What a strong answer covers

Splitting large datasets (often by date) so queries scan less and processing parallelises.

Red flag

Queries scan entire unpartitioned tables.

How do you handle late-arriving or out-of-order data?

What a strong answer covers

Watermarks, reprocessing windows, and upserts so corrections update prior results correctly.

Red flag

Assumes data always arrives in order and on time.

What is a slowly changing dimension?

What a strong answer covers

A modelling pattern for tracking how dimension attributes change over time (e.g. keeping history vs overwriting).

Red flag

Overwrites history and loses point-in-time accuracy.

How do orchestration tools help?

What a strong answer covers

They schedule, sequence, retry and monitor pipeline tasks with dependency management (e.g. DAGs).

Red flag

Chains cron jobs with no dependency handling.

How do you ensure data quality?

What a strong answer covers

Validation and tests on schema, nulls, uniqueness and ranges, with alerting when checks fail.

Red flag

Ships data with no quality checks.

What is the difference between a fact and a dimension table?

What a strong answer covers

Facts hold measurable events (numbers to aggregate); dimensions hold descriptive context to slice by.

Red flag

Mixes measures and descriptors arbitrarily.

How do you handle backfills?

What a strong answer covers

Idempotent, partition-aware reprocessing of historical data without disrupting ongoing loads.

Red flag

Manually re-runs and creates duplicates or gaps.

Senior Data Engineering interview questions

5+ years

Scale and reliability.

How do you design a pipeline for scale and cost?

What a strong answer covers

Partitioning, columnar formats, pushing compute to the warehouse, incremental processing, and monitoring cost drivers.

Red flag

Scans everything and ignores cost.

How do you make pipelines reliable and observable?

What a strong answer covers

Idempotent steps, retries, data-quality checks, lineage, alerting and SLAs on freshness.

Red flag

No monitoring; failures found by users.

How do you handle schema evolution in data pipelines?

What a strong answer covers

Backward-compatible changes, schema registries/contracts, and tolerant consumers so upstream changes don’t break downstream.

Red flag

An upstream column change silently breaks everything.

What are the tradeoffs of streaming vs batch at scale?

What a strong answer covers

Streaming gives low latency at higher operational complexity; batch is simpler and cheaper where latency allows.

Red flag

Builds streaming everywhere with no need.

How do you approach data governance and lineage?

What a strong answer covers

Track where data comes from and how it’s transformed, with cataloguing, access controls and quality ownership.

Red flag

No idea how a metric was derived.

How do you optimise warehouse query performance and cost?

What a strong answer covers

Partitioning/clustering, materialised or pre-aggregated tables, avoiding scans, and monitoring expensive queries.

Red flag

Runs full-table scans repeatedly.

How do you handle personally identifiable information in data?

What a strong answer covers

Classification, minimisation, masking/encryption, access controls, and compliance with regulations.

Red flag

Copies raw PII everywhere with no controls.

How do you choose the right storage format?

What a strong answer covers

Columnar formats (Parquet) for analytics, with compression and partitioning; row formats for transactional access.

Red flag

Stores analytics data as raw CSV/JSON and pays for it.

Skip the screening entirely.We vet Data Engineering engineers so you don’t have to — embed one in your team, or have us build it.

Hire Data Engineering developersCompare us

Build and score a full interview with our free interview scorecard tool, browse the full question hub, or see how we interview engineers.

Share