Spark makes distributed processing accessible and easy to misuse. These questions check whether a candidate understands lazy evaluation, shuffles and skew.
Hiring a Apache Spark developer is easy. Telling a real one from a convincing résumé is the hard part — and it’s most of what we do. These are grouped by level, because the same question that stretches a junior is a warm-up for a senior.
Junior Apache Spark interview questions
0–2 years
Core concepts.
What is Apache Spark?
A distributed engine for large-scale data processing in memory, with APIs for batch, SQL, streaming and ML.
Thinks it’s just “faster Hadoop” with no detail.
What is an RDD?
A Resilient Distributed Dataset: an immutable, partitioned collection processed in parallel, the low-level abstraction beneath DataFrames.
Cannot explain what makes it resilient (lineage).
What is the difference between a transformation and an action?
Transformations (map, filter) are lazy and build a plan; actions (count, collect) trigger execution.
Expects a transformation to run immediately.
What is lazy evaluation in Spark?
Transformations aren’t executed until an action needs a result, letting Spark optimise the whole plan.
Assumes each line runs as written.
What is a DataFrame vs an RDD?
A DataFrame is a higher-level, schema-aware, optimised (Catalyst) API; prefer it over raw RDDs for most work.
Writes low-level RDD code where DataFrames would be faster.
What is a partition in Spark?
A chunk of the data processed by one task; partitioning determines parallelism.
No idea how parallelism is controlled.
What does collect() do and why is it risky?
It pulls all data to the driver, which can run it out of memory on large datasets.
Calls collect() on a huge dataset and crashes the driver.
What is the driver vs an executor?
The driver coordinates the job and plan; executors run tasks on the cluster and hold data partitions.
Confuses their roles.
Mid-level Apache Spark interview questions
2–5 years
Execution and performance.
What is a shuffle and why is it expensive?
Redistributing data across partitions (for joins, groupBy) involving disk and network I/O — usually the main performance cost.
Triggers unnecessary shuffles and wonders why jobs are slow.
What is data skew and how do you handle it?
Uneven key distribution overloading some partitions/tasks; mitigations include salting keys, repartitioning and broadcast joins.
Ignores skew and one task runs forever.
What is a broadcast join and when do you use it?
Sending a small table to all executors to join without shuffling the large one, avoiding an expensive shuffle.
Shuffles a huge table to join a tiny lookup table.
What is the difference between cache/persist and when to use them?
They keep a reused dataset in memory/disk to avoid recomputation; use when a result is reused multiple times, not blindly.
Caches everything and runs out of memory.
What is the difference between narrow and wide transformations?
Narrow (map, filter) need no shuffle; wide (groupByKey, join) shuffle data across partitions.
Cannot predict which operations shuffle.
Why prefer reduceByKey over groupByKey?
reduceByKey combines values map-side before shuffling, moving far less data than groupByKey.
Uses groupByKey and shuffles everything.
How does partitioning affect performance?
Too few partitions limit parallelism; too many add overhead; repartition/coalesce tune it to the cluster and data size.
Runs with default partitioning regardless of scale.
What is the Catalyst optimizer?
The engine that optimises DataFrame/SQL query plans (predicate pushdown, etc.), a reason to prefer DataFrames over RDDs.
Unaware DataFrames are optimised.
Senior Apache Spark interview questions
5+ years
Scale and tuning.
How do you diagnose a slow Spark job?
The Spark UI to find expensive stages, shuffle sizes, skew and spills, then address the specific bottleneck.
Randomly increases resources without diagnosis.
How do you tune executor memory and cores?
Size executors and cores to the workload and cluster, avoiding tiny or giant executors, and account for shuffle/overhead memory.
Guesses at memory settings and hits OOM.
How do you handle out-of-memory errors?
Reduce shuffle/skew, increase partitions, avoid collect(), tune memory, and spill appropriately rather than just adding RAM.
Only ever raises the memory setting.
What is the difference between batch and structured streaming?
Structured Streaming treats a stream as an unbounded table processed incrementally with similar APIs, adding watermarks and state handling.
Cannot articulate streaming vs batch tradeoffs.
How do you handle late data in streaming?
Watermarks define how late data is accepted, bounding state and enabling correct windowed aggregations.
Ignores late/out-of-order data.
How do you optimise joins at scale?
Broadcast small sides, mitigate skew, pre-partition on join keys, and choose appropriate file formats and pruning.
Shuffles both huge sides with no strategy.
When is Spark the wrong tool?
For small data (single-machine tools are simpler/faster) or low-latency per-record needs; its overhead only pays off at scale.
Uses a Spark cluster to process a few megabytes.
How do you choose file formats and layout for Spark?
Columnar formats like Parquet with partitioning and predicate pushdown to minimise I/O.
Reads huge uncompressed CSVs repeatedly.
Build and score a full interview with our free interview scorecard tool, browse the full question hub, or see how we interview engineers.