ML interviews are full of people who can call a library but can’t reason about a model. These questions check whether a candidate understands the fundamentals and the pitfalls.
Hiring a Machine Learning developer is easy. Telling a real one from a convincing résumé is the hard part — and it’s most of what we do. These are grouped by level, because the same question that stretches a junior is a warm-up for a senior.
Junior Machine Learning interview questions
0–2 years
Core concepts.
What is the difference between supervised and unsupervised learning?
Supervised learns from labelled data to predict; unsupervised finds structure in unlabelled data (clustering, dimensionality reduction).
Confuses the two or can’t give examples.
What is overfitting and how do you spot it?
A model memorising training data and failing to generalise, seen as high training but low validation performance.
Judges a model only on training accuracy.
What is the difference between classification and regression?
Classification predicts categories; regression predicts continuous values.
Uses the wrong metric for the task type.
What is a training, validation and test set?
Data split to fit, tune and finally evaluate a model without leaking information from evaluation into training.
Evaluates on the training set or tunes on the test set.
What is feature engineering?
Transforming raw data into informative inputs (scaling, encoding, deriving features) that improve model performance.
Feeds raw, unscaled data with no thought.
Why do you split data before preprocessing?
To avoid data leakage — fitting scalers/encoders on the whole dataset leaks test information into training.
Scales the whole dataset before splitting.
What is the bias–variance tradeoff?
Simple models underfit (high bias); complex ones overfit (high variance); the goal is the balance that generalises.
Cannot explain why more complexity isn’t always better.
What is cross-validation?
Splitting data into folds to evaluate a model across multiple train/validation partitions for a robust estimate.
Trusts a single train/test split for everything.
Mid-level Machine Learning interview questions
2–5 years
Evaluation and modelling.
Why can accuracy be misleading?
On imbalanced data a naive majority-class model scores high accuracy while being useless; precision, recall and F1 tell the real story.
Reports accuracy on a heavily imbalanced problem.
What are precision, recall and F1?
Precision is correctness of positive predictions, recall is coverage of actual positives, F1 balances them; you choose based on the cost of errors.
Can define them but not choose which matters for the problem.
How do you handle imbalanced datasets?
Resampling, class weights, appropriate metrics and thresholds, and sometimes anomaly-detection framing.
Ignores imbalance and optimises accuracy.
What is regularisation?
Penalising complexity (L1/L2, dropout) to reduce overfitting and improve generalisation.
No strategy to combat overfitting.
How do you tune hyperparameters?
Systematic search (grid/random/Bayesian) with cross-validation, avoiding tuning on the test set.
Tweaks by hand and evaluates on the test set.
What is the difference between a parameter and a hyperparameter?
Parameters are learned from data; hyperparameters (learning rate, depth) are set before training and tuned.
Conflates the two.
What is data leakage and how do you prevent it?
Information from outside the training data influencing the model (target leakage, preprocessing on all data); prevented by careful pipelines and splits.
Includes future or target-derived features unknowingly.
How do you choose a model for a problem?
By data size and type, interpretability needs, and baseline performance — starting simple before reaching for complex models.
Jumps to a deep model for a tiny tabular dataset.
Senior Machine Learning interview questions
5+ years
Deployment and MLOps.
What is model/data drift and how do you handle it?
Input or relationship changes degrade a deployed model; you monitor performance and inputs and retrain or alert when it drifts.
Deploys once and never monitors.
How do you deploy and serve models in production?
Versioned models behind an API or batch pipeline, with monitoring, rollback and reproducible training.
Ships a notebook artifact with no reproducibility.
How do you evaluate a model beyond offline metrics?
Online experiments (A/B tests), business-metric impact, and monitoring, since offline scores don’t guarantee real-world value.
Assumes a good validation score means production success.
What is the difference between batch and online inference?
Batch scores data periodically; online serves low-latency predictions per request — each with different infrastructure and freshness tradeoffs.
Picks one without considering latency/freshness needs.
How do you ensure reproducibility in ML?
Version data, code and models, fix seeds where sensible, and track experiments so results can be reproduced and compared.
Cannot reproduce a past model or result.
How do you think about fairness and bias in models?
Examine data representativeness and disparate impact, choose appropriate metrics, and monitor outcomes across groups.
Ignores bias in data and outcomes.
When is machine learning the wrong solution?
When rules or heuristics suffice, data is insufficient or poor quality, or the cost of errors is unacceptable; ML isn’t always the answer.
Reaches for ML where simple logic would do.
How do you build an ML pipeline that scales?
Automated, reproducible stages for data, training, evaluation and deployment (MLOps) with monitoring and retraining.
Runs everything manually in notebooks.
Build and score a full interview with our free interview scorecard tool, browse the full question hub, or see how we interview engineers.