Company playbooks

The Apple Machine Learning Interview: On-Device ML, Core ML, and Applied Research

Name: Job Lobster
Price: 9.99 USD

11 min read · April 25, 2026

Apple's ML loop is not OpenAI's. They grade for model-compression craft, privacy-preserving training, and shipping models that run on a phone in your pocket. Here's the actual bar in 2026.

Apple's ML hiring is a distinct track from the general software engineering loop. In 2026, with Apple Intelligence shipping across iOS 18 and 19, the Foundation Models group pulling headcount, and on-device LLMs becoming a default expectation, the ML interview at Apple has gotten harder, narrower, and much more focused on what the rest of the industry treats as a nice-to-have: running models efficiently on constrained hardware.

If you're coming from Google DeepMind, Meta FAIR, or an OpenAI-adjacent startup, the Apple ML interview will surprise you. The questions are less about training a 70B model from scratch and more about how you'd shrink one to fit in 3GB of RAM on an iPhone 15 without tanking quality. The teams hiring are primarily Apple Intelligence Foundation Models (AIFM), the On-Device ML group, Core ML framework, Siri understanding, Vision (the framework and the product), Health signals, and Maps (routing and ETA models).

This guide covers the structure, the rubrics, the questions, and the prep path that actually works. Sources are Blind, Levels.fyi writeups, conversations with recruited candidates, and the public traces Apple leaves at WWDC and in their ML research blog.

The loop structure

The ML-IC loop at Apple has three shapes depending on the team. Applied scientist roles on AIFM look different from production ML engineer roles on Core ML, which look different from research scientist roles on the ML Research group.

For the applied ML engineer track (the majority of headcount):

Recruiter screen. 30 minutes. Resume, motivation, why Apple specifically. They screen hard for candidates who can articulate a reason beyond "big tech pays."
Technical phone screen. 60 minutes. Usually one ML-theory question (derive something from scratch, explain a loss function) plus a small coding problem — often implementing a transformer attention head, a beam search, or a data-loader piece in PyTorch.
Onsite coding round. One or two rounds. Algorithms plus applied. The applied round is frequently "implement this piece of a pipeline from a paper" — batched inference, a custom CUDA-esque kernel in simplified form, or a quantization loop.
Onsite ML system design. 60 minutes. "Design the on-device model that autocompletes text in Mail." "Design the image-understanding pipeline for Visual Look Up." "Design how you'd train and ship a small model for keyboard prediction without sending keystrokes to a server."
ML depth / research round. 60 minutes. Deep dive on your strongest area — could be NLP, CV, recsys, speech, or classical ML. They want to see one area mastered, not five skimmed.
Behavioral / values round. Same as the SWE track. Covered in the sibling guide.
Hiring-manager or director round. Often a second design conversation or a deep-dive on a past project you led.

Research scientist loops add a paper presentation round and weight paper output more heavily. Core ML framework roles add a systems round closer to a compiler or kernel interview than a model interview.

What Apple actually grades on

The rubric dimensions that cluster into strong hires on the ML track:

Model efficiency literacy. Do you know the numerical cost of a forward pass? Can you estimate FLOPs for a 3B model with 32 layers at 2K context? Do you know why int4 quantization works and what it breaks? At Apple, every model ships on hardware, and hardware has a memory bandwidth budget. Candidates who can only discuss models at the architecture level, without caring about bytes and cycles, do not clear the bar.
On-device training and inference. Personalization, federated learning, differential privacy, on-device fine-tuning with LoRA or similar. Apple invests heavily here because they can't train on user data the way Google can. If you've never thought about federated averaging, start.
Core ML and Metal Performance Shaders knowledge (for framework and on-device roles). You are not expected to be expert, but you should know that Core ML exists, that it compiles to a graph that runs on CPU, GPU, or the Neural Engine, and that the Neural Engine has specific opcodes it prefers.
Privacy-aware training. Differential privacy with specific epsilon budgets, secret aggregation, and the difference between user-level and example-level DP. Apple's internal review genuinely asks "what's our privacy budget on this training run."
Evaluation discipline. Can you design an offline eval that correlates with online success? Do you know when to use paired bootstrap, permutation tests, or BLEU-vs-human evals? Apple spends real effort on eval because they can't iterate on user data the way competitors can.
Shipping bias. Research-only candidates who have never pushed a model to production struggle. Apple wants people who have debugged a quantization regression on a specific device, not just people who have read the paper.
Taste on research directions. For IC5+ and research scientist roles, the interviewer wants to see that you have informed opinions. Is the future of on-device LLMs MoE, speculative decoding, or SSMs like Mamba? What do you think and why?

What does not score: naming every open-source model you've tried, claiming you "built an LLM," or hand-waving numerical questions with "depends on the context."

Example questions

From 2024-2026 Apple ML loops, reported on Blind and in candidate debriefs:

Derive the gradient of softmax cross-entropy from scratch, then describe what changes under label smoothing.
Implement attention in 40 lines of PyTorch, including masking. Then extend it to grouped-query attention.
Explain the memory cost of KV-cache for a 7B model at 32-layer depth, 4K context, fp16. Now at int4. Now with paged attention.
Design an on-device summarization feature for Mail. Scope: runs on an iPhone, 3GB RAM budget, one user's inbox. What model, what quantization, what evaluation, what privacy story?
You have a 7B base model that scores 60 on your benchmark. Quantizing to int4 drops it to 55. What techniques do you try, in what order, and what do you expect each to recover?
Design the training pipeline for a keyboard autocorrect model that improves over time without sending keystrokes to Apple.
Walk me through how you'd evaluate a new summarization model against the shipped one. What do you measure offline? What do you ship behind a flag for online eval?
Given two checkpoints A and B that disagree on 5% of examples, how do you decide which is better?
Explain speculative decoding. Why does it speed up inference, and when does it not?
Design the system that detects 'hallucination' in a model-generated summary of the user's email.
Your model's quality regresses on iPhone 13 but not iPhone 15. What's your debugging strategy?

Notice the pattern. Numeracy is tested in every round. Privacy shows up in half the questions. Hardware awareness is assumed at senior level.

Strong vs passing answers

A passing answer to "design on-device summarization for Mail" picks a small model, mentions quantization, and draws boxes. It scores lean hire if the candidate is otherwise strong.

A strong answer does:

Scopes the problem numerically first. "What's the email length distribution? Let's assume p95 is 2K tokens. Target output 100 tokens. We have a ~3GB RAM budget with iOS overhead, realistically 2GB for the model. That puts us at a 3B-parameter int4 model or a 1.5B fp16."
Picks a specific architecture and justifies. "I'd start from a distilled 3B-parameter decoder, probably a Llama-3-family shape, because the tooling for quantization-aware training is most mature there. I'd do a QAT pass with GPTQ-style int4 weights and int8 activations, targeting the Neural Engine."
Designs the data pipeline with privacy in mind. "Training data is synthetic — we generate email-summary pairs using a larger server-side teacher. We never train on user emails. For personalization, I'd add a LoRA adapter that we fine-tune on-device using a small batch of the user's last 100 emails with DP-SGD at epsilon 3."
Names the eval. "Offline: ROUGE-L plus a learned eval model trained on human judgments. Online: a silent A/B with a user-facing thumbs up/down, reported as a privacy-preserving aggregate."
Anticipates the device-level failure. "On iPhone 13 with 4GB RAM and older A-series, we'll hit memory pressure if Mail is in the background. I'd add a model tier: the full 3B for newer devices, a 1.5B fallback for older."
Names what would go wrong. "The biggest risk is hallucination on forwarded emails with long quoted threads. I'd add a specific eval slice for that and a UI affordance that lets the user see the summarized range."

That's an ICT5-quality answer. Candidates who hit all six points crisply get pulled up.

Common failure modes

The ways ML candidates reliably lose an Apple loop:

Server-centric thinking. Designing a system where the model runs on a GPU cluster and the phone is a thin client. Apple's ML org exists to avoid that pattern.
No numeracy. Being unable to estimate FLOPs, memory, or latency to order-of-magnitude accuracy. Guessing is fine; confident wrong numbers are not.
Ignoring quantization regressions. Pretending int4 is free. It isn't. Strong candidates know to check the specific regressions on long-context reasoning and code generation.
Weak evaluation stories. "We'll use BLEU." OK but why, and what does BLEU miss for this task, and what's the correlation with human preference?
Overselling LLM familiarity. Claiming deep expertise on models you haven't actually trained end-to-end. Apple interviewers push on specifics — what optimizer, what learning rate, what warmup, what scheduler, what evaluation cadence.
No privacy reflex. Describing a training pipeline that sends user data to a server without flagging it. The interviewer will wait for you to notice. Beat them to it.
Neural-Engine ignorance for on-device roles. If you're interviewing for Core ML or on-device ML, not knowing that the Neural Engine is int8-first with specific opcode limitations is a real gap.

Prep strategy

20-40 hours over three to four weeks, assuming you already have ML fundamentals:

Read Apple's ML research blog end-to-end. It's small and mostly high-signal. Pay attention to the on-device papers (OpenELM, Ferret, and the 2024-2025 Apple Intelligence technical report).
Read the Apple Intelligence Foundation Models technical report. Public. The architecture, quantization recipe, and adapter strategy are described in useful detail. You should be able to discuss it in interview.
Drill numerical estimation. Practice estimating FLOPs, memory, and latency for transformer models under different quantizations. Use back-of-envelope math. Chinchilla scaling, activation memory, KV-cache size — memorize the formulas.
Build a tiny on-device model, end to end. Train a small classifier or seq2seq model in PyTorch, convert to Core ML, run on an iPhone simulator or device, measure latency and memory. The experience is worth 20 papers.
Read one distillation paper and one quantization paper deeply. GPTQ and any of the recent distillation surveys will do. Be able to derive the algorithm from scratch.
Drill ML system design. Mock with an ex-Apple coach if you can. Apple's ML design rubric is specific and generic FAANG ML mocks undershoot.
Prepare your research pitch. For senior and above, you will be asked what you think the field should work on. Have a thesis. Defend it. Be willing to be wrong.

Next-day follow-up

Apple ML loops close faster than SWE loops — often within two weeks — because the teams are smaller and the hiring managers move urgently. What to do:

Send a one-line thank-you to the recruiter. Not to individual interviewers.
Write your debrief within 24 hours: every question, your answer, the follow-up probes. Apple ML teams talk to each other; the note is gold if you re-interview for a different team later.
If you get a no, ask specifically whether the feedback was on the depth round or the design round. Recruiters will often give directional feedback for ML, more than for SWE, because ML hiring managers push for it.
If you get a yes, negotiate on the sign-on and RSU refresh, not base. Apple's ML bands are higher than SWE bands at the same ICT level but still rigid on base.

The candidates who clear an Apple ML loop are the ones who can look at a product question, instantly frame it in terms of a budget (memory, latency, privacy), and walk the interviewer through the tradeoff space like they've shipped it before. If you can do that — and if you have real numbers in your head — you will be in the hire pile. If you are smart about LLMs but can't tell me how big an int4 7B model is in RAM, you will not.

Sources and further reading

When evaluating any company's interview process, hiring bar, or compensation, cross-reference what you read here against multiple primary sources before making decisions.

Levels.fyi — Crowdsourced compensation data with real recent offers across tech employers
Glassdoor — Self-reported interviews, salaries, and employee reviews searchable by company
Blind by Teamblind — Anonymous discussions about specific companies, often the freshest signal on layoffs, comp, culture, and team-level reputation
LinkedIn People Search — Find current employees by company, role, and location for warm-network outreach and informational interviews

These are starting points, not the last word. Combine multiple sources, weight recent data over older, and treat anonymous reports as signal that needs corroboration.

The Nvidia Machine Learning Interview — GPU Systems, CUDA Optimization, and Applied Research — Nvidia's ML loop doesn't look like Meta's or OpenAI's. They grade for GPU literacy, kernel-level intuition, and a working mental model of memory bandwidth. Here's the 2026 bar.
The Databricks Machine Learning Interview in 2026 — MLflow, Lakehouse, and Applied Modeling — Databricks ML interviews test whether you can build and ship models on top of Spark, MLflow, and the lakehouse — not just whether you can tune XGBoost on a Jupyter notebook. Here's how the loop actually grades.
Anthropic Research Engineer Interview in 2026 — Alignment, Evals, and the Research Take-Home — A focused guide to Anthropic research engineer interviews: what to expect, how to prepare for coding, research taste, evaluations, alignment thinking, and the research take-home without relying on hype.
ML Scientist Interview Questions — Research Depth, Papers, and Applied Modeling Rounds — ML scientist interviews blend research taste with applied engineering judgment. This guide covers paper deep dives, modeling questions, evaluation, experimentation, and how to show research depth without losing product relevance.
Nvidia Interview Process 2026: CUDA, Systems & Applied ML — A no-fluff breakdown of Nvidia's 2026 interview process for engineers—covering CUDA, distributed systems, and applied ML rounds with concrete prep advice.

The loop structure

What Apple actually grades on

Example questions

Strong vs passing answers

Common failure modes

Prep strategy

Next-day follow-up

Sources and further reading

Related guides

More in Company playbooks

Adobe Interview Process in 2026 — Creative Cloud Engineering, ML, and Craft

The Airbnb Data Scientist Interview in 2026 — Experimentation, Metrics, and Product Analytics

The Airbnb System Design Interview in 2026 — Search, Ranking, and Trust-and-Safety Scale