Skip to main content
Guides Company playbooks The Nvidia System Design Interview — Accelerated Computing and ML Infrastructure
Company playbooks

The Nvidia System Design Interview — Accelerated Computing and ML Infrastructure

9 min read · April 25, 2026

Nvidia's system design loop grades for accelerated-compute intuition, not generic distributed systems trivia. Here's how the 2026 rubric actually works, with questions, bar, and prep.

Nvidia's system design interview is not a FAANG system design interview with a GPU topping. It's a separate rubric. Where Google will grade you on whether you can reason about consistent hashing, latency budgets, and leader election, Nvidia grades you on whether you can reason about HBM bandwidth, NVLink topology, pipeline stalls, and the memory-movement cost of a 70B-parameter model through a training or inference system. If you prep for it with a copy of 'Designing Data-Intensive Applications' and nothing else, you will not clear the bar.

This guide is the 2026 structure of the Nvidia system design loop, what the rubric actually rewards, the canonical question shapes, and the prep path. Sources are Blind, Levels.fyi, GTC sessions, debriefs from loops across the DGX Cloud, TensorRT-LLM, NeMo, Megatron, and the Inference Microservices (NIM) teams.

Where system design fits in the loop

For senior ML and systems roles, Nvidia runs one or two system design rounds in an onsite of five or six. The two shapes:

  • ML system design round. 60 minutes. 'Design the inference server for a 70B model with 10K concurrent users.' 'Design the training pipeline for a 400B mixture-of-experts on 1024 GPUs.' 'Design an A/B serving system for two different quantizations of the same model.'
  • Accelerated-compute systems round. 60 minutes. Lower-level. 'Design a batched inference scheduler that maximizes GPU utilization across heterogeneous sequence lengths.' 'Walk me through the memory layout of an attention operator under tensor parallelism.' 'Design a KV-cache sharing system across concurrent requests.'

At IC5 and above (Nvidia's Staff and Principal levels), both rounds show up. At IC4, you get one or the other depending on the team. Applied ML teams lean ML design; CUDA-adjacent teams lean accelerated compute.

What the Nvidia rubric actually rewards

The non-obvious weights:

  • Bandwidth-first framing. Strong candidates open by naming the limiting resource. 'For batch-1 autoregressive decoding on a 70B model, we're memory-bound on HBM, not compute-bound. The question is how fast we can stream weights and KV-cache through the SMs, not how fast the matmul goes.' This opener buys you five minutes of credibility.
  • Communication cost awareness. Every design should include a communication budget. NVLink intra-node (~900 GB/s on H100/H200), InfiniBand inter-node (~400 Gbps per HCA), and when each dominates. Candidates who treat collectives as free fail hard.
  • Numerical honesty. Nvidia interviewers will interrupt with 'what's the number?' Designs with named memory footprints, bandwidth utilization, and latency estimates score a tier higher than designs at the box-and-arrow level.
  • Parallelism selection. DP, TP, PP, SP, EP (expert parallelism for MoEs), context parallelism. Strong candidates pick one, justify it, and name the tradeoff. Weak candidates list them all without choosing.
  • Precision strategy. What precision for weights, activations, master copy, KV-cache, gradients? Nvidia has shipped FP8 training and expects candidates to reason about it.
  • Batching strategy. Continuous batching vs static batching vs chunked-prefill for inference. Why, when, and what it costs.
  • Hardware topology awareness. 8-GPU DGX node, multi-node pod, SuperPod. You should be able to draw the topology and explain where the bandwidth bottlenecks are.
  • Observability bias. How do you profile this? nsys, ncu, torch profiler, MLPerf traces. Candidates who design without mentioning observability get dinged.

What does not score: generic distributed-systems patter (Raft, Kafka, load balancers) without tying it to the GPU bottleneck. CAP theorem references. Quoting blog posts without numbers.

The canonical questions

From 2024-2026 Nvidia system design loops, reported by candidates across levels:

  • Design an LLM inference service for a 70B model. 10K QPS, p99 latency budget of 200ms TTFT and 50ms per token. How many GPUs, what parallelism, what batching?
  • Design the training system for a 400B-parameter MoE on 1024 H100s. 32 experts, top-2 routing.
  • Design a multi-tenant inference platform where customers bring their own LoRA adapters and share a base model. What does scheduling look like?
  • Design the KV-cache management layer for a vLLM-style server. How do you page, evict, and share?
  • Design a checkpointing system for a training job that's using 4K GPUs. Target RTO under 5 minutes.
  • Design a data pipeline that feeds 10 TB/day of tokenized text to a distributed trainer without the GPUs ever starving.
  • Design a speculative decoding system that uses a small draft model plus a large target model. What's the throughput calculation?
  • Design an A/B serving system for two checkpoints of the same model, with production-grade observability.
  • Design a differentiable-simulation backend for robotics training.
  • Design the system that generates synthetic video at 30fps for a training loop.
  • Design an evaluation harness that can run 200 checkpoints per day through a 5000-prompt benchmark without becoming the bottleneck.

Notice the pattern. Every question has a clear 'right' limiting resource, and the interviewer wants you to find it in the first three minutes.

A strong answer walkthrough

Take the 70B-model inference design. A passing answer picks vLLM-style serving, picks TP=4 on an 8-GPU node, and moves on.

A strong answer walks this path:

  1. Scope the request. '10K QPS with 200ms TTFT means we need to produce first tokens at scale. Average output length? Let's assume 300 tokens. Average input length? 1K tokens. Total tokens per second = 10K × 300 = 3M output tokens/sec globally, plus 10M input tokens/sec prefill.'
  2. Bound the bottleneck. 'For decode, a 70B model at FP8 is 70GB weights. On an H100 at 3.35 TB/s HBM, batch 1 decode is ~48 tokens/sec per GPU before activations. Batch 64 gets us much higher throughput. So decode is memory-bandwidth bound and batching is the lever.'
  3. Pick parallelism. 'TP=4 on a single H100 node fits 70GB weights (after FP8 quant) with headroom for KV-cache. TP=8 hurts because we'd pay more all-reduce for no capacity gain.'
  4. Pick batching. 'Continuous batching with chunked prefill. I'd target prefill chunks of 8K tokens to keep prefill latency predictable, and I'd run decode batches of 64-128 depending on KV-cache pressure.'
  5. KV-cache math. '70B model at 80-layer, GQA with 8 KV heads, 128 head dim, FP16 KV cache = ~2.5 KB per token per layer, ~200 KB per token. With 10K concurrent users averaging 1K context, that's 2 TB of KV cache globally — way more than we have. So we need paging (vLLM-style) or aggressive eviction.'
  6. Count the fleet. 'At batch 128 per TP=4 group, each 8-GPU node serves roughly 256 decode slots. For 10K concurrent we need ~40 nodes, or ~320 H100s. Double that for prefill separation. Round up 20% for headroom. Call it 750 H100s in the steady state.'
  7. Disaggregate prefill from decode. 'Prefill is compute-bound. Decode is memory-bound. Running them on the same GPU wastes either the tensor cores or the bandwidth. 2026 best practice is disaggregated prefill/decode, with the KV-cache shipped from prefill nodes to decode nodes over NVLink or RDMA.'
  8. Observability. 'nsys traces across the request lifecycle, per-request TTFT and inter-token latency histograms, KV-cache utilization gauges, and per-GPU memory-bandwidth utilization.'
  9. Failure modes. 'Worst case is a long-context request starving everything else on a GPU. Solution: admission control, max-context limits, preemption.'

That's staff-quality. The candidate produced numbers, chose specific strategies, and named both the current SOTA (disaggregated prefill) and the failure modes.

Common failure modes

  • Skipping the numbers. Designs without HBM bandwidth, FLOPs, and communication estimates fail.
  • Treating the GPU as a black box. 'It's a GPU, it's fast.' No. Where is the bottleneck?
  • Confusing DP and FSDP. They are not the same. FSDP is ZeRO-3-style sharding of weights, not pure replicated data parallelism.
  • Ignoring the KV-cache. Inference designs that don't budget KV memory fail instantly for senior roles.
  • Single-GPU reasoning. At senior levels, you should default to multi-GPU and justify scaling down, not up.
  • Forgetting prefill vs decode. These have different bottlenecks. Strong candidates disaggregate them by the third minute.
  • Naming technologies without explaining them. 'I'd use TensorRT-LLM and vLLM.' OK, but what does that buy me? What's the tradeoff between them?

Prep strategy

20-40 hours over three to five weeks:

  • Read the vLLM paper, the Orca paper, and the FlashAttention papers. These are the canonical references for modern inference system design. Be able to derive PagedAttention from scratch.
  • Read Megatron-LM's parallelism paper. Understand TP and PP communication patterns.
  • Read the H100, H200, and B200 whitepapers. Memorize: HBM size, HBM bandwidth, TFLOPs by precision, NVLink bandwidth per link, NVLink bandwidth per GPU total, InfiniBand HCA bandwidth.
  • Drill roofline analysis. For each of five common models (7B, 13B, 34B, 70B, 400B), estimate decode throughput at batch 1 and batch 128 on H100, and check your work against published numbers.
  • Run one inference server on an H100 or A100. Not in production — a one-week personal project with vLLM or TensorRT-LLM. Profile it with nsys. Watch the numbers move when you change batch or precision.
  • Practice design aloud. Mock with an ex-Nvidia or MLSys-adjacent engineer if you can. Your thinking needs to be vocalized at the speed Nvidia interviewers expect — hedging kills.
  • Prepare two or three favorite designs you want to draw. If the question gives you room, steer toward a design you've rehearsed. Strong candidates do this all the time; it's not cheating, it's good preparation.

The 10-minute opener rule

Nvidia system design rounds are won or lost in the first 10 minutes. The pattern that consistently lands:

  1. Clarify the workload numerically. (2 min)
  2. Identify the bottleneck resource. (2 min)
  3. Pick the parallelism and precision strategy. (3 min)
  4. Estimate the fleet size in GPUs. (3 min)

If you nail those four in the first 10 minutes, the rest of the round is the interviewer pressure-testing your choices. If you haven't landed those by minute 15, you're not in the hire pile, no matter how good the second half is.

Comp and hiring-manager read

Comp ranges match the ML interview guide. Negotiation on system design roles specifically: Nvidia pays for distributed systems skill, and candidates with a track record of shipping inference or training infrastructure at scale (vLLM contributors, Megatron committers, MLPerf submitters) can push the initial RSU grant 25-40% above band at IC4/IC5.

The hiring-manager synthesis: 'Can this person walk in and own our 100-GPU problem on day one?' What they want is someone who has done the work — trained or served a real model at a real scale. They do not want a candidate who has only architected on a whiteboard. If your background is pure ML research, expect to be filtered toward Nvidia Research rather than Applied. If your background is pure distributed systems without GPUs, expect to be filtered to the inference platform teams and expected to ramp on GPU-specific concerns quickly. If you have both — and if you can think in HBM bandwidth and NVLink topology — Nvidia will move fast on you. Faster than any other FAANG-tier company in 2026.

Sources and further reading

When evaluating any company's interview process, hiring bar, or compensation, cross-reference what you read here against multiple primary sources before making decisions.

  • Levels.fyi — Crowdsourced compensation data with real recent offers across tech employers
  • Glassdoor — Self-reported interviews, salaries, and employee reviews searchable by company
  • Blind by Teamblind — Anonymous discussions about specific companies, often the freshest signal on layoffs, comp, culture, and team-level reputation
  • LinkedIn People Search — Find current employees by company, role, and location for warm-network outreach and informational interviews

These are starting points, not the last word. Combine multiple sources, weight recent data over older, and treat anonymous reports as signal that needs corroboration.