The OpenAI Research Engineer Interview — Paper Deep-Dives, Scaling, and Applied Work
OpenAI's Research Engineer loop grades for the ability to take a paper from PDF to a running cluster at scale. Here's the 2026 bar, the questions, and the prep path that actually works.
The OpenAI Research Engineer role is one of the most competitive hires in tech. Thousands of candidates apply per year, a small fraction get to the technical phone screen, and the onsite bar is calibrated to the median engineer at the frontier lab that has defined the decade in AI. The loop is not a FAANG loop. It does not grade you on Leetcode. It grades you on whether you can read a recent paper, ask three non-obvious questions about it, and then implement a variant on a 1000-GPU cluster without breaking.
This guide is the 2026 structure, rubric, and prep path. Sources are candidate debriefs on Blind and X, public OpenAI hiring posts, conversations with former and current OpenAI REs, and the patterns visible across the papers and systems OpenAI has shipped.
The Research Engineer role, defined
At OpenAI, 'Research Engineer' is not a junior version of 'Research Scientist.' It's a parallel track. Research Scientists own research directions; Research Engineers own the ability to make research run. In practice, strong REs write more production-grade code than RSs, own more infrastructure, and often co-author papers as first or co-first authors — particularly on the applied side. The implication for the interview: they want someone who reads papers fluently and writes systems fluently, and the loop tests both.
The primary teams hiring REs in 2026:
- Pretraining. The core team training base models. Research engineering here looks like owning data pipelines, training-loop performance, eval infrastructure, and specific loss/architecture experiments.
- Post-training. RLHF, RLAIF, instruction tuning, preference modeling. More experimental than pretraining and closer to the paper-reading surface.
- Multimodal. Vision, audio, video. Heavy on data and eval engineering.
- Safety and alignment. Red-teaming infra, interpretability tooling, eval-at-scale. Increasingly large in 2025-2026.
- Applied (the API-facing side). Latency, throughput, batching, tools, code execution sandboxes. Ship-oriented.
- Agents and reasoning. Chain-of-thought training, tool-use, reasoning-model-specific infra. Hot bucket in 2026.
The loop structure is largely the same across teams; the questions vary by team focus.
The loop
Typical 2026 loop:
- Recruiter screen. 30 min. Motivation, background, prior work at scale. OpenAI recruiters filter hard for candidates who can articulate specific technical interests.
- Technical phone screen 1 — coding. 60 min. Not Leetcode. Usually a numerical or ML-adjacent problem: implement attention, implement beam search with constraints, write a sampling function with nucleus plus repetition penalty, or debug a provided training loop.
- Technical phone screen 2 — ML depth. 60 min. Walk through a recent paper, deriving equations, or a domain deep-dive on your strongest area.
- Onsite or virtual onsite. 5 rounds, usually in one day.
- Coding round. More involved than phone screen. Implement a mini-training loop, an eval harness, or a specific algorithm from a paper.
- Paper deep-dive / research round. You present one of your projects, or the interviewer hands you a recent paper and asks you to explain, critique, and extend it.
- ML system design. Design a training or inference system at OpenAI scale.
- Applied ML problem-solving. Given a failure mode ('our model is hallucinating on code completion in a specific language'), walk through a debugging and fix strategy.
- Behavioral / values. OpenAI culture, team fit, motivation, alignment with mission.
- Hiring committee. Standard across OpenAI, similar to Google's model but faster. Decisions within 1-2 weeks of onsite.
The loop compresses well when OpenAI wants you — strong candidates have gone from recruiter to offer in three weeks. It can also stretch if there's team-matching uncertainty.
What OpenAI actually grades on
- Paper fluency. You should read 2-5 papers a week. You should have opinions. You should be able to critique methodology, reproduce results, and propose extensions. At the interview, expect questions like 'you mentioned you read the Llama 3 paper, walk me through their data mixture and what you'd change.'
- Numerical discipline. FLOPs, tokens, memory, communication cost. Everyone at OpenAI thinks in Chinchilla-era scaling laws. If you can't do 6 × params × tokens in your head, you are not at bar.
- Systems chops. Python/PyTorch fluency is table stakes. You should have written CUDA or Triton at least once. You should know what happens inside
torch.compile, what a custom backward pass looks like, and how to profile a training run. - Scale intuition. OpenAI thinks at 10K-GPU-cluster scale. What works on 1 GPU breaks at 100. What works at 100 breaks at 10K. Candidates who reason only at the single-node level get downgraded.
- Research taste. For more senior REs, interviewers want to see opinions on what's important. 'What are you excited about in post-training in 2026?' A strong answer is specific, defensible, and somewhat contrarian. A weak answer lists three buzzwords.
- Execution speed. OpenAI ships fast, and the culture rewards bias for action. Candidates who describe 3-month planning cycles for experiments feel wrong to OpenAI interviewers. Candidates who describe a 3-day iteration loop feel right.
- Communication. Tight, dense, unhedged. OpenAI interviewers push hard, and candidates who hedge every statement lose the room.
- Safety reflex. Not an alignment-team-only thing. Every RE is expected to think about capability risk, misuse, and evaluation gaps. A candidate who dismisses safety as 'someone else's problem' gets marked down across teams.
What doesn't score: Leetcode grinding, lengthy recitations of architectures without critique, or name-dropping papers you've skimmed.
Example questions
From 2024-2026 loops:
- Walk through the Chinchilla paper. What's the ratio? Why? Where does it break?
- Implement attention in PyTorch. Now add grouped-query attention. Now make it numerically stable for BF16.
- Here's a training loss curve. The loss dips and then spikes at step 42K. What are the usual suspects, ranked?
- Given an eval that is noisy (different temperature seeds give different scores), design a procedure for comparing two checkpoints that controls false positives.
- Walk me through one recent paper you read that you think is wrong. Why?
- Design the data pipeline for a 13T-token pretraining run.
- Your RLHF run is reward-hacking — scores are going up but human eval is flat. Debugging path?
- Implement token-level speculative decoding with a draft model.
- Design an eval that would have caught a specific failure mode (say, models refusing reasonable requests).
- You have 1024 H100s for two weeks. Propose the most valuable experiment to run on them.
- Walk me through what happens in a forward pass of a transformer, layer by layer, with specific attention to where activation memory lives.
- Explain the difference between DPO, PPO, and the various online RL variants. When would you pick each?
Notice the pattern. Every question has both a theoretical and practical half. The interviewer wants both.
What a strong answer looks like
For 'the eval is noisy — design a procedure for comparing two checkpoints,' a passing answer runs more samples and reports a mean. A strong answer:
- Names the right statistical tool. 'Paired bootstrap over the same prompts, not independent samples. The variance is per-prompt, so pairing gives much tighter intervals.'
- Picks a seed protocol. 'Five seeds per checkpoint per prompt. I'd set seeds per-prompt not globally so the comparison is truly paired.'
- Chooses the right test. 'Wilcoxon signed-rank on the per-prompt deltas, because the score distribution is not Gaussian for most LLM evals.'
- Controls for multiple testing. 'If we're evaluating 50 subtasks, Bonferroni or BH correction depending on how correlated the subtasks are.'
- Names the failure mode. 'The biggest risk is a subset of prompts being unstable — one prompt flipping decides the result. I'd add leave-one-out sensitivity.'
- Decides what signal to trust. 'If the delta is within 1 standard error of the pooled estimate, I don't ship. I either run more, or I decide it's a tie.'
That's a confident answer. Most candidates stop at step 1 or 2.
Paper deep-dive round: the actual mechanics
This round is OpenAI's signature and deserves its own section.
The interviewer hands you a recent paper (usually something in the last 6 months, often from a peer lab) and asks you to walk through it. They're grading:
- Do you read the paper or just the abstract?
- Can you identify the one or two key technical contributions vs the window-dressing?
- Do you notice methodology weaknesses the authors buried in the appendix?
- Can you propose a meaningful extension that would actually run?
- Do you correctly calibrate the importance of the work?
Prep strategy: pick 5-8 papers in your interest area from the last 12 months. For each, spend 3 hours writing notes: what the authors claim, what they actually proved, what's brittle, what you'd change, and what the follow-up would look like. Practice presenting each to a friend or LLM. In the interview, the goal is to pattern-match to one of your rehearsed deep-dives.
Common failure modes
- Leetcode prep. The loop is not Leetcode. Grinding array problems is low ROI.
- Surface-level paper reading. Candidates who've 'read' papers but can't derive the core equation or identify the ablations fail hard.
- No scale intuition. Reasoning only at the single-GPU level.
- Generic answers to 'what would you work on.' Saying 'I want to work on agents' without a specific, defensible thesis.
- No opinions. Candidates who hedge every claim. OpenAI interviewers read it as insufficient depth.
- Mission incoherence. Candidates who can't articulate why OpenAI specifically. The interviewer will ask, and 'I like AI' is a flunk.
Comp and the OpenAI PPU structure
OpenAI comp in 2026 is extraordinary for senior engineers. Typical bands:
- L3 RE (new grad or near): $200K-$240K base, $300K-$600K PPU per year (4-year grant amortized). TC $500K-$840K.
- L4 RE (mid-senior): $230K-$290K base, $700K-$1.5M PPU/yr. TC $930K-$1.8M.
- L5 RE (senior/staff): $290K-$360K base, $1.5M-$3M PPU/yr. TC $1.8M-$3.4M.
- L6 RE (principal): $340K-$420K base, $3M-$8M+ PPU/yr. TC $3.3M-$8.4M+.
PPUs (Profit Participation Units) are not equity in the traditional sense — they're a claim on a share of future profits, subject to OpenAI's capped-profit structure. They have traded on secondary markets at meaningful valuations during the 2023-2025 tender events, which is why the implied TC numbers are plausible. Do your diligence on vesting, cliff, and the current secondary market before signing.
Sign-on bonus: $50K-$300K depending on level. Always negotiable.
The negotiation lever at OpenAI is almost entirely PPU. Base is tightly banded. Senior candidates with competing Anthropic, DeepMind, or xAI offers routinely move OpenAI PPU grants 30-60%. Without a competing offer at a peer lab, the lever is mostly closed.
Prep strategy — 6 weeks
- Weeks 1-2: Read 12 papers in your area. Write structured notes on each. Practice presenting five.
- Weeks 3-4: Drill numerical estimation, scaling laws, and systems fundamentals. Implement one nontrivial thing from scratch (attention with Triton, a speculative-decoding loop, a DPO training loop).
- Week 5: Mock the paper deep-dive round 5 times. Mock the ML design round 3 times.
- Week 6: Rest the brain. Review notes. Sleep well.
The candidates who clear the OpenAI RE bar have three things in common: they read relentlessly, they have shipped something real at scale, and they have opinions they're willing to defend. If you can walk in, present a paper with the confidence of a first author, and estimate training FLOPs for a 400B model in under 30 seconds, you'll be in the hire pile. If you're a strong generalist without the paper reflex, you'll be filtered to Applied or to the Applied Engineer track, which is a different guide.
Sources and further reading
When evaluating any company's interview process, hiring bar, or compensation, cross-reference what you read here against multiple primary sources before making decisions.
- Levels.fyi — Crowdsourced compensation data with real recent offers across tech employers
- Glassdoor — Self-reported interviews, salaries, and employee reviews searchable by company
- Blind by Teamblind — Anonymous discussions about specific companies, often the freshest signal on layoffs, comp, culture, and team-level reputation
- LinkedIn People Search — Find current employees by company, role, and location for warm-network outreach and informational interviews
These are starting points, not the last word. Combine multiple sources, weight recent data over older, and treat anonymous reports as signal that needs corroboration.
Related guides
- Anthropic Research Engineer Interview in 2026 — Alignment, Evals, and the Research Take-Home — A focused guide to Anthropic research engineer interviews: what to expect, how to prepare for coding, research taste, evaluations, alignment thinking, and the research take-home without relying on hype.
- The OpenAI Applied Engineer Interview — Product Surfaces, Evals, and Shipping Fast — OpenAI's Applied Engineer loop grades for product velocity, eval discipline, and the judgment to ship an LLM feature that actually works. Here's the 2026 bar and prep path.
- OpenAI Interview Preparation — Research, Engineering, and Applied Roles in 2026 — A comprehensive, honest guide to navigating OpenAI's interview process across research, engineering, and applied roles — covering what to expect, how to prepare, and how to stand out in one of the most competitive hiring pipelines in tech.
- The Apple Machine Learning Interview: On-Device ML, Core ML, and Applied Research — Apple's ML loop is not OpenAI's. They grade for model-compression craft, privacy-preserving training, and shipping models that run on a phone in your pocket. Here's the actual bar in 2026.
- The Nvidia Machine Learning Interview — GPU Systems, CUDA Optimization, and Applied Research — Nvidia's ML loop doesn't look like Meta's or OpenAI's. They grade for GPU literacy, kernel-level intuition, and a working mental model of memory bandwidth. Here's the 2026 bar.
