Transformer Debug

[ OK ] 854ed663-de77-451d-bfe2-fdcd7287c3be — full content available

[ INFO ] category: Coding difficulty: medium freq: Must first seen: 2026-03-13

[MEDIUM][CODING][MUST]High Frequency

$ cat problem.md

The Transformer Debug interview question at OpenAI is a production-oriented problem-solving task designed to evaluate your ability to diagnose and fix real-world issues in a transformer training pipeline. You are typically provided with a small implementation of a Transformer (often in PyTorch) and a training script that contains multiple "bugs"—some that prevent execution and others that prevent the model from learning. PracHub +3

Problem Statement Overview

The core task is to identify and resolve defects across the entire model stack so that the code successfully trains a classifier or language model to pass specific unit tests and performance metrics. PracHub +1

Commonly Included Bugs

The "broken" code usually includes a mix of these defects:

Architecture & Masking: Causal Masking: Missing or incorrectly implemented causal masks, allowing the model to "see" future tokens during training. Positional Encoding: Missing, improperly added, or incorrectly scaled positional embeddings. KV Caching: Inefficiencies or errors in implementing Key-Value (KV) cache logic for inference.
Numerical Stability: Layer Norm Placement: Pre-LN vs. Post-LN issues causing gradient instability. Softmax Overflow: Applying softmax incorrectly in mixed precision (FP16/AMP) leading to NaN values. Learning Rate Warmup: Absence of a warmup period causing early training collapse.
Training Hygiene: Target Shifting: Failing to shift labels for next-token prediction (predicting the input itself). Optimizer Errors: Missing zero_grad() calls or incorrectly configured weight decay (AdamW). Data Leakage: Features leaking into labels or incorrect handling of padding tokens (ignore_index). Reddit +7

Expected Deliverables

Candidates are generally required to provide:

Corrected Code: A runnable PyTorch implementation where the training loss decreases and tests pass.
Defect Analysis: A list or explanation of each bug found, its root cause, and how the fix addresses it.
Unit Tests: Writing new PyTest-style tests to verify components like attention shapes, mask correctness, and gradient flow.
Metric Reporting: Calculation of standard metrics such as Accuracy, F1-score, or ROC-AUC for classification tasks. PracHub +4

Typical "Symptoms" to Diagnose

Plateauing Metrics: Validation accuracy stays near chance (e.g., 50% for binary classification) while training loss goes to zero.
CUDA Errors: Indexing or shape mismatches specifically around attention head manipulation or broadcasting.
Nondeterminism: Results that vary wildly across runs despite setting a seed. PracHub +2

Would you like a minimal PyTorch snippet demonstrating one of these common bugs, such as a broken causal mask?

[0] - [D] Upcoming interviews at frontier labs, tips? : r/MachineLearning [1] - Debug transformer and train classifier - OpenAI Questions [2] - MLE Interview 2.0: Research Engineering and Scary Rounds [3] - Debug and fix a PyTorch Transformer training loop - PracHub [4] - Debug a transformer training pipeline - PracHub [5] - OpenAI Machine Learning Engineering Tech Phone Screen [6] - Debug a Machine Learning Pipeline | OpenAI Interview Question [7] - Debug a failing ML classifier | OpenAI Interview Question - PracHub [8] - Openai Machine Learning Engineer Interview Questions - PracHub [9] - OpenAI Coding Interviews 2025: Real Questions from ... [10] - Transformer Model Debugging from Openai | Popular Questions [11] - LLM (ML) Job Interviews - Resources - Mimansa Jaiswal [12] - Top 24 LLM questions asked at DeepMind, OpenAI, Meta and more! [13] - Top Generative AI and LLM Interview Question with Answer

user@intervues:~/openai$