[ OK ]854ed663-de77-451d-bfe2-fdcd7287c3be — full content available
[ INFO ]category: Coding difficulty: medium freq: Must first seen: 2026-03-13
[MEDIUM][CODING][MUST]
$catproblem.md
The Transformer Debug interview question at OpenAI is a production-oriented problem-solving task designed to evaluate your ability to diagnose and fix real-world issues in a transformer training pipeline. You are typically provided with a small implementation of a Transformer (often in PyTorch) and a training script that contains multiple "bugs"—some that prevent execution and others that prevent the model from learning. PracHub +3
Problem Statement Overview
The core task is to identify and resolve defects across the entire model stack so that the code successfully trains a classifier or language model to pass specific unit tests and performance metrics. PracHub +1
Commonly Included Bugs
The "broken" code usually includes a mix of these defects:
Architecture & Masking:
Causal Masking: Missing or incorrectly implemented causal masks, allowing the model to "see" future tokens during training.
Positional Encoding: Missing, improperly added, or incorrectly scaled positional embeddings.
KV Caching: Inefficiencies or errors in implementing Key-Value (KV) cache logic for inference.
Numerical Stability:
Layer Norm Placement: Pre-LN vs. Post-LN issues causing gradient instability.
Softmax Overflow: Applying softmax incorrectly in mixed precision (FP16/AMP) leading to NaN values.
Learning Rate Warmup: Absence of a warmup period causing early training collapse.
Training Hygiene:
Target Shifting: Failing to shift labels for next-token prediction (predicting the input itself).
Optimizer Errors: Missing zero_grad() calls or incorrectly configured weight decay (AdamW).
Data Leakage: Features leaking into labels or incorrect handling of padding tokens (ignore_index). Reddit +7
Expected Deliverables
Candidates are generally required to provide:
Corrected Code: A runnable PyTorch implementation where the training loss decreases and tests pass.
Defect Analysis: A list or explanation of each bug found, its root cause, and how the fix addresses it.
Unit Tests: Writing new PyTest-style tests to verify components like attention shapes, mask correctness, and gradient flow.
Metric Reporting: Calculation of standard metrics such as Accuracy, F1-score, or ROC-AUC for classification tasks. PracHub +4
Typical "Symptoms" to Diagnose
Plateauing Metrics: Validation accuracy stays near chance (e.g., 50% for binary classification) while training loss goes to zero.
CUDA Errors: Indexing or shape mismatches specifically around attention head manipulation or broadcasting.
Nondeterminism: Results that vary wildly across runs despite setting a seed. PracHub +2
Would you like a minimal PyTorch snippet demonstrating one of these common bugs, such as a broken causal mask?