Transformer Architecture Deep Dive

[ OK ] 315 — full content available

[ INFO ] category: Coding · Domain Specific difficulty: medium freq: medium first seen: 2026-01-13

[MEDIUM][DOMAIN SPECIFIC][MEDIUM]Machine LearningTransformersDeep LearningDomain

$ cat problem.md

You are asked to implement the forward pass of the Transformer encoder block from scratch in NumPy (or PyTorch without using nn.TransformerEncoderLayer). The block must include: 1) Multi-head self-attention with scaled dot-product attention, 2) Position-wise feed-forward network with GELU activation, 3) Residual connections and layer normalization (pre-norm style), 4) Dropout on attention weights, FFN hidden layer, and residual paths. Input is a tensor X ∈ ℝ^{b×n×d} where b=batch, n=seq-len, d=model-dim. You will also receive masks (causal or padding) that must be applied before softmax. Return the final tensor of the same shape. You must not use any high-level transformer utilities; only basic linear algebra ops are allowed.

user@intervues:~/tiktok$