You are given a mini-batch of activations X ∈ ℝ^{B×T×d} produced by a transformer layer, where B is the batch size, T is the sequence length, and d is the model dimension. Implement two normalization routines:
Batch Normalization (BN): for every feature index j ∈ {0…d−1} compute mean μ_j and variance σ_j² across the entire mini-batch and all time steps (i.e. over the B·T samples), then normalize each feature column and apply learnable scale γ_j and shift β_j.
Layer Normalization (LN): for every sample (b,t) compute mean μ_{b,t} and variance σ_{b,t}² across the feature dimension d, then normalize that token vector and apply the same learned scale γ ∈ ℝ^d and shift β ∈ ℝ^d.
Write the two functions
batch_norm(X, γ, β, eps=1e-5, momentum=0.1, training=True, running_mean=None, running_var=None) layer_norm(X, γ, β, eps=1e-5)
that return the normalized tensor and, for batch_norm when training=True, updated running statistics. You may use NumPy or PyTorch-style tensor operations. Explain in comments why transformers prefer layer norm over batch norm for variable-length sequences and small batches.