Implement Decaying Attention

[ OK ] 434 — full content available

[ INFO ] category: Coding · Domain Specific difficulty: hard freq: medium first seen: 2026-01-13

[HARD][DOMAIN SPECIFIC][MEDIUM]Machine LearningDeep LearningAttentionPyTorch

$ cat problem.md

Implement a decaying attention mechanism in PyTorch. Given query (Q), key (K), and value (V) tensors of shape (batch_size, num_heads, seq_len, d_k), you must compute an attention matrix that exponentially down-weights the influence of tokens as their distance increases. Specifically, before applying softmax, add a bias matrix B to the scaled dot-product attention scores, where B[i][j] = -|i - j|. This bias causes attention weights to decay with distance, so nearby tokens receive higher attention and distant tokens receive exponentially lower attention. Your function should return the final attention output of shape (batch_size, num_heads, seq_len, d_k). Make sure to scale the QK^T product by 1/sqrt(d_k) before adding the bias. The bias addition must be done in a numerically stable way, and the entire operation should be efficient using PyTorch broadcasting.

user@intervues:~/nvidia$