Design and implement a high-performance transformer decoder layer that supports (1) Flash-Attention-style tiled self-attention to avoid materializing the full N×N attention matrix, (2) an explicit KV-cache that can be updated incrementally during autoregressive generation, and (3) Rotary Position Embedding (RoPE) applied to queries and keys so that attention scores depend only on relative position. Your module will be called repeatedly during generation: on the first token you receive the full prefix sequence; on every subsequent call you receive only one new token and must append it to the KV cache. You must return the next-token logits after every call. Your implementation must run entirely on a single GPU and must not exceed the GPU’s SRAM size per streaming multiprocessor (assume 192 KB). You should expose a clear API that lets the caller control tile size, RoPE base frequency, and whether to use the KV cache. You do not need to implement training backwards pass, only efficient inference.