You are given a partially-implemented GRPO (Group Relative Policy Optimization) training loop for a small language model. The code is supposed to perform one policy-update step on a batch of prompt/response pairs, but it currently fails to improve the reward metric and sometimes crashes with NaNs. Your task is to debug and fix exactly four hidden bugs that violate RL best-practice rules. The skeleton code loads a batch of 64 prompts, samples 4 responses per prompt with the current policy, computes token-level log-probabilities, calculates rewards with a provided scalar-reward model, forms advantages, and finally updates the policy with a clipped surrogate loss. You must locate and minimally patch the four bugs so that (a) sampling follows the correct categorical distribution, (b) advantages are numerically stable when their standard deviation is zero, (c) policy ratios are computed in log-space for numerical safety, and (d) the PPO-style clipping actually constrains the policy change. You are not allowed to change the overall algorithmic structure, hyper-parameters, or model architecture; only fix the four specific lines that violate the hints provided. After patching, the training loop should run without NaNs and achieve monotonically increasing reward on a provided unit-test tracker.