GRPO/RL Fundamentals Debug

[ OK ] 336 — full content available

[ INFO ] category: Coding · Domain Specific difficulty: hard freq: high first seen: 2026-01-13

[HARD][DOMAIN SPECIFIC][HIGH]Reinforcement LearningGRPOPPODebuggingML

$ cat problem.md

You are given a minimal PyTorch implementation of a GRPO (Group Relative Policy Optimization) training loop for a small language model. The code is intentionally seeded with four subtle bugs that prevent the policy from improving and can even destabilize training. Your task is to locate and fix every bug so that the agent reliably maximizes the expected reward on a simple arithmetic-token environment. You may only modify the lines that contain bugs; do not change the overall algorithmic structure, hyper-parameters, or environment definition. After each fix you must verify that the training curve (mean episode return vs. update) is monotonically increasing and reaches the maximum possible return within 500 updates. You will run the provided script locally and paste the final corrected code snippet (≤ 20 lines changed in total) into the answer box.

user@intervues:~/anthropic$