drpo-llm-rl
Skillby hiyenwong
Divergence Regularized Policy Optimization (DRPO) — smooth advantage-weighted quadratic regularizer replacing hard trust-region masks in LLM reinforcement learning. Use when optimizing LLM post-training with RL, improving upon GRPO/PPO/DPPO stability.
Details
- Path
- collection/skills/drpo-llm-rl/SKILL.md