Skip to content

drpo-llm-rl

Divergence Regularized Policy Optimization (DRPO) — smooth advantage-weighted quadratic regularizer replacing hard trust-region masks in LLM reinforcement learning. Use when optimizing LLM post-training with RL, improving upon GRPO/PPO/DPPO stability.

Repository Source folder

Details

Path
collection/skills/drpo-llm-rl/SKILL.md

FAQ