Skip to content

evaluate

Run benchy evaluations against models or systems. Covers the canonical smoke→full workflow, config selection, task filtering, exit policies, and reading run_outcome.json. Use when asked to evaluate, benchmark, or run benchy against a model or system config.

Repository Source folder

Details

Path
.agent/skills/evaluate/SKILL.md

FAQ