evaluate
Skillby surus-lat
Run benchy evaluations against models or systems. Covers the canonical smoke→full workflow, config selection, task filtering, exit policies, and reading run_outcome.json. Use when asked to evaluate, benchmark, or run benchy against a model or system config.
Details
- Path
- .agent/skills/evaluate/SKILL.md