benchmark-validation
Skillby concordance-co
Use when deciding whether a benchmark is worth deeper benchmark-first mechanistic interpretability work. Covers public availability, runnable access, label richness, product relevance, likely mechanistic question richness, scale, and obvious confounds before investing in latent-label work.
Details
- Path
- .agents/skills/benchmark-validation/SKILL.md