Skip to content

benchmark-validation

Use when deciding whether a benchmark is worth deeper benchmark-first mechanistic interpretability work. Covers public availability, runnable access, label richness, product relevance, likely mechanistic question richness, scale, and obvious confounds before investing in latent-label work.

Repository Source folder

Details

Path
.agents/skills/benchmark-validation/SKILL.md

FAQ