Discussion about this post

User's avatar
Shwetank Kumar's avatar

This is a great primer. The part that keeps coming up in practice is how differently the same model scores depending on what you're actually measuring. AISLE's cybersecurity benchmark this week showed model rankings completely reshuffled across tasks — no stable best model. Evals are only as useful as the specificity of the question you're asking them!

No posts

Ready for more?