Jun 5 · link
Agent evals and fake precision
Saving this because it explains why agent evals often become fake precision: clean numbers on a benchmark that doesn't resemble the real workflow.
A messy index of what I’m noticing: work, cities, meals, music, films, links, clips, and half-formed thoughts.
1 entry · #ai
Jun 5 · link
Saving this because it explains why agent evals often become fake precision: clean numbers on a benchmark that doesn't resemble the real workflow.