Log

A messy index of what I’m noticing: work, cities, meals, music, films, links, clips, and half-formed thoughts.

1 entry · #ai

Jun 5 · link

Agent evals and fake precision

Saving this because it explains why agent evals often become fake precision: clean numbers on a benchmark that doesn't resemble the real workflow.

Anthropic (opens in new tab)

Log · Arjun Aggarwal