en Skip to content

What to Keep When Evaluating AI

Evaluating an AI system requires more than collecting successful examples. Failure cases, criteria, and human review boundaries need to be recorded as well.

LLM-based tools make the gap between “looks good once” and “works repeatedly” especially visible. Evaluation notes should preserve at least the following:

  • where the system works well
  • where it fails
  • whether the failure is risky or merely inconvenient
  • where human review is required

This note is a public fragment that may later grow into Writing or a Lab.