Draft Outline
- Separate retrieval quality, generation quality, and agent task-completion quality
- Track source coverage, unsupported claims, tool-use errors, and recovery behavior
- Use golden sets, adversarial questions, traces, and human review rubrics
- Connect evaluation to deployment gates, monitoring, and regression testing