dataset registry -> run orchestrator -> metric scorers -> drift analyzer
| | | |
v v v v
case stratification model/retriever quality deltas gate policy
+ gold citations variant matrix + CI trendline + release decision
|
v
report + artifacts
Why this matters: evaluations become enforceable release policy instead of passive dashboards.