We need to build a lightweight, automated loop for sanity checking and evaluating Wizard integrations anytime there are changes to the underlying model, prompts, or resources.
model + prompt + context → wizard output → check (by human or model) → saved run
- Create a stable of boilerplate example apps for the Wizard to integrate with
- Run nightly or on CI/CD
- Save snapshots of the Wizard's output and diffs
- Save a summary of the diffs
- Save a qualitative evaluation of the diffs based on benchmarks or humans / LLMs as a judge
- Reporting and alerting