Lightweight eval for wizard integrations

We need to build a lightweight, automated loop for sanity checking and evaluating Wizard integrations anytime there are changes to the underlying model, prompts, or resources. 

`model + prompt + context` → `wizard output` → `check (by human or model)` → `saved run` 

- Create a stable of boilerplate example apps for the Wizard to integrate with
- Run nightly or on CI/CD
- Save snapshots of the Wizard's output and diffs
- Save a summary of the diffs 
- Save a qualitative evaluation of the diffs based on benchmarks or humans / LLMs as a judge
- Reporting and alerting 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lightweight eval for wizard integrations #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Lightweight eval for wizard integrations #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions