Skip to content

Commit 37fa003

Browse files
authored
Merge branch 'master' into ds-298-replace-old-tooltip-component
2 parents 467094b + f72bbaf commit 37fa003

File tree

207 files changed

+7831
-5987
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

207 files changed

+7831
-5987
lines changed

.github/workflows/storybook.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ jobs:
4343
pnpm add --global wrangler
4444
4545
- name: Deploy
46-
uses: cloudflare/wrangler-action@da0e0dfe58b7a431659754fdf3f186c529afbe65
46+
uses: cloudflare/wrangler-action@707f63750981584eb6abc365a50d441516fb04b8
4747
id: cloudflare_deployment
4848
with:
4949
apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }}

packages/@n8n/ai-workflow-builder.ee/evaluations/README.md

Lines changed: 243 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -104,21 +104,29 @@ The Langsmith integration provides two key components:
104104

105105
#### 6. Pairwise Evaluation
106106

107-
Pairwise evaluation provides a simpler, criteria-based approach to workflow evaluation. Instead of using the complex multi-metric evaluation system, it evaluates workflows against a custom set of "do" and "don't" rules defined in the dataset.
107+
Pairwise evaluation provides a criteria-based approach to workflow evaluation with hierarchical scoring and multi-judge consensus. It evaluates workflows against a custom set of "do" and "don't" rules defined in the dataset.
108108

109109
**Evaluator (`chains/pairwise-evaluator.ts`):**
110110
- Evaluates workflows against a checklist of criteria (dos and don'ts)
111111
- Uses an LLM to determine if each criterion passes or fails
112112
- Requires evidence-based justification for each decision
113-
- Calculates a simple pass/fail score (passes / total rules)
113+
- Returns `primaryPass` (true only if ALL criteria pass) and `diagnosticScore` (ratio of passes)
114114

115115
**Runner (`langsmith/pairwise-runner.ts`):**
116116
- Generates workflows from prompts in the dataset
117-
- Applies pairwise evaluation to each generated workflow
118-
- Reports three metrics to Langsmith:
119-
- `pairwise_score`: Overall score (0-1)
120-
- `pairwise_passed_count`: Number of criteria passed
121-
- `pairwise_failed_count`: Number of criteria violated
117+
- Runs multiple LLM judges in parallel for each evaluation (configurable via `--judges`)
118+
- Aggregates judge results using majority vote
119+
- Supports filtering by `notion_id` metadata for single-example runs
120+
- Reports five metrics to Langsmith:
121+
- `pairwise_primary`: Majority vote result (0 or 1)
122+
- `pairwise_diagnostic`: Average diagnostic score across judges
123+
- `pairwise_judges_passed`: Count of judges that passed
124+
- `pairwise_total_violations`: Sum of all violations
125+
- `pairwise_total_passes`: Sum of all passes
126+
127+
**Logger (`utils/logger.ts`):**
128+
- Simple evaluation logger with verbose mode support
129+
- Controls output verbosity via `--verbose` flag
122130

123131
**Dataset Format:**
124132
The pairwise evaluation expects a Langsmith dataset with examples containing:
@@ -217,6 +225,9 @@ GENERATE_TEST_CASES=true pnpm eval
217225

218226
# With custom concurrency
219227
EVALUATION_CONCURRENCY=10 pnpm eval
228+
229+
# With feature flags enabled
230+
pnpm eval --multi-agent --template-examples
220231
```
221232

222233
### Langsmith Evaluation
@@ -229,11 +240,59 @@ export LANGSMITH_DATASET_NAME=your_dataset_name
229240

230241
# Run evaluation
231242
pnpm eval:langsmith
243+
244+
# With feature flags enabled
245+
pnpm eval:langsmith --multi-agent
232246
```
233247

234248
### Pairwise Evaluation
235249

236-
Pairwise evaluation uses a dataset with custom do/don't criteria for each prompt.
250+
Pairwise evaluation uses a dataset with custom do/don't criteria for each prompt. It implements a hierarchical scoring system with multiple LLM judges per evaluation.
251+
252+
#### CLI Options
253+
254+
| Option | Description | Default |
255+
|--------|-------------|---------|
256+
| `--prompt <text>` | Run local evaluation with this prompt (no LangSmith required) | - |
257+
| `--dos <rules>` | Newline-separated "do" rules for local evaluation | - |
258+
| `--donts <rules>` | Newline-separated "don't" rules for local evaluation | - |
259+
| `--notion-id <id>` | Filter to a single example by its `notion_id` metadata | (all examples) |
260+
| `--max-examples <n>` | Limit number of examples to evaluate (useful for testing) | (no limit) |
261+
| `--repetitions <n>` | Number of times to repeat the entire evaluation | 1 |
262+
| `--generations <n>` | Number of workflow generations per prompt (for variance reduction) | 1 |
263+
| `--judges <n>` | Number of LLM judges per evaluation | 3 |
264+
| `--concurrency <n>` | Number of prompts to evaluate in parallel | 5 |
265+
| `--name <name>` | Custom experiment name in LangSmith | `pairwise-evals` |
266+
| `--output-dir <path>` | Save generated workflows and evaluation results to this directory | - |
267+
| `--verbose`, `-v` | Enable verbose logging (shows judge details, violations, etc.) | false |
268+
| `--multi-agent` | Enable multi-agent architecture (see [Feature Flags](#feature-flags)) | false |
269+
| `--template-examples` | Enable template-based examples (see [Feature Flags](#feature-flags)) | false |
270+
271+
#### Local Mode (No LangSmith Required)
272+
273+
Run a single pairwise evaluation locally without needing a LangSmith account:
274+
275+
```bash
276+
# Basic local evaluation
277+
pnpm eval:pairwise --prompt "Create a workflow that sends Slack messages" --dos "Use Slack node"
278+
279+
# With don'ts and multiple judges
280+
pnpm eval:pairwise \
281+
--prompt "Create a workflow that fetches data from an API" \
282+
--dos "Use HTTP Request node\nHandle errors" \
283+
--donts "Don't hardcode URLs" \
284+
--judges 5 \
285+
--verbose
286+
```
287+
288+
Local mode is useful for:
289+
- Testing prompts before adding them to a dataset
290+
- Quick iteration on evaluation criteria
291+
- Running evaluations without LangSmith setup
292+
293+
#### LangSmith Mode
294+
295+
For dataset-based evaluation with experiment tracking:
237296

238297
```bash
239298
# Set required environment variables
@@ -242,14 +301,104 @@ export LANGSMITH_API_KEY=your_api_key
242301
# Run pairwise evaluation (uses default dataset: notion-pairwise-workflows)
243302
pnpm eval:pairwise
244303

304+
# Run a single example by notion_id
305+
pnpm eval:pairwise --notion-id 30d29454-b397-4a35-8e0b-74a2302fa81a
306+
307+
# Run with 3 repetitions and 5 judges, custom experiment name
308+
pnpm eval:pairwise --repetitions 3 --judges 5 --name "my-experiment"
309+
310+
# Enable verbose logging to see all judge details
311+
pnpm eval:pairwise --notion-id abc123 --verbose
312+
245313
# Use a custom dataset
246314
LANGSMITH_DATASET_NAME=my-pairwise-dataset pnpm eval:pairwise
247315

248316
# Limit to specific number of examples (useful for testing)
249-
EVAL_MAX_EXAMPLES=2 pnpm eval:pairwise
317+
pnpm eval:pairwise --max-examples 2
318+
```
319+
320+
#### Multi-Generation Evaluation
321+
322+
The `--generations` flag enables multiple workflow generations per prompt, providing a **Generation Correctness** metric:
250323

251-
# Run with multiple repetitions
252-
pnpm eval:pairwise --repetitions 3
324+
```bash
325+
# Run 3 generations per prompt with 3 judges each
326+
pnpm eval:pairwise --generations 3 --judges 3 --verbose
327+
328+
# Example output:
329+
# Gen 1: 2/3 judges → ✓ PASS (diag=85%)
330+
# Gen 2: 1/3 judges → ✗ FAIL (diag=60%)
331+
# Gen 3: 3/3 judges → ✓ PASS (diag=95%)
332+
# 📊 [#1] 2/3 gens → PASS (gen_corr=0.67, diag=80%)
333+
```
334+
335+
**Generation Correctness** = (# passing generations) / total generations:
336+
- With `--generations 3`: Values are 0, 0.33, 0.67, or 1
337+
- With `--generations 5`: Values are 0, 0.2, 0.4, 0.6, 0.8, or 1
338+
339+
#### Hierarchical Scoring System
340+
341+
The pairwise evaluation uses a multi-level scoring hierarchy:
342+
343+
| Level | Primary Score | Secondary Score |
344+
|-------|--------------|-----------------|
345+
| Individual do/don't | Binary (true/false) | 0 or 1 |
346+
| 1 LLM judge | false if ANY criterion fails | Average of criteria scores |
347+
| N judges on 1 generation | Majority vote (≥50% pass) | Average diagnostic across judges |
348+
| N generations on 1 prompt | (# passing gens) / N | Average diagnostic across generations |
349+
| Full dataset | Average across prompts | Average diagnostic across all |
350+
351+
This approach reduces variance from LLM non-determinism by using multiple judges and generations.
352+
353+
#### Saving Artifacts with --output-dir
354+
355+
The `--output-dir` flag saves all generated workflows and evaluation results to disk:
356+
357+
```bash
358+
# Save artifacts to ./eval-output directory
359+
pnpm eval:pairwise --generations 3 --output-dir ./eval-output --verbose
360+
```
361+
362+
**Output structure:**
363+
```
364+
eval-output/
365+
├── prompt-1/
366+
│ ├── prompt.txt # Original prompt text
367+
│ ├── criteria.json # dos/donts criteria
368+
│ ├── gen-1/
369+
│ │ ├── workflow.json # Importable n8n workflow
370+
│ │ └── evaluation.json # Judge results for this generation
371+
│ ├── gen-2/
372+
│ │ ├── workflow.json
373+
│ │ └── evaluation.json
374+
│ └── gen-3/
375+
│ ├── workflow.json
376+
│ └── evaluation.json
377+
├── prompt-2/
378+
│ └── ...
379+
└── summary.json # Overall results summary
380+
```
381+
382+
**workflow.json**: Directly importable into n8n (File → Import from file)
383+
384+
**evaluation.json**: Contains per-judge results including violations and passes:
385+
```json
386+
{
387+
"generationIndex": 1,
388+
"majorityPass": false,
389+
"primaryPasses": 1,
390+
"numJudges": 3,
391+
"diagnosticScore": 0.35,
392+
"judges": [
393+
{
394+
"judgeIndex": 1,
395+
"primaryPass": false,
396+
"diagnosticScore": 0.30,
397+
"violations": [{"rule": "...", "justification": "..."}],
398+
"passes": [{"rule": "...", "justification": "..."}]
399+
}
400+
]
401+
}
253402
```
254403

255404
## Configuration
@@ -282,10 +431,77 @@ The evaluation will fail with a clear error message if `nodes.json` is missing.
282431
- `USE_LANGSMITH_EVAL` - Set to "true" to use Langsmith mode
283432
- `USE_PAIRWISE_EVAL` - Set to "true" to use pairwise evaluation mode
284433
- `LANGSMITH_DATASET_NAME` - Override default dataset name
285-
- `EVAL_MAX_EXAMPLES` - Limit number of examples to evaluate (useful for testing)
286434
- `EVALUATION_CONCURRENCY` - Number of parallel test executions (default: 5)
287435
- `GENERATE_TEST_CASES` - Set to "true" to generate additional test cases
288436
- `LLM_MODEL` - Model identifier for metadata tracking
437+
- `EVAL_FEATURE_MULTI_AGENT` - Set to "true" to enable multi-agent mode
438+
- `EVAL_FEATURE_TEMPLATE_EXAMPLES` - Set to "true" to enable template examples
439+
440+
### Feature Flags
441+
442+
Feature flags control experimental or optional behaviors in the AI Workflow Builder agent during evaluations. They can be set via environment variables or CLI arguments.
443+
444+
#### Available Flags
445+
446+
| Flag | Description | Default |
447+
|------|-------------|---------|
448+
| `multiAgent` | Enables multi-agent architecture with specialized sub-agents (supervisor, builder, configurator, discovery) | `false` |
449+
| `templateExamples` | Enables template-based examples in agent prompts | `false` |
450+
451+
#### Setting Feature Flags
452+
453+
**Via Environment Variables:**
454+
```bash
455+
# Enable multi-agent mode
456+
EVAL_FEATURE_MULTI_AGENT=true pnpm eval
457+
458+
# Enable template examples
459+
EVAL_FEATURE_TEMPLATE_EXAMPLES=true pnpm eval:pairwise
460+
461+
# Enable both
462+
EVAL_FEATURE_MULTI_AGENT=true EVAL_FEATURE_TEMPLATE_EXAMPLES=true pnpm eval:langsmith
463+
```
464+
465+
**Via CLI Arguments:**
466+
```bash
467+
# Enable multi-agent mode
468+
pnpm eval --multi-agent
469+
470+
# Enable template examples
471+
pnpm eval:pairwise --template-examples
472+
473+
# Enable both
474+
pnpm eval:langsmith --multi-agent --template-examples
475+
```
476+
477+
#### Usage Across Evaluation Modes
478+
479+
Feature flags work consistently across all evaluation modes:
480+
481+
**CLI Evaluation:**
482+
```bash
483+
pnpm eval --multi-agent --template-examples
484+
```
485+
486+
**Langsmith Evaluation:**
487+
```bash
488+
pnpm eval:langsmith --multi-agent
489+
```
490+
491+
**Pairwise Evaluation (LangSmith mode):**
492+
```bash
493+
pnpm eval:pairwise --multi-agent --template-examples
494+
```
495+
496+
**Pairwise Evaluation (Local mode):**
497+
```bash
498+
pnpm eval:pairwise --prompt "Create a Slack workflow" --dos "Use Slack node" --multi-agent
499+
```
500+
501+
When feature flags are enabled, they are logged at the start of the evaluation:
502+
```
503+
➔ Feature flags enabled: multiAgent, templateExamples
504+
```
289505

290506
## Output
291507

@@ -304,14 +520,22 @@ The evaluation will fail with a clear error message if `nodes.json` is missing.
304520
### Pairwise Evaluation Output
305521

306522
- Results are stored in Langsmith dashboard
307-
- Experiment name format: `pairwise-evals-[uuid]`
308-
- Metrics reported:
309-
- `pairwise_score`: Overall pass rate (0-1)
310-
- `pairwise_passed_count`: Number of criteria that passed
311-
- `pairwise_failed_count`: Number of criteria that were violated
523+
- Experiment name format: `<name>-[uuid]` (default: `pairwise-evals-[uuid]`)
524+
- Metrics reported (single generation mode):
525+
- `pairwise_primary`: Binary pass/fail based on majority vote (0 or 1)
526+
- `pairwise_diagnostic`: Average diagnostic score across judges (0-1)
527+
- `pairwise_judges_passed`: Number of judges that returned primaryPass=true
528+
- `pairwise_total_violations`: Sum of violations across all judges
529+
- `pairwise_total_passes`: Sum of passes across all judges
530+
- Additional metrics reported (multi-generation mode with `--generations N`):
531+
- `pairwise_generation_correctness`: (# passing generations) / N (0, 0.33, 0.67, 1 for N=3)
532+
- `pairwise_aggregated_diagnostic`: Average diagnostic score across all generations
533+
- `pairwise_generations_passed`: Count of generations that passed majority vote
534+
- `pairwise_total_judge_calls`: Total judge invocations (generations × judges)
312535
- Each result includes detailed comments with:
313-
- List of violations with justifications
314-
- List of passes with justifications
536+
- Majority vote summary
537+
- List of violations with justifications (per judge)
538+
- List of passes (per judge)
315539

316540
## Adding New Test Cases
317541

packages/@n8n/ai-workflow-builder.ee/evaluations/chains/pairwise-evaluator.test.ts

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,8 @@ describe('evaluateWorkflowPairwise', () => {
5252

5353
expect(result).toEqual({
5454
...mockResult,
55-
score: 1,
55+
primaryPass: true,
56+
diagnosticScore: 1,
5657
});
5758
expect(baseEvaluator.createEvaluatorChain).toHaveBeenCalledWith(
5859
mockLlm,
@@ -69,7 +70,7 @@ describe('evaluateWorkflowPairwise', () => {
6970
);
7071
});
7172

72-
it('should calculate score correctly with violations', async () => {
73+
it('should calculate diagnosticScore correctly with violations', async () => {
7374
const mockResult = {
7475
violations: [{ rule: "Don't do that", justification: 'Did it' }],
7576
passes: [{ rule: 'Do this', justification: 'Done' }],
@@ -79,10 +80,11 @@ describe('evaluateWorkflowPairwise', () => {
7980

8081
const result = await evaluateWorkflowPairwise(mockLlm, input);
8182

82-
expect(result.score).toBe(0.5);
83+
expect(result.primaryPass).toBe(false);
84+
expect(result.diagnosticScore).toBe(0.5);
8385
});
8486

85-
it('should return score 0 when no rules evaluated', async () => {
87+
it('should return diagnosticScore 0 when no rules evaluated', async () => {
8688
const mockResult = {
8789
violations: [],
8890
passes: [],
@@ -92,6 +94,7 @@ describe('evaluateWorkflowPairwise', () => {
9294

9395
const result = await evaluateWorkflowPairwise(mockLlm, input);
9496

95-
expect(result.score).toBe(0);
97+
expect(result.primaryPass).toBe(true);
98+
expect(result.diagnosticScore).toBe(0);
9699
});
97100
});

0 commit comments

Comments
 (0)