You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -104,21 +104,29 @@ The Langsmith integration provides two key components:
104
104
105
105
#### 6. Pairwise Evaluation
106
106
107
-
Pairwise evaluation provides a simpler, criteria-based approach to workflow evaluation. Instead of using the complex multi-metric evaluation system, it evaluates workflows against a custom set of "do" and "don't" rules defined in the dataset.
107
+
Pairwise evaluation provides a criteria-based approach to workflow evaluation with hierarchical scoring and multi-judge consensus. It evaluates workflows against a custom set of "do" and "don't" rules defined in the dataset.
108
108
109
109
**Evaluator (`chains/pairwise-evaluator.ts`):**
110
110
- Evaluates workflows against a checklist of criteria (dos and don'ts)
111
111
- Uses an LLM to determine if each criterion passes or fails
112
112
- Requires evidence-based justification for each decision
113
-
-Calculates a simple pass/fail score (passes / total rules)
113
+
-Returns `primaryPass` (true only if ALL criteria pass) and `diagnosticScore` (ratio of passes)
114
114
115
115
**Runner (`langsmith/pairwise-runner.ts`):**
116
116
- Generates workflows from prompts in the dataset
117
-
- Applies pairwise evaluation to each generated workflow
118
-
- Reports three metrics to Langsmith:
119
-
-`pairwise_score`: Overall score (0-1)
120
-
-`pairwise_passed_count`: Number of criteria passed
121
-
-`pairwise_failed_count`: Number of criteria violated
117
+
- Runs multiple LLM judges in parallel for each evaluation (configurable via `--judges`)
118
+
- Aggregates judge results using majority vote
119
+
- Supports filtering by `notion_id` metadata for single-example runs
120
+
- Reports five metrics to Langsmith:
121
+
-`pairwise_primary`: Majority vote result (0 or 1)
122
+
-`pairwise_diagnostic`: Average diagnostic score across judges
123
+
-`pairwise_judges_passed`: Count of judges that passed
124
+
-`pairwise_total_violations`: Sum of all violations
125
+
-`pairwise_total_passes`: Sum of all passes
126
+
127
+
**Logger (`utils/logger.ts`):**
128
+
- Simple evaluation logger with verbose mode support
129
+
- Controls output verbosity via `--verbose` flag
122
130
123
131
**Dataset Format:**
124
132
The pairwise evaluation expects a Langsmith dataset with examples containing:
Pairwise evaluation uses a dataset with custom do/don't criteria for each prompt.
250
+
Pairwise evaluation uses a dataset with custom do/don't criteria for each prompt. It implements a hierarchical scoring system with multiple LLM judges per evaluation.
251
+
252
+
#### CLI Options
253
+
254
+
| Option | Description | Default |
255
+
|--------|-------------|---------|
256
+
|`--prompt <text>`| Run local evaluation with this prompt (no LangSmith required) | - |
257
+
|`--dos <rules>`| Newline-separated "do" rules for local evaluation | - |
258
+
|`--donts <rules>`| Newline-separated "don't" rules for local evaluation | - |
259
+
|`--notion-id <id>`| Filter to a single example by its `notion_id` metadata | (all examples) |
260
+
|`--max-examples <n>`| Limit number of examples to evaluate (useful for testing) | (no limit) |
261
+
|`--repetitions <n>`| Number of times to repeat the entire evaluation | 1 |
262
+
|`--generations <n>`| Number of workflow generations per prompt (for variance reduction) | 1 |
263
+
|`--judges <n>`| Number of LLM judges per evaluation | 3 |
264
+
|`--concurrency <n>`| Number of prompts to evaluate in parallel | 5 |
265
+
|`--name <name>`| Custom experiment name in LangSmith |`pairwise-evals`|
266
+
|`--output-dir <path>`| Save generated workflows and evaluation results to this directory | - |
@@ -282,10 +431,77 @@ The evaluation will fail with a clear error message if `nodes.json` is missing.
282
431
-`USE_LANGSMITH_EVAL` - Set to "true" to use Langsmith mode
283
432
-`USE_PAIRWISE_EVAL` - Set to "true" to use pairwise evaluation mode
284
433
-`LANGSMITH_DATASET_NAME` - Override default dataset name
285
-
-`EVAL_MAX_EXAMPLES` - Limit number of examples to evaluate (useful for testing)
286
434
-`EVALUATION_CONCURRENCY` - Number of parallel test executions (default: 5)
287
435
-`GENERATE_TEST_CASES` - Set to "true" to generate additional test cases
288
436
-`LLM_MODEL` - Model identifier for metadata tracking
437
+
-`EVAL_FEATURE_MULTI_AGENT` - Set to "true" to enable multi-agent mode
438
+
-`EVAL_FEATURE_TEMPLATE_EXAMPLES` - Set to "true" to enable template examples
439
+
440
+
### Feature Flags
441
+
442
+
Feature flags control experimental or optional behaviors in the AI Workflow Builder agent during evaluations. They can be set via environment variables or CLI arguments.
0 commit comments