feat(evaluation): layout-aware FormSpec generation and quality evaluation#134
Open
danielnaab wants to merge 15 commits intomainfrom
Open
feat(evaluation): layout-aware FormSpec generation and quality evaluation#134danielnaab wants to merge 15 commits intomainfrom
danielnaab wants to merge 15 commits intomainfrom
Conversation
The formSpecSchema requires these fields. Also commits baseline and layout variant evaluation results showing +17.7% overall improvement.
Documents methodology, per-fixture results, and recommendations. Key finding: +17.7pp overall improvement with largest gains in title clarity (+43.7pp), topic cohesion (+37.5pp), and page sizing (+31.3pp). Conditional page use and delivery mode identified as areas for iteration.
…ance Delivery mode: removed overly conservative "default to static" and replaced with content-complexity-based criteria (narrative fields, sensitive topics, eligibility logic → conversational). Conditional pages: added explicit instructions for deriving page-level conditions from field-level conditions, with a worked example in the schema. Modest improvement (+6.3pp) but the inference remains hard for a prompt-only approach. Results: overall 77.1% (+19.8pp vs baseline). Delivery mode regression eliminated. Conditional use improved from 37.5% to 43.8%.
Final results after prompt iteration: +19.8pp overall (57.3% → 77.1%). Delivery mode regression eliminated. Conditional page use improved modestly (+6.3pp) but confirmed as a prompt-difficulty ceiling. Follow-up filed as #132 for deterministic post-processing approach.
Replace module-level mutable state (setLayoutJudge) with a factory function (createLayoutQualityKind) that takes the judge as a parameter. Consistent with the existing createLlmJudgeKind pattern.
- Add Zod schema validation for layout judge response (prevents NaN from malformed model output) - Add activity tracking to generateFormSpecWithLayout (matches generateFormSpec) - Update formSpecGenerator type to accept activity-tracking params - Add evaluationRunSchema.parse() before writing layout evaluation results - Fix score() signature to include _groundTruth parameter per EvaluationKind interface - Remove erroneous groundTruth filter in layout evaluation subcommand - Export buildLayoutPrompt from form-documents public index - Fix test import to use public index instead of internal path - Remove buildLayoutJudgePrompt from evaluation public index (implementation detail)
0e4af86 to
b5685ad
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
generateFormSpecWithLayout) that breaks forms into logical multi-page sections with adaptive sizing, topic-cohesive grouping, and conditional-page guidancesonnet-hybrid-layout-v1extraction variant that plugs in the layout generator via a newformSpecGeneratorextension point oncreateBedrockPdfExtractorlayout-qualityevaluation kind with a Bedrock LLM judge, aevaluate layoutCLI subcommand, and experiment results for 4 government PDFsStory
Closes #121
Acceptance Criteria
catalog/experiments/layout-quality/findings.md)generateFormSpecWithLayoutprompt)catalog/experiments/layout-quality/)Test Plan
bun run checkpasses (1439 tests, type check clean; pre-existing lint warnings on main unchanged)test/form-documents/layout-prompt.test.ts— layout prompt content, statistics, schematest/evaluation/layout-quality.test.ts— score normalization, NaN protection via Zod schema, summarizeReview Notes
Extension point design:
formSpecGeneratoronBedrockExtractorOptionsaccepts(model, spec, activityStore?, userId?, projectId?, modelId?) => Promise<FormSpec>, preserving cost tracking regardless of which generator is active. PassinggenerateFormSpecWithLayoutdirectly in the registry is compatible with this signature.Evaluation infrastructure: The
layout-qualitykind follows the factory pattern fromcreateLlmJudgeKind— judge injected at construction, not via mutable global state.Scope discipline: The
sonnet-hybrid-layout-v1variant is registered asexperimental. Promotion to production default is deferred pending #132 (deterministic conditional page injection), which addresses the main gap identified in the findings.