feat(evaluation): layout-aware FormSpec generation and quality evaluation by danielnaab · Pull Request #134 · flexion/forms-lab

danielnaab · 2026-05-06T08:20:41Z

Summary

Adds a layout-aware FormSpec generator (generateFormSpecWithLayout) that breaks forms into logical multi-page sections with adaptive sizing, topic-cohesive grouping, and conditional-page guidance
Registers sonnet-hybrid-layout-v1 extraction variant that plugs in the layout generator via a new formSpecGenerator extension point on createBedrockPdfExtractor
Adds a layout-quality evaluation kind with a Bedrock LLM judge, a evaluate layout CLI subcommand, and experiment results for 4 government PDFs

Story

Closes #121

Acceptance Criteria

An audit of current form output identifies specific issues with examples (catalog/experiments/layout-quality/findings.md)
Forms with more than ~10 fields are broken into logical sections or multi-step flows (enforced by generateFormSpecWithLayout prompt)
Related fields are grouped with descriptive section headings (layout prompt enforces topic-cohesive grouping)
The form layout works well on mobile viewports (requires UI-layer changes; out of scope for generation pipeline)
Accessibility review confirms screen reader compatibility (out of scope for this branch)
At least one real-world form tested end-to-end (4 government PDFs evaluated; results in catalog/experiments/layout-quality/)

Test Plan

bun run check passes (1439 tests, type check clean; pre-existing lint warnings on main unchanged)
test/form-documents/layout-prompt.test.ts — layout prompt content, statistics, schema
test/evaluation/layout-quality.test.ts — score normalization, NaN protection via Zod schema, summarize
Architecture dependency rule tests pass

Review Notes

Extension point design: formSpecGenerator on BedrockExtractorOptions accepts (model, spec, activityStore?, userId?, projectId?, modelId?) => Promise<FormSpec>, preserving cost tracking regardless of which generator is active. Passing generateFormSpecWithLayout directly in the registry is compatible with this signature.

Evaluation infrastructure: The layout-quality kind follows the factory pattern from createLlmJudgeKind — judge injected at construction, not via mutable global state.

Scope discipline: The sonnet-hybrid-layout-v1 variant is registered as experimental. Promotion to production default is deferred pending #132 (deterministic conditional page injection), which addresses the main gap identified in the findings.

The formSpecSchema requires these fields. Also commits baseline and layout variant evaluation results showing +17.7% overall improvement.

Documents methodology, per-fixture results, and recommendations. Key finding: +17.7pp overall improvement with largest gains in title clarity (+43.7pp), topic cohesion (+37.5pp), and page sizing (+31.3pp). Conditional page use and delivery mode identified as areas for iteration.

…ance Delivery mode: removed overly conservative "default to static" and replaced with content-complexity-based criteria (narrative fields, sensitive topics, eligibility logic → conversational). Conditional pages: added explicit instructions for deriving page-level conditions from field-level conditions, with a worked example in the schema. Modest improvement (+6.3pp) but the inference remains hard for a prompt-only approach. Results: overall 77.1% (+19.8pp vs baseline). Delivery mode regression eliminated. Conditional use improved from 37.5% to 43.8%.

Final results after prompt iteration: +19.8pp overall (57.3% → 77.1%). Delivery mode regression eliminated. Conditional page use improved modestly (+6.3pp) but confirmed as a prompt-difficulty ceiling. Follow-up filed as #132 for deterministic post-processing approach.

Replace module-level mutable state (setLayoutJudge) with a factory function (createLayoutQualityKind) that takes the judge as a parameter. Consistent with the existing createLlmJudgeKind pattern.

- Add Zod schema validation for layout judge response (prevents NaN from malformed model output) - Add activity tracking to generateFormSpecWithLayout (matches generateFormSpec) - Update formSpecGenerator type to accept activity-tracking params - Add evaluationRunSchema.parse() before writing layout evaluation results - Fix score() signature to include _groundTruth parameter per EvaluationKind interface - Remove erroneous groundTruth filter in layout evaluation subcommand - Export buildLayoutPrompt from form-documents public index - Fix test import to use public index instead of internal path - Remove buildLayoutJudgePrompt from evaluation public index (implementation detail)

danielnaab temporarily deployed to story-121-form-layout May 6, 2026 19:36 Inactive

danielnaab added 13 commits May 7, 2026 05:25

feat(form-documents): add layout-aware FormSpec generation prompt

670fec4

feat(form-documents): add formSpecGenerator option to extractor

752c95f

feat(extraction): register sonnet-hybrid-layout-v1 variant

72d0f85

feat(evaluation): add layout-quality evaluation kind

ab3a60a

feat(evaluation): add Bedrock layout judge

1f18021

feat(cli): add 'evaluate layout' subcommand

515d1bb

fix(form-documents): include createdAt/updatedAt in layout prompt schema

fa8fe36

The formSpecSchema requires these fields. Also commits baseline and layout variant evaluation results showing +17.7% overall improvement.

style: fix biome formatting

c44a9b7

refactor(evaluation): use factory pattern for layout quality kind

db9cb4d

Replace module-level mutable state (setLayoutJudge) with a factory function (createLayoutQualityKind) that takes the judge as a parameter. Consistent with the existing createLlmJudgeKind pattern.

danielnaab force-pushed the story-121/form-layout branch from 0e4af86 to b5685ad Compare May 7, 2026 05:36

danielnaab temporarily deployed to story-121-form-layout May 7, 2026 05:36 Inactive

docs(story-121): add code review artifact

149893b

danielnaab deployed to story-121-form-layout May 7, 2026 05:37 View deployment

danielnaab changed the title ~~feat(form-documents): layout-aware FormSpec generation~~ feat(evaluation): layout-aware FormSpec generation and quality evaluation May 7, 2026

docs: update flight board and session log for story-121

067204b

danielnaab temporarily deployed to story-121-form-layout May 7, 2026 05:37 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluation): layout-aware FormSpec generation and quality evaluation#134

feat(evaluation): layout-aware FormSpec generation and quality evaluation#134
danielnaab wants to merge 15 commits intomainfrom
story-121/form-layout

danielnaab commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielnaab commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Story

Acceptance Criteria

Test Plan

Review Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielnaab commented May 6, 2026 •

edited

Loading