Skip to content

Commit 0b46d0a

Browse files
committed
Add blog post about getting undeterministic agent into deterministic guardrails
1 parent c4ace23 commit 0b46d0a

File tree

1 file changed

+134
-0
lines changed

1 file changed

+134
-0
lines changed
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
created_at: 2026-02-20 12:20:24 +0100
3+
author: Łukasz Reszke
4+
tags: []
5+
publish: false
6+
---
7+
8+
# Getting nondeterministic agent into deterministic guardrails
9+
10+
AI agents don't reliably follow your instructions. Here's how I made it hurt less.
11+
12+
<!-- more -->
13+
14+
My context:
15+
* I currently work on a 12-year-old Rails legacy code base
16+
* The code base is undergoing modernization. Some of the large Active Record classes have been split into smaller ones, each into its own bounded context. Events are becoming a first-class citizen in the code. We also pay close attention to keep direction of dependencies as designed by context maps.
17+
* The client has a GitHub Copilot subscription. I mostly use Sonnet and Opus models.
18+
19+
## The basic setup
20+
21+
Initially I started with the basics. I was curious where it would get us. There's an AGENTS.md file with general rules to follow. Besides the AGENTS.md file I've added a few skills.
22+
The goal of the skills is to tell the agent about how it should write code. I am a big fan of Szymon's way of using RSpec. So I put that into a skill. I also developed a few skills that tell the agent how I want it to deal with event sourcing, ddd technical patterns, hotwire, backfilling data (especially events) and mutation testing. The mutation skill is quite essential because without it the agent goes bananas and tries to achieve 100% of mutation coverage with hacking.
23+
24+
An example of hacking is calling `send(:method)`.
25+
26+
I don't want to have such tests. Trying to achieve mutation coverage in such a way indicates that perhaps the code should be removed because it's just unnecessary noise.
27+
28+
So now the question is, is that enough?
29+
30+
## It's pretty good but can be better – tackling non-determinism
31+
32+
More than once (a day) I've experienced my agent to go off-rails and ignore my instructions. It doesn't respect what I've specified in AGENTS.md and/or skills.
33+
34+
It often happens when I am asking it to introduce a very similar-yet-a-little-bit-different command and handler for a specific business use case.
35+
36+
Changes to the production code are going very well. This is especially true if the goal is to replicate well-structured code. However, once it gets to the "write the tests" part, it switches to commodity mode and most likely uses RSpec in the most popular way, which I don't like. This is a large part of the existing codebase. If it doesn't fail on writing tests the way I want it, it usually doesn't run mutation testing, even though I expect the coverage not to drop below a certain point and the mutants to be eliminated. They should be killed properly.
37+
Using the `send` method is no bueno.
38+
39+
### Dealing with non-determinism
40+
41+
So we're not able to change whether the agent will respect AGENTS.md and skills all the time. At least not yet. Maybe never. So we have to deal with it differently.
42+
43+
What I am currently testing is to have guardrails aka dev workflows. The idea is to run tools that:
44+
- Will make me focus less on code structure, incorrect formatting, etc
45+
- Make sure tests for changed files are run
46+
- Make sure mutation tests are run
47+
- And, last but not least, make sure that the boundaries within bounded contexts are not violated. I noticed that the agent, just like humans, loves to take shortcuts to achieve a goal. The difference is that I never tell the agent we're under a strict deadline. So I'm not sure where this choice is coming from.
48+
49+
The workflow is Ruby code that is wired to a `/verify` custom command. The command runs bash with `ruby -r ./lib/dev_workflow.rb`.
50+
51+
The `dev_workflow.rb` orchestrates the full pipeline. Looking at its requires tells you everything about what it runs:
52+
53+
```ruby
54+
require_relative 'dev_workflow/step_result'
55+
require_relative 'dev_workflow/result'
56+
require_relative 'dev_workflow/changed_files'
57+
require_relative 'dev_workflow/steps/base'
58+
require_relative 'dev_workflow/steps/rubocop_step'
59+
require_relative 'dev_workflow/steps/rspec_step'
60+
require_relative 'dev_workflow/steps/mutant_step'
61+
require_relative 'dev_workflow/steps/eslint_step'
62+
require_relative 'dev_workflow/steps/jest_step'
63+
require_relative 'dev_workflow/verify_build'
64+
```
65+
66+
Each step follows the same pattern: check if relevant files changed, run the tool, return a structured result. Here's the mutation testing step as an example — the one that matters most given the problems I described earlier:
67+
68+
```ruby
69+
class MutantStep < Base
70+
ALLOWED_NAMESPACES = %w[CRM Ordering Billing].freeze
71+
72+
def call
73+
unless changed_files.any_ruby?
74+
return StepResult.skipped(name: name, skip_reason: 'no ruby files changed')
75+
end
76+
77+
subjects = mutation_subjects
78+
if subjects.empty?
79+
return StepResult.skipped(name: name, skip_reason: 'no mutant-eligible files changed')
80+
end
81+
82+
result, duration = measure_duration do
83+
run_mutant(subjects)
84+
end
85+
86+
output, success = result
87+
88+
if success
89+
StepResult.success(name: name, duration_seconds: duration, files_checked: subjects.size)
90+
else
91+
errors = parse_mutant_output(output)
92+
StepResult.failure(name: name, duration_seconds: duration, files_checked: subjects.size, errors: errors)
93+
end
94+
end
95+
96+
private
97+
98+
def run_mutant(subjects)
99+
subject_args = subjects.map { |s| "'#{s}'" }.join(' ')
100+
run_command("bundle exec mutant run --since HEAD #{subject_args}")
101+
end
102+
103+
def mutation_subjects
104+
changed_files.ruby_files
105+
.reject { |f| f.start_with?('spec/') }
106+
.filter_map { |f| file_to_subject(f) }
107+
.select { |subject| eligible_namespace?(subject) }
108+
.uniq
109+
end
110+
end
111+
```
112+
113+
The key detail is `StepResult`. Each step returns either `.skipped`, `.success`, or `.failure` with structured data. This is what the agent reads to understand what went wrong and what to fix.
114+
115+
Last but not least, to make sure that the non-deterministic agent won't ignore my desire to run this command by itself, I attached it to a git pre-commit hook:
116+
117+
```ruby
118+
#!/usr/bin/env ruby
119+
120+
require_relative "../lib/dev_workflow"
121+
122+
result = DevWorkflow::VerifyBuild.call(staged_only: true)
123+
124+
puts result.to_json
125+
126+
exit(result.success? ? 0 : 1)
127+
```
128+
129+
And at this point, at least calling the verify method is deterministic. So the agent gets feedback, fixes whatever is reported by the tool, reruns the verification and then it's able to commit the changes.
130+
131+
## Reviewing changes
132+
133+
Besides AGENTS.md, SKILLS.md and the workflow I described above, I still review the code. I focus on tests, architecture and security parts.
134+
I do take full ownership of the code that I ship. I don't trust the AI enough to cut the leash. And my conclusion from working with it in a legacy codebase is currently that it will not change that fast (for me).

0 commit comments

Comments
 (0)