docs(library): add ATR-inspired threat detection example#1869
Conversation
Adds an example configuration under examples/configs/atr_threat_detection that wires the built-in regex_detection input rail to a small set of ATR-inspired patterns covering instruction override, system prompt exfiltration, role-play jailbreak, base64-wrapped payload hints, MCP tool override markers, and file:// SSRF references. The full open detection set lives at https://github.com/Agent-Threat-Rule/agent-threat-rules under Apache-2.0. The example uses only built-in components (no new dependencies, no LLM calls in the rail itself) and includes a README with a runnable nemoguardrails chat command.
Documentation preview |
📝 WalkthroughWalkthroughThis pull request introduces a complete ATR-inspired threat detection example for NeMo Guardrails, including configuration files, detection logic, and documentation. The example demonstrates regex-based detection of six threat categories with automatic refusal responses. ChangesATR Threat Detection Example
🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/configs/atr_threat_detection/README.md (1)
37-41: ⚡ Quick winDocument how to enable the optional
atr report matchflow.At Line 37 onward, the extension section would be stronger if it explicitly shows adding
atr report matchunderrails.input.flows, so users can actually surface matched detections as described inrails.co.✍️ Suggested doc patch
## Extending To run against the live ATR YAML ruleset, parse the rule files at startup and append the `detection.regex_patterns` field of each rule to the `patterns` list under `regex_detection.input`. + +To also expose matched detections, add the optional flow from `rails.co` +to your input flows in `config/config.yml`: + +```yaml +rails: + input: + flows: + - atr report match + - regex check input +```🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/configs/atr_threat_detection/README.md` around lines 37 - 41, Update the "Extending" docs to show how to enable the optional "atr report match" flow by adding an example rails.input.flows YAML stanza that includes "atr report match" (and other flows like "regex check input"); also explicitly state to parse the live ATR YAML ruleset at startup and append each rule's detection.regex_patterns into the patterns list under regex_detection.input so matched detections surface via the rails flow (refer to rails.input.flows, atr report match, regex_detection.input, detection.regex_patterns, and patterns when making the change).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@examples/configs/atr_threat_detection/README.md`:
- Around line 37-41: Update the "Extending" docs to show how to enable the
optional "atr report match" flow by adding an example rails.input.flows YAML
stanza that includes "atr report match" (and other flows like "regex check
input"); also explicitly state to parse the live ATR YAML ruleset at startup and
append each rule's detection.regex_patterns into the patterns list under
regex_detection.input so matched detections surface via the rails flow (refer to
rails.input.flows, atr report match, regex_detection.input,
detection.regex_patterns, and patterns when making the change).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7bdd8e5b-4e1d-4b12-aa70-a31915585737
📒 Files selected for processing (3)
examples/configs/atr_threat_detection/README.mdexamples/configs/atr_threat_detection/config/config.ymlexamples/configs/atr_threat_detection/config/rails.co
Greptile SummaryThis PR adds a new
|
| Filename | Overview |
|---|---|
| examples/configs/atr_threat_detection/README.md | Documentation for the ATR threat detection example; extending snippet uses define flow instead of the library-standard define subflow, and the audit-logging guidance slightly mischaracterises what $result["detections"] returns. |
| examples/configs/atr_threat_detection/config/config.yml | Correct YAML structure matching existing test fixtures; configures gpt-4o-mini as the main model, applies case_insensitive: true, and wires regex check input via rails.input.flows. |
Sequence Diagram
sequenceDiagram
participant User
participant NeMoGuardrails
participant RegexCheckInput as regex check input (subflow)
participant DetectRegexPattern as detect_regex_pattern (action)
participant LLM as Main LLM (gpt-4o-mini)
User->>NeMoGuardrails: user message
NeMoGuardrails->>RegexCheckInput: invoke input rail
RegexCheckInput->>DetectRegexPattern: "execute(source=input, text=$user_message)"
DetectRegexPattern-->>RegexCheckInput: "{is_match, text, detections}"
alt "is_match == true"
RegexCheckInput-->>NeMoGuardrails: bot refuse to respond + stop
NeMoGuardrails-->>User: I'm sorry, I can't respond to that.
else "is_match == false"
RegexCheckInput-->>NeMoGuardrails: no match, continue
NeMoGuardrails->>LLM: forward benign message
LLM-->>NeMoGuardrails: response
NeMoGuardrails-->>User: LLM response
end
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
examples/configs/atr_threat_detection/README.md:63
Every input rail in the library and guardrails_only examples is defined with `define subflow` (`regex check input`, `dummy input rail`, etc.). Using `define flow` in the extending snippet is inconsistent with this pattern. In Colang 1.0, a bare `define flow` participates in context-based routing, which could interfere with how the engine selects flows. Using `define subflow` makes the definition an explicit subroutine that is only ever invoked directly by the rails system, matching the expected semantics.
```suggestion
define subflow atr report match
```
### Issue 2 of 2
examples/configs/atr_threat_detection/README.md:80-84
`$result["detections"]` returns the matched **regex pattern strings** (e.g., `"\\b(ignore|disregard|forget)\\s+..."`) not ATR rule IDs. A user implementing audit logging who reads "capture the matched rule list" will likely expect short identifiers like `["ATR-PI-001"]` but will instead receive the full raw pattern. The extending note should clarify that `detections` contains the pattern strings so readers know what format to expect when forwarding to a log sink or exception message.
Reviews (5): Last reviewed commit: "docs(example): fix Extending snippet to ..." | Re-trigger Greptile
| define flow atr report match | ||
| """Optional flow: log the matched ATR rule(s) when the input rail fires.""" | ||
| $result = await DetectRegexMatchAction(source="input", text=$user_message) | ||
| if $result["is_match"] | ||
| $matched_rules = $result["detections"] | ||
| bot refuse to respond | ||
| abort |
There was a problem hiding this comment.
atr report match flow is never wired as an input rail
The flow is defined but never referenced in config.yml under rails.input.flows, so it will never be executed. The file comment and PR description describe it as exposing matched rules for logging, but there is no path that invokes it. Any user who copies this example will silently get no logging, with no indication the flow is dormant. Additionally, if a developer does add it to the flows list alongside regex check input, DetectRegexMatchAction will run twice on the same message, which is redundant and could cause a double-abort.
Prompt To Fix With AI
This is a comment left during a code review.
Path: examples/configs/atr_threat_detection/config/rails.co
Line: 11-17
Comment:
**`atr report match` flow is never wired as an input rail**
The flow is defined but never referenced in `config.yml` under `rails.input.flows`, so it will never be executed. The file comment and PR description describe it as exposing matched rules for logging, but there is no path that invokes it. Any user who copies this example will silently get no logging, with no indication the flow is dormant. Additionally, if a developer does add it to the flows list alongside `regex check input`, `DetectRegexMatchAction` will run twice on the same message, which is redundant and could cause a double-abort.
How can I resolve this? If you propose a fix, please make it concise.| define flow atr report match | ||
| """Optional flow: log the matched ATR rule(s) when the input rail fires.""" | ||
| $result = await DetectRegexMatchAction(source="input", text=$user_message) | ||
| if $result["is_match"] |
There was a problem hiding this comment.
$matched_rules is assigned but never used
$matched_rules = $result["detections"] stores the list of matched patterns in a local variable, but neither a log statement nor any other reference to that variable follows. The docstring says the flow "log[s] the matched ATR rule(s)" and the PR description says it "exposes the matched rule list for logging," but the actual logging is missing, making the promise in the comment inaccurate.
Prompt To Fix With AI
This is a comment left during a code review.
Path: examples/configs/atr_threat_detection/config/rails.co
Line: 14
Comment:
**`$matched_rules` is assigned but never used**
`$matched_rules = $result["detections"]` stores the list of matched patterns in a local variable, but neither a log statement nor any other reference to that variable follows. The docstring says the flow "log[s] the matched ATR rule(s)" and the PR description says it "exposes the matched rule list for logging," but the actual logging is missing, making the promise in the comment inaccurate.
How can I resolve this? If you propose a fix, please make it concise.| models: [] | ||
|
|
There was a problem hiding this comment.
models: [] causes failure for non-matching inputs in interactive chat
The README guides users to run nemoguardrails chat interactively. If a user types any message that does not match a threat pattern, the runtime will attempt to invoke the main LLM to generate a response and fail because no model is configured. The chat example only works end-to-end for inputs that trigger the rail. Either adding a brief note to the README or providing a minimal stub model entry would prevent user confusion.
Prompt To Fix With AI
This is a comment left during a code review.
Path: examples/configs/atr_threat_detection/config/config.yml
Line: 1-2
Comment:
**`models: []` causes failure for non-matching inputs in interactive chat**
The README guides users to run `nemoguardrails chat` interactively. If a user types any message that does *not* match a threat pattern, the runtime will attempt to invoke the main LLM to generate a response and fail because no model is configured. The chat example only works end-to-end for inputs that trigger the rail. Either adding a brief note to the README or providing a minimal stub model entry would prevent user confusion.
How can I resolve this? If you propose a fix, please make it concise.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…itpick Adds a YAML stanza showing how to enable the optional 'atr report match' flow that already ships in rails.co, so users can surface matched rule identifiers instead of only the generic refusal. Order note clarifies why 'atr report match' must come before 'regex check input'.
| define bot refuse to respond | ||
| "I'm sorry, your request matched a threat detection rule and was blocked." |
There was a problem hiding this comment.
bot refuse to respond redefinition creates non-deterministic output
The library file nemoguardrails/library/regex/flows.v1.co already defines bot refuse to respond as "I'm sorry, I can't respond to that.". Because the default Colang version is 1.0 (no colang_version key in config.yml) and both files are loaded at runtime, Colang 1.0 treats both strings as alternatives for the same utterance. With models: [] there is no LLM to arbitrate, so the runtime will fall back to one of the two strings non-deterministically — likely the library's generic message rather than the threat-specific one defined here. A developer following this example will not reliably see the custom refusal message.
Prompt To Fix With AI
This is a comment left during a code review.
Path: examples/configs/atr_threat_detection/config/rails.co
Line: 8-9
Comment:
**`bot refuse to respond` redefinition creates non-deterministic output**
The library file `nemoguardrails/library/regex/flows.v1.co` already defines `bot refuse to respond` as `"I'm sorry, I can't respond to that."`. Because the default Colang version is 1.0 (no `colang_version` key in `config.yml`) and both files are loaded at runtime, Colang 1.0 treats both strings as *alternatives* for the same utterance. With `models: []` there is no LLM to arbitrate, so the runtime will fall back to one of the two strings non-deterministically — likely the library's generic message rather than the threat-specific one defined here. A developer following this example will not reliably see the custom refusal message.
How can I resolve this? If you propose a fix, please make it concise.Address Greptile/coderabbit P1 + P2 findings from @NVIDIA-NeMo bot reviewers on this PR: 1. P1: `bot refuse to respond` redefinition in rails.co collided with the library default in nemoguardrails/library/regex/flows.v1.co. Under Colang 1.0 this made the refusal utterance non-deterministic with `models: []` and no LLM to arbitrate. Fix: delete rails.co entirely. The example now uses the library default refusal message ("I'm sorry, I can't respond to that."). 2. P2: `models: []` caused a runtime error in `nemoguardrails chat` for any benign user message (the runtime needs a main model when the input rail does not abort). Fix: add a main model stub (openai/gpt-4o-mini) so chat runs end-to-end. The input rail still blocks threats before the model is invoked, so the model only sees benign inputs. 3. P2: `atr report match` flow was defined in rails.co but never wired under rails.input.flows -- it was dormant code. Fix: removed (rails.co deleted). The README's Extending section now shows the custom-flow pattern with a non-conflicting bot utterance (`bot refuse atr_threat`) and a `AtrRuleMatchedRailException` event for downstream observers, so the documented pattern is correct. 4. P2: `$matched_rules = $result["detections"]` assigned but never referenced -- comment promised "log the matched ATR rule(s)" but no logging followed. Fix: removed (the dormant flow no longer exists). The Extending section's custom-flow example uses `$matched_rules` only to gate the event emission, and emits an `AtrRuleMatchedRailException` so downstream code can subscribe to it. 5. Documentation correction: README and config.yml both cited ATR as Apache-2.0 -- the actual license is MIT. Corrected both references. Net diff: - config.yml: add main model stub, fix license comment. - rails.co: removed (used library default refusal). - README.md: fix license, update behavior description, replace stale "atr report match" wiring instructions with a correct custom-flow example in the Extending section. Tests pass locally (no test files in this PR; existing pr-tests-matrix green on 3.10-3.13). Signed-off-by: Adam Lin <adam@agentthreatrule.org>
|
@coderabbitai @greptile-apps — pushed 813f41c addressing the four findings on this PR: P1 — P2 — P2 — P2 — Documentation correction — README and config.yml both cited ATR as Apache-2.0. The actual license is MIT (LICENSE file + package.json). Fixed both references. Net diff: 3 files changed, +41 / -36 (rails.co deleted). All pr-tests-matrix checks should remain green; no test code in this PR. |
|
Tip For best results, initiate chat on the files or code changes.
P1 — P2 — P2 — P2 — License correction ✅ The net diff of +41/−36 with 🐇 ✨ |
|
|
||
| This example shows how to use the built-in `regex_detection` input rail | ||
| with a small set of patterns inspired by Agent Threat Rules, an open | ||
| detection standard for AI agent threats published under the MIT license: |
There was a problem hiding this comment.
The README states the ATR project is published under the MIT license, but the PR description says it is Apache-2.0. A user relying on this file to assess license compatibility for their project will get incorrect information.
| detection standard for AI agent threats published under the MIT license: | |
| detection standard for AI agent threats published under the Apache-2.0 license: |
Prompt To Fix With AI
This is a comment left during a code review.
Path: examples/configs/atr_threat_detection/README.md
Line: 5
Comment:
The README states the ATR project is published under the MIT license, but the PR description says it is Apache-2.0. A user relying on this file to assess license compatibility for their project will get incorrect information.
```suggestion
detection standard for AI agent threats published under the Apache-2.0 license:
```
How can I resolve this? If you propose a fix, please make it concise.| ```colang | ||
| define bot refuse atr_threat | ||
| "I'm sorry, that request was blocked by an ATR input safety rule." | ||
|
|
||
| define flow atr report match | ||
| $result = execute detect_regex_pattern(source="input", text=$user_message) | ||
| if $result["is_match"] | ||
| $matched_rules = $result["detections"] | ||
| create event AtrRuleMatchedRailException(message="ATR input rail blocked") | ||
| bot refuse atr_threat | ||
| stop | ||
| ``` |
There was a problem hiding this comment.
Extending snippet silently drops the custom refusal message
When both create event AtrRuleMatchedRailException and bot refuse atr_threat fire in the same flow, the Colang 1.0 runtime's event loop (see llmrails.py lines 946–954) appends the bot script to responses but also sets exception = event. The subsequent branch at line 986 (if exception: new_message = {"role": "exception", ...}) short-circuits and never uses responses, so the caller receives an exception object instead of the custom refusal text.
The established pattern (e.g. guardrails_only/input/config.co) guards on $config.enable_rails_exceptions and uses an if/else to emit either the exception event or the bot utterance — not both unconditionally.
Prompt To Fix With AI
This is a comment left during a code review.
Path: examples/configs/atr_threat_detection/README.md
Line: 53-64
Comment:
**Extending snippet silently drops the custom refusal message**
When both `create event AtrRuleMatchedRailException` and `bot refuse atr_threat` fire in the same flow, the Colang 1.0 runtime's event loop (see `llmrails.py` lines 946–954) appends the bot script to `responses` but also sets `exception = event`. The subsequent branch at line 986 (`if exception: new_message = {"role": "exception", ...}`) short-circuits and never uses `responses`, so the caller receives an exception object instead of the custom refusal text.
The established pattern (e.g. `guardrails_only/input/config.co`) guards on `$config.enable_rails_exceptions` and uses an `if/else` to emit **either** the exception event **or** the bot utterance — not both unconditionally.
How can I resolve this? If you propose a fix, please make it concise.Address greptile P1 follow-up on the README's Extending section: In Colang 1.0, the rails event loop appends bot utterances to `responses` but the subsequent branch that handles `exception = event` short-circuits and never emits `responses`, so combining `create event ...RailException` and `bot refuse atr_threat` in the same flow silently drops the refusal. The canonical pattern (e.g. examples/configs/guardrails_only/input/ config.co's `dummy input rail`) gates on `$config.enable_rails_exceptions` and uses an `if/else` to emit **either** the exception event **or** the bot utterance. Updated the README Extending snippet to follow that pattern, with an explicit note about the dropped-utterance behavior so future readers do not repeat the mistake. Also added a short paragraph explaining how to capture `$matched_rules` for downstream audit without breaking the canonical single-action branches. PR description: corrected from "Apache-2.0" to "MIT license" so it matches the README (ATR is MIT-licensed per LICENSE and package.json). Signed-off-by: Adam Lin <adam@agentthreatrule.org>
|
@greptile-apps — pushed 29beb3b addressing both follow-up P1 findings: P1 #1 (license mismatch): The README is correct — ATR is MIT-licensed (LICENSE + package.json in https://github.com/Agent-Threat-Rule/agent-threat-rules confirm). The PR description was the stale half. I updated the PR description from "Apache-2.0" to "MIT license" so the two now match. P1 #2 (Extending snippet dropping the bot refuse): Good catch on the Colang 1.0 event-loop behavior — combining Added an explicit note about the dropped-utterance behavior so future readers do not repeat the mistake, plus a short paragraph on how to capture |
|
Both fixes are solid. The Colang snippet now correctly mirrors the One minor nit on the snippet: the That said, the logic and the structural fix for the P1 findings are correct. The license alignment (MIT) and the event-loop behavior are both handled properly. This is good to merge once that keyword is addressed (or if you intentionally prefer Tip: You can customize Greptile's behavior for this repo with |
This adds an example configuration under examples/configs/atr_threat_detection that demonstrates how to wire the built-in regex_detection input rail to a small set of patterns inspired by Agent Threat Rules. ATR is an open detection standard for AI agent threats published under the MIT license at https://github.com/Agent-Threat-Rule/agent-threat-rules.
The example covers six common attack categories using compact regex patterns: instruction override, system prompt exfiltration, role-play jailbreak, base64-wrapped payload hints, MCP tool override markers, and file:// SSRF references. Each entry in the config carries an ATR rule id comment so users can cross-reference the open ruleset.
The example uses only built-in components. config.yml configures regex_detection.input with the patterns and case_insensitive flag, the input rail invokes the existing regex check input flow, and rails.co defines the refusal message plus an optional flow that exposes the matched rule list for logging. There are no new dependencies, no external services, and no LLM calls in the rail itself.
The README explains what each pattern targets, shows a runnable nemoguardrails chat command, and points to the upstream ATR repository for users who want to load the full ruleset at startup.
This fills a gap in the existing examples folder, which has injection_detection (YARA-based) and regex (used in tests) but no agent-specific threat sample focused on prompt injection, MCP, and skill-style attacks.
Summary by CodeRabbit
New Features
Documentation