fix(gateway): scope per-entity fault GET/DELETE by hosted-app FQNs#401
Open
bburda wants to merge 6 commits into
Open
fix(gateway): scope per-entity fault GET/DELETE by hosted-app FQNs#401bburda wants to merge 6 commits into
bburda wants to merge 6 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes per-entity fault scoping so fault reads and clears are based on hosted app FQNs instead of empty or overly broad namespace prefixes.
Changes:
- Adds
HandlerContext::resolve_entity_source_fqns()and uses it in fault handlers. - Updates per-entity fault GET/list/DELETE paths to apply hosted-app source filtering.
- Adds regression tests plus API/changelog documentation for the behavior change.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp |
Applies FQN-based fault scoping before returning or clearing faults. |
src/ros2_medkit_gateway/src/http/handlers/handler_context.cpp |
Adds helper to resolve entity-owned source FQNs. |
src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/handlers/handler_context.hpp |
Declares and documents the new source-FQN resolver. |
src/ros2_medkit_gateway/test/test_handler_context.cpp |
Adds unit coverage for source-FQN resolution. |
src/ros2_medkit_integration_tests/test/features/test_faults_scope_isolation.test.py |
Adds integration regression coverage for cross-component fault leakage. |
src/ros2_medkit_gateway/CHANGELOG.rst |
Documents the fault scoping behavior change. |
docs/api/rest.rst |
Updates fault endpoint response-code documentation. |
Comments suppressed due to low confidence (2)
src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp:610
- This uses the existing prefix-based collection filter with exact app FQNs. For an app
/ns/node, a fault reported by/ns/node_extrawill be included because it starts with the owned FQN, so the collection route can still leak similarly named nodes' faults. The collection filtering should use exact FQN membership (or a path-boundary-aware match) when the filter set contains app FQNs.
const auto & cache = ctx_.node()->get_thread_safe_cache();
auto app_fqns = HandlerContext::resolve_entity_source_fqns(cache, entity_info);
auto result = fault_mgr->list_faults("", filter.include_pending, filter.include_confirmed, filter.include_cleared,
filter.include_healed, include_muted, include_clusters);
if (result.success) {
json filtered_faults = filter_faults_by_sources(result.data["faults"], app_fqns);
src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp:1029
faults_to_clearis selected by intersection, but the later clear call is still byfault_codeonly. A fault with one reporting source in this app and another source outside it will be selected here and then cleared globally, including the out-of-scope source. Exclude or reject mixed-scope faults before issuing unscoped clears.
const auto & cache = ctx_.node()->get_thread_safe_cache();
auto app_fqns = HandlerContext::resolve_entity_source_fqns(cache, entity_info);
faults_to_clear = filter_faults_by_sources(result.data["faults"], app_fqns);
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp:844
- With the new all-sources scope rule, this error also fires for mixed-source faults where one reporting source is inside the entity and another is outside. The message says the fault is not reported by any app in scope, which is inaccurate for that case and makes 404s misleading; it should describe that not all reporting sources are within the entity scope.
HandlerContext::send_error(res, 404, ERR_RESOURCE_NOT_FOUND, "Fault not found",
{{"details", "Fault is not reported by any app within this entity's scope"},
{entity_info.id_field, entity_id},
mfaferek93
reviewed
May 26, 2026
mfaferek93
reviewed
May 26, 2026
Per-entity fault routes used the addressed entity's namespace_path as a
transport-level prefix filter. When namespace_path was empty (host-derived
or synthetic components, manifest components without an explicit
`namespace` field, Areas, Functions), the filter short-circuited:
- GET /{entity}/faults/{code} returned faults reported by apps that lived
in entirely different entities.
- DELETE /{entity}/faults/{code} cleared them, since ClearFault.srv has
no scope argument and the handler did not check ownership first.
Both handlers now resolve the entity to the set of FQNs of every hosted
app via the new HandlerContext::resolve_entity_source_fqns helper and
require the fault's reporting_sources to intersect that set. Empty set
means "nothing in scope" -> 404. DELETE additionally fetches the fault
and verifies scope before issuing the clear.
The GetFault.srv / ClearFault.srv contracts are unchanged.
The first pass on #395 only fixed the per-fault GET/DELETE routes. The collection routes (handle_list_faults, handle_clear_all_faults) still passed entity_info.namespace_path as the transport-level prefix filter for App entities. Apps with a wildcard ros_binding.namespace_pattern yield an empty effective_fqn, the transport silently disabled the filter, and the list/clear-all routes leaked - or cleared - every fault in the system. Both App branches now route through resolve_entity_source_fqns plus filter_faults_by_sources, matching how Component / Area / Function are already handled. Also: - Move the #395 CHANGELOG entry to Breaking Changes (the 200 -> 404 flip is observable to clients). - Document 404 on GET /{entity}/faults/{fault_code} in docs/api/rest.rst. - Add @verifies REQ_INTEROP_013 / REQ_INTEROP_015 tags to the scope-isolation tests so the generated verification matrix picks them up. - Add a DELETE happy-path test that clears the fault on the owning component and asserts 204. - Add unit coverage for resolve_entity_source_fqns on FUNCTION entities (aggregation + skip-unbound-hosts cases). - Note the wildcard-App limitation in the helper doxygen. - Drop unused launch_testing.actions import in the integration test.
…onents Address the eight review findings from the post-push self-review pass: - Source matching now enforces a path boundary: a node is in scope iff it equals one of the entity's owned FQNs or is a strict path-child (`<fqn>/<...>`). Pre-fix, `rfind(prefix, 0) == 0` let sibling nodes like `/ns/node_extra` pass for an entity owning `/ns/node`. - `fault_in_source_scope` now requires ALL reporting sources to be in scope. The old "any match" semantic exposed two cross-entity paths: the GET response carries the fault's full `reporting_sources` and environment data; the DELETE issues an unscoped `ClearFault.srv` that clears the aggregated record across every source for the code. `filter_faults_by_sources` inherits the same semantic so per-entity collection routes and per-fault routes agree on what counts as in scope. - `resolve_entity_source_fqns` AREA branch walks `get_subareas()` recursively. The manifest model nests areas (e.g. `engine` under `powertrain`) and components attach to subareas, so a top-level area query previously returned an empty set. - `resolve_entity_source_fqns` FUNCTION branch handles host IDs that are components (Function.hosts is documented as "App or Component IDs"), expanding component hosts to their apps. The cache's `function_to_apps_` index only resolves app hosts. - `fault_in_source_scope` is now a public static method on FaultHandlers so it can be unit-tested directly without a launch test. Added test_fault_handlers cases for exact match, path-child match, prefix collision, multi-source all-in-scope, multi-source partial, empty-scope, missing field, malformed entry, and root-FQN edge. - Added test_handler_context cases for nested-subarea recursion and function-hosting-component expansion. - Integration test setUp now verifies both the owning and the unrelated component were actually discovered before asserting 404. Without it, a 404 from an undiscovered entity is indistinguishable from the 404 the scope filter produces, so the regression would silently pass even with the fix reverted. - Doxygen for `resolve_entity_source_fqns` reworded to be generic about what callers do with an empty result (per-fault routes 404, list 200 with empty `items`, clear-all 204).
N1: Per-entity App fault list response no longer emits the global muted_count / cluster_count / muted_faults / clusters correlation metadata. include_muted / include_clusters URL params are inert on per-entity routes - they remain on the global GET /api/v1/faults route where the aggregates are meaningful. N2: ClearFault.srv gains a `skip_correlation_auto_clear` request flag. fault_manager_node honors it: when true, the correlation engine is NOT asked to compute the auto-clear cascade for the cleared fault code, and auto_cleared_codes in the response is empty. Per-entity DELETE routes (both single-fault and clear-all) set the flag to true so a scoped clear cannot reach faults reported by apps in other entities via auto_clear_with_root rules. Global DELETE /api/v1/faults keeps the flag false to preserve cluster-wide clearing. N3: handle_list_faults and handle_clear_all_faults Function/Area branches now route through HandlerContext::resolve_entity_source_fqns. Previously the Function branch used cache.get_entity_configurations() which only resolves App hosts (so a Function whose `hosts` list carried a Component ID returned an empty FQN set), and the Area branch walked cache.get_components_for_area() without recursing into subareas (so a top-level area like `powertrain` whose components lived in the `engine` subarea returned an empty set). Both classes are fixed via the shared helper. N4: 404 messages on per-fault GET/DELETE now describe the actual all-sources rule: a fault must have every reporting_source in the entity's scope; mixed-source faults that include any out-of-entity reporter are rejected. The previous wording said "not reported by any app within scope" which was wrong for the mixed-source case. N5: CHANGELOG behavior-change description now matches the implementation - the rule is "every reporting_source must be in scope", not just "reporting_sources must intersect". Documents the include_muted / include_clusters drop and the new ClearFault flag for per-entity DELETE. Test changes: - MockFaultServiceTransport in test_fault_manager_routing tracks the new skip_correlation_auto_clear arg. - Two new FaultManagerRoutingTest cases pin the default (false, for global routes) and the per-entity routing (true, for scoped DELETE).
…closure Three follow-ups on the per-entity fault scope fix: * `docs/api/rest.rst`: add a per-entity-scope note under Faults Endpoints spelling out the all-sources rule and steering operators to the global `/faults` route for fault codes shared across entities. The CHANGELOG documented the rule, but the API reference is where operators look. * `CHANGELOG.rst`: split the `ClearFault.srv` hash change into its own Breaking Changes bullet. Adding the `skip_correlation_auto_clear` request field changes the service type hash, so out-of-tree callers that invoke `/fault_manager/clear_fault` directly must rebuild against the new `ros2_medkit_msgs`. The in-tree gateway client and server were already updated together. * `test_faults_scope_isolation.test.py`: the 404 tests on both GET and DELETE now also assert the response body carries none of the success-shape keys (`item`, `environment_data`, `reporting_sources`, `snapshots`) and does not echo the out-of-scope reporting app FQN. Non-disclosure is the actual security property; a regression that flipped the status code to 404 but kept serializing the fault snapshot would otherwise slip past status-only assertions.
3e86737 to
cbbd8b1
Compare
Customer-facing docs across `messages.rst`, `ros2_medkit_msgs/README.md`,
`ros2_medkit_fault_manager/README.md`, and the fault-correlation tutorial
still listed the pre-fix `ClearFault.srv` shape (only `fault_code` in the
request, `success` + `message` in the response). Update all four to:
* List `skip_correlation_auto_clear` in the request, with the semantics
spelled out (cascade clear by default; `true` opts out so the call
clears only the named fault and leaves correlated symptoms alone) and
the gateway routing rule (per-entity DELETE sets it to `true`, global
DELETE leaves it `false`).
* List `auto_cleared_codes` in the response (was already returned by
the service, only the markdown table missed it).
* Flag the service type-hash change for direct out-of-tree
`ros2 service call` clients and generated clients - they must rebuild
against the new `ros2_medkit_msgs` to keep talking to `fault_manager`.
* Note in the fault-correlation tutorial that per-entity DELETE returns
empty `auto_cleared_codes` and that operators who want the cascade
must use the global `DELETE /api/v1/faults/{fault_code}` route.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Per-entity fault routes (
GET/DELETE /{entity-path}/faults/{fault_code}) leaked faults across entities. When the addressed entity had an emptynamespace_path(host-derived or synthetic components, manifest components without anamespacefield, Areas, Functions), the transport-level prefix filter short-circuited:GETreturned faults reported by apps that lived in entirely different entities.DELETEcleared them too, becauseClearFault.srvcarries no scope argument and the handler issued the clear without checking ownership first.Both handlers now resolve the addressed entity to the set of FQNs of every hosted app via a new
HandlerContext::resolve_entity_source_fqnshelper and apply a strict all-sources scope check: a fault is in scope iff every entry in itsreporting_sourcesis owned by the entity. Per-fault routes return404for any fault that fails the check; per-entity collection responses omit it; per-entityDELETEskips it.DELETEadditionally fetches the fault and verifies scope before issuing the clear so it can never clear a fault that does not belong to the entity, and uses the newClearFault.srvskip_correlation_auto_clearrequest field to opt out of cascade-clearing correlated symptoms that may live outside the entity.GetFault.srvis unchanged.ClearFault.srvgains theskip_correlation_auto_clearrequest field - this changes the service type hash and breaks out-of-tree callers that invoke/fault_manager/clear_faultdirectly (theros2_medkit_fault_managerREADME documents that pattern). The in-tree gateway client and server are updated together. See CHANGELOG for the full client-visible behavior delta.Issue
Type
Testing
TDD red-green verified locally:
test_faults_scope_isolation(new integration test,lidar_sensor+temp_sensorin hybrid mode withdemo_nodes_manifest.yaml):test_get_fault_returns_404_on_unrelated_componentgets200with the leaked lidar fault body andtest_clear_fault_returns_404_on_unrelated_componentgets204(the unrelated component cleared the lidar fault).item,environment_data,reporting_sources,snapshots) and does not echo the owning app's FQN, so a future regression that flips the status code without sanitizing the body still fails.FaultHandlers::fault_in_source_scopecovering exact-match, strict path-child, prefix collision, multi-source all-in-scope, multi-source partially-in-scope (the DELETE escalation case), empty scope, empty / missing / malformedreporting_sources, and a root-FQN edge.HandlerContext::resolve_entity_source_fqnscovering App / Component / Area (with nested subarea recursion) / Function (with component-host expansion) / Unknown plus empty-effective-fqn and missing-from-cache edges.skip_correlation_auto_clear=trueand the global DELETE leaves itfalse.test_faults_api,test_locking_faults,test_triggers_faults,test_scenario_fault_inspection,test_scenario_fault_lifecycle) pass.colcon test --packages-select ros2_medkit_gatewaylinters (clang_format, cppcheck, flake8, lint_cmake, pep257, xmllint) andclang-tidybuild green.Reviewer steps:
colcon build --packages-up-to ros2_medkit_integration_testscolcon test --packages-select ros2_medkit_gateway ros2_medkit_integration_tests --ctest-args -R 'test_faults_scope_isolation|test_handler_context|test_fault_handlers|test_fault_manager_routing'Checklist
Backwards compatibility note
Three client-visible behavior changes are documented in the CHANGELOG:
200/204for a fault outside an entity's scope now get404. The all-sources rule also rejects mixed-source faults that include any out-of-entity reporter - use the globalGET /api/v1/faultsto see those.ClearFault.srvservice hash change: adding theskip_correlation_auto_clearrequest field changes the service type hash. Out-of-tree callers that invoke/fault_manager/clear_faultdirectly must rebuild against the newros2_medkit_msgs.muted_count/cluster_count/muted_faults/clusterscorrelation metadata; that remains on the globalGET /api/v1/faultsroute.