Skip to content

Spurious recovery action trigger on first run target activation #198

@WilliamRoebuck

Description

@WilliamRoebuck

Description

  1. Control client reports running
  2. Initial state transition is successful (see clock() at successful initial state transition log)
  3. A problem is discovered and recovery action triggered (final log line below)

This can cause flaky test failures unless mitigated with a sleep(1) after the control client reports running and before requesting the first run target

Analysis results

No response

Solution

No response

Error Occurrence Rate

Sporadic

How to reproduce

Use bazel test //tests/integration/smoke --config=x86_64-linux --runs_per_test=60 and the error should occur in one or more runs, failing the test.

The sleep(1) workaround added in #194 should be temporarily removed if present

Supporting Information

This log shows a recovery action triggered during the smoke test:

[INFO] [launch_manager] [ RUN      ] Smoke.Daemon
[2026-05-13 12:12:43.279] [INFO] [launch_manager] [ STEP     ] Control daemon report kRunning
[2026-05-13 12:12:43.280] [INFO] [launch_manager] [ END STEP ] Control daemon report kRunning
[2026-05-13 12:12:43.280] [INFO] [launch_manager] [ STEP     ] Activate RunTarget Running
[2026-05-13 12:12:43.280] [INFO] [launch_manager]  2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Request sent. Waiting for acknowledgment... ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager]  2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Got kRunning for pid 67 ( control_daemon ) process 0 of group MainPG ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager]     2026/5/13 12:122026:43 LCLM /LCLM 5DEBUG:  [/13  12:12startProcess for: 43MainPG  LCLM LCLM processDEBUG:  [ 0 (  ProcessGroupManager::ControlClientHandler: got requestcontrol_daemon  ) done ]kSetStateRequest ( 16
[2026-05-13 12:12:43.281] [INFO] [launch_manager]    2026/5/13 12:12:43 LCLM) re state  LCLM DEBUG:  [ Control Client handler nudged ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager] MainPG/Running  of PG 2026/ 5MainPG/ ]13 12:
[2026-05-13 12:12:43.281] [INFO] [launch_manager] 12  2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Pending state for process group MainPG changed from  to MainPG/Running ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Graph::setState changes from kInTransition to kCancelled for PG 0 ( MainPG ) ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Control Client handler nudged ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Graph::setState changes from kCancelled to kUndefinedState for PG 0 ( MainPG ) ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Request acknowledged. ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Request acknowledged. ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Pending state for process group MainPG changed from MainPG/Running to  ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Start transition to MainPG/Running for PG MainPG ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager]  2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Graph::setState changes from kUndefinedState to kInTransition for PG 0 ( MainPG ) ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Stop Dependencies: 0 ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager]  2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Stop Dependencies: 0 ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager]  2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Start Dependencies: 0 ]
[2026-05-13 12:12:43.281] [INFO] [launch_manager] :  202643/ LCLM LCLM5 /13 DEBUG:  [12 :clock() at successful initial state transition:12:43 LCLM LCLM DEBUG:  [ Start Dependencies: 0 ]
[2026-05-13 12:12:43.282] [INFO] [launch_manager]  18.903 ms ]
[2026-05-13 12:12:43.282] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Graph::setState changes from kInTransition to kSuccess for PG 0 ( MainPG ) ]
[2026-05-13 12:12:43.282] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM INFO:    [ Completed the request for PG MainPG to State MainPG/Running in 0 ms ]
[2026-05-13 12:12:43.282] [INFO] [launch_manager]   2026 /5/ 13 122026:/5/1312 12:12::4343 LCLM  LCLMLCLM  DEBUG:  [ LCLMGraph::setState changes from DEBUG:  [  kSuccess to kUndefinedState Control Client handler nudged ]for PG
[2026-05-13 12:12:43.282] [INFO] [launch_manager]  0 ( MainPG ) ]
[2026-05-13 12:12:43.282] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Control Client handler nudged ]
[2026-05-13 12:12:43.282] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ ProcessGroupManager::ControlClientHandler: Sending kSetStateSuccess ( 20 ) re state MainPG/Running of PG MainPG ]
[2026-05-13 12:12:43.282] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM DEBUG:  [ Response sent. ]
[2026-05-13 12:12:43.282] [INFO] [launch_manager]   2026/5/13 12:12:43 LCLM LCLM WARNING: [ Problem discovered in PG MainPG Activating Recovery state. ]

Classification

Major

First Affected Release

0.5

Last Affected Release

0.7

Expected Fixed Release

0.8

Category

  • Safety Relevant
  • Security Relevant

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions