Skip to content

fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230

Open
djsly wants to merge 2 commits intomainfrom
djsly/37384340
Open

fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230
djsly wants to merge 2 commits intomainfrom
djsly/37384340

Conversation

@djsly
Copy link
Copy Markdown
Collaborator

@djsly djsly commented Apr 2, 2026

  • Understand the issue: retrycmd_get_tarball new 5-arg signature breaks VHD build callers using old 4-arg signature
  • Add backward compatibility to retrycmd_get_tarball: detect old 4-arg signature (3rd arg is a file path, not numeric) and default timeout to 60s
  • Update spec tests to cover the backward-compatible old 4-arg signature (2 new tests)
  • Run parallel code review and CodeQL validation
  • Final code review - [[ ]] and =~ are fine here (file has #!/bin/bash shebang and already uses [[ extensively)

…and guard

Remove hardcoded 60s timeout from retrycmd_get_tarball; caller now
passes timeout as 3rd positional arg (matches retrycmd_curl_file style).

Add optional max_budget_s (6th arg, default 0) to retrycmd_get_tarball
and retrycmd_curl_file. When > 0, _retry_file_curl_internal tracks
wall-clock time and exits early (return 2) once the budget is consumed,
bounding any single download to 5 min at provisioning time regardless
of retry count.

Add pre-command check_cse_timeout guard at the TOP of _retrycmd_internal
loop. Previously the guard only fired after a failed command, meaning a
new 300-600s operation could START at minute 12:59 and run to ~18 min.
The guard now prevents starting any new attempt when >780s have elapsed
since CSE_STARTTIME_SECONDS.

Pass 300s budget to all 11 provisioning-time download call sites:
  cse_install.sh (credential-provider, oras, secure-TLS, CNI x2, crictl, k8s)
  ubuntu/cse_install_ubuntu.sh (nvidia GPG key, containerd, runc)
  mariner/cse_install_mariner.sh (nvidia repo file)

Reduce nvidia-smi per-try timeout 300s to 30s in cse_config.sh (3 sites).
nvidia-smi is a CLI status check, not a long-running driver operation;
the 300s value was a copy-paste from the GPU driver install command.

Update spec tests: fix parameter tables, add budget-exceeded test and
pre-command CSE timeout test in cse_retry_helpers_spec.sh; update
expected call signature in cse_install_spec.sh.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent Linux CSE runs from exceeding the VM provisioner’s ~16-minute client window by adding (1) a pre-attempt global timeout guard and (2) a per-download wall-clock budget, then wiring those controls into provisioning-time download call sites and updating ShellSpec coverage accordingly.

Changes:

  • Add a pre-command check_cse_timeout guard in _retrycmd_internal and a per-operation maxBudget in _retry_file_curl_internal (propagated via retrycmd_curl_file / retrycmd_get_tarball).
  • Update provisioning-time download call sites to pass a 300s per-operation budget and adjust retrycmd_get_tarball call signature to include an explicit timeout.
  • Reduce nvidia-smi per-try timeout from 300s to 30s and update/add ShellSpec expectations for the new retry helper behaviors.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh Updates parameter tables and adds tests for pre-attempt global timeout and per-operation budget exit behavior.
spec/parts/linux/cloud-init/artifacts/cse_install_spec.sh Updates expected retrycmd_get_tarball invocation signature to include timeout + budget.
parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh Adds 300s per-operation budget to provisioning-time curl downloads (NVIDIA key, containerd, runc).
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Adds 300s per-operation budget to NVIDIA repo file download.
parts/linux/cloud-init/artifacts/cse_install.sh Adds explicit timeout + 300s per-operation budgets to provisioning-time tarball/curl downloads (credential provider, oras, CNI, crictl, k8s).
parts/linux/cloud-init/artifacts/cse_helpers.sh Implements pre-attempt global timeout guard and per-operation budget; changes retrycmd_get_tarball signature.
parts/linux/cloud-init/artifacts/cse_config.sh Lowers nvidia-smi retry timeout from 300s to 30s at 3 call sites.

Copilot finished work on behalf of djsly April 2, 2026 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants