fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230
Open
fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230
Conversation
…and guard Remove hardcoded 60s timeout from retrycmd_get_tarball; caller now passes timeout as 3rd positional arg (matches retrycmd_curl_file style). Add optional max_budget_s (6th arg, default 0) to retrycmd_get_tarball and retrycmd_curl_file. When > 0, _retry_file_curl_internal tracks wall-clock time and exits early (return 2) once the budget is consumed, bounding any single download to 5 min at provisioning time regardless of retry count. Add pre-command check_cse_timeout guard at the TOP of _retrycmd_internal loop. Previously the guard only fired after a failed command, meaning a new 300-600s operation could START at minute 12:59 and run to ~18 min. The guard now prevents starting any new attempt when >780s have elapsed since CSE_STARTTIME_SECONDS. Pass 300s budget to all 11 provisioning-time download call sites: cse_install.sh (credential-provider, oras, secure-TLS, CNI x2, crictl, k8s) ubuntu/cse_install_ubuntu.sh (nvidia GPG key, containerd, runc) mariner/cse_install_mariner.sh (nvidia repo file) Reduce nvidia-smi per-try timeout 300s to 30s in cse_config.sh (3 sites). nvidia-smi is a CLI status check, not a long-running driver operation; the 300s value was a copy-paste from the GPU driver install command. Update spec tests: fix parameter tables, add budget-exceeded test and pre-command CSE timeout test in cse_retry_helpers_spec.sh; update expected call signature in cse_install_spec.sh.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent Linux CSE runs from exceeding the VM provisioner’s ~16-minute client window by adding (1) a pre-attempt global timeout guard and (2) a per-download wall-clock budget, then wiring those controls into provisioning-time download call sites and updating ShellSpec coverage accordingly.
Changes:
- Add a pre-command
check_cse_timeoutguard in_retrycmd_internaland a per-operationmaxBudgetin_retry_file_curl_internal(propagated viaretrycmd_curl_file/retrycmd_get_tarball). - Update provisioning-time download call sites to pass a 300s per-operation budget and adjust
retrycmd_get_tarballcall signature to include an explicit timeout. - Reduce
nvidia-smiper-try timeout from 300s to 30s and update/add ShellSpec expectations for the new retry helper behaviors.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh | Updates parameter tables and adds tests for pre-attempt global timeout and per-operation budget exit behavior. |
| spec/parts/linux/cloud-init/artifacts/cse_install_spec.sh | Updates expected retrycmd_get_tarball invocation signature to include timeout + budget. |
| parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh | Adds 300s per-operation budget to provisioning-time curl downloads (NVIDIA key, containerd, runc). |
| parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh | Adds 300s per-operation budget to NVIDIA repo file download. |
| parts/linux/cloud-init/artifacts/cse_install.sh | Adds explicit timeout + 300s per-operation budgets to provisioning-time tarball/curl downloads (credential provider, oras, CNI, crictl, k8s). |
| parts/linux/cloud-init/artifacts/cse_helpers.sh | Implements pre-attempt global timeout guard and per-operation budget; changes retrycmd_get_tarball signature. |
| parts/linux/cloud-init/artifacts/cse_config.sh | Lowers nvidia-smi retry timeout from 300s to 30s at 3 call sites. |
…allers Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/8f356829-1593-46b5-94dc-6fa4a85aa931 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>
Devinwong
reviewed
Apr 2, 2026
pdamianov-dev
approved these changes
Apr 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
retrycmd_get_tarballnew 5-arg signature breaks VHD build callers using old 4-arg signatureretrycmd_get_tarball: detect old 4-arg signature (3rd arg is a file path, not numeric) and default timeout to 60s