test: pin EKS E2E helm-charts ref and fix EC2 Linux integration tests#711
Open
musa-asad wants to merge 2 commits into
Open
test: pin EKS E2E helm-charts ref and fix EC2 Linux integration tests#711musa-asad wants to merge 2 commits into
musa-asad wants to merge 2 commits into
Conversation
EKS E2E helm-charts pin ----------------------- The EKS E2E module cloned aws-observability/helm-charts at a floating "main" ref. A dependency bump renamed the operator's registered feature-gate IDs from hyphenated to non-hyphenated forms; the chart kept passing the old hyphenated IDs on the operator's --feature-gates argument until it was re-aligned in chart PR #318 (commit f6a3940). During that window the E2E Helm path launched the post-bump operator with two unregistered gate IDs, which made the controller-manager exit at startup and CrashLoop, timing out the readiness wait and failing terraform apply. Pinning helm_charts_branch to f6a3940 removes the cross-repo desync window so the cloned chart is always a fixed, reviewed version aligned with the operator's gate IDs. EC2 Linux integration test fixes -------------------------------- Three EC2 Linux integration tests were failing on recent CI runs. Two are fixed here; the third is documented as an infra/AMI issue. 1. sles-15 ca_bundle_test: the two sles-15 rows in generator/resources/ec2_linux_test_matrix.json hard-coded the RHEL CA bundle path /etc/ssl/certs/ca-bundle.crt, which does not exist on SLES. Set both sles-15 rows to the SUSE path /etc/ssl/ca-bundle.pem. RHEL / Alma / Rocky rows are unchanged. 2. alma-linux-10 ssm_document_test: two harness hardening edits. - util/awsservice/ssm.go WaitForCommandCompletion now returns immediately on a terminal non-success status (Failed / Cancelled / TimedOut) with the real status and status details, instead of looping until the budget expires and returning the generic "commands did not complete within 1 minute". This stops a fast, deterministic command failure from being misreported as a 60s timeout. - test/ssm_document/ssm_document_unix.go adds a bounded read-after-write poll (new WaitForParameterAvailable helper) after each PutStringParameter and before the dependent configure SendCommand, so the on-instance agent cannot fetch the config before the write has propagated (Parameter Store reads are eventually consistent). 3. rocky-linux-9 ssm_document_test: this is an infra/AMI issue, not a test bug. The instance never registers as an Online SSM managed instance within the readiness gate. The gate timeout was intentionally NOT raised (it would not help a never-registering instance). No code fix is forced; see the PR description for details and the recommendation to keep rocky-9 non-blocking.
845ad0a to
af8785f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR contains two related sets of changes to stabilize CI:
ca_bundle_test, alma-linux-10ssm_document_test, rocky-linux-9ssm_document_test).1. EKS E2E helm-charts pin
The EKS E2E Terraform module clones
aws-observability/helm-chartsat afloating
mainref and renders the operator's controller-manager Deploymentfrom whatever that branch happens to be at run time. A recent operator
dependency bump renamed the operator's registered feature-gate IDs from
hyphenated to non-hyphenated forms, but the chart kept passing the old
hyphenated IDs on the operator's
--feature-gatesargument until the chartwas re-aligned in chart PR #318 (commit
f6a3940).During that cross-repo desync window, the E2E Helm path launched the
post-bump operator with two unregistered gate IDs. The operator parses
--feature-gateswithpflagin ExitOnError mode, so an unregistered ID makesthe controller-manager exit at startup →
CrashLoopBackOff→ the deploymentnever reaches
condition=available. The 4-minute readiness wait then timesout and
terraform applyfails (surfacing only as the genericTerraform exited with code 1). The Addon path renders a different,released manifest that does not pass the rejected IDs, so it was unaffected —
which is why only the six Helm variants went red.
Pinning the cloned chart to the known-good SHA
f6a3940(the commit thatre-aligned the gate IDs) eliminates the floating-
maindesync window so theoperator and chart are always validated as a matched pair.
Change:
terraform/eks/e2e/variables.tf— change thehelm_charts_branchvariable default from
"main"to"f6a3940".2. EC2 Linux integration-test fixes
Three EC2 Linux integration tests were failing on recent CI runs. Two are
test/harness bugs fixed here and verified locally by compilation and static
analysis; the third is an infra/AMI registration issue documented below.
sles-15
ca_bundle_test— test-data fixWhat: In
generator/resources/ec2_linux_test_matrix.json, the two sles-15rows set
caCertPathto/etc/ssl/ca-bundle.pem(was/etc/ssl/certs/ca-bundle.crt).Why:
/etc/ssl/certs/ca-bundle.crtis the RHEL-family bundle path and doesnot exist on SLES; SUSE ships the bundle at
/etc/ssl/ca-bundle.pem. Only thetwo sles-15 rows are changed — the RHEL / Alma / Rocky rows that legitimately
use
/etc/ssl/certs/ca-bundle.crtare untouched.alma-linux-10
ssm_document_test— race + reporting fixWhat & why (two harness edits):
util/awsservice/ssm.goWaitForCommandCompletion: return immediately on aterminal non-success status (
Failed/Cancelled/TimedOut), carryingthe real status and the command's status details, instead of looping until
the time budget expires and returning the generic
commands did not complete within 1 minute. A fast, deterministic commandfailure was previously masked as a 60s timeout; this makes failures
self-diagnosing. Success and budget-exhausted-timeout paths are unchanged.
test/ssm_document/ssm_document_unix.go: add a bounded read-after-write poll(new
WaitForParameterAvailablehelper inutil/awsservice/ssm.go) aftereach
PutStringParameterand before the dependent configureSendCommand.SSM Parameter Store reads are eventually consistent, so without this the
on-instance agent could fetch the config before the write propagated and fail
with
ParameterNotFound. The poll has a small bounded budget and fails closedif the parameter is still unreadable.
rocky-linux-9
ssm_document_test— infra/AMI issue (non-blocking)This is not a test-code bug and is not fixed in this PR. The rocky-9
instance never registers as an Online SSM managed instance within the readiness
gate, so the test fails deterministically at the full budget — the signature of
an instance that never registers, not one that registers slowly.
not help an instance that never registers, and would only slow down the
failure.
amazon-ssm-agentinthe instance userdata) was evaluated and intentionally not applied: the
userdata and the instance IAM profile are shared verbatim across all ~20 Linux
distros in the matrix, and alma-linux-10 registers fine with that same shared
profile. There is no rocky-9-scoped, low-risk place to add the step without
touching shared bring-up infrastructure used by currently-green distros. This
points at the rocky-9 AMI / instance specifically (agent not shipped/started),
which cannot be fixed from this repo.
belongs in the rocky-9 AMI / instance role, outside this repo.
Test Output (local proof)
The three integration tests run on live EC2 instances against SSM/AWS in CI
and cannot be executed end-to-end from a local workspace. Local proof is
compilation + static validation; the CI re-run on this PR is the end-to-end
gate. rocky-9, being infra, can only be confirmed by CI/infra.
Note:
go build ./.../go vet ./...report one pre-existing, unrelatedfailure in
test/metric_dimension(undefined: isAllValuesGreaterThanOrEqualToZero— a symbol defined only in a_test.gofile but referenced from a non-test file). It exists on the unmodified branch,
is out of scope for this PR, and none of this PR's changed files touch
test/metric_dimension. Every issue surfaced by build/vet across the repo isconfined to that single pre-existing package; the whole repo minus it vets
clean (exit 0).