Skip to content

test: pin EKS E2E helm-charts ref and fix EC2 Linux integration tests#711

Open
musa-asad wants to merge 2 commits into
mainfrom
fix/e2e-helm-charts-pin
Open

test: pin EKS E2E helm-charts ref and fix EC2 Linux integration tests#711
musa-asad wants to merge 2 commits into
mainfrom
fix/e2e-helm-charts-pin

Conversation

@musa-asad

@musa-asad musa-asad commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR contains two related sets of changes to stabilize CI:

  1. EKS E2E helm-charts pin (original change on this branch).
  2. EC2 Linux integration-test fixes for three failing tests (sles-15
    ca_bundle_test, alma-linux-10 ssm_document_test, rocky-linux-9
    ssm_document_test).

1. EKS E2E helm-charts pin

The EKS E2E Terraform module clones aws-observability/helm-charts at a
floating main ref and renders the operator's controller-manager Deployment
from whatever that branch happens to be at run time. A recent operator
dependency bump renamed the operator's registered feature-gate IDs from
hyphenated to non-hyphenated forms, but the chart kept passing the old
hyphenated
IDs on the operator's --feature-gates argument until the chart
was re-aligned in chart PR #318 (commit f6a3940).

During that cross-repo desync window, the E2E Helm path launched the
post-bump operator with two unregistered gate IDs. The operator parses
--feature-gates with pflag in ExitOnError mode, so an unregistered ID makes
the controller-manager exit at startup → CrashLoopBackOff → the deployment
never reaches condition=available. The 4-minute readiness wait then times
out and terraform apply fails (surfacing only as the generic
Terraform exited with code 1). The Addon path renders a different,
released manifest that does not pass the rejected IDs, so it was unaffected —
which is why only the six Helm variants went red.

Pinning the cloned chart to the known-good SHA f6a3940 (the commit that
re-aligned the gate IDs) eliminates the floating-main desync window so the
operator and chart are always validated as a matched pair.

Change: terraform/eks/e2e/variables.tf — change the helm_charts_branch
variable default from "main" to "f6a3940".


2. EC2 Linux integration-test fixes

Three EC2 Linux integration tests were failing on recent CI runs. Two are
test/harness bugs fixed here and verified locally by compilation and static
analysis; the third is an infra/AMI registration issue documented below.

sles-15 ca_bundle_test — test-data fix

What: In generator/resources/ec2_linux_test_matrix.json, the two sles-15
rows set caCertPath to /etc/ssl/ca-bundle.pem (was
/etc/ssl/certs/ca-bundle.crt).

Why: /etc/ssl/certs/ca-bundle.crt is the RHEL-family bundle path and does
not exist on SLES; SUSE ships the bundle at /etc/ssl/ca-bundle.pem. Only the
two sles-15 rows are changed — the RHEL / Alma / Rocky rows that legitimately
use /etc/ssl/certs/ca-bundle.crt are untouched.

alma-linux-10 ssm_document_test — race + reporting fix

What & why (two harness edits):

  • util/awsservice/ssm.go WaitForCommandCompletion: return immediately on a
    terminal non-success status (Failed / Cancelled / TimedOut), carrying
    the real status and the command's status details, instead of looping until
    the time budget expires and returning the generic
    commands did not complete within 1 minute. A fast, deterministic command
    failure was previously masked as a 60s timeout; this makes failures
    self-diagnosing. Success and budget-exhausted-timeout paths are unchanged.
  • test/ssm_document/ssm_document_unix.go: add a bounded read-after-write poll
    (new WaitForParameterAvailable helper in util/awsservice/ssm.go) after
    each PutStringParameter and before the dependent configure SendCommand.
    SSM Parameter Store reads are eventually consistent, so without this the
    on-instance agent could fetch the config before the write propagated and fail
    with ParameterNotFound. The poll has a small bounded budget and fails closed
    if the parameter is still unreadable.

rocky-linux-9 ssm_document_test — infra/AMI issue (non-blocking)

This is not a test-code bug and is not fixed in this PR. The rocky-9
instance never registers as an Online SSM managed instance within the readiness
gate, so the test fails deterministically at the full budget — the signature of
an instance that never registers, not one that registers slowly.

  • The readiness-gate timeout was intentionally NOT raised. Raising it would
    not help an instance that never registers, and would only slow down the
    failure.
  • A best-effort harness mitigation (installing/enabling amazon-ssm-agent in
    the instance userdata) was evaluated and intentionally not applied: the
    userdata and the instance IAM profile are shared verbatim across all ~20 Linux
    distros in the matrix, and alma-linux-10 registers fine with that same shared
    profile. There is no rocky-9-scoped, low-risk place to add the step without
    touching shared bring-up infrastructure used by currently-green distros. This
    points at the rocky-9 AMI / instance specifically (agent not shipped/started),
    which cannot be fixed from this repo.
  • Recommendation: keep rocky-9 non-blocking for this PR. The true fix
    belongs in the rocky-9 AMI / instance role, outside this repo.

Test Output (local proof)

The three integration tests run on live EC2 instances against SSM/AWS in CI
and cannot be executed end-to-end from a local workspace. Local proof is
compilation + static validation; the CI re-run on this PR is the end-to-end
gate. rocky-9, being infra, can only be confirmed by CI/infra.

# go build (this PR's changed packages)
$ go build ./util/awsservice/... ./test/ssm_document/...
exit 0

# go vet (this PR's changed packages)
$ go vet ./util/awsservice/... ./test/ssm_document/...
exit 0

# go vet (entire repo except one pre-existing-broken package)
$ go vet $(go list ./... | grep -v '/test/metric_dimension')
exit 0

# gofmt (both changed Go files) — no output == clean
$ gofmt -l util/awsservice/ssm.go test/ssm_document/ssm_document_unix.go
(no output)

# matrix JSON still parses
$ python3 -c "import json;json.load(open('generator/resources/ec2_linux_test_matrix.json'));print('JSON_OK')"
JSON_OK

Note: go build ./... / go vet ./... report one pre-existing, unrelated
failure in test/metric_dimension (undefined: isAllValuesGreaterThanOrEqualToZero — a symbol defined only in a _test.go
file but referenced from a non-test file). It exists on the unmodified branch,
is out of scope for this PR, and none of this PR's changed files touch
test/metric_dimension. Every issue surfaced by build/vet across the repo is
confined to that single pre-existing package; the whole repo minus it vets
clean (exit 0).

@musa-asad musa-asad requested a review from a team as a code owner June 17, 2026 16:28
EKS E2E helm-charts pin
-----------------------
The EKS E2E module cloned aws-observability/helm-charts at a floating
"main" ref. A dependency bump renamed the operator's registered
feature-gate IDs from hyphenated to non-hyphenated forms; the chart kept
passing the old hyphenated IDs on the operator's --feature-gates argument
until it was re-aligned in chart PR #318 (commit f6a3940). During that
window the E2E Helm path launched the post-bump operator with two
unregistered gate IDs, which made the controller-manager exit at startup
and CrashLoop, timing out the readiness wait and failing terraform apply.
Pinning helm_charts_branch to f6a3940 removes the cross-repo desync window
so the cloned chart is always a fixed, reviewed version aligned with the
operator's gate IDs.

EC2 Linux integration test fixes
--------------------------------
Three EC2 Linux integration tests were failing on recent CI runs. Two are
fixed here; the third is documented as an infra/AMI issue.

1. sles-15 ca_bundle_test: the two sles-15 rows in
   generator/resources/ec2_linux_test_matrix.json hard-coded the RHEL CA
   bundle path /etc/ssl/certs/ca-bundle.crt, which does not exist on SLES.
   Set both sles-15 rows to the SUSE path /etc/ssl/ca-bundle.pem. RHEL /
   Alma / Rocky rows are unchanged.

2. alma-linux-10 ssm_document_test: two harness hardening edits.
   - util/awsservice/ssm.go WaitForCommandCompletion now returns
     immediately on a terminal non-success status (Failed / Cancelled /
     TimedOut) with the real status and status details, instead of looping
     until the budget expires and returning the generic "commands did not
     complete within 1 minute". This stops a fast, deterministic command
     failure from being misreported as a 60s timeout.
   - test/ssm_document/ssm_document_unix.go adds a bounded read-after-write
     poll (new WaitForParameterAvailable helper) after each
     PutStringParameter and before the dependent configure SendCommand, so
     the on-instance agent cannot fetch the config before the write has
     propagated (Parameter Store reads are eventually consistent).

3. rocky-linux-9 ssm_document_test: this is an infra/AMI issue, not a test
   bug. The instance never registers as an Online SSM managed instance
   within the readiness gate. The gate timeout was intentionally NOT raised
   (it would not help a never-registering instance). No code fix is forced;
   see the PR description for details and the recommendation to keep
   rocky-9 non-blocking.
@musa-asad musa-asad force-pushed the fix/e2e-helm-charts-pin branch from 845ad0a to af8785f Compare June 19, 2026 05:03
@musa-asad musa-asad changed the title test: pin EKS E2E helm-charts ref to known-good SHA test: pin EKS E2E helm-charts ref and fix EC2 Linux integration tests Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant