JAX-vLLM Offloading k8s AWS EKS #5162
Triggered via pull request
December 5, 2025 18:57
Status
Failure
Total duration
3h 57m 57s
Artifacts
44
ci.yaml
on: pull_request
metadata
4s
Matrix: amd64 / test-distribution
Matrix: arm64 / test-distribution
amd64
/
...
/
build-mpi-operator-compatible-base
2m 44s
arm64
/
...
/
build-mpi-operator-compatible-base
Matrix: amd64 / test-jax-cutlass-h100 / jax-cutlass-test-h100
Matrix: amd64 / test-jax / run-unit-test
Matrix: amd64 / test-te-a100 / run-unit-test
Matrix: amd64 / test-te-h100 / te-test-h100
amd64
/
build-torchax
7m 8s
amd64
/
...
/
launch-slurm-runner
2h 24m
amd64
/
test-nsys-jax-eks
4m 20s
amd64
/
...
/
launch-slurm-runner
1h 57m
Matrix: amd64 / test-nsys-jax / run-unit-test
Matrix: amd64 / test-nccl / nccl-test
Matrix: amd64 / test-nccl / nccl-test-gke / nccl-gke
Matrix: arm64 / test-jax-cutlass-h100 / jax-cutlass-test-h100
Waiting for pending jobs
Matrix: arm64 / test-jax / run-unit-test
Waiting for pending jobs
Matrix: arm64 / test-te-a100 / run-unit-test
Waiting for pending jobs
Matrix: arm64 / test-te-h100 / te-test-h100
Waiting for pending jobs
arm64
/
build-torchax
8m 9s
arm64
/
test-nsys-jax-eks
arm64
/
...
/
launch-slurm-runner
arm64
/
...
/
launch-slurm-runner
Matrix: arm64 / test-nsys-jax / run-unit-test
Waiting for pending jobs
Matrix: arm64 / test-nccl / nccl-test
Waiting for pending jobs
Matrix: arm64 / test-nccl / nccl-test-gke / nccl-gke
Waiting for pending jobs
Matrix: amd64 / test-maxtext / maxtext-multinode
Matrix: amd64 / test-maxtext / single-process-multi-device
amd64
/
test-axlearn-eks
0s
amd64
/
test-axlearn-fuji-models-eks
0s
Matrix: amd64 / test-nsys-jax-archive
Matrix: arm64 / test-maxtext / maxtext-multinode
Waiting for pending jobs
Matrix: arm64 / test-maxtext / single-process-multi-device
Waiting for pending jobs
arm64
/
test-axlearn-eks
0s
arm64
/
test-axlearn-fuji-models-eks
0s
Matrix: arm64 / test-nsys-jax-archive
Matrix: amd64 / test-rosetta-t5x / vit-multi-gpu-multi-node
Matrix: arm64 / test-rosetta-t5x / vit-multi-gpu-multi-node
Waiting for pending jobs
Matrix: publish-containers
finalize
/
publish-badge
6s
Annotations
13 errors
|
amd64 / test-nccl / nccl-test-gke / nccl-gke (broadcast_perf_mpi)
Process completed with exit code 1.
|
|
amd64 / test-nccl / nccl-test-gke / nccl-gke (all_reduce_perf_mpi)
The strategy configuration was canceled because "amd64.test-nccl.nccl-test-gke.nccl-gke.broadcast_perf_mpi" failed
|
|
amd64 / test-nccl / nccl-test-gke / nccl-gke (all_gather_perf_mpi)
The strategy configuration was canceled because "amd64.test-nccl.nccl-test-gke.nccl-gke.broadcast_perf_mpi" failed
|
|
amd64 / test-nccl / nccl-test-gke / nccl-gke (reduce_scatter_perf_mpi)
The strategy configuration was canceled because "amd64.test-nccl.nccl-test-gke.nccl-gke.broadcast_perf_mpi" failed
|
|
arm64 / build-axlearn
buildx failed with: ERROR: failed to build: failed to solve: process "/bin/sh -c <<\"EOF\" bash -exu\n git config --global user.email \"${GIT_USER_EMAIL}\"\n git config --global user.name \"${GIT_USER_NAME}\"\n git-clone.sh \"${URLREF_AXLEARN}\" \"${SRC_PATH_AXLEARN}\"\n ${DEST_MANIFEST_DIR}/create-distribution.sh \\\n --manifest ${DEST_MANIFEST_DIR}/manifest.yaml \\\n --package axlearn\nEOF" did not complete successfully: exit code: 1
|
|
amd64 / build-axlearn
buildx failed with: ERROR: failed to build: failed to solve: process "/bin/sh -c <<\"EOF\" bash -exu\n git config --global user.email \"${GIT_USER_EMAIL}\"\n git config --global user.name \"${GIT_USER_NAME}\"\n git-clone.sh \"${URLREF_AXLEARN}\" \"${SRC_PATH_AXLEARN}\"\n ${DEST_MANIFEST_DIR}/create-distribution.sh \\\n --manifest ${DEST_MANIFEST_DIR}/manifest.yaml \\\n --package axlearn\nEOF" did not complete successfully: exit code: 1
|
|
amd64 / test-maxtext-gke / maxtext-gke-xpk
Process completed with exit code 1.
|
|
amd64 / test-te-h100 / te-test-h100 (unittest, 8)
Process completed with exit code 1.
|
|
amd64 / test-te-a100 / te-A100-unit-test
The self-hosted runner lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
|
|
amd64 / test-nsys-jax / nsys-jax-A100-unit-test
Process completed with exit code 1.
|
|
amd64 / test-maxtext / test-maxtext-outcome
Process completed with exit code 1.
|
|
amd64 / test-rosetta-t5x / test-t5x-rosetta-metrics
Process completed with exit code 1.
|
|
amd64 / test-rosetta-t5x / test-t5x-rosetta-outcome
Process completed with exit code 1.
|
Artifacts
Produced during runtime
| Name | Size | Digest | |
|---|---|---|---|
|
artifact-axlearn-build-amd64
|
472 Bytes |
sha256:4f790289d1a82e8e129dfe0839e8691c33253bafc1e65e9ccf5a8dce505a9499
|
|
|
artifact-axlearn-build-arm64
|
470 Bytes |
sha256:e63984c4a6716cc036cbcda76d3e323bcd7dbe10e490f2fc03e652b40e8aead3
|
|
|
artifact-base-build-amd64
|
567 Bytes |
sha256:a4ef4599dea9703f6b06334a5ef04eef8b94ab49e699dc60568d05e142731cea
|
|
|
artifact-base-build-arm64
|
567 Bytes |
sha256:e0b909889d915d91686b0d2274fb78cb62e736fe43febb6203e61e09c721cdf1
|
|
|
artifact-equinox-build-amd64
|
571 Bytes |
sha256:ac4c11875340da28e3b0a8b0e6435f1f4e4cb8e9c1f3607bdd2d765e1abf9ad7
|
|
|
artifact-equinox-build-arm64
|
569 Bytes |
sha256:57c8d315e70f60538cb9dbc2147f89bd07b6e0385489774d69030084fb7a1f3f
|
|
|
artifact-final-report
|
3.75 KB |
sha256:e99d2cf1f5fb99f6984b245ff60ed72a1fd7dc9e3d0607852e003a15536a7399
|
|
|
artifact-jax-build-amd64
|
554 Bytes |
sha256:bc5cae48b93ce09678c339d5619a5aa49a963248ec490d3da2dd246248d90376
|
|
|
artifact-jax-build-arm64
|
553 Bytes |
sha256:94474da54a75983b37d7c44de3d2651ef493483d95541d2a1b656a1d26803e7c
|
|
|
artifact-maxtext-build-amd64
|
568 Bytes |
sha256:c771e849c9a642b3f75e16a8882d5d35a00324ba73fdba193a5122ca1cd80bbb
|
|
|
artifact-maxtext-build-arm64
|
568 Bytes |
sha256:4ee3d2548ff559b72f2c3d560198f2ed9e2faa893a212de6ad597305848927bf
|
|
|
artifact-maxtext-test
|
1.47 KB |
sha256:08e693082fa8f2d374a9e88ad0c5894860de8642f370c65faa1c2c755b739b9f
|
|
|
artifact-mpi-operator-compatible-base-build-amd64
|
638 Bytes |
sha256:fe0f10a84153b0ef9ee80ca8c2dc234d7fce3fe799c07aa924addd62a2e2aa38
|
|
|
artifact-nccl-gke-build-amd64
|
572 Bytes |
sha256:235f6c0ec15d52967f39e2119e0d856098c600fcf1e578d3b49a7979a33c38e2
|
|
|
artifact-rosetta-build-t5x-amd64
|
584 Bytes |
sha256:8e59adb0ce6a7591d0a2f622e18b2b7756c3fe4e01f3f271e5a4f3a3eb97aab4
|
|
|
artifact-rosetta-build-t5x-arm64
|
585 Bytes |
sha256:7ea08dc2a6bf7d4a34a95015769488068c20bf5de0b84136e98c077e6f587c27
|
|
|
artifact-rosetta-t5x-mgmn-test
|
624 Bytes |
sha256:4427f084ec7ae290662a66c3451a1fa3baf9623691626803a0a3fa8f6e4003f5
|
|
|
artifact-t5x-build-amd64
|
570 Bytes |
sha256:872995ae3515d8707700b4d23cdcc4cf59f595fb155f41e02e2169faa17f07a5
|
|
|
artifact-t5x-build-arm64
|
568 Bytes |
sha256:ea6391646ec3bd7d24a9f9743c5980a11b01d7afe9a6e8ed0148c65187eb10fe
|
|
|
artifact-torchax-build-amd64
|
568 Bytes |
sha256:b61e3b26a77be2801e95aa35f3e441c366608760e2d8aa913f4f3e90308aaa8e
|
|
|
artifact-torchax-build-arm64
|
568 Bytes |
sha256:81727466f1dcbe130157a95a3ec32cd8cb16f36c9606a05f04b904345a6411eb
|
|
|
artifact-workflow-metadata
|
277 Bytes |
sha256:d0ff8d35d17b3962f3c51974d1129ad49fb0c827cd13593d9f65188d9c544975
|
|
|
bumped-manifest
|
51.6 KB |
sha256:962b628cff857a2d1c1589efb8f2e283273f2babe26e11cec1af20ffc0991808
|
|
|
final-base
|
254 Bytes |
sha256:21982c560a30ad5ffc485cd111047fd46dcdffdc35d82937af03ab79bb13ee75
|
|
|
final-equinox
|
263 Bytes |
sha256:12e9dcea617e78f5644fbcff0aec5931ef1095cac30b26443153e99ed6b514cc
|
|
|
final-jax
|
251 Bytes |
sha256:7262c3848ac9cc2102b0899103b24d0c9700dae75720518b7d35694626bb3f81
|
|
|
final-maxtext
|
262 Bytes |
sha256:8e09d1e50d8d180029ba4e32c5253301435d1d961afd854f11009fd412c074f5
|
|
|
final-t5x
|
251 Bytes |
sha256:60deffa8d6ee405c0651f044f74d29d68b107b7c89981e7c27a6f8e18610e793
|
|
|
final-upstream-t5x
|
276 Bytes |
sha256:9767d533db9ba071ca0f67161aa41c59f81a9d4d916d30a4a625db8573bf12c4
|
|
|
gke-maxtext-train-sitrep
|
228 Bytes |
sha256:1976bb9a78d41604154e5899fa91b656301fb60e1f6bb9dc89c114bca7f5e153
|
|
|
jax-cutlass-test-H100
|
8.59 KB |
sha256:6f7421fa35b8449f8a6503b595963236027ffa9e8f3083d5e54e706966ab9135
|
|
|
jax-unit-test-A100
|
22.4 KB |
sha256:cb7e4c139a5541c1d80807c1260cfbc8cada7b3dcb2458208f5b0e8089d76ea0
|
|
|
mealkit-equinox
|
271 Bytes |
sha256:be992892830ccc240381955343eed3bfde1fceb24e530c6f8b6d1711c690cbdc
|
|
|
mealkit-jax
|
261 Bytes |
sha256:e62cb7a9a1d551bc81a11c47f5e80601c7d96c52b4cd8d07f243c4df97d81106
|
|
|
mealkit-maxtext
|
271 Bytes |
sha256:8c240028dc077a53188b0107eb1c7533f0a17462ebd0d31ffb4f08275dff87fb
|
|
|
mealkit-t5x
|
261 Bytes |
sha256:9df6379d9c33626e70009c4cc31bd017859aeb21b44e6a38af989899cce67f00
|
|
|
mealkit-upstream-t5x
|
285 Bytes |
sha256:2db1ec95fe179219946775b0f874773c74d16dd9ac259b1b5cf4746718fb0cfc
|
|
|
nccl-gke-broadcast-sitrep
|
229 Bytes |
sha256:260b29db026f18d991cf4a2444ae29bb82e40c81f96b1b34924745caafd49932
|
|
|
nsys-jax-unit-test-A100
|
139 MB |
sha256:5c92134e94385cb759329d86d9f8349bcb55681aeff40ea4d15aaf1db108cfc3
|
|
|
rosetta-t5x-vit-19973001931-VIT8G1N
|
15.6 KB |
sha256:c788d6958a0edf4d5f8122bbd1b960f51fdb0d79d563855808cf5481a6ac9860
|
|
|
te-unit-test-H100
|
4.49 MB |
sha256:c79a4ad027aeda48894a0731cf7d1b9a3498074a88caf1133cf782bf6accf568
|
|
|
upstream-maxtext-19973001931-1DP2FSDP4TP1PP_single_process
|
28.7 KB |
sha256:e9c10a56de0ba657332936283693c46f932118f2c28245070808e85af22cd1bc
|
|
|
upstream-maxtext-19973001931-2DP2FSDP2TP1PP
|
63.2 KB |
sha256:b80731027c600dfb203359530ca0bf64959e916ae408edc90febd0db7a8f48b1
|
|
|
upstream-maxtext-metrics-test-log
|
2.52 KB |
sha256:9d9c7043365a3b5ba19f7a42239b59d5d5274ae88d28991f7f36880afe3ba57e
|
|