Skip to content

#48 - Add optional AWS availability zone selection#49

Open
spacepirate0001 wants to merge 1 commit into
isaac-sim:mainfrom
spacepirate0001:fix/aws-availability-zone-selection
Open

#48 - Add optional AWS availability zone selection#49
spacepirate0001 wants to merge 1 commit into
isaac-sim:mainfrom
spacepirate0001:fix/aws-availability-zone-selection

Conversation

@spacepirate0001
Copy link
Copy Markdown

@spacepirate0001 spacepirate0001 commented Jun 3, 2026

What

Adds an optional availability zone selection for AWS deployments so users can route around per-zone GPU capacity shortages.

Closes #48.

Why

The AWS subnet, and therefore the EC2 instance, was always placed in the first sorted availability zone that offers the requested instance type (e.g. us-west-2a):

availability_zone = try(sort(data.aws_ec2_instance_type_offerings.zones.locations)[0], "not-available")

AZ offerings indicate that a zone supports an instance type, not that it currently has capacity. As a result, deployments deterministically landed in the same zone and failed with InsufficientInstanceCapacity whenever that zone was exhausted, even when other zones in the region had capacity. Re-running did not help because the selection is deterministic.

How

  • New optional availability_zone Terraform variable on both the root aws module and the isaac-workstation module (default "").
  • The subnet now uses the requested AZ when provided, otherwise falls back to the existing auto-select (first offered zone), so default behavior is unchanged.
  • New --availability-zone / --az option on deploy-aws. It accepts a full zone name (us-west-2b) or a bare suffix letter (b, combined with the chosen region), validates that the zone belongs to the selected region, and writes the value into the tfvars passed to Terraform.

Backward compatibility

Leaving the prompt empty (the default) reproduces the previous behavior exactly. Existing .tfvars files without the variable are fine because the variable defaults to "".

Testing

  • python3 -c "import ast; ast.parse(open('deploy-aws').read())" passes.
  • Verified the existing tests do not assert on CLI param counts or tfvars keys, so they are unaffected.
  • Note: terraform fmt/validate should be run in the project container.

Files changed

  • deploy-aws
  • src/terraform/aws/main.tf
  • src/terraform/aws/variables.tf
  • src/terraform/aws/isaac-workstation/main.tf
  • src/terraform/aws/isaac-workstation/variables.tf

The AWS subnet, and therefore the EC2 instance, was always placed in the
first sorted availability zone that offers the requested instance type
(e.g. us-west-2a). AZ offerings indicate that a zone supports an instance
type, not that it has live capacity, so deployments deterministically
landed in the same zone and failed with InsufficientInstanceCapacity
whenever that zone was exhausted, even when other zones in the region had
capacity.

Add an optional availability_zone variable to the root and isaac-workstation
Terraform modules and a corresponding --availability-zone/--az option on
deploy-aws. The value is passed through to the subnet; when left empty the
behavior is unchanged and the first offered zone is auto-selected. The CLI
accepts a full zone name (us-west-2b) or a bare suffix letter (b) and
validates it against the selected region.

Signed-off-by: Haytham Amin <haythamelmogazy@gmail.com>
@spacepirate0001 spacepirate0001 force-pushed the fix/aws-availability-zone-selection branch from c286335 to 55c76bc Compare June 3, 2026 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AWS: deployment always uses the first ('-a') availability zone, causing avoidable InsufficientInstanceCapacity failures

1 participant