Nomad: recommendations for singleton deployments (#1473)

tgross · aimeeu · web-flow · commit f12f70d6dbe4 · 2025-12-18T14:16:03.000-05:00
Many users have a requirement to run exactly one instance of a given allocation
because it requires exclusive access to some cluster-wide resource, which we'll
refer to here as a "singleton allocation". This is challenging to implement, so
this document is intended to describe an accepted design to publish as a
how-to/tutorial.

Co-authored-by: Aimee Ukasick &lt;aimee.ukasick@hashicorp.com&gt;
diff --git a/content/nomad/v1.11.x/content/docs/job-declare/strategy/singleton.mdx b/content/nomad/v1.11.x/content/docs/job-declare/strategy/singleton.mdx
@@ -0,0 +1,301 @@
+---
+layout: docs
+page_title: Configure singleton deployments
+description: |-
+  Declare a job that guarantees only a single instance can run at a time, with
+  minimal downtime.
+---
+
+# Configure singleton deployments
+
+A singleton deployment is one where there is at most one instance of a given
+allocation running on the cluster at one time. You might need this if the
+workload needs exclusive access to a remote resource like a data store. Nomad
+does not support singleton deployments as a built-in feature. Your workloads
+continue to run even when the Nomad client agent has crashed, so ensuring
+there's at most one allocation for a given workload requires some cooperation from the
+job. This document describes how to implement singleton deployments.
+
+## Design Goals
+
+The configuration described here meets these primary design goals:
+
+- The design prevents a specific process with a task from running if there
+  is another instance of that task running anywhere else on the Nomad cluster.
+- Nomad should be able to recover from failure of the task or the node on which
+  the task is running with minimal downtime, where "recovery" means that Nomad should stop the
+  original task and schedule a replacement
+  task.
+- Nomad should minimize false positive detection of failures to avoid
+  unnecessary downtime during the cutover.
+
+There's a tradeoff between between recovery speed and false positives. The
+faster you make Nomad attempt to recover from failure, the more likely that a
+transient failure causes Nomad to schedule a replacement and a subsequent
+downtime.
+
+Note that it's not possible to design a perfectly zero-downtime singleton
+allocation in a distributed system. This design errs on the side of
+correctness: having zero or one allocations running rather than the incorrect one or two
+allocations running.
+
+## Overview
+
+There are several options available for some details of the implementation, but
+all of them include the following:
+
+- You must have a distributed lock with a TTL that's refreshed from the
+  allocation. The process that sets and refreshes the lock must have its
+  lifecycle tied to the main task. It can be either in-process, in-task with
+  supervision, or run as a sidecar. If the allocation cannot obtain the lock,
+  then it must not start whatever process or operation you intend to be a
+  singleton. After a configurable window without obtaining the lock, the
+  allocation must fail.
+- You must set the [`group.disconnect.stop_on_client_after`][] field. This
+  forces a Nomad client that's disconnected from the server to stop the
+  singleton allocation, which in turn releases the lock or allows its TTL to
+  expire.
+
+Tune the lock TTL, the time it takes the alloc to
+give up, and the `stop_on_client_after` duration timer values to reduce the
+maximum amount of downtime the application can have.
+
+The Nomad [Locks API][] can support the operations needed. In psuedo-code these
+operations are the following:
+
+- To acquire the lock, `PUT /v1/var/:path?lock-acquire`
+  - On success: start heartbeat every 1/2 TTL
+  - On conflict or failure: retry with backoff and timeout.
+    - Once out of attempts, exit the process with error code.
+- To heartbeat, `PUT /v1/var/:path?lock-renew`
+  - On success: continue
+  - On conflict: exit the process with error code
+  - On failure: retry with backoff up to TTL.
+    - If TTL expires, attempt to revoke lock, then exit the process with error code.
+
+The allocation can safely use the Nomad [Task API][] socket to write to the
+locks API, rather than communicating with the server directly. This reduces load
+on the server and speeds up detection of failed client nodes because the
+disconnected client cannot forward the Task API requests to the leader.
+
+The [`nomad var lock`][] command implements this logic, so you can use it to shim
+the process being locked.
+
+### ACLs
+
+Allocations cannot write to Nomad variables by default. You must configure a
+[workload-associated ACL policy][] that allows write access in the
+[`namespace.variables`][] block. For example, the following ACL policy allows
+access to write a lock on the path `nomad/jobs/example/lock` in the `prod`
+namespace:
+
+```
+namespace "prod" {
+  variables {
+    path "nomad/jobs/example/lock" {
+      capabilities = ["write", "read", "list"]
+    }
+  }
+}
+```
+
+You set this policy on the job with `nomad acl policy apply -namespace prod -job
+example example-lock ./policy.hcl`.
+
+## Implementation
+
+### Use `nomad var lock`
+
+We recommend implementing the locking logic with `nomad var lock` as a shim in
+your task. This example jobspec assumes there's a Nomad binary in the container
+image.
+
+```hcl
+job "example" {
+  group "group" {
+
+    disconnect {
+      stop_on_client_after = "1m"
+    }
+
+    task "primary" {
+      config {
+        driver  = "docker"
+        image   = "example/app:1"
+        command = "nomad"
+        args = [
+            "var", "lock", "nomad/jobs/example/lock", # lock
+            "busybox", "httpd",                       # application
+            "-vv", "-f", "-p", "8001", "-h", "/local" # application args
+        ]
+      }
+
+      identity {
+        env = true
+      }
+    }
+  }
+}
+```
+
+If you don't want to ship a Nomad binary in the container image, make a
+read-only mount of the binary from a host volume. This only works in cases
+where the Nomad binary has been statically linked or you have glibc in the
+container image.
+
+<CodeBlockConfig lineNumbers highlight="8-12,30-33">
+
+```hcl
+job "example" {
+  group "group" {
+
+    disconnect {
+      stop_on_client_after = "1m"
+    }
+
+    volume "binaries" {
+      type      = "host"
+      source    = "binaries"
+      read_only = true
+    }
+
+    task "primary" {
+      config {
+        driver  = "docker"
+        image   = "example/app:1"
+        command = "/opt/bin/nomad"
+        args = [
+            "var", "lock", "nomad/jobs/example/lock", # lock
+            "busybox", "httpd",                       # application
+            "-vv", "-f", "-p", "8001", "-h", "/local" # application args
+        ]
+      }
+
+      identity {
+        env = true # make NOMAD_TOKEN available to lock command
+      }
+
+      volume_mount {
+        volume      = "binaries"
+        destination = "/opt/bin"
+      }
+    }
+  }
+}
+
+### Sidecar lock
+
+If you cannot implement the lock logic in your application or with a shim such
+as `nomad var lock`, you need to implement it such that the task you are locking
+is running as a sidecar of the locking task, which has [`task.leader=true`][]
+set.
+
+<CodeBlockConfig lineNumbers highlight="9">
+
+```hcl
+job "example" {
+  group "group" {
+
+    disconnect {
+      stop_on_client_after = "1m"
+    }
+
+    task "lock" {
+      leader = true
+      config {
+        driver   = "raw_exec"
+        command  = "/opt/lock-script.sh"
+        pid_mode = "host"
+      }
+
+      identity {
+        env = true # make NOMAD_TOKEN available to lock command
+      }
+    }
+
+    task "application" {
+      lifecycle {
+        hook = "poststart"
+        sidecar = true
+      }
+
+      config {
+        driver  = "docker"
+        image   = "example/app:1"
+      }
+    }
+  }
+}
+
+The locking task has the following requirements:
+
+- Must be in the same group as the task being locked.
+- Must be able to terminate the task being locked without the Nomad client being
+  up. For example, they share the same PID namespace, or the locking task is
+  privileged.
+- Must have a way of signalling the task being locked that it is safe to start.
+  For example, the locking task can write a Sentinel file into the `/alloc`
+  directory, which the locked task tries to read on startup and blocks until it
+  exists.
+
+If you cannot meet the third requirement, then you need to split the lock
+acquisition and lock heartbeat into separate tasks.
+
+<CodeBlockConfig lineNumbers highlight="8-20,22-32">
+
+```hcl
+job "example" {
+  group "group" {
+
+    disconnect {
+      stop_on_client_after = "1m"
+    }
+
+    task "acquire" {
+      lifecycle {
+        hook = "prestart"
+        sidecar = false
+      }
+      config {
+        driver   = "raw_exec"
+        command  = "/opt/lock-acquire-script.sh"
+      }
+      identity {
+        env = true # make NOMAD_TOKEN available to lock command
+      }
+    }
+
+    task "heartbeat" {
+      leader = true
+      config {
+        driver   = "raw_exec"
+        command  = "/opt/lock-heartbeat-script.sh"
+        pid_mode = "host"
+      }
+      identity {
+        env = true # make NOMAD_TOKEN available to lock command
+      }
+    }
+
+    task "application" {
+      lifecycle {
+        hook = "poststart"
+        sidecar = true
+      }
+
+      config {
+        driver  = "docker"
+        image   = "example/app:1"
+      }
+    }
+  }
+}
+
+[`group.disconnect.stop_on_client_after`]: /nomad/docs/job-specification/disconnect#stop_on_client_after
+[Locks API]: /nomad/api-docs/variables/locks
+[Task API]: /nomad/api-docs/task-api
+[`nomad var lock`]: /nomad/commands/var/lock
+[workload-associated ACL policy]: /nomad/docs/concepts/workload-identity#workload-associated-acl-policies
+[`namespace.variables`]: /nomad/docs/other-specifications/acl-policy#variables
+[`task.leader=true`]: /nomad/docs/job-specification/task#leader
+[`restart`]: /nomad/docs/job-specification/restart
diff --git a/content/nomad/v1.11.x/data/docs-nav-data.json b/content/nomad/v1.11.x/data/docs-nav-data.json
@@ -697,6 +697,10 @@
           {
             "title": "Configure rolling",
             "path": "job-declare/strategy/rolling"
+          },
+          {
+            "title": "Configure singleton",
+            "path": "job-declare/strategy/singleton"
           }
         ]
       },

Original file line number	Diff line number	Diff line change
`@@ -697,6 +697,10 @@`
`697`	`697`	`{`
`698`	`698`	`"title": "Configure rolling",`
`699`	`699`	`"path": "job-declare/strategy/rolling"`
	`700`	`+ },`
	`701`	`+ {`
	`702`	`+ "title": "Configure singleton",`
	`703`	`+ "path": "job-declare/strategy/singleton"`
`700`	`704`	`}`
`701`	`705`	`]`
`702`	`706`	`},`