This protocol defines the communication specification for robot control inference services. It is used to connect a local robot client with a remote inference server.
The end-to-end test flow works as follows:
- You implement the inference service — Based on this sample code, wrap your model in a service that exposes the
POST /predictAPI (seeserver.py). Replaceload_model()and the inference logic insidepredict()with your own policy. - You run the service locally or on a reachable host — Start the server (e.g.
python server.py) so it listens on a known host and port. - The local ARK Aloha arm acts as the client — The on-robot (or local) control stack collects joint state and camera images, calls your
/predictendpoint at the configured rate (up to 50 FPS), and applies the returned 14-D action vector to the dual-arm system.
Participants provide an inference API; the robot side pulls observations and posts actions over HTTP. Here, client.py is a minimal stand-in for the Aloha client; in production, the ARK stack replaces it while keeping the same request/response format.
┌─────────────────────┐ HTTP POST /predict ┌──────────────────────┐
│ ARK Aloha (client) │ ─────────────────────────► │ Your inference API │
│ state + images │ ◄───────────────────────── │ (this sample) │
│ applies action │ action [14] │ load_model + predict│
└─────────────────────┘ └──────────────────────┘
# Terminal 1: start the inference server
python server.py
# Terminal 2: simulate the robot client (optional smoke test)
python client.pyPoint the real Aloha client at your server URL (same multipart fields as client.py).
cd /path/to/sample_code
# (Recommended) Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install fastapi uvicorn numpy opencv-python requestsEdit server.py:
- Implement
load_model()— load checkpoints and return your policy object. - Implement inference in
predict(state, images)— map observations to a 14-Dnumpyaction vector.
Keep the HTTP contract (POST /predict, multipart fields, JSON response) unchanged so the Aloha client can connect without modification.
# Start the service
python server.py
# Run the sample client against http://localhost:8000/predict
python client.pyFor integration with the real arm, configure the client stack to use your server host/port and verify latency stays within the 30s timeout at your target control rate.
POST /predict
multipart/form-data
| Field | Type | Required | Description |
|---|---|---|---|
| state | string (JSON) | Yes | Joint state array (14 floats) |
| task | string | Yes | Task name |
| cam_high | file (JPEG) | No | High camera image |
| cam_left_wrist | file (JPEG) | No | Left wrist camera image |
| cam_right_wrist | file (JPEG) | No | Right wrist camera image |
{
"action": [j0, j1, j2, j3, j4, j5, g0, j6, j7, j8, j9, j10, j11, g1]
}| Index | Meaning |
|---|---|
| 0–5 | Left arm joints 1–6 |
| 6 | Left gripper |
| 7–12 | Right arm joints 1–6 |
| 13 | Right gripper |
Same layout as state.
| Property | Value |
|---|---|
| Format | JPEG |
| Size | 224 × 224 |
| Channels | RGB |
| Parameter | Value |
|---|---|
| Protocol | HTTP/1.1 |
| Encoding | multipart/form-data |
| JPEG quality | 100 |
| Timeout | 30 s |
| Call rate | up to 50 FPS |
client.py # Example client (robot side)
server.py # Server template (your inference API)
Server responsibilities:
load_model()— load your modelpredict(state, images)— run inference and return a 14-D action
Client responsibilities (Aloha / client.py):
- Read joint state and cameras
- POST to
/predict - Apply returned
actionto the robot