## If you do not have dora-rs installed
# sudo apt update && sudo apt-get install wget
cargo install dora-cli --locked
pip install dora-rs
## In development, within the dora-node-api-python folder, You can also use:
## maturin develop --release
## To test Python version of dora
cd dora-rs/py-latency
dora up
### For operators
dora start dataflow_op.yml --attach
# Ctrl + C at the end
cat benchmark_data.csv
## To test Rust version of dora
cd dora-rs/rs-latency
cargo build --release --all
dora start dataflow.ymldocker run --network=host -e DISPLAY=${DISPLAY} -v $(pwd):/dora-benchmark -it osrf/ros:humble-desktop
# Within the docker container
cd dora-benchmark/ros2/py_pubsub
colcon build
. install/setup.bash
ros2 run py_pubsub listener & ros2 run py_pubsub talker
cat time.csvA pure-Rust ROS 2 publisher/subscriber using the official
rclrs client library, mirroring
the dora-rs Rust benchmark for a fair Rust-vs-Rust comparison.
# Build inside a Jazzy ROS 2 container. The script clones rclrs +
# message-package dependencies into a colcon workspace and builds.
docker volume create ros2-ws
docker run --rm --platform linux/amd64 \
-v ros2-ws:/ros2_ws \
-v $(pwd):/dora-benchmark \
osrf/ros:jazzy-desktop \
bash /dora-benchmark/ros2/rs_pubsub/build.sh
# Run the benchmark
docker run --rm --platform linux/amd64 \
-v ros2-ws:/ros2_ws \
-v $(pwd):/dora-benchmark \
-w /dora-benchmark/ros2/rs_pubsub \
osrf/ros:jazzy-desktop bash -c '
set +u
source /opt/ros/jazzy/setup.sh
source /ros2_ws/install/setup.sh
rm -f time.csv
ros2 run ros2_benchmark_rs listener > /tmp/listener.log 2>&1 &
sleep 2
ros2 run ros2_benchmark_rs talker
sleep 5
cat time.csv
'Measures GPU-to-GPU data transfer latency using CUDA IPC handles. Only the
64-byte IPC handle traverses the messaging framework -- bulk data stays on
GPU. All five configurations use identical raw
cudaIpcGetMemHandle / cudaIpcOpenMemHandle calls, so the differences
reflect messaging-framework overhead.
1000 samples per size, 10 warmup iterations discarded, 5 ms inter-send delay. Dora benchmarks use the direct node-to-node TCP optimization (dora-rs/dora#1621).
| Size | Dora C++ | Dora Rust | Dora Python | ROS2 C++ | ROS2 Python |
|---|---|---|---|---|---|
| 8 B | 236 | 173 | 320 | 256 | 436 |
| 64 B | 201 | 174 | 219 | 289 | 425 |
| 512 B | 248 | 181 | 252 | 369 | 452 |
| 4 KB | 164 | 170 | 277 | 260 | 456 |
| 40 KB | 202 | 141 | 339 | 327 | 320 |
| 400 KB | 202 | 242 | 306 | 177 | 288 |
| 4 MB | 290 | 263 | 318 | 207 | 346 |
| 40 MB | 269 | 265 | 500 | 316 | 510 |
| Size | Dora C++ | Dora Rust | Dora Python | ROS2 C++ | ROS2 Python |
|---|---|---|---|---|---|
| 8 B | 374 | 220 | 423 | 353 | 489 |
| 64 B | 351 | 212 | 307 | 411 | 499 |
| 512 B | 305 | 306 | 325 | 413 | 533 |
| 4 KB | 280 | 226 | 402 | 363 | 515 |
| 40 KB | 362 | 211 | 436 | 398 | 411 |
| 400 KB | 303 | 353 | 365 | 252 | 339 |
| 4 MB | 371 | 399 | 401 | 276 | 395 |
| 40 MB | 310 | 308 | 648 | 516 | 692 |
| Size | Dora C++ | Dora Rust | Dora Python | ROS2 C++ | ROS2 Python |
|---|---|---|---|---|---|
| 8 B | 458 | 280 | 489 | 416 | 534 |
| 64 B | 439 | 260 | 358 | 468 | 582 |
| 512 B | 387 | 376 | 385 | 457 | 601 |
| 4 KB | 347 | 324 | 461 | 421 | 580 |
| 40 KB | 435 | 328 | 484 | 460 | 493 |
| 400 KB | 388 | 434 | 415 | 320 | 383 |
| 4 MB | 470 | 459 | 491 | 343 | 454 |
| 40 MB | 372 | 384 | 725 | 583 | 762 |
Across all 8 size brackets (p50):
| Framework | min | max | mean |
|---|---|---|---|
| Dora Rust | 141 | 265 | 201 |
| Dora C++ | 164 | 290 | 226 |
| ROS2 C++ | 177 | 369 | 275 |
| Dora Python | 219 | 500 | 316 |
| ROS2 Python | 288 | 510 | 404 |
Dora Rust is consistently fastest for GPU-to-GPU CUDA IPC (mean 201 Β΅s, 25β35% faster than ROS2 C++). Dora C++ comes second. Dora Python beats ROS2 Python by ~90 Β΅s on average.
Range overlap: 4 of 5 frameworks (everything except ROS2 Python) overlap in the [219β265] Β΅s window. ROS2 Python is the only outlier on the slow side β its minimum (288 Β΅s) is already above Dora Rust's maximum (265 Β΅s).
Same machine, same 8 size brackets as the CUDA table, but moving actual bulk data through the framework instead of just a 64-byte IPC handle. Values are average latency in Β΅s per message, 1000 samples per bracket.
| Size | Dora Rust | Dora Python | ROS2 C++ | ROS2 Python |
|---|---|---|---|---|
| 8 B | 104 | 398 | 279 | 428 |
| 64 B | 116 | 385 | 303 | 386 |
| 512 B | 127 | 391 | 304 | 351 |
| 4 KB | 347 | 824 | 283 | 341 |
| 40 KB | 428 | 844 | 321 | 473 |
| 400 KB | 349 | 864 | 349 | 1201 |
| 4 MB | 889 | 1170 | 20899 | 18422 |
| 40 MB | 2523 | 3187 | DNF | DNF |
ROS2 C++ and ROS2 Python both failed to sustain the 20 ms publish rate at 40 MB β they dropped messages and the benchmark couldn't collect 1000 samples.
Compare to the CUDA IPC table: at 40 MB, Dora Rust CPU = 2523 Β΅s vs Dora Rust CUDA IPC β 265 Β΅s β 10Γ slower when actual data travels. ROS2 C++ goes from ~300 Β΅s to 21 ms at 4 MB (a 70Γ blow-up), and ROS2 Python hits the same wall. FastDDS+CDR serialization of variable-size arrays is quadratic-ish at large sizes.
Takeaway: framework differences are dramatic for bulk data on CPU (Dora is 20β60Γ faster than ROS2 at 4 MB+) and negligible for GPU-to-GPU via CUDA IPC (everyone within ~100 Β΅s). If your data is on the GPU, use CUDA IPC; the framework choice then barely matters. If your data is on the CPU, pick dora.
- Dora uses Unix domain sockets for local IPC (dora-rs/dora#1622).
- Dora benchmarks use
dora run <yaml>(fresh coordinator/daemon per run) instead of long-liveddora up. We observed that leftover state in a persistent daemon systematically slowed Dora Rust by ~100 Β΅s per size bracket. Always usedora runfor benchmarks. - Dora C++ receiver uses
event_as_input(raw uint8 payload) instead of Arrow C-Data Interface to avoid FFI overhead. Sender packs[timestamp][num_elements][ipc_handle]into a single 80-byte uint8 array. - ROS2 C++ p50 is lowest at 400KBβ4MB β FastDDS has very low per-message overhead on the steady-state path, but it has higher cold-start jitter at tiny sizes.
- Noise is real β single-digit percent variance between runs; exact p50 ordering shifts between runs, especially at the 8Bβ512B boundary.
pip install dora-rs torch numpy tqdm
cd dora-rs/cuda-latency
dora up
dora start cuda_bench.yml --attach
cat benchmark_data.csvcd dora-rs/cuda-latency-rust/receiver
cargo build --release
cd ..
dora up
dora start dataflow.yml --attach
cat time.csvUses prebuilt dora C++ libraries downloaded automatically from GitHub releases.
Requires: Arrow C++ (libarrow-dev), CUDA toolkit, CMake 3.17+.
cd dora-rs/cuda-latency-cpp
mkdir -p build && cd build
cmake ..
make -j$(nproc)
cd ..
dora up
dora start dataflow.yml --attach
cat time.csvMeasures GPU-to-GPU latency using CUDA IPC handles over ROS2, comparable to
the Dora CUDA benchmark (dora-rs/cuda-latency/). Only the 64-byte IPC handle
traverses ROS2 -- the bulk data stays on the GPU.
Requires: ROS2 Jazzy, NVIDIA GPU, CUDA toolkit.
cd ros2
# Build all three packages
source /opt/ros/jazzy/setup.bash
colcon build --packages-select cuda_interfaces cuda_cpp_pubsub cuda_py_pubsub
source install/setup.sh
# --- C++ CUDA IPC benchmark ---
install/cuda_cpp_pubsub/lib/cuda_cpp_pubsub/cuda_listener &
install/cuda_cpp_pubsub/lib/cuda_cpp_pubsub/cuda_talker
# Ctrl-C the listener after the talker finishes
cat time.csv
# --- Python CUDA IPC benchmark ---
install/cuda_py_pubsub/lib/cuda_py_pubsub/cuda_listener &
install/cuda_py_pubsub/lib/cuda_py_pubsub/cuda_talker
# Ctrl-C the listener after the talker finishes
cat time.csv
# --- CPU fallback baseline (data goes GPUβCPUβROS2βCPUβGPU) ---
CUDA_BENCH_MODE=cpu install/cuda_cpp_pubsub/lib/cuda_cpp_pubsub/cuda_listener &
CUDA_BENCH_MODE=cpu install/cuda_cpp_pubsub/lib/cuda_cpp_pubsub/cuda_talkerSizes match the Dora CUDA benchmark: 512, 5120, 51200, 512000, 5120000 int64 elements (4 KB -- 40 MB). 100 samples per size, 10 warmup iterations.
Both the dora-rs Rust benchmark (dora-rs/rs-latency/timer.csv) and the ROS 2
Python benchmark (ros2/py_pubsub/time.csv) share the same columns so the
results can be compared directly:
date, language, platform, name, size_bytes, avg_us, p50_us, p90_us, p99_us, n
Each row summarises one payload size bracket: average, p50, p90, p99 latency
(in microseconds) over n samples. The benchmarks use 1000 samples per size
and match payload sizes (8 B, 40 KB, 400 KB, 4 MB, 40 MB) for a
direct comparison.

