Robotic Dataflow Benchmark

Getting started

## If you do not have dora-rs installed
# sudo apt update && sudo apt-get install wget
cargo install dora-cli --locked
pip install dora-rs
## In development, within the dora-node-api-python folder, You can also use:
## maturin develop --release

## To test Python version of dora
cd dora-rs/py-latency
dora up

### For operators
dora start dataflow_op.yml --attach
# Ctrl + C at the end

cat benchmark_data.csv

## To test Rust version of dora
cd dora-rs/rs-latency
cargo build --release --all
dora start dataflow.yml

Getting started ROS2 Python

docker run --network=host -e DISPLAY=${DISPLAY} -v $(pwd):/dora-benchmark -it osrf/ros:humble-desktop

# Within the docker container
cd dora-benchmark/ros2/py_pubsub
colcon build
. install/setup.bash
ros2 run py_pubsub listener & ros2 run py_pubsub talker
cat time.csv

Getting started ROS2 Rust (rclrs)

A pure-Rust ROS 2 publisher/subscriber using the official rclrs client library, mirroring the dora-rs Rust benchmark for a fair Rust-vs-Rust comparison.

# Build inside a Jazzy ROS 2 container. The script clones rclrs +
# message-package dependencies into a colcon workspace and builds.
docker volume create ros2-ws
docker run --rm --platform linux/amd64 \
    -v ros2-ws:/ros2_ws \
    -v $(pwd):/dora-benchmark \
    osrf/ros:jazzy-desktop \
    bash /dora-benchmark/ros2/rs_pubsub/build.sh

# Run the benchmark
docker run --rm --platform linux/amd64 \
    -v ros2-ws:/ros2_ws \
    -v $(pwd):/dora-benchmark \
    -w /dora-benchmark/ros2/rs_pubsub \
    osrf/ros:jazzy-desktop bash -c '
        set +u
        source /opt/ros/jazzy/setup.sh
        source /ros2_ws/install/setup.sh
        rm -f time.csv
        ros2 run ros2_benchmark_rs listener > /tmp/listener.log 2>&1 &
        sleep 2
        ros2 run ros2_benchmark_rs talker
        sleep 5
        cat time.csv
    '

CUDA GPU-to-GPU Latency

Measures GPU-to-GPU data transfer latency using CUDA IPC handles. Only the 64-byte IPC handle traverses the messaging framework -- bulk data stays on GPU. All five configurations use identical raw cudaIpcGetMemHandle / cudaIpcOpenMemHandle calls, so the differences reflect messaging-framework overhead.

1000 samples per size, 10 warmup iterations discarded, 5 ms inter-send delay. Dora benchmarks use the direct node-to-node TCP optimization (dora-rs/dora#1621).

p50 latency (microseconds)

Size	Dora C++	Dora Rust	Dora Python	ROS2 C++	ROS2 Python
8 B	236	173	320	256	436
64 B	201	174	219	289	425
512 B	248	181	252	369	452
4 KB	164	170	277	260	456
40 KB	202	141	339	327	320
400 KB	202	242	306	177	288
4 MB	290	263	318	207	346
40 MB	269	265	500	316	510

p90 latency (microseconds)

Size	Dora C++	Dora Rust	Dora Python	ROS2 C++	ROS2 Python
8 B	374	220	423	353	489
64 B	351	212	307	411	499
512 B	305	306	325	413	533
4 KB	280	226	402	363	515
40 KB	362	211	436	398	411
400 KB	303	353	365	252	339
4 MB	371	399	401	276	395
40 MB	310	308	648	516	692

p99 latency (microseconds)

Size	Dora C++	Dora Rust	Dora Python	ROS2 C++	ROS2 Python
8 B	458	280	489	416	534
64 B	439	260	358	468	582
512 B	387	376	385	457	601
4 KB	347	324	461	421	580
40 KB	435	328	484	460	493
400 KB	388	434	415	320	383
4 MB	470	459	491	343	454
40 MB	372	384	725	583	762

Framework comparison — range analysis

Across all 8 size brackets (p50):

Framework	min	max	mean
Dora Rust	141	265	201
Dora C++	164	290	226
ROS2 C++	177	369	275
Dora Python	219	500	316
ROS2 Python	288	510	404

Dora Rust is consistently fastest for GPU-to-GPU CUDA IPC (mean 201 µs, 25–35% faster than ROS2 C++). Dora C++ comes second. Dora Python beats ROS2 Python by ~90 µs on average.

Range overlap: 4 of 5 frameworks (everything except ROS2 Python) overlap in the [219–265] µs window. ROS2 Python is the only outlier on the slow side — its minimum (288 µs) is already above Dora Rust's maximum (265 µs).

CPU benchmark (same machine, same sizes)

Same machine, same 8 size brackets as the CUDA table, but moving actual bulk data through the framework instead of just a 64-byte IPC handle. Values are average latency in µs per message, 1000 samples per bracket.

Size	Dora Rust	Dora Python	ROS2 C++	ROS2 Python
8 B	104	398	279	428
64 B	116	385	303	386
512 B	127	391	304	351
4 KB	347	824	283	341
40 KB	428	844	321	473
400 KB	349	864	349	1201
4 MB	889	1170	20899	18422
40 MB	2523	3187	DNF	DNF

ROS2 C++ and ROS2 Python both failed to sustain the 20 ms publish rate at 40 MB — they dropped messages and the benchmark couldn't collect 1000 samples.

Compare to the CUDA IPC table: at 40 MB, Dora Rust CPU = 2523 µs vs Dora Rust CUDA IPC ≈ 265 µs — 10× slower when actual data travels. ROS2 C++ goes from ~300 µs to 21 ms at 4 MB (a 70× blow-up), and ROS2 Python hits the same wall. FastDDS+CDR serialization of variable-size arrays is quadratic-ish at large sizes.

Takeaway: framework differences are dramatic for bulk data on CPU (Dora is 20–60× faster than ROS2 at 4 MB+) and negligible for GPU-to-GPU via CUDA IPC (everyone within ~100 µs). If your data is on the GPU, use CUDA IPC; the framework choice then barely matters. If your data is on the CPU, pick dora.

Notes

Dora uses Unix domain sockets for local IPC (dora-rs/dora#1622).
Dora benchmarks use dora run <yaml> (fresh coordinator/daemon per run) instead of long-lived dora up. We observed that leftover state in a persistent daemon systematically slowed Dora Rust by ~100 µs per size bracket. Always use dora run for benchmarks.
Dora C++ receiver uses event_as_input (raw uint8 payload) instead of Arrow C-Data Interface to avoid FFI overhead. Sender packs [timestamp][num_elements][ipc_handle] into a single 80-byte uint8 array.
ROS2 C++ p50 is lowest at 400KB–4MB — FastDDS has very low per-message overhead on the steady-state path, but it has higher cold-start jitter at tiny sizes.
Noise is real — single-digit percent variance between runs; exact p50 ordering shifts between runs, especially at the 8B–512B boundary.

Getting started Dora CUDA GPU-to-GPU (Python)

pip install dora-rs torch numpy tqdm
cd dora-rs/cuda-latency
dora up
dora start cuda_bench.yml --attach
cat benchmark_data.csv

Getting started Dora CUDA GPU-to-GPU (Rust)

cd dora-rs/cuda-latency-rust/receiver
cargo build --release
cd ..
dora up
dora start dataflow.yml --attach
cat time.csv

Getting started Dora CUDA GPU-to-GPU (C++)

Uses prebuilt dora C++ libraries downloaded automatically from GitHub releases.

Requires: Arrow C++ (libarrow-dev), CUDA toolkit, CMake 3.17+.

cd dora-rs/cuda-latency-cpp
mkdir -p build && cd build
cmake ..
make -j$(nproc)
cd ..
dora up
dora start dataflow.yml --attach
cat time.csv

Getting started ROS2 CUDA GPU-to-GPU (IPC)

Measures GPU-to-GPU latency using CUDA IPC handles over ROS2, comparable to the Dora CUDA benchmark (dora-rs/cuda-latency/). Only the 64-byte IPC handle traverses ROS2 -- the bulk data stays on the GPU.

Requires: ROS2 Jazzy, NVIDIA GPU, CUDA toolkit.

cd ros2

# Build all three packages
source /opt/ros/jazzy/setup.bash
colcon build --packages-select cuda_interfaces cuda_cpp_pubsub cuda_py_pubsub
source install/setup.sh

# --- C++ CUDA IPC benchmark ---
install/cuda_cpp_pubsub/lib/cuda_cpp_pubsub/cuda_listener &
install/cuda_cpp_pubsub/lib/cuda_cpp_pubsub/cuda_talker
# Ctrl-C the listener after the talker finishes
cat time.csv

# --- Python CUDA IPC benchmark ---
install/cuda_py_pubsub/lib/cuda_py_pubsub/cuda_listener &
install/cuda_py_pubsub/lib/cuda_py_pubsub/cuda_talker
# Ctrl-C the listener after the talker finishes
cat time.csv

# --- CPU fallback baseline (data goes GPU→CPU→ROS2→CPU→GPU) ---
CUDA_BENCH_MODE=cpu install/cuda_cpp_pubsub/lib/cuda_cpp_pubsub/cuda_listener &
CUDA_BENCH_MODE=cpu install/cuda_cpp_pubsub/lib/cuda_cpp_pubsub/cuda_talker

Sizes match the Dora CUDA benchmark: 512, 5120, 51200, 512000, 5120000 int64 elements (4 KB -- 40 MB). 100 samples per size, 10 warmup iterations.

CSV output schema

Both the dora-rs Rust benchmark (dora-rs/rs-latency/timer.csv) and the ROS 2 Python benchmark (ros2/py_pubsub/time.csv) share the same columns so the results can be compared directly:

date, language, platform, name, size_bytes, avg_us, p50_us, p90_us, p99_us, n

Each row summarises one payload size bracket: average, p50, p90, p99 latency (in microseconds) over n samples. The benchmarks use 1000 samples per size and match payload sizes (8 B, 40 KB, 400 KB, 4 MB, 40 MB) for a direct comparison.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
dora-rs		dora-rs
ros2		ros2
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robotic Dataflow Benchmark

Getting started

Getting started ROS2 Python

Getting started ROS2 Rust (rclrs)

CUDA GPU-to-GPU Latency

p50 latency (microseconds)

p90 latency (microseconds)

p99 latency (microseconds)

Framework comparison — range analysis

CPU benchmark (same machine, same sizes)

Notes

Getting started Dora CUDA GPU-to-GPU (Python)

Getting started Dora CUDA GPU-to-GPU (Rust)

Getting started Dora CUDA GPU-to-GPU (C++)

Getting started ROS2 CUDA GPU-to-GPU (IPC)

CSV output schema

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Robotic Dataflow Benchmark

Getting started

Getting started ROS2 Python

Getting started ROS2 Rust (rclrs)

CUDA GPU-to-GPU Latency

p50 latency (microseconds)

p90 latency (microseconds)

p99 latency (microseconds)

Framework comparison — range analysis

CPU benchmark (same machine, same sizes)

Notes

Getting started Dora CUDA GPU-to-GPU (Python)

Getting started Dora CUDA GPU-to-GPU (Rust)

Getting started Dora CUDA GPU-to-GPU (C++)

Getting started ROS2 CUDA GPU-to-GPU (IPC)

CSV output schema

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages