Skip to content

dora-rs/dora-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Robotic Dataflow Benchmark

Screenshot 2025-01-11 at 11-01-37 dora-rs dora-rs Screenshot 2025-01-11 at 11-01-43 dora-rs dora-rs

Getting started

## If you do not have dora-rs installed
# sudo apt update && sudo apt-get install wget
cargo install dora-cli --locked
pip install dora-rs
## In development, within the dora-node-api-python folder, You can also use:
## maturin develop --release

## To test Python version of dora
cd dora-rs/py-latency
dora up

### For operators
dora start dataflow_op.yml --attach
# Ctrl + C at the end

cat benchmark_data.csv

## To test Rust version of dora
cd dora-rs/rs-latency
cargo build --release --all
dora start dataflow.yml

Getting started ROS2 Python

docker run --network=host -e DISPLAY=${DISPLAY} -v $(pwd):/dora-benchmark -it osrf/ros:humble-desktop

# Within the docker container
cd dora-benchmark/ros2/py_pubsub
colcon build
. install/setup.bash
ros2 run py_pubsub listener & ros2 run py_pubsub talker
cat time.csv

Getting started ROS2 Rust (rclrs)

A pure-Rust ROS 2 publisher/subscriber using the official rclrs client library, mirroring the dora-rs Rust benchmark for a fair Rust-vs-Rust comparison.

# Build inside a Jazzy ROS 2 container. The script clones rclrs +
# message-package dependencies into a colcon workspace and builds.
docker volume create ros2-ws
docker run --rm --platform linux/amd64 \
    -v ros2-ws:/ros2_ws \
    -v $(pwd):/dora-benchmark \
    osrf/ros:jazzy-desktop \
    bash /dora-benchmark/ros2/rs_pubsub/build.sh

# Run the benchmark
docker run --rm --platform linux/amd64 \
    -v ros2-ws:/ros2_ws \
    -v $(pwd):/dora-benchmark \
    -w /dora-benchmark/ros2/rs_pubsub \
    osrf/ros:jazzy-desktop bash -c '
        set +u
        source /opt/ros/jazzy/setup.sh
        source /ros2_ws/install/setup.sh
        rm -f time.csv
        ros2 run ros2_benchmark_rs listener > /tmp/listener.log 2>&1 &
        sleep 2
        ros2 run ros2_benchmark_rs talker
        sleep 5
        cat time.csv
    '

CUDA GPU-to-GPU Latency

Measures GPU-to-GPU data transfer latency using CUDA IPC handles. Only the 64-byte IPC handle traverses the messaging framework -- bulk data stays on GPU. All five configurations use identical raw cudaIpcGetMemHandle / cudaIpcOpenMemHandle calls, so the differences reflect messaging-framework overhead.

1000 samples per size, 10 warmup iterations discarded, 5 ms inter-send delay. Dora benchmarks use the direct node-to-node TCP optimization (dora-rs/dora#1621).

p50 latency (microseconds)

Size Dora C++ Dora Rust Dora Python ROS2 C++ ROS2 Python
8 B 236 173 320 256 436
64 B 201 174 219 289 425
512 B 248 181 252 369 452
4 KB 164 170 277 260 456
40 KB 202 141 339 327 320
400 KB 202 242 306 177 288
4 MB 290 263 318 207 346
40 MB 269 265 500 316 510

p90 latency (microseconds)

Size Dora C++ Dora Rust Dora Python ROS2 C++ ROS2 Python
8 B 374 220 423 353 489
64 B 351 212 307 411 499
512 B 305 306 325 413 533
4 KB 280 226 402 363 515
40 KB 362 211 436 398 411
400 KB 303 353 365 252 339
4 MB 371 399 401 276 395
40 MB 310 308 648 516 692

p99 latency (microseconds)

Size Dora C++ Dora Rust Dora Python ROS2 C++ ROS2 Python
8 B 458 280 489 416 534
64 B 439 260 358 468 582
512 B 387 376 385 457 601
4 KB 347 324 461 421 580
40 KB 435 328 484 460 493
400 KB 388 434 415 320 383
4 MB 470 459 491 343 454
40 MB 372 384 725 583 762

Framework comparison β€” range analysis

Across all 8 size brackets (p50):

Framework min max mean
Dora Rust 141 265 201
Dora C++ 164 290 226
ROS2 C++ 177 369 275
Dora Python 219 500 316
ROS2 Python 288 510 404

Dora Rust is consistently fastest for GPU-to-GPU CUDA IPC (mean 201 Β΅s, 25–35% faster than ROS2 C++). Dora C++ comes second. Dora Python beats ROS2 Python by ~90 Β΅s on average.

Range overlap: 4 of 5 frameworks (everything except ROS2 Python) overlap in the [219–265] Β΅s window. ROS2 Python is the only outlier on the slow side β€” its minimum (288 Β΅s) is already above Dora Rust's maximum (265 Β΅s).

CPU benchmark (same machine, same sizes)

Same machine, same 8 size brackets as the CUDA table, but moving actual bulk data through the framework instead of just a 64-byte IPC handle. Values are average latency in Β΅s per message, 1000 samples per bracket.

Size Dora Rust Dora Python ROS2 C++ ROS2 Python
8 B 104 398 279 428
64 B 116 385 303 386
512 B 127 391 304 351
4 KB 347 824 283 341
40 KB 428 844 321 473
400 KB 349 864 349 1201
4 MB 889 1170 20899 18422
40 MB 2523 3187 DNF DNF

ROS2 C++ and ROS2 Python both failed to sustain the 20 ms publish rate at 40 MB β€” they dropped messages and the benchmark couldn't collect 1000 samples.

Compare to the CUDA IPC table: at 40 MB, Dora Rust CPU = 2523 Β΅s vs Dora Rust CUDA IPC β‰ˆ 265 Β΅s β€” 10Γ— slower when actual data travels. ROS2 C++ goes from ~300 Β΅s to 21 ms at 4 MB (a 70Γ— blow-up), and ROS2 Python hits the same wall. FastDDS+CDR serialization of variable-size arrays is quadratic-ish at large sizes.

Takeaway: framework differences are dramatic for bulk data on CPU (Dora is 20–60Γ— faster than ROS2 at 4 MB+) and negligible for GPU-to-GPU via CUDA IPC (everyone within ~100 Β΅s). If your data is on the GPU, use CUDA IPC; the framework choice then barely matters. If your data is on the CPU, pick dora.

Notes

  • Dora uses Unix domain sockets for local IPC (dora-rs/dora#1622).
  • Dora benchmarks use dora run <yaml> (fresh coordinator/daemon per run) instead of long-lived dora up. We observed that leftover state in a persistent daemon systematically slowed Dora Rust by ~100 Β΅s per size bracket. Always use dora run for benchmarks.
  • Dora C++ receiver uses event_as_input (raw uint8 payload) instead of Arrow C-Data Interface to avoid FFI overhead. Sender packs [timestamp][num_elements][ipc_handle] into a single 80-byte uint8 array.
  • ROS2 C++ p50 is lowest at 400KB–4MB β€” FastDDS has very low per-message overhead on the steady-state path, but it has higher cold-start jitter at tiny sizes.
  • Noise is real β€” single-digit percent variance between runs; exact p50 ordering shifts between runs, especially at the 8B–512B boundary.

Getting started Dora CUDA GPU-to-GPU (Python)

pip install dora-rs torch numpy tqdm
cd dora-rs/cuda-latency
dora up
dora start cuda_bench.yml --attach
cat benchmark_data.csv

Getting started Dora CUDA GPU-to-GPU (Rust)

cd dora-rs/cuda-latency-rust/receiver
cargo build --release
cd ..
dora up
dora start dataflow.yml --attach
cat time.csv

Getting started Dora CUDA GPU-to-GPU (C++)

Uses prebuilt dora C++ libraries downloaded automatically from GitHub releases.

Requires: Arrow C++ (libarrow-dev), CUDA toolkit, CMake 3.17+.

cd dora-rs/cuda-latency-cpp
mkdir -p build && cd build
cmake ..
make -j$(nproc)
cd ..
dora up
dora start dataflow.yml --attach
cat time.csv

Getting started ROS2 CUDA GPU-to-GPU (IPC)

Measures GPU-to-GPU latency using CUDA IPC handles over ROS2, comparable to the Dora CUDA benchmark (dora-rs/cuda-latency/). Only the 64-byte IPC handle traverses ROS2 -- the bulk data stays on the GPU.

Requires: ROS2 Jazzy, NVIDIA GPU, CUDA toolkit.

cd ros2

# Build all three packages
source /opt/ros/jazzy/setup.bash
colcon build --packages-select cuda_interfaces cuda_cpp_pubsub cuda_py_pubsub
source install/setup.sh

# --- C++ CUDA IPC benchmark ---
install/cuda_cpp_pubsub/lib/cuda_cpp_pubsub/cuda_listener &
install/cuda_cpp_pubsub/lib/cuda_cpp_pubsub/cuda_talker
# Ctrl-C the listener after the talker finishes
cat time.csv

# --- Python CUDA IPC benchmark ---
install/cuda_py_pubsub/lib/cuda_py_pubsub/cuda_listener &
install/cuda_py_pubsub/lib/cuda_py_pubsub/cuda_talker
# Ctrl-C the listener after the talker finishes
cat time.csv

# --- CPU fallback baseline (data goes GPU→CPU→ROS2→CPU→GPU) ---
CUDA_BENCH_MODE=cpu install/cuda_cpp_pubsub/lib/cuda_cpp_pubsub/cuda_listener &
CUDA_BENCH_MODE=cpu install/cuda_cpp_pubsub/lib/cuda_cpp_pubsub/cuda_talker

Sizes match the Dora CUDA benchmark: 512, 5120, 51200, 512000, 5120000 int64 elements (4 KB -- 40 MB). 100 samples per size, 10 warmup iterations.

CSV output schema

Both the dora-rs Rust benchmark (dora-rs/rs-latency/timer.csv) and the ROS 2 Python benchmark (ros2/py_pubsub/time.csv) share the same columns so the results can be compared directly:

date, language, platform, name, size_bytes, avg_us, p50_us, p90_us, p99_us, n

Each row summarises one payload size bracket: average, p50, p90, p99 latency (in microseconds) over n samples. The benchmarks use 1000 samples per size and match payload sizes (8 B, 40 KB, 400 KB, 4 MB, 40 MB) for a direct comparison.

About

Benchmarking tool to evaluate dora-rs speed of communication witihin different scenarios βš‘πŸš€πŸ’¨πŸƒ

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors