Skip to content

thanos/ex_arrow

Repository files navigation

ExArrow

CI Hex version Hex docs License

Native Apache Arrow for the BEAM: IPC streaming, Arrow Flight, and ADBC database bindings. Column data lives in Rust buffers; Elixir holds lightweight opaque handles. Precompiled NIFs for Linux, macOS, and Windows — no Rust required to use.


Contents


Why ExArrow was built

The Arrow ecosystem has become the de-facto interchange standard for columnar data. Python, R, Rust, Java, Go, and C++ all speak Arrow natively. Data warehouses, query engines, stream processors, ML frameworks, and databases expose Arrow Flight endpoints or ADBC interfaces. The BEAM had no first-class way to participate in this ecosystem.

ExArrow was written to fill that gap. It gives Elixir and Erlang applications the same low-level, zero-copy Arrow primitives that the rest of the ecosystem already takes for granted — without requiring callers to understand NIF memory management, dirty schedulers, or the Arrow C Data Interface.

The design goal is intentionally narrow: be the reliable Arrow transport and interchange layer for the BEAM, and let other libraries (Explorer, Nx, etc.) do the analysis on top of it.


What it brings to the Elixir Ecosystem

Prior to ExArrow, an Elixir application that needed to exchange data with a Flight server, query a database via ADBC, or read/write an Arrow IPC file had three options: shell out to Python, implement the protocol manually in Elixir (row-by-row, with all the copying that entails), or simply not do it.

ExArrow adds:

  • IPC reading and writing — Arrow stream and file formats, from binary or a file path, in both directions. Read a file produced by PyArrow, DuckDB, or Pandas; write a file for the same consumers. No format conversion needed.
  • Arrow Flight client and server — Connect to Dremio, InfluxDB IOx, Snowflake Flight endpoints, or any custom Flight service. Run an in-process echo server for testing. Transfer Arrow streams over gRPC with one API call.
  • ADBC database connectivity — Execute SQL against any ADBC-compatible database (SQLite, PostgreSQL, DuckDB, BigQuery, Snowflake, and more) and receive the results as a lazy Arrow stream — never materialising rows into BEAM terms unless you ask for them.
  • Zero-copy streaming — Column buffers are allocated once in Rust and held there until consumed. The BEAM scheduler is never stalled on large copies. Dirty NIF schedulers are used for blocking I/O.
  • A uniform stream abstractionExArrow.Stream works identically for IPC, Flight, and ADBC results. Code that processes batches does not know or care where the data came from.

How ExArrow differs from Explorer, Nx, ADBC, and ExZarr

These libraries are complementary, not competing. Each has a distinct role.

Library Role Overlap with ExArrow
Explorer In-memory dataframe analysis (filter, group, sort, plot). Backed by Polars/Arrow internally. Explorer can load/dump Arrow IPC streams. ExArrow is the transport; Explorer is the analysis layer.
Nx Numerical computing and tensor operations (multi-dimensional arrays, GPU support, ML). Nx tensors and Arrow columns are both typed flat arrays. There is currently no direct bridge, but ExArrow IPC can produce data for downstream tensor conversion.
adbc (livebook-dev) Elixir wrapper around the ADBC C library for driver management — downloading and configuring drivers. ExArrow uses adbc optionally for driver download; adbc's core purpose is driver lifecycle, not Arrow streaming or Flight.
ExZarr Read/write Zarr v2/v3 chunked array format (used in climate science, genomics, cloud-native ND arrays). Zarr and Arrow are complementary storage formats. ExZarr addresses ND chunk storage; ExArrow addresses columnar interchange and network transport.

In short: ExArrow is a transport and interchange library. It moves Arrow data between processes, databases, services, and files as efficiently as possible. It does not analyse, transform, or visualise data — that is the job of Explorer, Nx, or your own application logic.


Where ExArrow fits

flowchart TB
    App("Your Elixir Application")

    App --> Explorer("Explorer\ndataframes & analysis")
    App --> Nx("Nx\ntensors & ML")
    App --> ExArrow("ExArrow\nIPC · Flight · ADBC")
    App --> ExZarr("ExZarr\nZarr chunked arrays")

    ExArrow --> IPC("Arrow IPC\nstream & file")
    ExArrow --> FlightSvr("Arrow Flight\ngRPC server")
    ExArrow --> ADBCDrv("ADBC\ndriver")

    IPC -. "interop via IPC binary" .-> Explorer

    FlightSvr --> FlightSvcs("Dremio · InfluxDB IOx\nDuckDB · Snowflake")
    ADBCDrv   --> Databases("PostgreSQL · SQLite\nDuckDB · BigQuery")

    classDef app      fill:#1a1a2e,stroke:#4a90d9,color:#e0e0e0,rx:6
    classDef lib      fill:#16213e,stroke:#4a90d9,color:#e0e0e0,rx:6
    classDef proto    fill:#0f3460,stroke:#4a90d9,color:#e0e0e0,rx:6
    classDef external fill:#1a1a2e,stroke:#888,color:#aaa,rx:6,stroke-dasharray:4 4

    class App app
    class Explorer,Nx,ExArrow,ExZarr lib
    class IPC,FlightSvr,ADBCDrv proto
    class FlightSvcs,Databases external
Loading

ExArrow sits at the boundary between the BEAM and the Arrow ecosystem. It speaks the protocols that data infrastructure uses — IPC, Flight, ADBC — and surfaces them as idiomatic Elixir APIs. Explorer and Nx sit above it and consume the data it delivers.


What this enables

  • Elixir as a data pipeline node. Read Arrow IPC from Kafka, HTTP, or a socket; apply lightweight routing or filtering; forward via Flight or write to file — without ever copying column data into BEAM terms.
  • Zero-copy query results. Run SQL against PostgreSQL, DuckDB, SQLite, or BigQuery via ADBC. The result stream is backed by native Arrow buffers. A 100-million-row result set uses minimal BEAM heap regardless of size.
  • Interop with the Python/R data world. Read files produced by PyArrow, Pandas, or Polars. Write files that DuckDB, R's arrow package, or any Arrow consumer can read. No CSV conversion, no schema translation.
  • First-class Flight client. Connect to Dremio, InfluxDB IOx, or any service that exposes an Arrow Flight endpoint. List flights, fetch schemas, stream data, or call custom actions — from a Phoenix controller, a GenServer, or a Livebook cell.
  • Benchmarked, observable performance. The included Benchee suite quantifies the zero-copy advantage and publishes results per commit at thanos.github.io/ex_arrow/dev/bench.

Requirements

  • Elixir ~> 1.14 (OTP 25 / NIF 2.15 and OTP 26+ / NIF 2.16)

Installation

Add the dependency:

def deps do
  [{:ex_arrow, "~> 0.1.0"}]
end

Using precompiled NIFs (default) After mix deps.get and mix compile, ExArrow downloads a prebuilt NIF for your platform from the project's GitHub releases. No Rust or C toolchain is required. Supported platforms: Linux x8664/aarch64, macOS x8664/arm64, Windows x8664.

Building from source If no precompiled NIF exists for your platform, or you are developing ExArrow itself, set EX_ARROW_BUILD=1 and have Rust installed:

EX_ARROW_BUILD=1 mix deps.get
EX_ARROW_BUILD=1 mix compile

The optional dependency {:rustler, "~> 0.32.0", optional: true} is required for source builds and is already listed in ExArrow's own mix.exs.

For path dependencies (e.g. Livebook or Mix.install), add rustler explicitly and have Rust available:

Mix.install([
  {:ex_arrow, path: "/path/to/ex_arrow"},
  {:rustler, "~> 0.37.3", optional: true}
])

Alternatively, use the published Hex package so the precompiled NIF is used and no Rust is needed: Mix.install([{:ex_arrow, "~> 0.1.0"}]).


Quick start

Read an Arrow IPC stream:

{:ok, stream} = ExArrow.IPC.Reader.from_file("/path/to/data.arrow")
{:ok, schema} = ExArrow.Stream.schema(stream)
fields = ExArrow.Schema.fields(schema)

case ExArrow.Stream.next(stream) do
  %ExArrow.RecordBatch{} = batch -> IO.inspect(ExArrow.RecordBatch.num_rows(batch))
  nil -> :done
  {:error, msg} -> IO.puts("Error: #{msg}")
end

Connect to an Arrow Flight server:

{:ok, client} = ExArrow.Flight.Client.connect("localhost", 9999, [])
{:ok, stream} = ExArrow.Flight.Client.do_get(client, "my_ticket")
{:ok, schema} = ExArrow.Stream.schema(stream)
batch = ExArrow.Stream.next(stream)

Query a database with ADBC:

{:ok, db} = ExArrow.ADBC.Database.open(driver_name: "adbc_driver_sqlite", uri: ":memory:")
{:ok, conn} = ExArrow.ADBC.Connection.open(db)
{:ok, stmt} = ExArrow.ADBC.Statement.new(conn, "SELECT 1 AS n")
{:ok, stream} = ExArrow.ADBC.Statement.execute(stmt)
{:ok, schema} = ExArrow.Stream.schema(stream)
batch = ExArrow.Stream.next(stream)

Livebook tutorials

Interactive notebooks (open in Livebook):

  • Quick start — IPC, Flight, and ADBC in one notebook.
  • 01 IPC — Stream vs file format, read/write, schema, Explorer interop.
  • 02 Flight — Echo server, client, listflights, getschema, actions.
  • 03 ADBC — Database, Connection, Statement, Stream, metadata APIs.

See livebook/README.md for run instructions.


IPC: stream and file

Stream (sequential) — from binary or file path:

{:ok, stream} = ExArrow.IPC.Reader.from_binary(ipc_bytes)
{:ok, stream} = ExArrow.IPC.Reader.from_file("/data/events.arrow")

{:ok, schema} = ExArrow.Stream.schema(stream)

Stream.repeatedly(fn -> ExArrow.Stream.next(stream) end)
|> Enum.take_while(&(&1 != nil and not match?({:error, _}, &1)))

Write to binary or file:

{:ok, binary} = ExArrow.IPC.Writer.to_binary(schema, batches)
:ok = ExArrow.IPC.Writer.to_file("/out/result.arrow", schema, batches)

File format (random access):

{:ok, file} = ExArrow.IPC.File.from_file("/data/large.arrow")
{:ok, schema} = ExArrow.IPC.File.schema(file)
n = ExArrow.IPC.File.batch_count(file)
{:ok, batch} = ExArrow.IPC.File.get_batch(file, 0)

Arrow Flight: client and server

Start the built-in echo server:

{:ok, server} = ExArrow.Flight.Server.start_link(9999)
{:ok, port} = ExArrow.Flight.Server.port(server)
:ok = ExArrow.Flight.Server.stop(server)

Transfer data:

{:ok, client} = ExArrow.Flight.Client.connect("localhost", 9999, [])

:ok = ExArrow.Flight.Client.do_put(client, schema, [batch1, batch2])

{:ok, stream} = ExArrow.Flight.Client.do_get(client, "echo")
batch = ExArrow.Stream.next(stream)

Metadata:

{:ok, flights} = ExArrow.Flight.Client.list_flights(client, <<>>)
{:ok, info}    = ExArrow.Flight.Client.get_flight_info(client, {:cmd, "echo"})
{:ok, schema}  = ExArrow.Flight.Client.get_schema(client, {:cmd, "echo"})
{:ok, actions} = ExArrow.Flight.Client.list_actions(client)
{:ok, ["pong"]} = ExArrow.Flight.Client.do_action(client, "ping", <<>>)

Flight is plaintext only in this release. Products that speak Arrow Flight include Dremio, InfluxDB IOx, and custom analytics servers.


ADBC: database to Arrow streams

SQLite in-memory:

{:ok, db}   = ExArrow.ADBC.Database.open(driver_name: "adbc_driver_sqlite", uri: ":memory:")
{:ok, conn} = ExArrow.ADBC.Connection.open(db)
{:ok, stmt} = ExArrow.ADBC.Statement.new(conn, "SELECT 1 AS n, 'hello' AS s")
{:ok, stream} = ExArrow.ADBC.Statement.execute(stmt)
batch = ExArrow.Stream.next(stream)

PostgreSQL:

{:ok, db} = ExArrow.ADBC.Database.open(
  driver_name: "adbc_driver_postgresql",
  uri: "postgresql://user:pass@localhost:5432/mydb"
)
{:ok, conn}   = ExArrow.ADBC.Connection.open(db)
{:ok, stmt}   = ExArrow.ADBC.Statement.new(conn, "SELECT id, name FROM users")
{:ok, stream} = ExArrow.ADBC.Statement.execute(stmt)

Metadata:

{:ok, types_stream} = ExArrow.ADBC.Connection.get_table_types(conn)
{:ok, schema}       = ExArrow.ADBC.Connection.get_table_schema(conn, nil, nil, "users")
{:ok, objs_stream}  = ExArrow.ADBC.Connection.get_objects(conn, depth: "tables")

Optional driver download via the adbc package:

# Add {:adbc, "~> 0.7"} to deps, then:
Adbc.download_driver!(:sqlite)
{:ok, db} = ExArrow.ADBC.Database.open(driver_name: "adbc_driver_sqlite", uri: ":memory:")

Or use the convenience helper which calls Adbc.download_driver!/1 when the package is available: ExArrow.ADBC.DriverHelper.ensure_driver_and_open/2.


Using ExArrow with Explorer

Explorer handles in-memory analysis. ExArrow handles streaming and transport. They connect via Arrow IPC.

ExArrow to Explorer:

{:ok, stream} = ExArrow.IPC.Reader.from_file("/data/source.arrow")
{:ok, schema} = ExArrow.Stream.schema(stream)
batches =
  Stream.repeatedly(fn -> ExArrow.Stream.next(stream) end)
  |> Enum.take_while(fn nil -> false; {:error, _} -> false; _ -> true end)
{:ok, binary} = ExArrow.IPC.Writer.to_binary(schema, batches)
df = Explorer.DataFrame.load_ipc_stream!(binary)

Explorer to ExArrow:

df = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
binary = Explorer.DataFrame.dump_ipc_stream!(df)
{:ok, stream} = ExArrow.IPC.Reader.from_binary(binary)
batch = ExArrow.Stream.next(stream)

Use case examples

Ingest IPC from HTTP or Kafka and write to file

ipc_bytes = get_arrow_stream_from_http_or_kafka()
{:ok, stream} = ExArrow.IPC.Reader.from_binary(ipc_bytes)
{:ok, schema} = ExArrow.Stream.schema(stream)
batches =
  Stream.repeatedly(fn -> ExArrow.Stream.next(stream) end)
  |> Enum.take_while(fn nil -> false; {:error, _} -> false; _ -> true end)
:ok = ExArrow.IPC.Writer.to_file("/data/ingested.arrow", schema, batches)

Query a database and forward via Flight

{:ok, db}     = ExArrow.ADBC.Database.open(driver_name: "adbc_driver_sqlite", uri: "file:report.db")
{:ok, conn}   = ExArrow.ADBC.Connection.open(db)
{:ok, stmt}   = ExArrow.ADBC.Statement.new(conn, "SELECT * FROM sales WHERE year = 2024")
{:ok, stream} = ExArrow.ADBC.Statement.execute(stmt)
{:ok, schema} = ExArrow.Stream.schema(stream)
batches =
  Stream.repeatedly(fn -> ExArrow.Stream.next(stream) end)
  |> Enum.take_while(fn nil -> false; {:error, _} -> false; _ -> true end)

{:ok, client} = ExArrow.Flight.Client.connect("flight.example.com", 32010, [])
:ok = ExArrow.Flight.Client.do_put(client, schema, batches)

Connect to Dremio, InfluxDB IOx, or a custom Flight service

{:ok, client}  = ExArrow.Flight.Client.connect("dremio.example.com", 32010, connect_timeout_ms: 5_000)
{:ok, flights} = ExArrow.Flight.Client.list_flights(client, <<>>)
{:ok, stream}  = ExArrow.Flight.Client.do_get(client, ticket_from_service)
batch = ExArrow.Stream.next(stream)

Interchange with Python or R

# Read a file written by PyArrow or Pandas
{:ok, file}   = ExArrow.IPC.File.from_file("/data/from_python.arrow")
{:ok, schema} = ExArrow.IPC.File.schema(file)
n = ExArrow.IPC.File.batch_count(file)
for i <- 0..(n - 1) do
  {:ok, batch} = ExArrow.IPC.File.get_batch(file, i)
  # process batch
end

# Write for Python, R, or DuckDB
:ok = ExArrow.IPC.Writer.to_file("/data/for_python.arrow", schema, batches)

End-to-end: ADBC to Flight

{:ok, db}     = ExArrow.ADBC.Database.open(driver_name: "adbc_driver_postgresql",
                  uri: "postgresql://localhost/mydb")
{:ok, conn}   = ExArrow.ADBC.Connection.open(db)
{:ok, stmt}   = ExArrow.ADBC.Statement.new(conn, "SELECT * FROM sensor_readings")
{:ok, stream} = ExArrow.ADBC.Statement.execute(stmt)
{:ok, schema} = ExArrow.Stream.schema(stream)
batches =
  Stream.repeatedly(fn -> ExArrow.Stream.next(stream) end)
  |> Enum.take_while(fn nil -> false; {:error, _} -> false; _ -> true end)

{:ok, client} = ExArrow.Flight.Client.connect("flight.internal", 32010, [])
:ok = ExArrow.Flight.Client.do_put(client, schema, batches)

Benchmarks

ExArrow ships a Benchee-based benchmark suite in bench/ that quantifies the zero-copy streaming advantage over row-oriented alternatives.

Running locally

Benchee is a :dev-only dependency; MIX_ENV=dev is required.

MIX_ENV=dev mix run bench/ipc_read_bench.exs   # single suite
MIX_ENV=dev mix run bench/run_all.exs           # all suites
MIX_ENV=dev mix bench                           # convenience alias

HTML reports are written to bench/output/ (gitignored).

Suites

File What it measures
ipc_read_bench.exs Stream handle vs materialise — BEAM memory saved by keeping data native
ipc_write_bench.exs IPC serialisation vs :erlang.term_to_binary — columnar vs row-oriented write
flight_bench.exs Flight doput / doget / roundtrip latency with in-process server
adbc_bench.exs Stream handle vs schema peek vs full collect
pipeline_bench.exs End-to-end: IPC file on disk to Flight doput without materialising in BEAM

Published results

Results from every push to main are published at: https://thanos.github.io/ex_arrow/dev/bench/

The CI workflow posts a PR alert comment when any scenario regresses more than 20% relative to the previous baseline.


Documentation

API reference: mix docs or hexdocs.pm/ex_arrow.


Development

mix deps.get
EX_ARROW_BUILD=1 mix compile    # build NIF from source
mix test                         # exclude :adbc / :adbc_package tags if no drivers installed
mix docs                         # generate ExDoc
MIX_ENV=dev mix bench            # run benchmark suite

Local CI script (runs format, credo, dialyzer, tests, coverage, docs):

script/ci

Roadmap

The items below represent the planned direction for ExArrow. Contributions are welcome for any of them.

Near-term (v0.2)

  • TLS for Arrow Flight — encrypted connections for non-loopback Flight endpoints (mTLS and system CA store).
  • Flight server routing — configurable ticket-to-dataset mapping so the built-in server can serve multiple named datasets, not just the last upload.
  • Larger test matrix — integration tests against PostgreSQL, DuckDB, and BigQuery ADBC drivers in CI.
  • ADBC connection pooling — first-class NimblePool-backed pool exposed through the public API.

Medium-term (v0.3)

  • Arrow compute kernels — thin NIF bindings to arrow-compute for filter/project/sort on native buffers without materialising into BEAM.
  • Parquet support — read and write Parquet files via the Arrow Rust parquet crate; complement Explorer's Parquet support with a streaming API.
  • Explorer bridge moduleExArrow.Explorer for direct conversion between ExArrow.Stream / ExArrow.RecordBatch and Explorer.DataFrame without the IPC round-trip.
  • Nx bridge moduleExArrow.Nx for converting a record batch column into an Nx.Tensor without copying through BEAM binary.

Longer-term

  • Flight SQL — the Flight SQL protocol for databases that expose it (DuckDB, CockroachDB, Dremio).
  • Streaming writes to Parquet and Delta Lake — sink for data pipeline nodes.
  • OTel / telemetry integration:telemetry events for IPC read/write throughput, Flight request latency, and ADBC query duration.
  • Windows aarch64 precompiled NIF — once GitHub-hosted Windows arm64 runners are generally available.

FAQ

When should I use ExArrow? Use ExArrow when you need to read or write Arrow IPC (stream or file), connect to an Arrow Flight server (Dremio, InfluxDB IOx, custom), or run SQL via ADBC and receive Arrow result streams. Good fit for data pipelines, ETL, and interchange with systems that already speak Arrow.

When should I not use ExArrow? Do not use it as a dataframe or query engine. For in-memory analysis, filtering, grouping, and plotting, use Explorer. Do not use it as a replacement for Ecto when you only need normal SQL results. For Parquet-only workflows with no Flight/ADBC, consider Explorer's Parquet support first.

Can I use ExArrow and Explorer together? Yes. ExArrow handles transport and protocol layers. Use ExArrow.IPC.Writer.to_binary/2 to produce IPC, then Explorer.DataFrame.load_ipc_stream!/1 to load it. In the other direction, Explorer.DataFrame.dump_ipc_stream!/1 produces bytes that ExArrow.IPC.Reader.from_binary/1 can read.

Why do I get a 404 or "couldn't fetch NIF" on compile? Precompiled NIFs are hosted on GitHub releases. If you are on an unsupported platform or an unreleased version, the download fails. Set EX_ARROW_BUILD=1, install Rust, and run mix compile to build from source.

Is Arrow Flight over TLS supported? Not yet. Flight in this release is plaintext only. Use on localhost or trusted networks. TLS is on the roadmap for v0.2.

Which ADBC drivers are supported? Any ADBC driver that provides a shared library — for example adbc_driver_sqlite, adbc_driver_postgresql, or the DuckDB ADBC driver. You must install the driver and pass its path, or ensure the driver manager can find it. Metadata and binding support depend on the individual driver.


License

MIT. See LICENSE for details. Copyright (c) 2025 Thanos Vassilakis.

About

Native Apache Arrow for the BEAM: IPC streaming, Arrow Flight, and ADBC database bindings. Column data lives in Rust buffers; Elixir holds lightweight opaque handles. Precompiled NIFs for Linux, macOS, and Windows — no Rust required to use.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors