-
Notifications
You must be signed in to change notification settings - Fork 70
Description
Overview
The addition of Tensor support is valuable for positioning Arrow.jl as a great library for machine learning and scientific computing. These multi-dimensional data structures are currently unsupported but are defined within the Arrow format specification. The Arrow specification provides distinct formats for dense and sparse multi-dimensional arrays, this issue are focusing on sparse multi-dimensional arrays.
Sparse Tensors
Sparse tensors provide an efficient storage format for data where a majority of the elements are zero. The Arrow format defines a SparseTensor message type in its FlatBuffers schema, which supports several index formats to represent the locations of the non-zero values.
- Coordinate (COO): This is the most straightforward sparse format. It explicitly stores the coordinates of every non-zero element.
- Storage: It uses two primary buffers:
- data: A contiguous buffer containing the non-zero values themselves.
- indicesBuffer: A buffer representing an N×M matrix, where N is the number of non-zero values and M is the number of dimensions of the tensor. Each row of this matrix is a tuple of coordinates for a corresponding value in the data buffer.
- Use Case: COO is best when you're adding data points one by one, or when your data is scattered randomly without a clear pattern.
- Storage: It uses two primary buffers:
- Compressed Sparse Matrix (CSX): This format is specialized for 2D tensors (matrices) and includes Compressed Sparse Row (CSR) and Compressed Sparse Column (CSC) variants. It achieves compression by not repeating row (for CSR) or column (for CSC) indices.
- Storage: It uses three buffers:
- data: A buffer of the non-zero values.
- indicesBuffer: A buffer of indices for the uncompressed dimension (e.g., column indices for CSR).
- indptrBuffer (Index Pointer Buffer): A buffer of length (compressed_dimension_size + 1). It stores offsets into the indicesBuffer and data buffer. The non-zero elements for row i are located between indptr[i] and indptr[i+1].
- Use Case: CSX for matrix operations like matrix-vector multiplication.
- Storage: It uses three buffers:
- Compressed Sparse Fiber (CSF): This is the most general and advanced format, extending the compression scheme of CSR/CSC to tensors of arbitrary dimension. It recursively compresses the tensor dimension by dimension.
- Storage: Its structure is more complex and hierarchical:
- data: A buffer of the non-zero values.
- indicesBuffers: An array of buffers, one for each dimension of the tensor, storing the coordinate values at each level of the hierarchy.
- indptrBuffer (Index Pointer Buffer): An array of buffers, one for each level of the hierarchy (except the last), storing pointers that define the relationship between a coordinate in one dimension and its children coordinates in the next dimension.
- Use Case: CSF provides excellent compression and performance for a wide range of tensor algebra operations on highly structured sparse data but is the most complex to implement. https://stephenchou.net/files/chou-18-sm-thesis.pdf
- Storage: Its structure is more complex and hierarchical:
Proposed Design
New struct types could be defined that act as zero-copy views over the underlying Arrow memory buffers.
An abstract supertype could be defined, with concrete implementations for each sparse format.
abstract type AbstractSparseTensor{T, N} <: AbstractArray{T, N} end
struct SparseTensorCOO{T, N} <: AbstractSparseTensor{T, N}
indices::AbstractMatrix{<:Integer} # View over indicesBuffer
data::AbstractVector{T} # View over data buffer
shape::NTuple{N, Int}
end
struct SparseTensorCSX{T} <: AbstractSparseTensor{T, 2}
indptr::AbstractVector{<:Integer} # View over indptrBuffer
indices::AbstractVector{<:Integer} # View over indicesBuffer
data::AbstractVector{T} # View over data buffer
shape::NTuple{2, Int}
compressed_axis::Symbol # :Row or :Column
end
This design allows the DenseTensor to leverage Julia's rich AbstractArray interface for slicing and other operations while ensuring data is never copied from the Arrow buffer. Each of these structs would hold views (e.g., created with unsafe_wrap) into the raw memory buffers provided by the Arrow data, again ensuring a zero-copy representation.