Sparse Tensor Support

### Overview

The addition of Tensor support is valuable for positioning Arrow.jl as a great library for machine learning and scientific computing. These multi-dimensional data structures are currently unsupported but are defined within the Arrow format specification. The Arrow specification provides distinct formats for dense and sparse multi-dimensional arrays, this issue are focusing on sparse multi-dimensional arrays.

### Sparse Tensors

Sparse tensors provide an efficient storage format for data where a majority of the elements are zero. The Arrow format defines a [SparseTensor message type in its FlatBuffers schema](https://github.com/apache/arrow/blob/main/format/SparseTensor.fbs#L209), which supports several index formats to represent the locations of the non-zero values.
* [Coordinate (COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO)): This is the most straightforward sparse format. It explicitly stores the coordinates of every non-zero element.
  * Storage: It uses two primary buffers:
    1. data: A contiguous buffer containing the non-zero values themselves.
    2. [indicesBuffer](https://github.com/apache/arrow/blob/main/format/SparseTensor.fbs#L34): A buffer representing an N×M matrix, where N is the number of non-zero values and M is the number of dimensions of the tensor. Each row of this matrix is a tuple of coordinates for a corresponding value in the data buffer.
  * Use Case: COO is best when you're adding data points one by one, or when your data is scattered randomly without a clear pattern.
* [Compressed Sparse Matrix (CSX)](https://www.researchgate.net/publication/221643637_CSX_An_Extended_Compression_Format_for_SpMV_on_Shared_Memory_Systems): This format is specialized for 2D tensors (matrices) and includes [Compressed Sparse Row (CSR)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)) and [Compressed Sparse Column (CSC) ](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_column_(CSC_or_CCS))variants. It achieves compression by not repeating row (for CSR) or column (for CSC) indices.
  * Storage: It uses three buffers:
    1. data: A buffer of the non-zero values.
    2. indicesBuffer: A buffer of indices for the uncompressed dimension (e.g., column indices for CSR).
    3. [indptrBuffer (Index Pointer Buffer)](https://github.com/apache/arrow/blob/main/format/SparseTensor.fbs#L87): A buffer of length (compressed_dimension_size + 1). It stores offsets into the indicesBuffer and data buffer. The non-zero elements for row i are located between indptr[i] and indptr[i+1].
  * Use Case: CSX for matrix operations like matrix-vector multiplication.
* [Compressed Sparse Fiber (CSF)](https://www.boristhebrave.com/2021/01/01/compressed-sparse-fibers-explained/): This is the most general and advanced format, extending the compression scheme of CSR/CSC to tensors of arbitrary dimension. It recursively compresses the tensor dimension by dimension.
   * Storage: Its structure is more complex and hierarchical:
      1. data: A buffer of the non-zero values.
      2. indicesBuffers: An array of buffers, one for each dimension of the tensor, storing the coordinate values at each level of the hierarchy.
      3. [indptrBuffer (Index Pointer Buffer)](https://github.com/apache/arrow/blob/main/format/SparseTensor.fbs#L87): An array of buffers, one for each level of the hierarchy (except the last), storing pointers that define the relationship between a coordinate in one dimension and its children coordinates in the next dimension.
  * Use Case: CSF provides excellent compression and performance for a wide range of tensor algebra operations on highly structured sparse data but is the most complex to implement. https://stephenchou.net/files/chou-18-sm-thesis.pdf

### Proposed Design
New struct types could be defined that act as zero-copy views over the underlying Arrow memory buffers.
An abstract supertype could be defined, with concrete implementations for each sparse format.

```
abstract type AbstractSparseTensor{T, N} <: AbstractArray{T, N} end

struct SparseTensorCOO{T, N} <: AbstractSparseTensor{T, N}
    indices::AbstractMatrix{<:Integer} # View over indicesBuffer
    data::AbstractVector{T}             # View over data buffer
    shape::NTuple{N, Int}
end

struct SparseTensorCSX{T} <: AbstractSparseTensor{T, 2}
    indptr::AbstractVector{<:Integer}   # View over indptrBuffer
    indices::AbstractVector{<:Integer}  # View over indicesBuffer
    data::AbstractVector{T}             # View over data buffer
    shape::NTuple{2, Int}
    compressed_axis::Symbol # :Row or :Column
end
```

This design allows the DenseTensor to leverage Julia's rich AbstractArray interface for slicing and other operations while ensuring data is never copied from the Arrow buffer. Each of these structs would hold views (e.g., created with unsafe_wrap) into the raw memory buffers provided by the Arrow data, again ensuring a zero-copy representation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparse Tensor Support #565

Overview

Sparse Tensors

Proposed Design

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sparse Tensor Support #565

Description

Overview

Sparse Tensors

Proposed Design

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions