Skip to content

[DISCUSSION] Usage of arrow-rs in other areas #9423

@alamb

Description

@alamb

In any case, at least at my company we probably have a few PiB of data written with this or an even earlier version.

BTW this is so cool to hear

Really want to make sure I don't spam this PR with too much sideinfo, but I just wanted to use this opportunity to share that I (+ a coworker) will give the talk "Scaling Data Processing for Training Workloads at DeepL Research with Rust" at this year's PyCon DE / PyData in Darmstadt (Germany), where we go a bit into detail about this!

Working with arrow-rs (+ PyO3 as the Python binding layer) has been an absolute blast so far for coming up with a highly optimized and efficient deep learning data ingress pipeline.

Especially compared to pyarrow, we've rarely or never seen any issues concerning surprisingly high resource usages, memory leaks or randomly not supported features (I'm somewhat sure selectively decoding specific rows by row index to reduce memory usage during sparse decoding isn't possible in a non-clunky way with pyarrow, and with arrow-rs's RowSelection this was trivially easy, even as a feature exposed to Python). Happy to stay connected on this topic.

Originally posted by @jonded94 in #9374 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions