In any case, at least at my company we probably have a few PiB of data written with this or an even earlier version.
BTW this is so cool to hear
Really want to make sure I don't spam this PR with too much sideinfo, but I just wanted to use this opportunity to share that I (+ a coworker) will give the talk "Scaling Data Processing for Training Workloads at DeepL Research with Rust" at this year's PyCon DE / PyData in Darmstadt (Germany), where we go a bit into detail about this!
Working with arrow-rs (+ PyO3 as the Python binding layer) has been an absolute blast so far for coming up with a highly optimized and efficient deep learning data ingress pipeline.
Especially compared to pyarrow, we've rarely or never seen any issues concerning surprisingly high resource usages, memory leaks or randomly not supported features (I'm somewhat sure selectively decoding specific rows by row index to reduce memory usage during sparse decoding isn't possible in a non-clunky way with pyarrow, and with arrow-rs's RowSelection this was trivially easy, even as a feature exposed to Python). Happy to stay connected on this topic.
Originally posted by @jonded94 in #9374 (comment)
Really want to make sure I don't spam this PR with too much sideinfo, but I just wanted to use this opportunity to share that I (+ a coworker) will give the talk "Scaling Data Processing for Training Workloads at DeepL Research with Rust" at this year's PyCon DE / PyData in Darmstadt (Germany), where we go a bit into detail about this!
Working with
arrow-rs(+PyO3as the Python binding layer) has been an absolute blast so far for coming up with a highly optimized and efficient deep learning data ingress pipeline.Especially compared to
pyarrow, we've rarely or never seen any issues concerning surprisingly high resource usages, memory leaks or randomly not supported features (I'm somewhat sure selectively decoding specific rows by row index to reduce memory usage during sparse decoding isn't possible in a non-clunky way withpyarrow, and witharrow-rs'sRowSelectionthis was trivially easy, even as a feature exposed to Python). Happy to stay connected on this topic.Originally posted by @jonded94 in #9374 (comment)