Skip to content

Consider columnar storage #38

@adsharma

Description

@adsharma

This is about the huggingface datasets.

Many of them are either compressed csv/json dumps which are not viewable/queryable using the huggingface UI. Have you considered using parquet/duckdb file formats?

I have some scripts to process llama3*.zip files to produce parquet/duckdb. They produce a entity -> event -> event graph. Not sure about concepts graph.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions