Skip to content

Potential data overlap and incomplete coverage due to random.shuffle in concept_generation.py #42

@hjhhsy120

Description

@hjhhsy120

Hi, I would like to report a potential issue in atlas_rag/kg_construction/concept_generation.py at line 113. The load_data_with_shard function currently employs random.shuffle when processing multiple shards. In a concurrent sharding environment, this approach might lead to data overlap and incomplete coverage, since each shard independently shuffles the dataset before selecting its subset. I am not entirely sure if this behavior is by design or if there is a misunderstanding on my part regarding the sharding logic. I would appreciate your feedback on this, and if it is indeed a potential issue, I hope it can be addressed. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions