Skip to content

Datasets core#900

Open
ThomSerg wants to merge 181 commits into
masterfrom
dataset_core
Open

Datasets core#900
ThomSerg wants to merge 181 commits into
masterfrom
dataset_core

Conversation

@ThomSerg

@ThomSerg ThomSerg commented Apr 1, 2026

Copy link
Copy Markdown
Collaborator

A first PR in a larger sequence of upcoming ones, bringing the work on datasets, IO and benchmarks from branch benchmark_datasets into master. Tried to keep it as minimal as possible, also with just a single dataset implementation; XCSP3Dataset. In some places you will notice some placeholders / things that don't seem needed right now but that future PRs will use to build upon. Some of these are labelled with "TODO".

@ThomSerg ThomSerg requested a review from OrestisLomis April 1, 2026 14:00

@OrestisLomis OrestisLomis left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is really nice! Some questions about the class hierarchy remain for me, please have a look. Few more small comments as well

Comment thread cpmpy/tools/datasets/core.py Outdated
}


class IndexedDataset(Dataset):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this functionality not directly in Dataset? I do not really see what kind of data would not be implementable in an indexable way. Is it for instance generator datasets?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any dataset in which the instances must be consumed in order (where random access is not possible). This can for example happen when instances are generated on the spot and this generation process is sequence dependent, e.g. it uses a initial random seed and then goes on from there. Or maybe a very large dataset that is for some reason in one very big file, where reading the entire file into memory is not efficient and it is better to just stream instance by instance. I do agree that for now we don't have a usecase for it, but I wanted to prevent that in the future we or any other contributor couldn't add one of these alternative dataset types due to us hard-coding indexability into the base class. With a very generic and limited base class, these future additions are still a possibility.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After internal discussion, I removed the intermediate IndexedDataset. All datasets now inherit from Dataset, which requires implementation for indexability and iterability. Datasets that are only one of the two are more exception than the norm, so one idea is to just keep the code simple. Though this will cause issues for those datasets (streaming datasets, generator-based datasets), since they will have to through an exception that random access is not allowed. And it's not that nice of a design that the user has to take into account that index-based access could throw an exception for some datasets. In my opinion the previous code for keeping indexability out of the base dataset class is not a lot, but I can also see the other point of view in terms of complexity and maintainability.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that forcing someone who wants to implement a Dataset to throw errors for abstract functions he doesn't need is maybe not the cleanest design, however I would argue it is also not necessary. Both the cases you mentioned can naturally be iterated over, the indexing is perhaps a little trickier and maybe not so clean, but still possible by naively just starting from the startpoint every time you want to index an instance. This can be improved with hashing/caching perhaps, but up to the implementor.

Comment thread cpmpy/tools/datasets/core.py
Comment thread cpmpy/tools/datasets/core.py Outdated
Comment thread cpmpy/tools/datasets/core.py Outdated
Comment thread cpmpy/tools/datasets/core.py
Comment thread cpmpy/tools/datasets/core.py
Comment thread cpmpy/tools/datasets/core.py Outdated
Comment thread cpmpy/tools/datasets/core.py Outdated
Comment thread cpmpy/tools/datasets/xcsp3.py
@ThomSerg ThomSerg requested a review from OrestisLomis April 17, 2026 08:51

@OrestisLomis OrestisLomis left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is alright as is now. See comment for more detailed discussion.

Comment thread cpmpy/tools/datasets/core.py Outdated
}


class IndexedDataset(Dataset):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that forcing someone who wants to implement a Dataset to throw errors for abstract functions he doesn't need is maybe not the cleanest design, however I would argue it is also not necessary. Both the cases you mentioned can naturally be iterated over, the indexing is perhaps a little trickier and maybe not so clean, but still possible by naively just starting from the startpoint every time you want to index an instance. This can be improved with hashing/caching perhaps, but up to the implementor.

@tias tias left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

also nicely extensive test-suite.

Some questions/remarks where I think it can be a bit cleaner/simpler, but you know better what is still coming so responses very welcome

import cpmpy as cp


def _format_bytes(bytes_num):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you import typing I kind-of expect typed functions... e.g. this is float or int?

bytes_num /= 1024.0


class classproperty:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems like sugar coating... do we care?

things like this can stop us from using mypyc in the future, while file parsing/loading is something where C code can really shine...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(read other comments lower first)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning for this was to be able to define "abstract" on class fields, so that when a user is developing a new dataset they get immediate feedback when they forgot to overwrite one of these fields. With that decorator it acts just like an abstractmethod. But I do also fear that this will cause issues with mypyc. An alternative would be to just use normal fields with placeholder values that should be replaced (or leave empty) and then use the __init_subclass__ method to check, before construction, that the user did so.

class Dataset(ABC):
    name: ClassVar[str]
    description: ClassVar[str]
    homepage: ClassVar[str]
    citation: ClassVar[List[str]] = []

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)
        if cls is Dataset:
            return
        for attr in ("name", "description", "homepage"):
            if attr not in cls.__dict__:
                raise TypeError(f"{cls.__name__} must define class attribute {attr!r}")

Comment thread cpmpy/tools/datasets/core.py
Comment thread cpmpy/tools/datasets/core.py
Comment thread cpmpy/tools/datasets/core.py
fp = futures[future]
print(f"Error collecting metadata for {fp.name}: {e}")

def _collect_one_metadata(self, file_path):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

euh... we have 'instance_metadata()', why do we need this? it looks duplicate... (its also untyped)

Comment thread cpmpy/tools/datasets/core.py
class FromFilesDataset(FileDataset):
# Plain class attributes so that dataset_metadata() (a classmethod
# that reads cls.name / cls.description / ...) works correctly.
name = ""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the better, cleaner way... just plain class attributes, should be like that in the superclass too?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if its required, you can check in the constructor that it should be non-empty...

Comment thread cpmpy/tools/datasets/core.py
@tias tias added this to the v0.20 milestone Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants