Datasets core#900
Conversation
OrestisLomis
left a comment
There was a problem hiding this comment.
I think it is really nice! Some questions about the class hierarchy remain for me, please have a look. Few more small comments as well
| } | ||
|
|
||
|
|
||
| class IndexedDataset(Dataset): |
There was a problem hiding this comment.
Why is this functionality not directly in Dataset? I do not really see what kind of data would not be implementable in an indexable way. Is it for instance generator datasets?
There was a problem hiding this comment.
Any dataset in which the instances must be consumed in order (where random access is not possible). This can for example happen when instances are generated on the spot and this generation process is sequence dependent, e.g. it uses a initial random seed and then goes on from there. Or maybe a very large dataset that is for some reason in one very big file, where reading the entire file into memory is not efficient and it is better to just stream instance by instance. I do agree that for now we don't have a usecase for it, but I wanted to prevent that in the future we or any other contributor couldn't add one of these alternative dataset types due to us hard-coding indexability into the base class. With a very generic and limited base class, these future additions are still a possibility.
There was a problem hiding this comment.
After internal discussion, I removed the intermediate IndexedDataset. All datasets now inherit from Dataset, which requires implementation for indexability and iterability. Datasets that are only one of the two are more exception than the norm, so one idea is to just keep the code simple. Though this will cause issues for those datasets (streaming datasets, generator-based datasets), since they will have to through an exception that random access is not allowed. And it's not that nice of a design that the user has to take into account that index-based access could throw an exception for some datasets. In my opinion the previous code for keeping indexability out of the base dataset class is not a lot, but I can also see the other point of view in terms of complexity and maintainability.
There was a problem hiding this comment.
I do agree that forcing someone who wants to implement a Dataset to throw errors for abstract functions he doesn't need is maybe not the cleanest design, however I would argue it is also not necessary. Both the cases you mentioned can naturally be iterated over, the indexing is perhaps a little trickier and maybe not so clean, but still possible by naively just starting from the startpoint every time you want to index an instance. This can be improved with hashing/caching perhaps, but up to the implementor.
OrestisLomis
left a comment
There was a problem hiding this comment.
I think it is alright as is now. See comment for more detailed discussion.
| } | ||
|
|
||
|
|
||
| class IndexedDataset(Dataset): |
There was a problem hiding this comment.
I do agree that forcing someone who wants to implement a Dataset to throw errors for abstract functions he doesn't need is maybe not the cleanest design, however I would argue it is also not necessary. Both the cases you mentioned can naturally be iterated over, the indexing is perhaps a little trickier and maybe not so clean, but still possible by naively just starting from the startpoint every time you want to index an instance. This can be improved with hashing/caching perhaps, but up to the implementor.
tias
left a comment
There was a problem hiding this comment.
great!
also nicely extensive test-suite.
Some questions/remarks where I think it can be a bit cleaner/simpler, but you know better what is still coming so responses very welcome
| import cpmpy as cp | ||
|
|
||
|
|
||
| def _format_bytes(bytes_num): |
There was a problem hiding this comment.
if you import typing I kind-of expect typed functions... e.g. this is float or int?
| bytes_num /= 1024.0 | ||
|
|
||
|
|
||
| class classproperty: |
There was a problem hiding this comment.
that seems like sugar coating... do we care?
things like this can stop us from using mypyc in the future, while file parsing/loading is something where C code can really shine...
There was a problem hiding this comment.
(read other comments lower first)
There was a problem hiding this comment.
My reasoning for this was to be able to define "abstract" on class fields, so that when a user is developing a new dataset they get immediate feedback when they forgot to overwrite one of these fields. With that decorator it acts just like an abstractmethod. But I do also fear that this will cause issues with mypyc. An alternative would be to just use normal fields with placeholder values that should be replaced (or leave empty) and then use the __init_subclass__ method to check, before construction, that the user did so.
class Dataset(ABC):
name: ClassVar[str]
description: ClassVar[str]
homepage: ClassVar[str]
citation: ClassVar[List[str]] = []
def __init_subclass__(cls, **kwargs):
super().__init_subclass__(**kwargs)
if cls is Dataset:
return
for attr in ("name", "description", "homepage"):
if attr not in cls.__dict__:
raise TypeError(f"{cls.__name__} must define class attribute {attr!r}")| fp = futures[future] | ||
| print(f"Error collecting metadata for {fp.name}: {e}") | ||
|
|
||
| def _collect_one_metadata(self, file_path): |
There was a problem hiding this comment.
euh... we have 'instance_metadata()', why do we need this? it looks duplicate... (its also untyped)
| class FromFilesDataset(FileDataset): | ||
| # Plain class attributes so that dataset_metadata() (a classmethod | ||
| # that reads cls.name / cls.description / ...) works correctly. | ||
| name = "" |
There was a problem hiding this comment.
I think this is the better, cleaner way... just plain class attributes, should be like that in the superclass too?
There was a problem hiding this comment.
if its required, you can check in the constructor that it should be non-empty...
A first PR in a larger sequence of upcoming ones, bringing the work on datasets, IO and benchmarks from branch
benchmark_datasetsinto master. Tried to keep it as minimal as possible, also with just a single dataset implementation; XCSP3Dataset. In some places you will notice some placeholders / things that don't seem needed right now but that future PRs will use to build upon. Some of these are labelled with "TODO".