Map-style dataset and iterable-style dataset
Question
A question was asked here which is also the question I want to know the answer.
A map-style dataset in Pytorch has the
__getitem__()
and__len__()
and iterable-style datasets has__iter__()
protocol. If we use map-style, we can access the data withdataset[idx]
which is great, however with the iterable dataset we can’t.My question is why this distinction was necessary? What makes the data random read so expensive or even improbable? From [1]
What are those two style datasets?
Based on [4]
Map-style datasets [4]
A map-style dataset is one that implements the __getitem__()
and __len__()
protocols, and represents a map from (possibly non-integral) indices/keys to data samples.
For example, such a dataset, when accessed with dataset[idx]
, could read the idx
-th image and its corresponding label from a folder on the disk.
Iterable-style datasets [4]
An iterable-style dataset is an instance of a subclass of IterableDataset
that implements the __iter__()
protocol, and represents an iterable over data samples. This type of datasets is particularly suitable for cases where random reads are expensive or even improbable, and where the batch size depends on the fetched data.
For example, such a dataset, when called iter(dataset)
, could return a stream of data reading from a database, a remote server, or even logs generated in real time.
Answer
I understood the main difference between these datasets, that the
IterableDataset
provides a clean way to yield data from e.g. a stream, i.e. where the length of the dataset is unknown or cannot be simply calculated. Inside the__iter__
method you would thus have to make sure to exit the iteration at some point, e.g. if your data stream is empty. From [2]
Further reading
See [3]
Reference
[1]https://discuss.pytorch.org/t/dataset-map-style-vs-iterable-style/92329
[2]https://discuss.pytorch.org/t/dataset-map-style-vs-iterable-style/92329/2?u=jimmy_xiaoke_shen
[3] Quick guide to loading data in PyTorch and TensorFlow | Yizhe’s notebook (yizhepku.github.io)