Map-style dataset and iterable-style dataset

Jimmy (xiaoke) Shen
2 min readJul 13, 2022

--

Question

A question was asked here which is also the question I want to know the answer.

A map-style dataset in Pytorch has the __getitem__() and __len__() and iterable-style datasets has __iter__() protocol. If we use map-style, we can access the data with dataset[idx] which is great, however with the iterable dataset we can’t.

My question is why this distinction was necessary? What makes the data random read so expensive or even improbable? From [1]

What are those two style datasets?

Based on [4]

Map-style datasets [4]

A map-style dataset is one that implements the __getitem__() and __len__() protocols, and represents a map from (possibly non-integral) indices/keys to data samples.

For example, such a dataset, when accessed with dataset[idx], could read the idx-th image and its corresponding label from a folder on the disk.

Iterable-style datasets [4]

An iterable-style dataset is an instance of a subclass of IterableDataset that implements the __iter__() protocol, and represents an iterable over data samples. This type of datasets is particularly suitable for cases where random reads are expensive or even improbable, and where the batch size depends on the fetched data.

For example, such a dataset, when called iter(dataset), could return a stream of data reading from a database, a remote server, or even logs generated in real time.

Answer

I understood the main difference between these datasets, that the IterableDataset provides a clean way to yield data from e.g. a stream, i.e. where the length of the dataset is unknown or cannot be simply calculated. Inside the __iter__ method you would thus have to make sure to exit the iteration at some point, e.g. if your data stream is empty. From [2]

Further reading

See [3]

Reference

[1]https://discuss.pytorch.org/t/dataset-map-style-vs-iterable-style/92329

[2]https://discuss.pytorch.org/t/dataset-map-style-vs-iterable-style/92329/2?u=jimmy_xiaoke_shen

[3] Quick guide to loading data in PyTorch and TensorFlow | Yizhe’s notebook (yizhepku.github.io)

[4] torch.utils.data — PyTorch 1.12 documentation

--

--