一文讀懂PyTorch中Dataset與DataLoader
點(diǎn)擊上方“程序員大白”,選擇“星標(biāo)”公眾號(hào)
重磅干貨,第一時(shí)間送達(dá)

作者 | Chenllliang@知乎(已授權(quán))
來(lái)源 | https://zhuanlan.zhihu.com/p/105578087
編輯 | 極市平臺(tái)
導(dǎo)讀
Dataset 是 DataLoader 實(shí)例化的一個(gè)參數(shù),本文先從 Dataset 的源代碼講起,然后講 DataLoader 中的主要函數(shù),讓大家學(xué)會(huì)自定義自己的數(shù)據(jù)集。
深度時(shí)代,數(shù)據(jù)為王。
PyTorch 為我們提供的兩個(gè) Dataset 和 DataLoader 類(lèi)分別負(fù)責(zé)可被 Pytorch 使用的數(shù)據(jù)集的創(chuàng)建以及向訓(xùn)練傳遞數(shù)據(jù)的任務(wù)。如果想個(gè)性化自己的數(shù)據(jù)集或者數(shù)據(jù)傳遞方式,也可以自己重寫(xiě)子類(lèi)。
Dataset 是 DataLoader 實(shí)例化的一個(gè)參數(shù),本文先從 Dataset 的源代碼講起,然后講 DataLoader,關(guān)注主要函數(shù),少細(xì)枝末節(jié),讓大家學(xué)會(huì)自定義自己的數(shù)據(jù)集。
ps:本文搬運(yùn)自作者的博客 陳亮的博客 | Liang's Bloghttps://chenllliang.github.io,里面有一些完成/待完成的文章,歡迎大家一起交流。
Dataset
什么時(shí)候使用Dataset
CIFAR10是CV訓(xùn)練中經(jīng)常使用到的一個(gè)數(shù)據(jù)集,在PyTorch中CIFAR10是一個(gè)寫(xiě)好的Dataset,我們使用時(shí)只需以下代碼:
data = datasets.CIFAR10("./data/", transform=transform, train=True, download=True)
datasets.CIFAR10就是一個(gè)Datasets子類(lèi),data是這個(gè)類(lèi)的一個(gè)實(shí)例。
我們有的時(shí)候需要用自己在一個(gè)文件夾中的數(shù)據(jù)作為數(shù)據(jù)集,這個(gè)時(shí)候,我們可以使用ImageFolder這個(gè)方便的API。
FaceDataset = datasets.ImageFolder('./data', transform=img_transform)
如何自定義一個(gè)數(shù)據(jù)集
torch.utils.data.Dataset 是一個(gè)表示數(shù)據(jù)集的抽象類(lèi)。任何自定義的數(shù)據(jù)集都需要繼承這個(gè)類(lèi)并覆寫(xiě)相關(guān)方法。
所謂數(shù)據(jù)集,其實(shí)就是一個(gè)負(fù)責(zé)處理索引(index)到樣本(sample)映射的一個(gè)類(lèi)(class)。
Pytorch提供兩種數(shù)據(jù)集:Map式數(shù)據(jù)集 Iterable式數(shù)據(jù)集
Map式數(shù)據(jù)集
一個(gè)Map式的數(shù)據(jù)集必須要重寫(xiě)getitem(self, index),len(self) 兩個(gè)內(nèi)建方法,用來(lái)表示從索引到樣本的映射(Map).
這樣一個(gè)數(shù)據(jù)集dataset,舉個(gè)例子,當(dāng)使用dataset[idx]命令時(shí),可以在你的硬盤(pán)中讀取你的數(shù)據(jù)集中第idx張圖片以及其標(biāo)簽(如果有的話(huà));len(dataset)則會(huì)返回這個(gè)數(shù)據(jù)集的容量。
自定義類(lèi)大致是這樣的:class CustomDataset(data.Dataset):#需要繼承data.Dataset
def __init__(self):
# TODO
# 1. Initialize file path or list of file names.
pass
def __getitem__(self, index):
# TODO
# 1. Read one data from file (e.g. using numpy.fromfile, PIL.Image.open).
# 2. Preprocess the data (e.g. torchvision.Transform).
# 3. Return a data pair (e.g. image and label).
#這里需要注意的是,第一步:read one data,是一個(gè)data
pass
def __len__(self):
# You should change 0 to the total size of your dataset.
return 0
例子1:自己實(shí)驗(yàn)中寫(xiě)的一個(gè)例子:這里我們的圖片文件儲(chǔ)存在“./data/faces/”文件夾下,圖片的名字并不是從1開(kāi)始,而是從 final_train_tag_dict.txt 這個(gè)文件保存的字典中讀取,label 信息也是用這個(gè)文件中讀取。大家可以照著上面的注釋閱讀這段代碼。
from torch.utils import data
import numpy as np
from PIL import Image
class face_dataset(data.Dataset):
def __init__(self):
self.file_path = './data/faces/'
f=open("final_train_tag_dict.txt","r")
self.label_dict=eval(f.read())
f.close()
def __getitem__(self,index):
label = list(self.label_dict.values())[index-1]
img_id = list(self.label_dict.keys())[index-1]
img_path = self.file_path+str(img_id)+".jpg"
img = np.array(Image.open(img_path))
return img,label
def __len__(self):
return len(self.label_dict)
下面我們看一下官方MNIST數(shù)據(jù)集的例子。
class MNIST(data.Dataset):
"""`MNIST <http://yann.lecun.com/exdb/mnist/>`_ Dataset.
Args:
root (string): Root directory of dataset where ``processed/training.pt``
and ``processed/test.pt`` exist.
train (bool, optional): If True, creates dataset from ``training.pt``,
otherwise from ``test.pt``.
download (bool, optional): If true, downloads the dataset from the internet and
puts it in root directory. If dataset is already downloaded, it is not
downloaded again.
transform (callable, optional): A function/transform that takes in an PIL image
and returns a transformed version. E.g, ``transforms.RandomCrop``
target_transform (callable, optional): A function/transform that takes in the
target and transforms it.
"""
urls = [
'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz',
]
raw_folder = 'raw'
processed_folder = 'processed'
training_file = 'training.pt'
test_file = 'test.pt'
classes = ['0 - zero', '1 - one', '2 - two', '3 - three', '4 - four',
'5 - five', '6 - six', '7 - seven', '8 - eight', '9 - nine']
class_to_idx = {_class: i for i, _class in enumerate(classes)}
@property
def targets(self):
if self.train:
return self.train_labels
else:
return self.test_labels
def __init__(self, root, train=True, transform=None, target_transform=None, download=False):
self.root = os.path.expanduser(root)
self.transform = transform
self.target_transform = target_transform
self.train = train # training set or test set
if download:
self.download()
if not self._check_exists():
raise RuntimeError('Dataset not found.' +
' You can use download=True to download it')
if self.train:
self.train_data, self.train_labels = torch.load(
os.path.join(self.root, self.processed_folder, self.training_file))
else:
self.test_data, self.test_labels = torch.load(
os.path.join(self.root, self.processed_folder, self.test_file))
def __getitem__(self, index):
"""
Args:
index (int): Index
Returns:
tuple: (image, target) where target is index of the target class.
"""
if self.train:
img, target = self.train_data[index], self.train_labels[index]
else:
img, target = self.test_data[index], self.test_labels[index]
# doing this so that it is consistent with all other datasets
# to return a PIL Image
img = Image.fromarray(img.numpy(), mode='L')
if self.transform is not None:
img = self.transform(img)
if self.target_transform is not None:
target = self.target_transform(target)
return img, target
def __len__(self):
if self.train:
return len(self.train_data)
else:
return len(self.test_data)
def _check_exists(self):
return os.path.exists(os.path.join(self.root, self.processed_folder, self.training_file)) and \
os.path.exists(os.path.join(self.root, self.processed_folder, self.test_file))
def download(self):
"""Download the MNIST data if it doesn't exist in processed_folder already."""
from six.moves import urllib
import gzip
if self._check_exists():
return
# download files
try:
os.makedirs(os.path.join(self.root, self.raw_folder))
os.makedirs(os.path.join(self.root, self.processed_folder))
except OSError as e:
if e.errno == errno.EEXIST:
pass
else:
raise
for url in self.urls:
print('Downloading ' + url)
data = urllib.request.urlopen(url)
filename = url.rpartition('/')[2]
file_path = os.path.join(self.root, self.raw_folder, filename)
with open(file_path, 'wb') as f:
f.write(data.read())
with open(file_path.replace('.gz', ''), 'wb') as out_f, \
gzip.GzipFile(file_path) as zip_f:
out_f.write(zip_f.read())
os.unlink(file_path)
# process and save as torch files
print('Processing...')
training_set = (
read_image_file(os.path.join(self.root, self.raw_folder, 'train-images-idx3-ubyte')),
read_label_file(os.path.join(self.root, self.raw_folder, 'train-labels-idx1-ubyte'))
)
test_set = (
read_image_file(os.path.join(self.root, self.raw_folder, 't10k-images-idx3-ubyte')),
read_label_file(os.path.join(self.root, self.raw_folder, 't10k-labels-idx1-ubyte'))
)
with open(os.path.join(self.root, self.processed_folder, self.training_file), 'wb') as f:
torch.save(training_set, f)
with open(os.path.join(self.root, self.processed_folder, self.test_file), 'wb') as f:
torch.save(test_set, f)
print('Done!')
def __repr__(self):
fmt_str = 'Dataset ' + self.__class__.__name__ + '\n'
fmt_str += ' Number of datapoints: {}\n'.format(self.__len__())
tmp = 'train' if self.train is True else 'test'
fmt_str += ' Split: {}\n'.format(tmp)
fmt_str += ' Root Location: {}\n'.format(self.root)
tmp = ' Transforms (if any): '
fmt_str += '{0}{1}\n'.format(tmp, self.transform.__repr__().replace('\n', '\n' + ' ' * len(tmp)))
tmp = ' Target Transforms (if any): '
fmt_str += '{0}{1}'.format(tmp, self.target_transform.__repr__().replace('\n', '\n' + ' ' * len(tmp)))
return fmt_str
Iterable式數(shù)據(jù)集
一個(gè)Iterable(迭代)式數(shù)據(jù)集是抽象類(lèi)data.IterableDataset的子類(lèi),并且覆寫(xiě)了iter方法成為一個(gè)迭代器。這種數(shù)據(jù)集主要用于數(shù)據(jù)大小未知,或者以流的形式的輸入,本地文件不固定的情況,需要以迭代的方式來(lái)獲取樣本索引。
關(guān)于迭代器與生成器的知識(shí)可以參見(jiàn)博主的另一篇文章Python迭代器與生成器介紹及在Pytorch源碼中應(yīng)用。
DataLoader
Data loader. Combines a dataset and a sampler, and provides an iterable over the given dataset. —— PyTorch Documents
一般來(lái)說(shuō) PyTorch 中深度學(xué)習(xí)訓(xùn)練的流程是這樣的:1. 創(chuàng)建Dateset 2. Dataset傳遞給DataLoader 3. DataLoader迭代產(chǎn)生訓(xùn)練數(shù)據(jù)提供給模型
對(duì)應(yīng)的一般都會(huì)有這三部分代碼
# 創(chuàng)建Dateset(可以自定義)
dataset = face_dataset # Dataset部分自定義過(guò)的face_dataset
# Dataset傳遞給DataLoader
dataloader = torch.utils.data.DataLoader(dataset,batch_size=64,shuffle=False,num_workers=8)
# DataLoader迭代產(chǎn)生訓(xùn)練數(shù)據(jù)提供給模型
for i in range(epoch):
for index,(img,label) in enumerate(dataloader):
pass
到這里應(yīng)該就PyTorch的數(shù)據(jù)集和數(shù)據(jù)傳遞機(jī)制應(yīng)該就比較清晰明了了。Dataset負(fù)責(zé)建立索引到樣本的映射,DataLoader負(fù)責(zé)以特定的方式從數(shù)據(jù)集中迭代的產(chǎn)生 一個(gè)個(gè)batch的樣本集合。在enumerate過(guò)程中實(shí)際上是dataloader按照其參數(shù)sampler規(guī)定的策略調(diào)用了其dataset的getitem方法。
參數(shù)介紹
先看一下實(shí)例化一個(gè)DataLoader所需的參數(shù),我們只關(guān)注幾個(gè)重點(diǎn)即可。
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
batch_sampler=None, num_workers=0, collate_fn=None,
pin_memory=False, drop_last=False, timeout=0,
worker_init_fn=None)
參數(shù)介紹:dataset (Dataset) – 定義好的Map式或者Iterable式數(shù)據(jù)集。batch_size (python:int, optional) 一個(gè)batch含有多少樣本 (default: 1)。shuffle (bool, optional) – 每一個(gè)epoch的batch樣本是相同還是隨機(jī) (default: False)。sampler (Sampler, optional) – 決定數(shù)據(jù)集中采樣的方法. 如果有,則shuffle參數(shù)必須為False。batch_sampler (Sampler, optional) 和 sampler 類(lèi)似,但是一次返回的是一個(gè)batch內(nèi)所有樣本的index。和 batch_size, shuffle, sampler, and drop_last 三個(gè)參數(shù)互斥。num_workers (python:int, optional) 多少個(gè)子程序同時(shí)工作來(lái)獲取數(shù)據(jù),多線(xiàn)程。(default: 0) collate_fn (callable, optional) 合并樣本列表以形成小批量。pin_memory (bool, optional) 如果為T(mén)rue,數(shù)據(jù)加載器在返回前將張量復(fù)制到CUDA固定內(nèi)存中。drop_last (bool, optional) – 如果數(shù)據(jù)集大小不能被batch_size整除,設(shè)置為T(mén)rue可刪除最后一個(gè)不完整的批處理。如果設(shè)為False并且數(shù)據(jù)集的大小不能被batch_size整除,則最后一個(gè)batch將更小。(default: False) timeout (numeric, optional) 如果是正數(shù),表明等待從worker進(jìn)程中收集一個(gè)batch等待的時(shí)間,若超出設(shè)定的時(shí)間還沒(méi)有收集到,那就不收集這個(gè)內(nèi)容了。這個(gè)numeric應(yīng)總是大于等于0。(default: 0) worker_init_fn (_callable, optional*) 每個(gè)worker初始化函數(shù) (default: None)dataset 沒(méi)什么好說(shuō)的,很重要,需要按照前面所說(shuō)的兩種dataset定義好,完成相關(guān)函數(shù)的重寫(xiě)。batch\_size 也沒(méi)啥好說(shuō)的,就是訓(xùn)練的一個(gè)批次的樣本數(shù)。shuffle 表示每一個(gè)epoch中訓(xùn)練樣本的順序是否相同,一般True。
采樣器
sampler 重點(diǎn)參數(shù),采樣器,是一個(gè)迭代器。PyTorch提供了多種采樣器,用戶(hù)也可以自定義采樣器。
所有sampler都是繼承 torch.utils.data.sampler.Sampler這個(gè)抽象類(lèi)。
關(guān)于迭代器的基礎(chǔ)知識(shí)在博主這篇文章中可以找到Python迭代器與生成器介紹及在Pytorch源碼中應(yīng)用。
class Sampler(object):
# """Base class for all Samplers.
# Every Sampler subclass has to provide an __iter__ method, providing a way
# to iterate over indices of dataset elements, and a __len__ method that
# returns the length of the returned iterators.
# """
# 一個(gè) 迭代器 基類(lèi)
def __init__(self, data_source):
pass
def __iter__(self):
raise NotImplementedError
def __len__(self):
raise NotImplementedError
PyTorch 自帶的 Sampler
SequentialSampler RandomSampler SubsetRandomSampler WeightedRandomSampler
SequentialSampler 很好理解就是順序采樣器。
其原理是首先在初始化的時(shí)候拿到數(shù)據(jù)集data_source,之后在__iter__方法中首先得到一個(gè)和data_source一樣長(zhǎng)度的range可迭代器。每次只會(huì)返回一個(gè)索引值。
class SequentialSampler(Sampler):
# r"""Samples elements sequentially, always in the same order.
# Arguments:
# data_source (Dataset): dataset to sample from
# """
# 產(chǎn)生順序 迭代器
def __init__(self, data_source):
self.data_source = data_source
def __iter__(self):
return iter(range(len(self.data_source)))
def __len__(self):
return len(self.data_source)
參數(shù)作用:
data_source: 同上num_samples: 指定采樣的數(shù)量,默認(rèn)是所有。replacement: 若為T(mén)rue,則表示可以重復(fù)采樣,即同一個(gè)樣本可以重復(fù)采樣,這樣可能導(dǎo)致有的樣本采樣不到。所以此時(shí)我們可以設(shè)置num_samples來(lái)增加采樣數(shù)量使得每個(gè)樣本都可能被采樣到。
class RandomSampler(Sampler):
# r"""Samples elements randomly. If without replacement, then sample from a shuffled dataset.
# If with replacement, then user can specify ``num_samples`` to draw.
# Arguments:
# data_source (Dataset): dataset to sample from
# num_samples (int): number of samples to draw, default=len(dataset)
# replacement (bool): samples are drawn with replacement if ``True``, default=False
# """
def __init__(self, data_source, replacement=False, num_samples=None):
self.data_source = data_source
self.replacement = replacement
self.num_samples = num_samples
if self.num_samples is not None and replacement is False:
raise ValueError("With replacement=False, num_samples should not be specified, "
"since a random permute will be performed.")
if self.num_samples is None:
self.num_samples = len(self.data_source)
if not isinstance(self.num_samples, int) or self.num_samples <= 0:
raise ValueError("num_samples should be a positive integeral "
"value, but got num_samples={}".format(self.num_samples))
if not isinstance(self.replacement, bool):
raise ValueError("replacement should be a boolean value, but got "
"replacement={}".format(self.replacement))
def __iter__(self):
n = len(self.data_source)
if self.replacement:
return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
return iter(torch.randperm(n).tolist())
def __len__(self):
return len(self.data_source)
這個(gè)采樣器常見(jiàn)的使用場(chǎng)景是將訓(xùn)練集劃分成訓(xùn)練集和驗(yàn)證集:
class SubsetRandomSampler(Sampler):
# r"""Samples elements randomly from a given list of indices, without replacement.
# Arguments:
# indices (sequence): a sequence of indices
# """
def __init__(self, indices):
self.indices = indices
def __iter__(self):
return (self.indices[i] for i in torch.randperm(len(self.indices)))
def __len__(self):
return len(self.indices)
batch_sampler
前面的采樣器每次都只返回一個(gè)索引,但是我們?cè)谟?xùn)練時(shí)是對(duì)批量的數(shù)據(jù)進(jìn)行訓(xùn)練,而這個(gè)工作就需要BatchSampler來(lái)做。也就是說(shuō)BatchSampler的作用就是將前面的Sampler采樣得到的索引值進(jìn)行合并,當(dāng)數(shù)量等于一個(gè)batch大小后就將這一批的索引值返回。
class BatchSampler(Sampler):
# Wraps another sampler to yield a mini-batch of indices.
# Args:
# sampler (Sampler): Base sampler.
# batch_size (int): Size of mini-batch.
# drop_last (bool): If ``True``, the sampler will drop the last batch if
# its size would be less than ``batch_size``
# Example:
# >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
# [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
# >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
# [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
# 批次采樣
def __init__(self, sampler, batch_size, drop_last):
if not isinstance(sampler, Sampler):
raise ValueError("sampler should be an instance of "
"torch.utils.data.Sampler, but got sampler={}"
.format(sampler))
if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \
batch_size <= 0:
raise ValueError("batch_size should be a positive integeral value, "
"but got batch_size={}".format(batch_size))
if not isinstance(drop_last, bool):
raise ValueError("drop_last should be a boolean value, but got "
"drop_last={}".format(drop_last))
self.sampler = sampler
self.batch_size = batch_size
self.drop_last = drop_last
def __iter__(self):
batch = []
for idx in self.sampler:
batch.append(idx)
if len(batch) == self.batch_size:
yield batch
batch = []
if len(batch) > 0 and not self.drop_last:
yield batch
def __len__(self):
if self.drop_last:
return len(self.sampler) // self.batch_size
else:
return (len(self.sampler) + self.batch_size - 1) // self.batch_size
多線(xiàn)程
num_workers 參數(shù)表示同時(shí)參與數(shù)據(jù)讀取的線(xiàn)程數(shù)量,多線(xiàn)程技術(shù)可以加快數(shù)據(jù)讀取,提供GPU CPU利用率。
推薦閱讀
關(guān)于程序員大白
程序員大白是一群哈工大,東北大學(xué),西湖大學(xué)和上海交通大學(xué)的碩士博士運(yùn)營(yíng)維護(hù)的號(hào),大家樂(lè)于分享高質(zhì)量文章,喜歡總結(jié)知識(shí),歡迎關(guān)注[程序員大白],大家一起學(xué)習(xí)進(jìn)步!

