欧美成人在线视频网站,午夜无码AV,免费欧美一级视频,久久久久国产豆花视频,av手机版,免费三级怡红院,欧美精品成人视频,无码网站18

點(diǎn)擊上方“程序員大白”，選擇“星標(biāo)”公眾號(hào)
重磅干貨，第一時(shí)間送達(dá)

作者 | Chenllliang@知乎（已授權(quán)）

來(lái)源 | https://zhuanlan.zhihu.com/p/105578087

編輯 | 極市平臺(tái)

導(dǎo)讀

Dataset 是 DataLoader 實(shí)例化的一個(gè)參數(shù)，本文先從 Dataset 的源代碼講起，然后講 DataLoader 中的主要函數(shù)，讓大家學(xué)會(huì)自定義自己的數(shù)據(jù)集。

深度時(shí)代，數(shù)據(jù)為王。

PyTorch 為我們提供的兩個(gè) Dataset 和 DataLoader 類(lèi)分別負(fù)責(zé)可被 Pytorch 使用的數(shù)據(jù)集的創(chuàng)建以及向訓(xùn)練傳遞數(shù)據(jù)的任務(wù)。如果想個(gè)性化自己的數(shù)據(jù)集或者數(shù)據(jù)傳遞方式，也可以自己重寫(xiě)子類(lèi)。

Dataset 是 DataLoader 實(shí)例化的一個(gè)參數(shù)，本文先從 Dataset 的源代碼講起，然后講 DataLoader，關(guān)注主要函數(shù)，少細(xì)枝末節(jié)，讓大家學(xué)會(huì)自定義自己的數(shù)據(jù)集。

ps：本文搬運(yùn)自作者的博客陳亮的博客 | Liang's Bloghttps://chenllliang.github.io，里面有一些完成/待完成的文章，歡迎大家一起交流。

Dataset

什么時(shí)候使用Dataset

CIFAR10是CV訓(xùn)練中經(jīng)常使用到的一個(gè)數(shù)據(jù)集，在PyTorch中CIFAR10是一個(gè)寫(xiě)好的Dataset，我們使用時(shí)只需以下代碼：

data = datasets.CIFAR10("./data/", transform=transform, train=True, download=True)

datasets.CIFAR10就是一個(gè)Datasets子類(lèi)，data是這個(gè)類(lèi)的一個(gè)實(shí)例。

我們有的時(shí)候需要用自己在一個(gè)文件夾中的數(shù)據(jù)作為數(shù)據(jù)集，這個(gè)時(shí)候，我們可以使用ImageFolder這個(gè)方便的API。

FaceDataset = datasets.ImageFolder('./data', transform=img_transform)

如何自定義一個(gè)數(shù)據(jù)集

torch.utils.data.Dataset 是一個(gè)表示數(shù)據(jù)集的抽象類(lèi)。任何自定義的數(shù)據(jù)集都需要繼承這個(gè)類(lèi)并覆寫(xiě)相關(guān)方法。

所謂數(shù)據(jù)集，其實(shí)就是一個(gè)負(fù)責(zé)處理索引(index)到樣本(sample)映射的一個(gè)類(lèi)(class)。

Pytorch提供兩種數(shù)據(jù)集：Map式數(shù)據(jù)集 Iterable式數(shù)據(jù)集

Map式數(shù)據(jù)集

一個(gè)Map式的數(shù)據(jù)集必須要重寫(xiě)getitem(self, index),len(self) 兩個(gè)內(nèi)建方法，用來(lái)表示從索引到樣本的映射（Map）.

這樣一個(gè)數(shù)據(jù)集dataset，舉個(gè)例子，當(dāng)使用dataset[idx]命令時(shí)，可以在你的硬盤(pán)中讀取你的數(shù)據(jù)集中第idx張圖片以及其標(biāo)簽（如果有的話(huà)）;len(dataset)則會(huì)返回這個(gè)數(shù)據(jù)集的容量。

自定義類(lèi)大致是這樣的：class CustomDataset(data.Dataset):#需要繼承data.Dataset

    def __init__(self):
        # TODO
        # 1. Initialize file path or list of file names.
        pass
    def __getitem__(self, index):
        # TODO
        # 1. Read one data from file (e.g. using numpy.fromfile, PIL.Image.open).
        # 2. Preprocess the data (e.g. torchvision.Transform).
        # 3. Return a data pair (e.g. image and label).
        #這里需要注意的是，第一步：read one data，是一個(gè)data
        pass
    def __len__(self):
        # You should change 0 to the total size of your dataset.
        return 0

例子1：自己實(shí)驗(yàn)中寫(xiě)的一個(gè)例子：這里我們的圖片文件儲(chǔ)存在“./data/faces/”文件夾下，圖片的名字并不是從1開(kāi)始，而是從 final_train_tag_dict.txt 這個(gè)文件保存的字典中讀取，label 信息也是用這個(gè)文件中讀取。大家可以照著上面的注釋閱讀這段代碼。

from torch.utils import data
import numpy as np
from PIL import Image


class face_dataset(data.Dataset):
    def __init__(self):
        self.file_path = './data/faces/'
        f=open("final_train_tag_dict.txt","r")
        self.label_dict=eval(f.read())
        f.close()

    def __getitem__(self,index):
        label = list(self.label_dict.values())[index-1]
        img_id = list(self.label_dict.keys())[index-1]
        img_path = self.file_path+str(img_id)+".jpg"
        img = np.array(Image.open(img_path))
        return img,label

    def __len__(self):
        return len(self.label_dict)

下面我們看一下官方MNIST數(shù)據(jù)集的例子。

class MNIST(data.Dataset):
    """`MNIST <http://yann.lecun.com/exdb/mnist/>`_ Dataset.
    Args:
        root (string): Root directory of dataset where ``processed/training.pt``
            and  ``processed/test.pt`` exist.
        train (bool, optional): If True, creates dataset from ``training.pt``,
            otherwise from ``test.pt``.
        download (bool, optional): If true, downloads the dataset from the internet and
            puts it in root directory. If dataset is already downloaded, it is not
            downloaded again.
        transform (callable, optional): A function/transform that  takes in an PIL image
            and returns a transformed version. E.g, ``transforms.RandomCrop``
        target_transform (callable, optional): A function/transform that takes in the
            target and transforms it.
    """
    urls = [
        'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
        'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
        'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
        'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz',
    ]
    raw_folder = 'raw'
    processed_folder = 'processed'
    training_file = 'training.pt'
    test_file = 'test.pt'
    classes = ['0 - zero', '1 - one', '2 - two', '3 - three', '4 - four',
               '5 - five', '6 - six', '7 - seven', '8 - eight', '9 - nine']
    class_to_idx = {_class: i for i, _class in enumerate(classes)}

    @property
    def targets(self):
        if self.train:
            return self.train_labels
        else:
            return self.test_labels

    def __init__(self, root, train=True, transform=None, target_transform=None, download=False):
        self.root = os.path.expanduser(root)
        self.transform = transform
        self.target_transform = target_transform
        self.train = train  # training set or test set

        if download:
            self.download()

        if not self._check_exists():
            raise RuntimeError('Dataset not found.' +
                               ' You can use download=True to download it')

        if self.train:
            self.train_data, self.train_labels = torch.load(
                os.path.join(self.root, self.processed_folder, self.training_file))
        else:
            self.test_data, self.test_labels = torch.load(
                os.path.join(self.root, self.processed_folder, self.test_file))

    def __getitem__(self, index):
        """
        Args:
            index (int): Index
        Returns:
            tuple: (image, target) where target is index of the target class.
        """
        if self.train:
            img, target = self.train_data[index], self.train_labels[index]
        else:
            img, target = self.test_data[index], self.test_labels[index]

        # doing this so that it is consistent with all other datasets
        # to return a PIL Image
        img = Image.fromarray(img.numpy(), mode='L')

        if self.transform is not None:
            img = self.transform(img)

        if self.target_transform is not None:
            target = self.target_transform(target)

        return img, target

    def __len__(self):
        if self.train:
            return len(self.train_data)
        else:
            return len(self.test_data)

    def _check_exists(self):
        return os.path.exists(os.path.join(self.root, self.processed_folder, self.training_file)) and \
            os.path.exists(os.path.join(self.root, self.processed_folder, self.test_file))

    def download(self):
        """Download the MNIST data if it doesn't exist in processed_folder already."""
        from six.moves import urllib
        import gzip

        if self._check_exists():
            return

        # download files
        try:
            os.makedirs(os.path.join(self.root, self.raw_folder))
            os.makedirs(os.path.join(self.root, self.processed_folder))
        except OSError as e:
            if e.errno == errno.EEXIST:
                pass
            else:
                raise

        for url in self.urls:
            print('Downloading ' + url)
            data = urllib.request.urlopen(url)
            filename = url.rpartition('/')[2]
            file_path = os.path.join(self.root, self.raw_folder, filename)
            with open(file_path, 'wb') as f:
                f.write(data.read())
            with open(file_path.replace('.gz', ''), 'wb') as out_f, \
                    gzip.GzipFile(file_path) as zip_f:
                out_f.write(zip_f.read())
            os.unlink(file_path)

        # process and save as torch files
        print('Processing...')

        training_set = (
            read_image_file(os.path.join(self.root, self.raw_folder, 'train-images-idx3-ubyte')),
            read_label_file(os.path.join(self.root, self.raw_folder, 'train-labels-idx1-ubyte'))
        )
        test_set = (
            read_image_file(os.path.join(self.root, self.raw_folder, 't10k-images-idx3-ubyte')),
            read_label_file(os.path.join(self.root, self.raw_folder, 't10k-labels-idx1-ubyte'))
        )
        with open(os.path.join(self.root, self.processed_folder, self.training_file), 'wb') as f:
            torch.save(training_set, f)
        with open(os.path.join(self.root, self.processed_folder, self.test_file), 'wb') as f:
            torch.save(test_set, f)

        print('Done!')

    def __repr__(self):
        fmt_str = 'Dataset ' + self.__class__.__name__ + '\n'
        fmt_str += '    Number of datapoints: {}\n'.format(self.__len__())
        tmp = 'train' if self.train is True else 'test'
        fmt_str += '    Split: {}\n'.format(tmp)
        fmt_str += '    Root Location: {}\n'.format(self.root)
        tmp = '    Transforms (if any): '
        fmt_str += '{0}{1}\n'.format(tmp, self.transform.__repr__().replace('\n', '\n' + ' ' * len(tmp)))
        tmp = '    Target Transforms (if any): '
        fmt_str += '{0}{1}'.format(tmp, self.target_transform.__repr__().replace('\n', '\n' + ' ' * len(tmp)))
        return fmt_str

Iterable式數(shù)據(jù)集

一個(gè)Iterable（迭代）式數(shù)據(jù)集是抽象類(lèi)data.IterableDataset的子類(lèi)，并且覆寫(xiě)了iter方法成為一個(gè)迭代器。這種數(shù)據(jù)集主要用于數(shù)據(jù)大小未知，或者以流的形式的輸入，本地文件不固定的情況，需要以迭代的方式來(lái)獲取樣本索引。

關(guān)于迭代器與生成器的知識(shí)可以參見(jiàn)博主的另一篇文章Python迭代器與生成器介紹及在Pytorch源碼中應(yīng)用。

DataLoader

Data loader. Combines a dataset and a sampler, and provides an iterable over the given dataset. —— PyTorch Documents

一般來(lái)說(shuō) PyTorch 中深度學(xué)習(xí)訓(xùn)練的流程是這樣的：1. 創(chuàng)建Dateset 2. Dataset傳遞給DataLoader 3. DataLoader迭代產(chǎn)生訓(xùn)練數(shù)據(jù)提供給模型

對(duì)應(yīng)的一般都會(huì)有這三部分代碼

# 創(chuàng)建Dateset(可以自定義)
    dataset = face_dataset # Dataset部分自定義過(guò)的face_dataset
# Dataset傳遞給DataLoader
    dataloader = torch.utils.data.DataLoader(dataset,batch_size=64,shuffle=False,num_workers=8)
# DataLoader迭代產(chǎn)生訓(xùn)練數(shù)據(jù)提供給模型
    for i in range(epoch):
        for index,(img,label) in enumerate(dataloader):
            pass

到這里應(yīng)該就PyTorch的數(shù)據(jù)集和數(shù)據(jù)傳遞機(jī)制應(yīng)該就比較清晰明了了。Dataset負(fù)責(zé)建立索引到樣本的映射，DataLoader負(fù)責(zé)以特定的方式從數(shù)據(jù)集中迭代的產(chǎn)生一個(gè)個(gè)batch的樣本集合。在enumerate過(guò)程中實(shí)際上是dataloader按照其參數(shù)sampler規(guī)定的策略調(diào)用了其dataset的getitem方法。

參數(shù)介紹

先看一下實(shí)例化一個(gè)DataLoader所需的參數(shù)，我們只關(guān)注幾個(gè)重點(diǎn)即可。

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None)

參數(shù)介紹：
dataset (Dataset) – 定義好的Map式或者Iterable式數(shù)據(jù)集。
batch_size (python:int, optional) 一個(gè)batch含有多少樣本 (default: 1)。
shuffle (bool, optional) – 每一個(gè)epoch的batch樣本是相同還是隨機(jī) (default: False)。
sampler (Sampler, optional) – 決定數(shù)據(jù)集中采樣的方法. 如果有，則shuffle參數(shù)必須為False。
batch_sampler (Sampler, optional) 和 sampler 類(lèi)似，但是一次返回的是一個(gè)batch內(nèi)所有樣本的index。和 batch_size, shuffle, sampler, and drop_last 三個(gè)參數(shù)互斥。
num_workers (python:int, optional) 多少個(gè)子程序同時(shí)工作來(lái)獲取數(shù)據(jù)，多線(xiàn)程。(default: 0)
collate_fn (callable, optional) 合并樣本列表以形成小批量。
pin_memory (bool, optional) 如果為T(mén)rue，數(shù)據(jù)加載器在返回前將張量復(fù)制到CUDA固定內(nèi)存中。
drop_last (bool, optional) – 如果數(shù)據(jù)集大小不能被batch_size整除，設(shè)置為T(mén)rue可刪除最后一個(gè)不完整的批處理。如果設(shè)為False并且數(shù)據(jù)集的大小不能被batch_size整除，則最后一個(gè)batch將更小。(default: False)
timeout (numeric, optional) 如果是正數(shù)，表明等待從worker進(jìn)程中收集一個(gè)batch等待的時(shí)間，若超出設(shè)定的時(shí)間還沒(méi)有收集到，那就不收集這個(gè)內(nèi)容了。這個(gè)numeric應(yīng)總是大于等于0。(default: 0)
worker_init_fn (_callable, optional*) 每個(gè)worker初始化函數(shù) (default: None)
dataset 沒(méi)什么好說(shuō)的，很重要，需要按照前面所說(shuō)的兩種dataset定義好，完成相關(guān)函數(shù)的重寫(xiě)。
batch\_size 也沒(méi)啥好說(shuō)的，就是訓(xùn)練的一個(gè)批次的樣本數(shù)。
shuffle 表示每一個(gè)epoch中訓(xùn)練樣本的順序是否相同，一般True。

采樣器

sampler 重點(diǎn)參數(shù)，采樣器，是一個(gè)迭代器。PyTorch提供了多種采樣器，用戶(hù)也可以自定義采樣器。

所有sampler都是繼承 torch.utils.data.sampler.Sampler這個(gè)抽象類(lèi)。

關(guān)于迭代器的基礎(chǔ)知識(shí)在博主這篇文章中可以找到Python迭代器與生成器介紹及在Pytorch源碼中應(yīng)用。

class Sampler(object):
    # """Base class for all Samplers.
    # Every Sampler subclass has to provide an __iter__ method, providing a way
    # to iterate over indices of dataset elements, and a __len__ method that
    # returns the length of the returned iterators.
    # """
    # 一個(gè) 迭代器 基類(lèi)
    def __init__(self, data_source):
        pass

    def __iter__(self):
        raise NotImplementedError

    def __len__(self):
        raise NotImplementedError

PyTorch 自帶的 Sampler

SequentialSampler
RandomSampler
SubsetRandomSampler
WeightedRandomSampler

SequentialSampler 很好理解就是順序采樣器。

其原理是首先在初始化的時(shí)候拿到數(shù)據(jù)集data_source，之后在__iter__方法中首先得到一個(gè)和data_source一樣長(zhǎng)度的range可迭代器。每次只會(huì)返回一個(gè)索引值。

class SequentialSampler(Sampler):
    # r"""Samples elements sequentially, always in the same order.
    # Arguments:
    #     data_source (Dataset): dataset to sample from
    # """
   # 產(chǎn)生順序 迭代器
    def __init__(self, data_source):
        self.data_source = data_source

    def __iter__(self):
        return iter(range(len(self.data_source)))

    def __len__(self):
        return len(self.data_source)

參數(shù)作用：

data_source: 同上
num_samples: 指定采樣的數(shù)量，默認(rèn)是所有。
replacement: 若為T(mén)rue，則表示可以重復(fù)采樣，即同一個(gè)樣本可以重復(fù)采樣，這樣可能導(dǎo)致有的樣本采樣不到。所以此時(shí)我們可以設(shè)置num_samples來(lái)增加采樣數(shù)量使得每個(gè)樣本都可能被采樣到。

class RandomSampler(Sampler):
    # r"""Samples elements randomly. If without replacement, then sample from a shuffled dataset.
    # If with replacement, then user can specify ``num_samples`` to draw.
    # Arguments:
    #     data_source (Dataset): dataset to sample from
    #     num_samples (int): number of samples to draw, default=len(dataset)
    #     replacement (bool): samples are drawn with replacement if ``True``, default=False
    # """

    def __init__(self, data_source, replacement=False, num_samples=None):
        self.data_source = data_source
        self.replacement = replacement
        self.num_samples = num_samples

        if self.num_samples is not None and replacement is False:
            raise ValueError("With replacement=False, num_samples should not be specified, "
                             "since a random permute will be performed.")

        if self.num_samples is None:
            self.num_samples = len(self.data_source)

        if not isinstance(self.num_samples, int) or self.num_samples <= 0:
            raise ValueError("num_samples should be a positive integeral "
                             "value, but got num_samples={}".format(self.num_samples))
        if not isinstance(self.replacement, bool):
            raise ValueError("replacement should be a boolean value, but got "
                             "replacement={}".format(self.replacement))

    def __iter__(self):
        n = len(self.data_source)
        if self.replacement:
            return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
        return iter(torch.randperm(n).tolist())

    def __len__(self):
        return len(self.data_source)

這個(gè)采樣器常見(jiàn)的使用場(chǎng)景是將訓(xùn)練集劃分成訓(xùn)練集和驗(yàn)證集:

class SubsetRandomSampler(Sampler):
    # r"""Samples elements randomly from a given list of indices, without replacement.
    # Arguments:
    #     indices (sequence): a sequence of indices
    # """

    def __init__(self, indices):
        self.indices = indices

    def __iter__(self):
        return (self.indices[i] for i in torch.randperm(len(self.indices)))

    def __len__(self):
        return len(self.indices)

batch_sampler

前面的采樣器每次都只返回一個(gè)索引，但是我們?cè)谟?xùn)練時(shí)是對(duì)批量的數(shù)據(jù)進(jìn)行訓(xùn)練，而這個(gè)工作就需要BatchSampler來(lái)做。也就是說(shuō)BatchSampler的作用就是將前面的Sampler采樣得到的索引值進(jìn)行合并，當(dāng)數(shù)量等于一個(gè)batch大小后就將這一批的索引值返回。

class BatchSampler(Sampler):
    #     Wraps another sampler to yield a mini-batch of indices.
    # Args:
    #     sampler (Sampler): Base sampler.
    #     batch_size (int): Size of mini-batch.
    #     drop_last (bool): If ``True``, the sampler will drop the last batch if
    #         its size would be less than ``batch_size``
    # Example:
    #     >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
    #     [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
    #     >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
    #     [[0, 1, 2], [3, 4, 5], [6, 7, 8]]

# 批次采樣
    def __init__(self, sampler, batch_size, drop_last):
        if not isinstance(sampler, Sampler):
            raise ValueError("sampler should be an instance of "
                             "torch.utils.data.Sampler, but got sampler={}"
                             .format(sampler))
        if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \
                batch_size <= 0:
            raise ValueError("batch_size should be a positive integeral value, "
                             "but got batch_size={}".format(batch_size))
        if not isinstance(drop_last, bool):
            raise ValueError("drop_last should be a boolean value, but got "
                             "drop_last={}".format(drop_last))
        self.sampler = sampler
        self.batch_size = batch_size
        self.drop_last = drop_last

    def __iter__(self):
        batch = []
        for idx in self.sampler:
            batch.append(idx)
            if len(batch) == self.batch_size:
                yield batch
                batch = []
        if len(batch) > 0 and not self.drop_last:
            yield batch

    def __len__(self):
        if self.drop_last:
            return len(self.sampler) // self.batch_size
        else:
            return (len(self.sampler) + self.batch_size - 1) // self.batch_size

多線(xiàn)程

num_workers 參數(shù)表示同時(shí)參與數(shù)據(jù)讀取的線(xiàn)程數(shù)量，多線(xiàn)程技術(shù)可以加快數(shù)據(jù)讀取，提供GPU CPU利用率。

推薦閱讀
“拍一拍” 能撤回了 ！！！
5款Chrome插件，第1款絕對(duì)良心！
為開(kāi)發(fā)色情游戲，這家公司赴日尋找AV女優(yōu)拍攝，期望暴力賺錢(qián)結(jié)果...
拼多多終于釀成慘劇
華為阿里下班時(shí)間曝光：所有的光鮮，都有加班的味道

關(guān)于程序員大白

程序員大白是一群哈工大，東北大學(xué)，西湖大學(xué)和上海交通大學(xué)的碩士博士運(yùn)營(yíng)維護(hù)的號(hào)，大家樂(lè)于分享高質(zhì)量文章，喜歡總結(jié)知識(shí)，歡迎關(guān)注[程序員大白]，大家一起學(xué)習(xí)進(jìn)步！

一文讀懂PyTorch中Dataset與DataLoader