點擊下方卡片，關(guān)注“新機器視覺”公眾號

視覺/圖像重磅干貨，第一時間送達

僅做學術(shù)分享，如有侵權(quán)，聯(lián)系刪除

轉(zhuǎn)載于：機器學習小王子

計算機視覺主要問題有圖像分類、目標檢測和圖像分割等。針對圖像分類任務(wù)，提升準確率的方法路線有兩條，一個是模型的修改，另一個是各種數(shù)據(jù)處理和訓練的技巧(tricks)。圖像分類中的各種技巧對于目標檢測、圖像分割等任務(wù)也有很好的作用，因此值得好好總結(jié)。本文在精讀論文的基礎(chǔ)上，總結(jié)了圖像分類任務(wù)的各種tricks如下：

Warmup
Linear scaling learning rate
Label-smoothing
Random image cropping and patching
Knowledge Distillation
Cutout
Random erasing
Cosine learning rate decay
Mixup training
AdaBoud
AutoAugment
其他經(jīng)典的tricks

Warmup

學習率是神經(jīng)網(wǎng)絡(luò)訓練中最重要的超參數(shù)之一，針對學習率的技巧有很多。Warm up是在ResNet論文[1]中提到的一種學習率預(yù)熱的方法。由于剛開始訓練時模型的權(quán)重(weights)是隨機初始化的(全部置為0是一個坑，原因見[2])，此時選擇一個較大的學習率，可能會帶來模型的不穩(wěn)定。學習率預(yù)熱就是在剛開始訓練的時候先使用一個較小的學習率，訓練一些epoches或iterations，等模型穩(wěn)定時再修改為預(yù)先設(shè)置的學習率進行訓練。論文[1]中使用一個110層的ResNet在cifar10上訓練時，先用0.01的學習率訓練直到訓練誤差低于80%(大概訓練了400個iterations)，然后使用0.1的學習率進行訓練。

上述的方法是constant warmup，18年Facebook又針對上面的warmup進行了改進[3]，因為從一個很小的學習率一下變?yōu)楸容^大的學習率可能會導致訓練誤差突然增大。論文[3]提出了gradual warmup來解決這個問題，即從最開始的小學習率開始，每個iteration增大一點，直到最初設(shè)置的比較大的學習率。

Gradual warmup代碼如下：

    
     from torch.optim.lr_scheduler import_LRScheduler
     

     classGradualWarmupScheduler(_LRScheduler):
     """
      Args:
      optimizer (Optimizer): Wrapped optimizer.
      multiplier: target learning rate = base lr * multiplier
      total_epoch: target learning rate is reached at total_epoch, gradually
      after_scheduler: after target_epoch, use this scheduler(eg. ReduceLROnPlateau)
      """
     

     def __init__(self, optimizer, multiplier, total_epoch, after_scheduler=None):
      self.multiplier = multiplier
     if self.multiplier <= 1.:
     raiseValueError('multiplier should be greater than 1.')
      self.total_epoch = total_epoch
      self.after_scheduler = after_scheduler
      self.finished = False
      super().__init__(optimizer)
     

     def get_lr(self):
     if self.last_epoch > self.total_epoch:
     if self.after_scheduler:
     ifnot self.finished:
      self.after_scheduler.base_lrs = [base_lr * self.multiplier for base_lr in self.base_lrs]
      self.finished = True
     return self.after_scheduler.get_lr()
     return[base_lr * self.multiplier for base_lr in self.base_lrs]
     

     return[base_lr * ((self.multiplier - 1.) * self.last_epoch / self.total_epoch + 1.) for base_lr in self.base_lrs]
     

     def step(self, epoch=None):
     if self.finished and self.after_scheduler:
     return self.after_scheduler.step(epoch)
     else:
     return super(GradualWarmupScheduler, self).step(epoch)

Linear scaling learning rate

Linear scaling learning rate是在論文[3]中針對比較大的batch size而提出的一種方法。

在凸優(yōu)化問題中，隨著批量的增加，收斂速度會降低，神經(jīng)網(wǎng)絡(luò)也有類似的實證結(jié)果。隨著batch size的增大，處理相同數(shù)據(jù)量的速度會越來越快，但是達到相同精度所需要的epoch數(shù)量越來越多。也就是說，使用相同的epoch時，大batch size訓練的模型與小batch size訓練的模型相比，驗證準確率會減小。

上面提到的gradual warmup是解決此問題的方法之一。另外，linear scaling learning rate也是一種有效的方法。在mini-batch SGD訓練時，梯度下降的值是隨機的，因為每一個batch的數(shù)據(jù)是隨機選擇的。增大batch size不會改變梯度的期望，但是會降低它的方差。也就是說，大batch size會降低梯度中的噪聲，所以我們可以增大學習率來加快收斂。

具體做法很簡單，比如ResNet原論文[1]中，batch size為256時選擇的學習率是0.1，當我們把batch size變?yōu)橐粋€較大的數(shù)b時，學習率應(yīng)該變?yōu)?0.1 × b/256。

Label-smoothing

在分類問題中，我們的最后一層一般是全連接層，然后對應(yīng)標簽的one-hot編碼，即把對應(yīng)類別的值編碼為1，其他為0。這種編碼方式和通過降低交叉熵損失來調(diào)整參數(shù)的方式結(jié)合起來，會有一些問題。這種方式會鼓勵模型對不同類別的輸出分數(shù)差異非常大，或者說，模型過分相信它的判斷。但是，對于一個由多人標注的數(shù)據(jù)集，不同人標注的準則可能不同，每個人的標注也可能會有一些錯誤。模型對標簽的過分相信會導致過擬合。

標簽平滑(Label-smoothing regularization,LSR)是應(yīng)對該問題的有效方法之一，它的具體思想是降低我們對于標簽的信任，例如我們可以將損失的目標值從1稍微降到0.9，或者將從0稍微升到0.1。標簽平滑最早在inception-v2[4]中被提出，它將真實的概率改造為：

其中，ε是一個小的常數(shù)，K是類別的數(shù)目，y是圖片的真正的標簽，i代表第i個類別，q_i是圖片為第i類的概率。

總的來說，LSR是一種通過在標簽y中加入噪聲，實現(xiàn)對模型約束，降低模型過擬合程度的一種正則化方法。

LSR代碼如下：

    
     import torch
     import torch.nn as nn
     

     

     class LSR(nn.Module):
     

     def __init__(self, e=0.1, reduction='mean'):
      super().__init__()
     

      self.log_softmax = nn.LogSoftmax(dim=1)
      self.e = e
      self.reduction = reduction
     

     def _one_hot(self, labels, classes, value=1):
     """
      Convert labels to one hot vectors
     

      Args:
      labels: torch tensor in format [label1, label2, label3, ...]
      classes: int, number of classes
      value: label value in one hot vector, default to 1
     

      Returns:
      return one hot format labels in shape [batchsize, classes]
      """
     

      one_hot = torch.zeros(labels.size(0), classes)
     

     #labels and value_added size must match
      labels = labels.view(labels.size(0), -1)
      value_added = torch.Tensor(labels.size(0), 1).fill_(value)
     

      value_added = value_added.to(labels.device)
      one_hot = one_hot.to(labels.device)
     

      one_hot.scatter_add_(1, labels, value_added)
     

     return one_hot
     

     def _smooth_label(self, target, length, smooth_factor):
     """convert targets to one-hot format, and smooth
      them.
      Args:
      target: target in form with [label1, label2, label_batchsize]
      length: length of one-hot format(number of classes)
      smooth_factor: smooth factor for label smooth
     

      Returns:
      smoothed labels in one hot format
      """
      one_hot = self._one_hot(target, length, value=1- smooth_factor)
      one_hot += smooth_factor / length
     

     return one_hot.to(target.device)

Random image cropping and patching

Random image cropping and patching (RICAP)[7]方法隨機裁剪四個圖片的中部分，然后把它們拼接為一個圖片，同時混合這四個圖片的標簽。

RICAP在caifar10上達到了2.19%的錯誤率。

如下圖所示，Ix, Iy是原始圖片的寬和高。w和h稱為boundary position，它決定了四個裁剪得到的小圖片的尺寸。w和h從beta分布Beta(β, β)中隨機生成，β也是RICAP的超參數(shù)。最終拼接的圖片尺寸和原圖片尺寸保持一致。

RICAP的代碼如下：

    
     beta = 0.3# hyperparameter
     for(images, targets) in train_loader:
     

     # get the image size
      I_x, I_y = images.size()[2:]
     

     # draw a boundry position (w, h)
      w = int(np.round(I_x * np.random.beta(beta, beta)))
      h = int(np.round(I_y * np.random.beta(beta, beta)))
      w_ = [w, I_x - w, w, I_x - w]
      h_ = [h, h, I_y - h, I_y - h]
     

     # select and crop four images
      cropped_images = {}
      c_ = {}
      W_ = {}
     for k in range(4):
      index = torch.randperm(images.size(0))
      x_k = np.random.randint(0, I_x - w_[k] + 1)
      y_k = np.random.randint(0, I_y - h_[k] + 1)
      cropped_images[k] = images[index][:, :, x_k:x_k + w_[k], y_k:y_k + h_[k]]
      c_[k] = target[index].cuda()
      W_[k] = w_[k] * h_[k] / (I_x * I_y)
     

     # patch cropped images
      patched_images = torch.cat(
     (torch.cat((cropped_images[0], cropped_images[1]), 2),
      torch.cat((cropped_images[2], cropped_images[3]), 2)),
     3)
     #patched_images = patched_images.cuda()
     

     # get output
      output = model(patched_images)
     

     # calculate loss and accuracy
      loss = sum([W_[k] * criterion(output, c_[k]) for k in range(4)])
      acc = sum([W_[k] * accuracy(output, c_[k])[0] for k in range(4)])

Knowledge Distillation

提高幾乎所有機器學習算法性能的一種非常簡單的方法是在相同的數(shù)據(jù)上訓練許多不同的模型，然后對它們的預(yù)測進行平均。但是使用所有的模型集成進行預(yù)測是比較麻煩的，并且可能計算量太大而無法部署到大量用戶。Knowledge Distillation(知識蒸餾)[8]方法就是應(yīng)對這種問題的有效方法之一。

在知識蒸餾方法中，我們使用一個教師模型來幫助當前的模型（學生模型）訓練。教師模型是一個較高準確率的預(yù)訓練模型，因此學生模型可以在保持模型復(fù)雜度不變的情況下提升準確率。比如，可以使用ResNet-152作為教師模型來幫助學生模型ResNet-50訓練。在訓練過程中，我們會加一個蒸餾損失來懲罰學生模型和教師模型的輸出之間的差異。

給定輸入，假定p是真正的概率分布，z和r分別是學生模型和教師模型最后一個全連接層的輸出。之前我們會用交叉熵損失l(p,softmax(z))來度量p和z之間的差異，這里的蒸餾損失同樣用交叉熵。所以，使用知識蒸餾方法總的損失函數(shù)是

上式中，第一項還是原來的損失函數(shù)，第二項是添加的用來懲罰學生模型和教師模型輸出差異的蒸餾損失。其中，T是一個溫度超參數(shù)，用來使softmax的輸出更加平滑的。實驗證明，用ResNet-152作為教師模型來訓練ResNet-50，可以提高后者的準確率。

Cutout

Cutout[9]是一種新的正則化方法。原理是在訓練時隨機把圖片的一部分減掉，這樣能提高模型的魯棒性。它的來源是計算機視覺任務(wù)中經(jīng)常遇到的物體遮擋問題。通過cutout生成一些類似被遮擋的物體，不僅可以讓模型在遇到遮擋問題時表現(xiàn)更好，還能讓模型在做決定時更多地考慮環(huán)境(context)。

代碼如下：

    
     import torch
     import numpy as np
     

     classCutout(object):
     """Randomly mask out one or more patches from an image.
      Args:
      n_holes (int): Number of patches to cut out of each image.
      length (int): The length (in pixels) of each square patch.
      """
     def __init__(self, n_holes, length):
      self.n_holes = n_holes
      self.length = length
     

     def __call__(self, img):
     """
      Args:
      img (Tensor): Tensor image of size (C, H, W).
      Returns:
      Tensor: Image with n_holes of dimension length x length cut out of it.
      """
      h = img.size(1)
      w = img.size(2)
     

      mask = np.ones((h, w), np.float32)
     

     for n in range(self.n_holes):
      y = np.random.randint(h)
      x = np.random.randint(w)
     

      y1 = np.clip(y - self.length // 2, 0, h)
      y2 = np.clip(y + self.length // 2, 0, h)
      x1 = np.clip(x - self.length // 2, 0, w)
      x2 = np.clip(x + self.length // 2, 0, w)
     

      mask[y1: y2, x1: x2] = 0.
     

      mask = torch.from_numpy(mask)
      mask = mask.expand_as(img)
      img = img * mask
     

     return img

效果如下圖，每個圖片的一小部分被cutout了。

Random erasing

Random erasing[6]其實和cutout非常類似，也是一種模擬物體遮擋情況的數(shù)據(jù)增強方法。區(qū)別在于，cutout是把圖片中隨機抽中的矩形區(qū)域的像素值置為0，相當于裁剪掉，random erasing是用隨機數(shù)或者數(shù)據(jù)集中像素的平均值替換原來的像素值。而且，cutout每次裁剪掉的區(qū)域大小是固定的，Random erasing替換掉的區(qū)域大小是隨機的。

Random erasing代碼如下：

    
     from __future__ import absolute_import
     from torchvision.transforms import*
     from PIL importImage
     import random
     import math
     import numpy as np
     import torch
     

     classRandomErasing(object):
     '''
      probability: The probability that the operation will be performed.
      sl: min erasing area
      sh: max erasing area
      r1: min aspect ratio
      mean: erasing value
      '''
     def __init__(self, probability = 0.5, sl = 0.02, sh = 0.4, r1 = 0.3, mean=[0.4914, 0.4822, 0.4465]):
      self.probability = probability
      self.mean = mean
      self.sl = sl
      self.sh = sh
      self.r1 = r1
     

     def __call__(self, img):
     

     if random.uniform(0, 1) > self.probability:
     return img
     

     for attempt in range(100):
      area = img.size()[1] * img.size()[2]
     

      target_area = random.uniform(self.sl, self.sh) * area
      aspect_ratio = random.uniform(self.r1, 1/self.r1)
     

      h = int(round(math.sqrt(target_area * aspect_ratio)))
      w = int(round(math.sqrt(target_area / aspect_ratio)))
     

     if w < img.size()[2] and h < img.size()[1]:
      x1 = random.randint(0, img.size()[1] - h)
      y1 = random.randint(0, img.size()[2] - w)
     if img.size()[0] == 3:
      img[0, x1:x1+h, y1:y1+w] = self.mean[0]
      img[1, x1:x1+h, y1:y1+w] = self.mean[1]
      img[2, x1:x1+h, y1:y1+w] = self.mean[2]
     else:
      img[0, x1:x1+h, y1:y1+w] = self.mean[0]
     return img
     

     return img

Cosine learning rate decay

在warmup之后的訓練過程中，學習率不斷衰減是一個提高精度的好方法。其中有step decay和cosine decay等，前者是隨著epoch增大學習率不斷減去一個小的數(shù)，后者是讓學習率隨著訓練過程曲線下降。

對于cosine decay，假設(shè)總共有T個batch（不考慮warmup階段），在第t個batch時，學習率η_t為：

這里，η代表初始設(shè)置的學習率。這種學習率遞減的方式稱之為cosine decay。

下面是帶有warmup的學習率衰減的可視化圖[4]。其中，圖(a)是學習率隨epoch增大而下降的圖，可以看出cosine decay比step decay更加平滑一點。圖(b)是準確率隨epoch的變化圖，兩者最終的準確率沒有太大差別，不過cosine decay的學習過程更加平滑。

在pytorch的torch.optim.lr_scheduler中有更多的學習率衰減的方法，至于哪個效果好，可能對于不同問題答案是不一樣的。對于step decay，使用方法如下：

    
     # Assuming optimizer uses lr = 0.05 for all groups
     # lr = 0.05 if epoch < 30
     # lr = 0.005 if 30 <= epoch < 60
     # lr = 0.0005 if 60 <= epoch < 90
     

     from torch.optim.lr_scheduler importStepLR
     scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
     for epoch in range(100):
      scheduler.step()
      train(...)
      validate(...)

Mixup training

Mixup[10]是一種新的數(shù)據(jù)增強的方法。Mixup training，就是每次取出2張圖片，然后將它們線性組合，得到新的圖片，以此來作為新的訓練樣本，進行網(wǎng)絡(luò)的訓練，如下公式，其中x代表圖像數(shù)據(jù)，y代表標簽，則得到的新的xhat, yhat。

其中，λ是從Beta(α, α)隨機采樣的數(shù)，在[0,1]之間。在訓練過程中，僅使用(xhat, yhat)。

Mixup方法主要增強了訓練樣本之間的線性表達，增強網(wǎng)絡(luò)的泛化能力，不過mixup方法需要較長的時間才能收斂得比較好。

Mixup代碼如下：

    
     for(images, labels) in train_loader:
     

      l = np.random.beta(mixup_alpha, mixup_alpha)
     

      index = torch.randperm(images.size(0))
      images_a, images_b = images, images[index]
      labels_a, labels_b = labels, labels[index]
     

      mixed_images = l * images_a + (1- l) * images_b
     

      outputs = model(mixed_images)
      loss = l * criterion(outputs, labels_a) + (1- l) * criterion(outputs, labels_b)
      acc = l * accuracy(outputs, labels_a)[0] + (1- l) * accuracy(outputs, labels_b)[0]

AdaBound

AdaBound是最近一篇論文[5]中提到的，按照作者的說法，AdaBound會讓你的訓練過程像adam一樣快，并且像SGD一樣好。

如下圖所示，使用AdaBound會收斂速度更快，過程更平滑，結(jié)果更好。

另外，這種方法相對于SGD對超參數(shù)的變化不是那么敏感，也就是說魯棒性更好。但是，針對不同的問題還是需要調(diào)節(jié)超參數(shù)的，只是所用的時間可能變少了。

當然，AdaBound還沒有經(jīng)過普遍的檢驗，也有可能只是對于某些問題效果好。

使用方法如下：安裝AdaBound

    
     pip install adabound

使用AdaBound(和其他PyTorch optimizers用法一致)

    
     optimizer = adabound.AdaBound(model.parameters(), lr=1e-3, final_lr=0.1)

AutoAugment

數(shù)據(jù)增強在圖像分類問題上有很重要的作用，但是增強的方法有很多，并非一股腦地用上所有的方法就是最好的。那么，如何選擇最佳的數(shù)據(jù)增強方法呢？AutoAugment[11]就是一種搜索適合當前問題的數(shù)據(jù)增強方法的方法。該方法創(chuàng)建一個數(shù)據(jù)增強策略的搜索空間，利用搜索算法選取適合特定數(shù)據(jù)集的數(shù)據(jù)增強策略。此外，從一個數(shù)據(jù)集中學到的策略能夠很好地遷移到其它相似的數(shù)據(jù)集上。

AutoAugment在cifar10上的表現(xiàn)如下表，達到了98.52%的準確率。

其他經(jīng)典的tricks

常用的正則化方法為

Dropout
L1/L2正則
Batch Normalization
Early stopping
Random cropping
Mirroring
Rotation
Color shifting
PCA color augmentation
...

其他

Xavier init[12]
...

參考

[1] Deep Residual Learning for Image Recognition(https://arxiv.org/pdf/1512.03385.pdf)

[2] http://cs231n.github.io/neural-networks-2/

[3] Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour(https://arxiv.org/pdf/1706.02677v2.pdf)

[4] Rethinking the Inception Architecture for Computer Vision(https://arxiv.org/pdf/1512.00567v3.pdf)

[4]Bag of Tricks for Image Classification with Convolutional Neural Networks(https://arxiv.org/pdf/1812.01187.pdf)

[5] Adaptive Gradient Methods with Dynamic Bound of Learning Rate(https://www.luolc.com/publications/adabound/)

[6] Random erasing(https://arxiv.org/pdf/1708.04896v2.pdf)

[7] RICAP(https://arxiv.org/pdf/1811.09030.pdf)

[8] Distilling the Knowledge in a Neural Network(https://arxiv.org/pdf/1503.02531.pdf)

[9] Improved Regularization of Convolutional Neural Networks with Cutout(https://arxiv.org/pdf/1708.04552.pdf)

[10] Mixup: BEYOND EMPIRICAL RISK MINIMIZATION(https://arxiv.org/pdf/1710.09412.pdf)

[11] AutoAugment: Learning Augmentation Policies from Data(https://arxiv.org/pdf/1805.09501.pdf)

[12] Understanding the difficulty of training deep feedforward neural networks(http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)

—版權(quán)聲明—

僅用于學術(shù)分享，版權(quán)屬于原作者。

若有侵權(quán)，請聯(lián)系微信號:yiyang-sy 刪除或修改！

—THE END—

收藏 | 深度學習圖像分類技巧總結(jié)