↑ 點擊藍字?關注極市平臺

作者丨土豆@知乎

來源丨h(huán)ttps://zhuanlan.zhihu.com/p/180020358

本文僅用于學術分享，著作權歸作者所有。如有侵權，請聯(lián)系后臺作刪文處理。

極市導讀

本文總結了10大Pytorch操作中的“坑”，能夠幫助使用者規(guī)避不必要的麻煩。>>>極市七夕粉絲福利活動：煉丹師們，七夕這道算法題，你會解嗎？

pytorch中的交叉熵

pytorch的交叉熵nn.CrossEntropyLoss在訓練階段，里面是內置了softmax操作的，因此只需要喂入原始的數(shù)據結果即可，不需要在之前再添加softmax層。這個和tensorflow的tf.softmax_cross_entropy_with_logits如出一轍.[1][2]pytorch的交叉熵nn.CrossEntropyLoss在訓練階段，里面是內置了softmax操作的，因此只需要喂入原始的數(shù)據結果即可，不需要在之前再添加softmax層。這個和tensorflow的tf.softmax_cross_entropy_with_logits如出一轍.[1][2]

pytorch中的MSELoss和KLDivLoss

在深度學習中，MSELoss均方差損失和KLDivLossKL散度是經常使用的兩種損失，在pytorch中，也有這兩個函數(shù)，如:

loss = nn.MSELoss()input = torch.randn(3, 5, requires_grad=True)target = torch.randn(3, 5)output = loss(input, target)output.backward()

這個時候我們要注意到，我們的標簽target是需要一個不能被訓練的，也就是requires_grad=False的值，否則將會報錯，出現(xiàn)如：

AssertionError: nn criterions don’t compute the gradient w.r.t. targets - please mark these variables as volatile or not requiring gradients

我們注意到，其實不只是MSELoss，其他很多l(xiāng)oss，比如交叉熵，KL散度等，其target都需要是一個不能被訓練的值的，這個和TensorFlow中的tf.nn.softmax_cross_entropy_with_logits_v2不太一樣，后者可以使用可訓練的target，具體見[3]

在驗證和測試階段取消掉梯度（no_grad）

一般來說，我們在進行模型訓練的過程中，因為要監(jiān)控模型的性能，在跑完若干個epoch訓練之后，需要進行一次在驗證集[4]上的性能驗證。一般來說，在驗證或者是測試階段，因為只是需要跑個前向傳播(forward)就足夠了，因此不需要保存變量的梯度。保存梯度是需要額外顯存或者內存進行保存的，占用了空間，有時候還會在驗證階段導致OOM(Out Of Memory)錯誤，因此我們在驗證和測試階段，最好顯式地取消掉模型變量的梯度。 在pytroch 0.4及其以后的版本中，用torch.no_grad()這個上下文管理器就可以了，例子如下：

model.train()# here train the model, just skip the codesmodel.eval() # here we start to evaluate the modelwith torch.no_grad(): for each in eval_data:  data, label = each  logit = model(data)  ... # here we just skip the codes

如上，我們只需要在加上上下文管理器就可以很方便的取消掉梯度。這個功能在pytorch以前的版本中，通過設置volatile=True生效，不過現(xiàn)在這個用法已經被拋棄了。

顯式指定`model.train()`和`model.eval()`

我們的模型中經常會有一些子模型，其在訓練時候和測試時候的參數(shù)是不同的，比如dropout[6]中的丟棄率和Batch Normalization[5]中的和等，這個時候我們就需要顯式地指定不同的階段（訓練或者測試），在pytorch中我們通過model.train()和model.eval()進行顯式指定，具體如：

model = CNNNet(params)# here we start the trainingmodel.train()for each in train_data: data, label = each logit = model(data) loss = criterion(logit, label) ... # just skip# here we start the evaluation
model.eval() with torch.no_grad(): # we dont need grad in eval phase for each in eval_data:  data, label = each  logit = model(data)  loss = criterion(logit, label)  ... # just skip

注意，在模型中有BN層或者dropout層時，在訓練階段和測試階段必須顯式指定train()和eval()。

關于`retain_graph`的使用

在對一個損失進行反向傳播時，在pytorch中調用out.backward()即可實現(xiàn)，給個小例子如：

import torchimport torch.nn as nnimport numpy as npclass net(nn.Module):    def __init__(self):        super().__init__()        self.fc1 = nn.Linear(10,2)        self.act = nn.ReLU()    def forward(self,inputv):        return self.act(self.fc1(inputv))n = net()opt = torch.optim.Adam(n.parameters(),lr=3e-4)inputv = torch.tensor(np.random.normal(size=(4,10))).float()output = n(inputv)target = torch.tensor(np.ones((4,2))).float()loss = nn.functional.mse_loss(output, target)loss.backward() # here we calculate the gradient w.r.t the leaf

對loss進行反向傳播就可以求得，即是損失對于每個葉子節(jié)點的梯度。我們注意到，在.backward()這個API的文檔中，有幾個參數(shù)，如:

backward(gradient=None, retain_graph=None, create_graph=False)

這里我們關注的是retain_graph這個參數(shù)，這個參數(shù)如果為False或者None則在反向傳播完后，就釋放掉構建出來的graph，如果為True則不對graph進行釋放[7][8]。

我們這里就有個問題，我們既然已經計算忘了梯度了，為什么還要保存graph呢？直接釋放掉等待下一個迭代不就好了嗎，不釋放掉不會白白浪費內存嗎？我們這里根據[7]中的討論，簡要介紹下為什么在某些情況下需要保留graph。如下圖所示，我們用代碼構造出此graph:

import torchfrom torch.autograd import Variablea = Variable(torch.rand(1, 4), requires_grad=True)b = a**2c = b*2d = c.mean()e = c.sum()

如果我們第一次需要對末節(jié)點d進行求梯度，我們有:

d.backward()

問題是在執(zhí)行完反向傳播之后，因為沒有顯式地要求它保留graph，系統(tǒng)對graph內存進行釋放，如果下一步需要對節(jié)點e進行求梯度，那么將會因為沒有這個graph而報錯。因此有例子：

d.backward(retain_graph=True) # finee.backward(retain_graph=True) # fined.backward() # also finee.backward() # error will occur!

利用這個性質在某些場景是有作用的，比如在對抗生成網絡GAN中需要先對某個模塊比如生成器進行訓練，后對判別器進行訓練，這個時候整個網絡就會存在兩個以上的loss，例子如:

G_loss = ...D_loss = ...
opt.zero_grad() # 對所有梯度清0D_loss.backward(retain_graph=True) # 保存graph結構，后續(xù)還要用opt.step() # 更新梯度，只更新D的，因為只有D的不為0
opt.zero_grad() # 對所有梯度清0G_loss.backward(retain_graph=False) # 不保存graph結構了，可以釋放graph，# 下一個迭代中通過forward還可以build出來的opt.step() # 更新梯度，只更新G的，因為只有G的不為0

這個時候就可以對網絡中多個loss進行分步的訓練了。

進行梯度累積，實現(xiàn)內存緊張情況下的大`batch_size`訓練

在上面討論的retain_graph參數(shù)中，還可以用于累積梯度，在GPU顯存緊張的情況下使用可以等價于用更大的batch_size進行訓練。首先我們要明白，當調用.backward()時，其實是對損失到各個節(jié)點的梯度進行計算，計算結果將會保存在各個節(jié)點上，如果不用opt.zero_grad()對其進行清0，那么只要你一直調用.backward()梯度就會一直累積，相當于是在大的batch_size下進行的訓練。我們給出幾個例子闡述我們的觀點。

import torchimport torch.nn as nnimport numpy as npclass net(nn.Module):    def __init__(self):        super().__init__()        self.fc1 = nn.Linear(10,2)        self.act = nn.ReLU()    def forward(self,inputv):        return self.act(self.fc1(inputv))n = net()inputv = torch.tensor(np.random.normal(size=(4,10))).float()output = n(inputv)target = torch.tensor(np.ones((4,2))).float()loss = nn.functional.mse_loss(output, target)loss.backward(retain_graph=True)opt = torch.optim.Adam(n.parameters(),lr=0.01)for each in n.parameters():    print(each.grad)

第一次輸出:

tensor([[ 0.0493, -0.0581, -0.0451,  0.0485,  0.1147,  0.1413, -0.0712, -0.1459,          0.1090, -0.0896],        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,          0.0000,  0.0000]])tensor([-0.1192,  0.0000])

在運行一次loss.backward(retain_graph=True)，輸出為:

tensor([[ 0.0987, -0.1163, -0.0902,  0.0969,  0.2295,  0.2825, -0.1424, -0.2917,          0.2180, -0.1792],        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,          0.0000,  0.0000]])tensor([-0.2383,  0.0000])

同理，第三次：

tensor([[ 0.1480, -0.1744, -0.1353,  0.1454,  0.3442,  0.4238, -0.2136, -0.4376,          0.3271, -0.2688],        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,          0.0000,  0.0000]])tensor([-0.3575,  0.0000])

運行一次opt.zero_grad()，輸出為：

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])tensor([0., 0.])

現(xiàn)在明白為什么我們一般在求梯度時要用opt.zero_grad()了吧，那是為什么不要這次的梯度結果被上一次給影響，但是在某些情況下這個‘影響’是可以利用的。

調皮的`dropout`

這個在利用torch.nn.functional.dropout的時候，其參數(shù)為：

torch.nn.functional.dropout(input, p=0.5, training=True, inplace=False)

注意這里有個training指明了是否是在訓練階段，是否需要對神經元輸出進行隨機丟棄，這個是需要自行指定的，即便是用了model.train()或者model.eval()都是如此，這個和torch.nn.dropout不同，因為后者是一個層(Layer)，而前者只是一個函數(shù)，不能紀錄狀態(tài)[9]。

嘿，檢查自己，說你呢, `index_select`

torch.index_select()是一個用于索引給定張量中某一個維度中元素的方法，其API手冊如：

torch.index_select(input, dim, index, out=None) → TensorParameters:  input (Tensor) – 輸入張量，需要被索引的張量 dim (int) – 在某個維度被索引 index (LongTensor) – 一維張量，用于提供索引信息 out (Tensor, optional) – 輸出張量，可以不填

其作用很簡單，比如我現(xiàn)在的輸入張量為1000 * 10的尺寸大小，其中1000為樣本數(shù)量，10為特征數(shù)目，如果我現(xiàn)在需要指定的某些樣本，比如第1-100,300-400等等樣本，我可以用一個index進行索引，然后應用torch.index_select()就可以索引了，例子如：

>>> x = torch.randn(3, 4)>>> xtensor([[ 0.1427,  0.0231, -0.5414, -1.0009],        [-0.4664,  0.2647, -0.1228, -1.1068],        [-1.1734, -0.6571,  0.7230, -0.6004]])>>> indices = torch.tensor([0, 2])>>> torch.index_select(x, 0, indices) # 按行索引tensor([[ 0.1427,  0.0231, -0.5414, -1.0009],        [-1.1734, -0.6571,  0.7230, -0.6004]])>>> torch.index_select(x, 1, indices) # 按列索引tensor([[ 0.1427, -0.5414],        [-0.4664, -0.1228],        [-1.1734,  0.7230]])

然而有一個問題是，pytorch似乎在使用GPU的情況下，不檢查index是否會越界，因此如果你的index越界了，但是報錯的地方可能不在使用index_select()的地方，而是在后續(xù)的代碼中，這個似乎就需要留意下你的index了。同時，index是一個LongTensor，這個也是要留意的。

悄悄地更新，BN層就是個小可愛

在trainning狀態(tài)下，BN層的統(tǒng)計參數(shù)running_mean和running_var是在調用forward()后就更新的，這個和一般的參數(shù)不同，容易造成疑惑，考慮到篇幅較長，請移步到[11]。

`F.interpolate`的問題

我們經常需要對圖像進行插值，而pytorch的確也是提供對以tensor形式表示的圖像進行插值的功能，那就是函數(shù)torch.nn.functional.interpolate[12]，但是我們注意到這個插值函數(shù)有點特別，它是對以batch為單位的圖像進行插值的，如果你想要用以下的代碼去插值：

image = torch.rand(3,112,112) # H = 112, W = 112, C = 3的圖像image = torch.nn.functional.interpolate(image, size=(224,224))

那么這樣就會報錯，因為此處的size只接受一個整數(shù)，其對W這個維度進行縮放，這里，interpolate會認為3是batch_size，因此如果需要對圖像的H和W進行插值，那么我們應該如下操作：

image = torch.rand(3,112,112) # H = 112, W = 112, C = 3的圖像image = image.unsqueeze(0) # shape become (1,3,112,112)image = torch.nn.functional.interpolate(image, size=(224,224))

Reference

[1]. Why does CrossEntropyLoss include the softmax function?https://discuss.pytorch.org/t/why-does-crossentropyloss-include-the-softmax-function/4420

[2]. Do I need to use softmax before nn.CrossEntropyLoss()?https://discuss.pytorch.org/t/do-i-need-to-use-softmax-before-nn-crossentropyloss/16739/2

[3]. tf.nn.softmax_cross_entropy_with_logits 將在未來棄用，https://blog.csdn.net/LoseInVain/article/details/80932605

[4]. 訓練集，測試集，檢驗集的區(qū)別與交叉檢驗，https://blog.csdn.net/LoseInVain/article/details/78108955

[5]. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[J]. arXiv preprint arXiv:1502.03167, 2015.

[6]. Hinton G E, Srivastava N, Krizhevsky A, et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. arXiv preprint arXiv:1207.0580, 2012.?

[7]. What does the parameter retain_graph mean in the Variable's backward() method?https://stackoverflow.com/questions/46774641/what-does-the-parameter-retain-graph-mean-in-the-variables-backward-method

[8]. https://pytorch.org/docs/stable/autograd.html?highlight=backward#torch.Tensor.backward

[9] https://pytorch.org/docs/stable/nn.html?highlight=dropout#torch.nn.functional.dropout

[10]. https://github.com/pytorch/pytorch/issues/571

[11]. Pytorch的BatchNorm層使用中容易出現(xiàn)的問題，https://blog.csdn.net/LoseInVain/article/details/86476010

[12]. https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.interpolate

推薦閱讀

煉丹師們，七夕這道算法題，你會解嗎？
9大主題卷積神經網絡(CNN)的PyTorch實現(xiàn)
使用注意力機制來做醫(yī)學圖像分割的解釋和Pytorch實現(xiàn)

添加極市小助手微信（ID : cvmart2），備注：姓名-學校/公司-研究方向-城市（如：小極-北大-目標檢測-深圳），即可申請加入極市目標檢測/圖像分割/工業(yè)檢測/人臉/醫(yī)學影像/3D/SLAM/自動駕駛/超分辨率/姿態(tài)估計/ReID/GAN/圖像增強/OCR/視頻理解等技術交流群：每月大咖直播分享、真實項目需求對接、求職內推、算法競賽、干貨資訊匯總、與?10000+來自港科大、北大、清華、中科院、CMU、騰訊、百度等名校名企視覺開發(fā)者互動交流~

△長按添加極市小助手

△長按關注極市平臺，獲取最新CV干貨

覺得有用麻煩給個在看啦~??

匯總 Pytorch 踩過的10個坑

pytorch中的交叉熵

pytorch中的MSELoss和KLDivLoss

在驗證和測試階段取消掉梯度（no_grad）

顯式指定model.train()和model.eval()

關于retain_graph的使用

進行梯度累積，實現(xiàn)內存緊張情況下的大batch_size訓練

調皮的dropout

嘿，檢查自己，說你呢, index_select