影视先锋成人电影,国产亚洲精品久久777777,手机av网站,综合激情网五月,九九九视频在线观看,免费成年人视频,欧美日韩在线播放视频,婷婷五月天亚洲无码

點(diǎn)擊上方“視學(xué)算法”，選擇加"星標(biāo)"或“置頂”

重磅干貨，第一時(shí)間送達(dá)

作者 | xiaopl@知乎（已授權(quán)）

來源 | https://zhuanlan.zhihu.com/p/67184419?

編輯 | 極市平臺

導(dǎo)讀

本文圍繞 PyTorch 中的 tensor 展開，討論了張量的求導(dǎo)機(jī)制，在不同設(shè)備之間的轉(zhuǎn)換，神經(jīng)網(wǎng)絡(luò)中權(quán)重的更新等內(nèi)容。受眾是使用過 PyTorch 一段時(shí)間的用戶。?

本文主要圍繞 PyTorch 中的 tensor 展開，討論了張量的求導(dǎo)機(jī)制，在不同設(shè)備之間的轉(zhuǎn)換，神經(jīng)網(wǎng)絡(luò)中權(quán)重的更新等內(nèi)容。受眾群是使用過 PyTorch 一段時(shí)間的用戶。本文中的代碼例子基于 Python 3 和 PyTorch 1.1，如果文章中有錯誤或者沒有說明白的地方，歡迎在評論區(qū)指正和討論。

文章具體內(nèi)容分為以下6個部分：

tensor.requires_grad
torch.no_grad()
反向傳播及網(wǎng)絡(luò)的更新
tensor.detach()
CPU and GPU
tensor.item

1. requires_grad

當(dāng)我們創(chuàng)建一個張量 (tensor) 的時(shí)候，如果沒有特殊指定的話，那么這個張量是默認(rèn)是不需要求導(dǎo)的。我們可以通過 tensor.requires_grad 來檢查一個張量是否需要求導(dǎo)。

在張量間的計(jì)算過程中，如果在所有輸入中，有一個輸入需要求導(dǎo)，那么輸出一定會需要求導(dǎo)；相反，只有當(dāng)所有輸入都不需要求導(dǎo)的時(shí)候，輸出才會不需要。

舉一個比較簡單的例子，比如我們在訓(xùn)練一個網(wǎng)絡(luò)的時(shí)候，我們從 DataLoader 中讀取出來的一個 mini-batch的數(shù)據(jù)，這些輸入默認(rèn)是不需要求導(dǎo)的，其次，網(wǎng)絡(luò)的輸出我們沒有特意指明需要求導(dǎo)吧，Ground Truth 我們也沒有特意設(shè)置需要求導(dǎo)吧。這么一想，哇，那我之前的那些 loss 咋還能自動求導(dǎo)呢？其實(shí)原因就是上邊那條規(guī)則，雖然輸入的訓(xùn)練數(shù)據(jù)是默認(rèn)不求導(dǎo)的，但是，我們的 model 中的所有參數(shù)，它默認(rèn)是求導(dǎo)的，這么一來，其中只要有一個需要求導(dǎo)，那么輸出的網(wǎng)絡(luò)結(jié)果必定也會需要求的。來看個實(shí)例：

input?=?torch.randn(8,?3,?50,?100)
print(input.requires_grad)
#?False

net?=?nn.Sequential(nn.Conv2d(3,?16,?3,?1),
????????????????????nn.Conv2d(16,?32,?3,?1))
for?param?in?net.named_parameters():
????print(param[0],?param[1].requires_grad)
#?0.weight?True
#?0.bias?True
#?1.weight?True
#?1.bias?True

output?=?net(input)
print(output.requires_grad)
#?True

誠不欺我！但是，大家請注意前邊只是舉個例子來說明。在寫代碼的過程中，不要把網(wǎng)絡(luò)的輸入和 Ground Truth 的 requires_grad 設(shè)置為 True。雖然這樣設(shè)置不會影響反向傳播，但是需要額外計(jì)算網(wǎng)絡(luò)的輸入和 Ground Truth 的導(dǎo)數(shù)，增大了計(jì)算量和內(nèi)存占用不說，這些計(jì)算出來的導(dǎo)數(shù)結(jié)果也沒啥用。因?yàn)槲覀冎恍枰窠?jīng)網(wǎng)絡(luò)中的參數(shù)的導(dǎo)數(shù)，用來更新網(wǎng)絡(luò)，其余的導(dǎo)數(shù)都不需要。

好了，有個這個例子做鋪墊，那么我們來得寸進(jìn)尺一下。我們試試把網(wǎng)絡(luò)參數(shù)的 requires_grad 設(shè)置為 False 會怎么樣，同樣的網(wǎng)絡(luò)：

input?=?torch.randn(8,?3,?50,?100)
print(input.requires_grad)
#?False

net?=?nn.Sequential(nn.Conv2d(3,?16,?3,?1),
????????????????????nn.Conv2d(16,?32,?3,?1))
for?param?in?net.named_parameters():
????param[1].requires_grad?=?False
????print(param[0],?param[1].requires_grad)
#?0.weight?False
#?0.bias?False
#?1.weight?False
#?1.bias?False

output?=?net(input)
print(output.requires_grad)
#?False

這樣有什么用處？用處大了。我們可以通過這種方法，在訓(xùn)練的過程中凍結(jié)部分網(wǎng)絡(luò)，讓這些層的參數(shù)不再更新，這在遷移學(xué)習(xí)中很有用處。我們來看一個【Tutorial—FINETUNING TORCHVISION MODELS】https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html%23initialize-and-reshape-the-networks 給的例子：

model?=?torchvision.models.resnet18(pretrained=True)
for?param?in?model.parameters():
????param.requires_grad?=?False

#?用一個新的?fc?層來取代之前的全連接層
#?因?yàn)樾聵?gòu)建的?fc?層的參數(shù)默認(rèn)?requires_grad=True
model.fc?=?nn.Linear(512,?100)

#?只更新?fc?層的參數(shù)
optimizer?=?optim.SGD(model.fc.parameters(),?lr=1e-2,?momentum=0.9)

#?通過這樣，我們就凍結(jié)了?resnet?前邊的所有層，
#?在訓(xùn)練過程中只更新最后的 fc 層中的參數(shù)。

2. torch.no_grad()

當(dāng)我們在做 evaluating 的時(shí)候（不需要計(jì)算導(dǎo)數(shù)），我們可以將推斷（inference）的代碼包裹在 with torch.no_grad(): 之中，以達(dá)到暫時(shí)不追蹤網(wǎng)絡(luò)參數(shù)中的導(dǎo)數(shù)的目的，總之是為了減少可能存在的計(jì)算和內(nèi)存消耗。看?https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#gradients 給出的例子：

x?=?torch.randn(3,?requires_grad?=?True)
print(x.requires_grad)
#?True
print((x?**?2).requires_grad)
#?True

with?torch.no_grad():
????print((x?**?2).requires_grad)
????#?False

print((x?**?2).requires_grad)
#?True

3. 反向傳播及網(wǎng)絡(luò)的更新

這部分我們比較簡單地講一講，有了網(wǎng)絡(luò)輸出之后，我們怎么根據(jù)這個結(jié)果來更新我們的網(wǎng)絡(luò)參數(shù)呢。我們以一個非常簡單的自定義網(wǎng)絡(luò)來講解這個問題，這個網(wǎng)絡(luò)包含2個卷積層，1個全連接層，輸出的結(jié)果是20維的，類似分類問題中我們一共有20個類別，網(wǎng)絡(luò)如下：

class?Simple(nn.Module):
????def?__init__(self):
????????super().__init__()
????????self.conv1?=?nn.Conv2d(3,?16,?3,?1,?padding=1,?bias=False)
????????self.conv2?=?nn.Conv2d(16,?32,?3,?1,?padding=1,?bias=False)
????????self.linear?=?nn.Linear(32*10*10,?20,?bias=False)

????def?forward(self,?x):
????????x?=?self.conv1(x)
????????x?=?self.conv2(x)
????????x?=?self.linear(x.view(x.size(0),?-1))
????????return?x

接下來我們用這個網(wǎng)絡(luò)，來研究一下整個網(wǎng)絡(luò)更新的流程：

#?創(chuàng)建一個很簡單的網(wǎng)絡(luò)：兩個卷積層，一個全連接層
model?=?Simple()
#?為了方便觀察數(shù)據(jù)變化，把所有網(wǎng)絡(luò)參數(shù)都初始化為?0.1
for?m?in?model.parameters():
????m.data.fill_(0.1)

criterion?=?nn.CrossEntropyLoss()
optimizer?=?torch.optim.SGD(model.parameters(),?lr=1.0)

model.train()
#?模擬輸入8個?sample，每個的大小是?10x10，
#?值都初始化為1，讓每次輸出結(jié)果都固定，方便觀察
images?=?torch.ones(8,?3,?10,?10)
targets?=?torch.ones(8,?dtype=torch.long)

output?=?model(images)
print(output.shape)
#?torch.Size([8,?20])

loss?=?criterion(output,?targets)

print(model.conv1.weight.grad)
#?None
loss.backward()
print(model.conv1.weight.grad[0][0][0])
#?tensor([-0.0782,?-0.0842,?-0.0782])
#?通過一次反向傳播，計(jì)算出網(wǎng)絡(luò)參數(shù)的導(dǎo)數(shù)，
#?因?yàn)槠颍覀冎挥^察一小部分結(jié)果

print(model.conv1.weight[0][0][0])
#?tensor([0.1000,?0.1000,?0.1000],?grad_fn=)
#?我們知道網(wǎng)絡(luò)參數(shù)的值一開始都初始化為?0.1?的

optimizer.step()
print(model.conv1.weight[0][0][0])
#?tensor([0.1782,?0.1842,?0.1782],?grad_fn=)
#?回想剛才我們設(shè)置?learning?rate?為?1，這樣，
#?更新后的結(jié)果，正好是?(原始權(quán)重?-?求導(dǎo)結(jié)果)?！

optimizer.zero_grad()
print(model.conv1.weight.grad[0][0][0])
#?tensor([0.,?0.,?0.])
#?每次更新完權(quán)重之后，我們記得要把導(dǎo)數(shù)清零啊，
#?不然下次會得到一個和上次計(jì)算一起累加的結(jié)果。
#?當(dāng)然，zero_grad()?的位置，可以放到前邊去，
#?只要保證在計(jì)算導(dǎo)數(shù)前，參數(shù)的導(dǎo)數(shù)是清零的就好。

這里，我們多提一句，我們把整個網(wǎng)絡(luò)參數(shù)的值都傳到 optimizer 里面了，這種情況下我們調(diào)用 model.zero_grad()，效果是和 optimizer.zero_grad() 一樣的。這個知道就好，建議大家堅(jiān)持用 optimizer.zero_grad()。我們現(xiàn)在來看一下如果沒有調(diào)用 zero_grad()，會怎么樣吧：

#?...
#?代碼和之前一樣
model.train()

#?第一輪
images?=?torch.ones(8,?3,?10,?10)
targets?=?torch.ones(8,?dtype=torch.long)

output?=?model(images)
loss?=?criterion(output,?targets)
loss.backward()
print(model.conv1.weight.grad[0][0][0])
#?tensor([-0.0782,?-0.0842,?-0.0782])

#?第二輪
output?=?model(images)
loss?=?criterion(output,?targets)
loss.backward()
print(model.conv1.weight.grad[0][0][0])
#?tensor([-0.1564,?-0.1684,?-0.1564])

我們可以看到，第二次的結(jié)果正好是第一次的2倍。第一次結(jié)束之后，因?yàn)槲覀儧]有更新網(wǎng)絡(luò)權(quán)重，所以第二次反向傳播的求導(dǎo)結(jié)果和第一次結(jié)果一樣，加上上次我們沒有將 loss 清零，所以結(jié)果正好是2倍。另外大家可以看一下這個博客【torch 代碼解析為什么要使用 optimizer.zero_grad()】https://blog.csdn.net/scut_salmon/article/details/82414730，我覺得講得很好。

4. tensor.detach()

接下來我們來探討兩個 0.4.0 版本更新產(chǎn)生的遺留問題。第一個，[tensor.data]()和tensor.detach()。

在 0.4.0 版本以前，.data 是用來取 Variable 中的 tensor 的，但是之后 Variable 被取消，.data 卻留了下來。現(xiàn)在我們調(diào)用 tensor.data，可以得到 tensor的數(shù)據(jù) + requires_grad=False 的版本，而且二者共享儲存空間，也就是如果修改其中一個，另一個也會變。因?yàn)?PyTorch 的自動求導(dǎo)系統(tǒng)不會追蹤 tensor.data 的變化，所以使用它的話可能會導(dǎo)致求導(dǎo)結(jié)果出錯。官方建議使用 tensor.detach() 來替代它，二者作用相似，但是 detach 會被自動求導(dǎo)系統(tǒng)追蹤，使用起來很安全。多說無益，我們來看個例子吧：

a?=?torch.tensor([7.,?0,?0],?requires_grad=True)
b?=?a?+?2
print(b)
#?tensor([9.,?2.,?2.],?grad_fn=)

loss?=?torch.mean(b?*?b)

b_?=?b.detach()
b_.zero_()
print(b)
#?tensor([0.,?0.,?0.],?grad_fn=)
#?儲存空間共享，修改?b_?,?b?的值也變了

loss.backward()
#?RuntimeError:?one?of?the?variables?needed?for?gradient?computation?has?been?modified?by?an?inplace?operation

這個例子中，b 是用來計(jì)算 loss 的一個變量，我們在計(jì)算完 loss 之后，進(jìn)行反向傳播之前，修改 b 的值。這么做會導(dǎo)致相關(guān)的導(dǎo)數(shù)的計(jì)算結(jié)果錯誤，因?yàn)槲覀冊谟?jì)算導(dǎo)數(shù)的過程中還會用到 b 的值，但是它已經(jīng)變了（和正向傳播過程中的值不一樣了）。在這種情況下，PyTorch 選擇報(bào)錯來提醒我們。但是，如果我們使用 tensor.data 的時(shí)候，結(jié)果是這樣的：

a?=?torch.tensor([7.,?0,?0],?requires_grad=True)
b?=?a?+?2
print(b)
#?tensor([9.,?2.,?2.],?grad_fn=)

loss?=?torch.mean(b?*?b)

b_?=?b.data
b_.zero_()
print(b)
#?tensor([0.,?0.,?0.],?grad_fn=)

loss.backward()

print(a.grad)
#?tensor([0.,?0.,?0.])

#?其實(shí)正確的結(jié)果應(yīng)該是：
#?tensor([6.0000,?1.3333,?1.3333])

這個導(dǎo)數(shù)計(jì)算的結(jié)果明顯是錯的，但沒有任何提醒，之后再 Debug 會非常痛苦。所以，建議大家都用 tensor.detach() 啊。上邊這個代碼例子是受 https://github.com/pytorch/pytorch/issues/6990 啟發(fā)。

5. CPU and GPU

接下來我們來說另一個問題，是關(guān)于 [tensor.cuda]()() 和 tensor.to(device) 的。后者是 0.4.0 版本之后后添加的，當(dāng) device 是 GPU 的時(shí)候，這兩者并沒有區(qū)別。那為什么要在新版本增加后者這個表達(dá)呢，是因?yàn)橛辛怂覀冎苯釉诖a最上邊加一句話指定 device ，后面的代碼直接用to(device) 就可以了：

device?=?torch.device("cuda")?if?torch.cuda.is_available()?else?torch.device("cpu")

a?=?torch.rand([3,3]).to(device)
#?干其他的活
b?=?torch.rand([3,3]).to(device)
#?干其他的活
c?=?torch.rand([3,3]).to(device)

而之前版本的話，當(dāng)我們每次在不同設(shè)備之間切換的時(shí)候，每次都要用 if cuda.is_available() 判斷能否使用 GPU，很麻煩。這個精彩的解釋來自于?https://stackoverflow.com/questions/53331247/pytorch-0-4-0-there-are-three-ways-to-create-tensors-on-cuda-device-is-there-s 。

if?torch.cuda.is_available():
????a?=?torch.rand([3,3]).cuda()
#?干其他的活
if??torch.cuda.is_available():
????b?=?torch.rand([3,3]).cuda()
#?干其他的活
if??torch.cuda.is_available():
????c?=?torch.rand([3,3]).cuda()

關(guān)于使用 GPU 還有一個點(diǎn)，在我們想把 GPU tensor 轉(zhuǎn)換成 Numpy 變量的時(shí)候，需要先將 tensor 轉(zhuǎn)換到 CPU 中去，因?yàn)?Numpy 是 CPU-only 的。其次，如果 tensor 需要求導(dǎo)的話，還需要加一步 detach，再轉(zhuǎn)成 Numpy 。例子如下：

x??=?torch.rand([3,3],?device='cuda')
x_?=?x.cpu().numpy()

y??=?torch.rand([3,3],?requires_grad=True,?device='cuda').
y_?=?y.cpu().detach().numpy()
#?y_?=?y.detach().cpu().numpy()?也可以
#?二者好像差別不大？我們來比比時(shí)間：
start_t?=?time.time()
for?i?in?range(10000):
????y_?=?y.cpu().detach().numpy()
print(time.time()?-?start_t)
#?1.1049120426177979

start_t?=?time.time()
for?i?in?range(10000):
????y_?=?y.detach().cpu().numpy()
print(time.time()?-?start_t)
#?1.115112543106079
#?時(shí)間差別不是很大，當(dāng)然，這個速度差別可能和電腦配置
#?（比如 GPU 很貴，CPU 卻很爛）有關(guān)。

6. tensor.item()

我們在提取 loss 的純數(shù)值的時(shí)候，常常會用到 loss.item()，其返回值是一個 Python 數(shù)值 (python number)。不像從 tensor 轉(zhuǎn)到 numpy (需要考慮 tensor 是在 cpu，還是 gpu，需不需要求導(dǎo))，無論什么情況，都直接使用 item() 就完事了。如果需要從 gpu 轉(zhuǎn)到 cpu 的話，PyTorch 會自動幫你處理。

但注意 item() 只適用于 tensor 只包含一個元素的時(shí)候。因?yàn)榇蠖鄶?shù)情況下我們的 loss 就只有一個元素，所以就經(jīng)常會用到 loss.item()。如果想把含多個元素的 tensor 轉(zhuǎn)換成 Python list 的話，要使用 [tensor.tolist]()()。

x??=?torch.randn(1,?requires_grad=True,?device='cuda')
print(x)
#?tensor([-0.4717],?device='cuda:0',?requires_grad=True)

y?=?x.item()
print(y,?type(y))
#?-0.4717346727848053?

x?=?torch.randn([2,?2])
y?=?x.tolist()
print(y)
#?[[-1.3069953918457031,?-0.2710231840610504],?[-1.26217520236969,?0.5559719800949097]]

結(jié)語

以上內(nèi)容就是我平時(shí)在寫代碼的時(shí)候，覺得需要注意的地方。文章中用了一些簡單的代碼作為例子，旨在幫助大家理解。文章內(nèi)容不少，看到這里的大家都辛苦了，感謝閱讀。

最后還是那句話，希望本文能對大家學(xué)習(xí)和理解 PyTorch 有所幫助。

參考

【PyTorch Docs: AUTOGRAD MECHANICS】https://pytorch.org/docs/stable/notes/autograd.html
【PyTorch 0.4.0 release notes】https://github.com/pytorch/pytorch/releases/tag/v0.4.0

如果覺得有用，就請分享到朋友圈吧！

點(diǎn)個在看 paper不斷！

實(shí)踐教程 | 淺談 PyTorch 中的 tensor 及使用