點(diǎn)擊上方“小白學(xué)視覺(jué)”，選擇加"星標(biāo)"或“置頂”

重磅干貨，第一時(shí)間送達(dá)

本文轉(zhuǎn)自極市平臺(tái)

作者丨劉昕宸@知乎

來(lái)源丨h(huán)ttps://zhuanlan.zhihu.com/p/268308900

編輯丨極市平臺(tái)

導(dǎo)讀

通過(guò)堆疊神經(jīng)網(wǎng)絡(luò)層數(shù)（增加深度）可以非常有效地增強(qiáng)表征，提升特征學(xué)習(xí)效果，但是會(huì)出現(xiàn)深層網(wǎng)絡(luò)的性能退化問(wèn)題，ResNet的出現(xiàn)能夠解決這個(gè)問(wèn)題。本文用論文解讀的方式展現(xiàn)了ResNet的實(shí)現(xiàn)方式、分類(lèi)、目標(biāo)檢測(cè)等任務(wù)上相比SOTA更好的效果。

論文標(biāo)題：Deep Residual Learning for Image Recognition

1 motivation

通過(guò)總結(jié)前人的經(jīng)驗(yàn)，我們常會(huì)得出這樣的結(jié)論：通過(guò)堆疊神經(jīng)網(wǎng)絡(luò)層數(shù)（增加深度）可以非常有效地增強(qiáng)表征，提升特征學(xué)習(xí)效果。

為什么深度的網(wǎng)絡(luò)表征效果會(huì)好？
深度學(xué)習(xí)很不好解釋?zhuān)蟾诺慕忉尶梢允牵壕W(wǎng)絡(luò)的不同層可以提取不同抽象層次的特征，越深的層提取的特征越抽象。因此深度網(wǎng)絡(luò)可以整合low-medium-high各種層次的特征，增強(qiáng)網(wǎng)絡(luò)表征能力。

那好，我們就直接增加網(wǎng)絡(luò)深度吧！

但是事情好像并沒(méi)有那么簡(jiǎn)單！

梯度優(yōu)化問(wèn)題：

我們不禁發(fā)問(wèn)：Is learning better networks as easy as stacking more layers?

首先，深度網(wǎng)絡(luò)優(yōu)化是比較困難的，比如會(huì)出現(xiàn)梯度爆炸/梯度消失等問(wèn)題。不過(guò)，這個(gè)問(wèn)題已經(jīng)被normalized initialization和batch normalization等措施解決得差不多了。

退化問(wèn)題：

好，那就直接上deeper network吧！

但是新問(wèn)題又來(lái)了：deeper network收斂是收斂了，卻出現(xiàn)了效果上的degradation

deeper network準(zhǔn)確率飽和后，很快就退化了

為什么會(huì)這樣呢？網(wǎng)絡(luò)更深了，參數(shù)更多了，應(yīng)該擬合能力更強(qiáng)了才對(duì)??！噢，一定是過(guò)擬合了。

但似乎也不是過(guò)擬合的問(wèn)題：

因?yàn)?6-layer網(wǎng)絡(luò)（紅線）的training error（左圖）也比20-layer網(wǎng)絡(luò)（黃線）要高，這就應(yīng)該不是過(guò)擬合了??！

那么究竟是什么原因?qū)е铝薲eeper network degradation問(wèn)題呢？

現(xiàn)在，我們換一種思路來(lái)構(gòu)建deeper network：

the added layers are identity mapping, and the other layers are copied from the learned shallower model.（在原始的淺層網(wǎng)絡(luò)基礎(chǔ)上增加的層視為是identity mapping）

也就是假設(shè)淺層網(wǎng)絡(luò)已經(jīng)可以得到一個(gè)不錯(cuò)的結(jié)果了，那我接下來(lái)新增加的層啥也不干，只是擬合一個(gè)identity mapping，輸出就擬合輸入，這樣總可以吧。

這樣的話，我們覺(jué)得：這樣構(gòu)建的深層網(wǎng)絡(luò)至少不應(yīng)該比它對(duì)應(yīng)的淺層training error要高。對(duì)吧。

但是實(shí)驗(yàn)又無(wú)情地表明：這樣卻又不能得到（與淺層網(wǎng)絡(luò)）一樣好的結(jié)果，甚至還會(huì)比它差！

看來(lái)，深度網(wǎng)絡(luò)的優(yōu)化并不容易的！

總結(jié)一下：直覺(jué)上深度網(wǎng)絡(luò)應(yīng)該會(huì)有更好的表征能力，但是事實(shí)卻是深度網(wǎng)絡(luò)結(jié)果會(huì)變差，由此我們認(rèn)為深度網(wǎng)絡(luò)的優(yōu)化部分出了問(wèn)題，深度網(wǎng)絡(luò)的參數(shù)空間變得更復(fù)雜提升了優(yōu)化的難度。

那么，ResNet來(lái)了。

我們就想啊，與其直接擬合一個(gè)desired underlying mapping?

?，不如讓網(wǎng)絡(luò)嘗試擬合一個(gè)residual mapping?

?。

也就是：

原先的映射?

?，被轉(zhuǎn)換為了?

我們?cè)谶@里假設(shè)優(yōu)化殘差映射（residual mapping）?

?是比優(yōu)化原映射?

?要容易的。
比如如果現(xiàn)在恒等映射（identity mapping）是最優(yōu)的，那么似乎通過(guò)堆疊一些非線性層的網(wǎng)絡(luò)將殘差映射為0，從而擬合這個(gè)恒等映射，最種做法是更容易的。

?可以通過(guò)如上圖所示的短路連接（shortcut connection）結(jié)構(gòu)來(lái)實(shí)現(xiàn)。

shortcut就是設(shè)計(jì)的可以skip幾層的結(jié)構(gòu)，在ResNet中就是起到了相當(dāng)于一個(gè)最最簡(jiǎn)單的identity mapping，其輸出被加到了stacked layers的輸出上。這樣做既沒(méi)有增加新的參數(shù)，也沒(méi)有增加計(jì)算復(fù)雜性。

ResNet的具體結(jié)構(gòu)，后面會(huì)詳細(xì)介紹。

接下來(lái)，本文在ImageNet和CIFAR-10等數(shù)據(jù)集上做實(shí)驗(yàn)，主要是想驗(yàn)證2個(gè)問(wèn)題：

deep residual nets比它對(duì)應(yīng)版本的plain nets更好優(yōu)化，training error更低。
deep residual nets能夠從更深的網(wǎng)絡(luò)中獲得更好的表征，從而提升分類(lèi)效果。

2 solution

ResNet想做什么？

learning residual functions with reference to the layer inputs, instead of learning unreferenced functions.

理解不了沒(méi)關(guān)系，接著往下看。

2.1 Residual Learning

前提：如果假設(shè)多個(gè)非線性層能夠漸近一個(gè)復(fù)雜的函數(shù)，那么多個(gè)非線性層也一定可以漸近這個(gè)殘差函數(shù)。

令?

?表示目標(biāo)擬合函數(shù)。

所以與其考慮擬合?

?，不如考慮擬合其對(duì)應(yīng)的殘差函數(shù)?

?.這兩種擬合難度可能是不同的。

回到上面的討論，如果被增加層能夠被構(gòu)建成identity mapping，那么深層網(wǎng)絡(luò)的性能至少不應(yīng)該比其對(duì)應(yīng)的淺層版本要差。

這個(gè)表明：網(wǎng)絡(luò)是在使用多層非線性網(wǎng)絡(luò)趨近identity mapping做優(yōu)化這里出了問(wèn)題。

殘差學(xué)習(xí)的這種方式，使得“如果identity mapping是最優(yōu)的，網(wǎng)絡(luò)的優(yōu)化器直接將殘差學(xué)習(xí)為0”就可以了，這樣做是比較簡(jiǎn)單的。

但其實(shí)在真實(shí)情況下，identity mapping不一定是最優(yōu)的映射??！原文說(shuō)這種設(shè)計(jì)仍有意義，這種設(shè)計(jì)help to precondition the problem.

也就是如果optimal function更接近identity mapping的話，優(yōu)化器應(yīng)該能夠比較容易找到殘差，而不是重新學(xué)習(xí)一個(gè)新的。

后面實(shí)驗(yàn)也表明了：殘差網(wǎng)絡(luò)各層輸出的標(biāo)準(zhǔn)差是比較低的（如上圖，后面會(huì)解釋?zhuān)?，這也驗(yàn)證了在identity mapping的基礎(chǔ)上學(xué)習(xí)殘差，確實(shí)是會(huì)更容易（identity mappings provide reasonable preconditioning.）.

這里解釋得還是比較含糊，但總結(jié)來(lái)說(shuō)就是作者想解釋?zhuān)?xùn)練學(xué)習(xí)殘差會(huì)有效降低學(xué)習(xí)的難度，可能據(jù)此來(lái)解決深層網(wǎng)絡(luò)的性能退化問(wèn)題。

2.2 Identity Mapping by Shortcuts

再回顧一下這個(gè)著名的殘差塊圖片：

identity mapping實(shí)現(xiàn)得非常之簡(jiǎn)單，直接就用了個(gè)shortcut

形式化就是：

?表示residual mapping，比如上圖，實(shí)際上就是2層網(wǎng)絡(luò)，也就是?

然后直接將?

?與?

?element-wise相加。

最后，給?

?套一個(gè)激活函數(shù)?

這么設(shè)計(jì)（shortcut）有個(gè)巨大的好處，就是沒(méi)有引入任何新的參數(shù)，也沒(méi)有增加計(jì)算復(fù)雜度。

下面還有2個(gè)小問(wèn)題：

問(wèn)題1：關(guān)于?

因?yàn)?

?是element-wise相加，那么如果?

?和?

?維度不一樣怎么辦？

方案一：直接對(duì)?

?補(bǔ)0.

方案二：增加一個(gè)網(wǎng)絡(luò)層（參數(shù)為?

?），改變?

?的維度。即：

事實(shí)上，每個(gè)shortcut我們都可以加一個(gè)映射層?

?（實(shí)現(xiàn)起來(lái)可以就是個(gè)感知機(jī)）。不需要做維度轉(zhuǎn)化時(shí)，?

?就是個(gè)方陣。但是后面實(shí)驗(yàn)表明，直接shortcut就已經(jīng)足夠好了，不需要再加那么多參數(shù)浪費(fèi)計(jì)算資源。

問(wèn)題2：關(guān)于?

?的結(jié)構(gòu)應(yīng)該是什么樣的呢？

?可以是2層或者3層，也可以是更多；但是不要是1層，效果會(huì)不好。

最后，shortcut設(shè)計(jì)不僅針對(duì)全連接網(wǎng)絡(luò)，卷積網(wǎng)絡(luò)當(dāng)然也是沒(méi)問(wèn)題的！

2.3 網(wǎng)絡(luò)架構(gòu)

受VGGNet（左邊）啟發(fā)，設(shè)計(jì)了34層的plain network（中間），以及其對(duì)應(yīng)的residual network版本（右邊）。

注意：中間plain network和右邊residual network層數(shù)一致，網(wǎng)絡(luò)參數(shù)也可以設(shè)計(jì)得完全一樣（element-wise相加維度不match時(shí)直接補(bǔ)0就不會(huì)增加任何learnable parameters）。

34-layer plain network設(shè)計(jì)原則（遵循VGGNet）：

for the same output feature map size, the layers have the same number of filters
if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer

3 dataset and experiments

3.1 ImageNet on Classification

3.1.1 與plain network的對(duì)比實(shí)驗(yàn)

這個(gè)實(shí)驗(yàn)是核心，為了說(shuō)明residual network能夠非常完美地解決“深度增加帶來(lái)的degradation”問(wèn)題?。?！

左邊是plain network，右邊是ResNet；細(xì)線是train error，加粗線是val error

Plain network會(huì)出現(xiàn)網(wǎng)絡(luò)的層數(shù)增加，train error和val error都會(huì)升高

什么原因呢？？？
首先排除過(guò)擬合，因?yàn)閠rain error也會(huì)升高
其次排除梯度消失，網(wǎng)絡(luò)中使用了batch normalization，并且作者也做實(shí)驗(yàn)驗(yàn)證了梯度的存在
事實(shí)上，34-layers plain network也是可以實(shí)現(xiàn)比較好的準(zhǔn)確率的，這說(shuō)明網(wǎng)絡(luò)在一定程度上也是work了的。
作者猜測(cè)：We conjecture that the deep plain nets may have exponentially low convergence rates.?層數(shù)的提升會(huì)在一定程度上指數(shù)級(jí)別影響收斂速度。

下面是Residual Network與plain network的量化對(duì)比：

觀察上面兩張圖，我們可以得出結(jié)論：

而ResNet卻真正實(shí)現(xiàn)了網(wǎng)絡(luò)層數(shù)增加，train error和val error都降低了，證明了網(wǎng)絡(luò)深度確實(shí)可以幫助提升網(wǎng)絡(luò)的性能。degradation problem在一定程度上得到了解決。
相對(duì)于plain 34-layers，ResNet 34-layers的top-1 error rate也降低了3.5%。resnet實(shí)現(xiàn)了在沒(méi)有增加任何參數(shù)的情況下，獲得了更低error rate，網(wǎng)絡(luò)更加高效。
從plain/residual 18-layers的比較來(lái)看，兩者的error rate差不多，但是ResNet卻能夠收斂得更快。

總結(jié)來(lái)說(shuō)就是，ResNet在不增加任何參數(shù)的情況下，僅使用shortcuts and zero-padding for matching dimensions結(jié)構(gòu)，就實(shí)現(xiàn)了：

解決了degradation problem，更高的準(zhǔn)確率，更快的收斂速度

簡(jiǎn)直太強(qiáng)了！??！

3.1.2 Identity v.s. Projection shortcuts

所謂projection shortcuts，就是：shortcuts包括了一個(gè)可學(xué)習(xí)參數(shù)（可以用來(lái)對(duì)齊維度，使得element-wise相加可以實(shí)現(xiàn)）：

設(shè)計(jì)了A，B，C三種實(shí)驗(yàn)：

A：?jiǎn)渭兪褂胕dentity shortcuts：

維度不能對(duì)齊時(shí)使用zero padding來(lái)提升維度

此方案沒(méi)有增加任何參數(shù)

B：僅僅在維度需要對(duì)齊時(shí)使用projection shortcuts，其余均使用parameter-free的identity shortcuts

C：全部使用projection shortcuts

下面是三種方案的實(shí)驗(yàn)結(jié)果：

ABC三種方案均明顯好于plain版本

C雖然結(jié)果稍微優(yōu)于B、C，但是卻引入了大量的參數(shù)，增加了時(shí)空計(jì)算復(fù)雜度。

作者認(rèn)為：projection shortcuts are not essential for addressing the degradation problem.

因此后面的實(shí)驗(yàn)仍然采用A或者B結(jié)構(gòu)。

3.1.3 Deeper Bottleneck Architectures.

為了探索更深層的網(wǎng)絡(luò)，保證訓(xùn)練時(shí)間在可控范圍內(nèi)，作者重又設(shè)計(jì)了bottleneck版本的building block

左邊是原版本，右邊是bottleneck版本

bottleneck版本是將卷積核換成了1*1,3*3,1*1的size，雖然層數(shù)增加到了3層，但是降低了參數(shù)量。
作者在這里是想探索深度的真正瓶頸，而不是追求很低的error rate，因此在這里使用了更加精簡(jiǎn)的bottleneck building block

50-layers：將34-layers的每個(gè)2-layer block換成了3-layers bottleneck block

101-layers/152-layers：增加更多的3-layers bottleneck block

網(wǎng)絡(luò)具體參數(shù)可參考如下圖：

實(shí)驗(yàn)結(jié)果如下所示：

網(wǎng)絡(luò)越深，確實(shí)取得了更好的結(jié)果。Plain network的degradation problem似乎消失了。

3.2 CIFAR-10實(shí)驗(yàn)與分析

實(shí)線是test error，虛線是train error

左邊是plain network，中間是ResNet，右邊是ResNet with 110 layers and 1202 layers.

結(jié)論基本與之前一致，但在1202層時(shí)，ResNet還是出現(xiàn)了degradation現(xiàn)象（結(jié)果比110層差），作者認(rèn)為是過(guò)擬合。

另外：Analysis of Layer Responses

關(guān)于response：The responses are the outputs of each

?layer, after BN and before other nonlinearity (ReLU/addition).

從上圖我們可以直接看出：ResNet較于plain network，一般來(lái)說(shuō)response std更小。

并且：deeper ResNet has smaller magni- tudes of responses

這就說(shuō)明了：

residual functions（即??） might be generally closer to zero than the non-residual functions.
When there are more layers, an individual layer of ResNets tends to modify the signal less.（也就是后面逐漸就接近identity mapping，要擬合的殘差越來(lái)越小，離目標(biāo)越來(lái)越近）

4 code review

ResNet實(shí)現(xiàn)非常簡(jiǎn)單，網(wǎng)上各種實(shí)現(xiàn)多如牛毛，這里僅隨意找了個(gè)實(shí)現(xiàn)版本作為對(duì)照：

代碼基于CIFAR-10的：

2層的BasicBlock：

class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, in_planes, planes, stride=1, option='A'):
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_planes != planes:
            if option == 'A':
                """
                For CIFAR10 ResNet paper uses option A.
                """
                self.shortcut = LambdaLayer(lambda x:
                                            F.pad(x[:, :, ::2, ::2], (0, 0, 0, 0, planes//4, planes//4), "constant", 0))
            elif option == 'B':
                self.shortcut = nn.Sequential(
                     nn.Conv2d(in_planes, self.expansion * planes, kernel_size=1, stride=stride, bias=False),
                     nn.BatchNorm2d(self.expansion * planes)
                )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        out = F.relu(out)
        return out

ResNet骨架：

解釋一下：

forward函數(shù)中定義resnet骨架：

首：1層conv
身：由BasicBlock構(gòu)成layer1、layer2、layer3，個(gè)數(shù)分別為??，因?yàn)槊總€(gè)BasicBlock有2層，所以總層數(shù)是?
尾：1層fc

所以總共有?

?層！

layer1, layer2, layer3輸出維度分別是16，32，64

class ResNet(nn.Module):
    def __init__(self, block, num_blocks, num_classes=10):
        super(ResNet, self).__init__()
        self.in_planes = 16

        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(16)
        self.layer1 = self._make_layer(block, 16, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 32, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 64, num_blocks[2], stride=2)
        self.linear = nn.Linear(64, num_classes)

        self.apply(_weights_init)

    def _make_layer(self, block, planes, num_blocks, stride):
        strides = [stride] + [1]*(num_blocks-1)
        layers = []
        for stride in strides:
            layers.append(block(self.in_planes, planes, stride))
            self.in_planes = planes * block.expansion

        return nn.Sequential(*layers)

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = F.avg_pool2d(out, out.size()[3])
        out = out.view(out.size(0), -1)
        out = self.linear(out)
        return out

最后，像堆積木一樣，通過(guò)設(shè)置layer1、layer2、layer3的BasicBlock個(gè)數(shù)來(lái)堆出不同層的ResNet：

def resnet20():
    return ResNet(BasicBlock, [3, 3, 3])


def resnet32():
    return ResNet(BasicBlock, [5, 5, 5])


def resnet44():
    return ResNet(BasicBlock, [7, 7, 7])


def resnet56():
    return ResNet(BasicBlock, [9, 9, 9])


def resnet110():
    return ResNet(BasicBlock, [18, 18, 18])


def resnet1202():
    return ResNet(BasicBlock, [200, 200, 200])

5 conclusion

ResNet核心就是residual learning和shortcut identity mapping，實(shí)現(xiàn)方式極其簡(jiǎn)單，卻取得了極其好的效果，在分類(lèi)、目標(biāo)檢測(cè)等任務(wù)上均是大比分領(lǐng)先SOTA，這種非常general的創(chuàng)新是非常不容易的，這也是ResNet備受推崇的原因吧！

另外給我的啟示就是：不僅僅是"talk is cheap, show me the code"了，而是"code is also relatively cheap, show me ur sense and thinking"!

下載1：OpenCV-Contrib擴(kuò)展模塊中文版教程

在「小白學(xué)視覺(jué)」公眾號(hào)后臺(tái)回復(fù)：擴(kuò)展模塊中文教程，即可下載全網(wǎng)第一份OpenCV擴(kuò)展模塊教程中文版，涵蓋擴(kuò)展模塊安裝、SFM算法、立體視覺(jué)、目標(biāo)跟蹤、生物視覺(jué)、超分辨率處理等二十多章內(nèi)容。

下載2：Python視覺(jué)實(shí)戰(zhàn)項(xiàng)目31講

在「小白學(xué)視覺(jué)」公眾號(hào)后臺(tái)回復(fù)：Python視覺(jué)實(shí)戰(zhàn)項(xiàng)目31講，即可下載包括圖像分割、口罩檢測(cè)、車(chē)道線檢測(cè)、車(chē)輛計(jì)數(shù)、添加眼線、車(chē)牌識(shí)別、字符識(shí)別、情緒檢測(cè)、文本內(nèi)容提取、面部識(shí)別等31個(gè)視覺(jué)實(shí)戰(zhàn)項(xiàng)目，助力快速學(xué)校計(jì)算機(jī)視覺(jué)。

下載3：OpenCV實(shí)戰(zhàn)項(xiàng)目20講

在「小白學(xué)視覺(jué)」公眾號(hào)后臺(tái)回復(fù)：OpenCV實(shí)戰(zhàn)項(xiàng)目20講，即可下載含有20個(gè)基于OpenCV實(shí)現(xiàn)20個(gè)實(shí)戰(zhàn)項(xiàng)目，實(shí)現(xiàn)OpenCV學(xué)習(xí)進(jìn)階。

下載4：leetcode算法開(kāi)源書(shū)

在「小白學(xué)視覺(jué)」公眾號(hào)后臺(tái)回復(fù)：leetcode，即可下載。每題都 runtime beats 100% 的開(kāi)源好書(shū)，你值得擁有！

交流群

歡迎加入公眾號(hào)讀者群一起和同行交流，目前有SLAM、三維視覺(jué)、傳感器、自動(dòng)駕駛、計(jì)算攝影、檢測(cè)、分割、識(shí)別、醫(yī)學(xué)影像、GAN、算法競(jìng)賽等微信群（以后會(huì)逐漸細(xì)分），請(qǐng)掃描下面微信號(hào)加群，備注：”昵稱(chēng)+學(xué)校/公司+研究方向“，例如：”張三?+?上海交大?+?視覺(jué)SLAM“。請(qǐng)按照格式備注，否則不予通過(guò)。添加成功后會(huì)根據(jù)研究方向邀請(qǐng)進(jìn)入相關(guān)微信群。請(qǐng)勿在群內(nèi)發(fā)送廣告，否則會(huì)請(qǐng)出群，謝謝理解~

重讀經(jīng)典：完全解析特征學(xué)習(xí)大殺器ResNet