CeiT:訓(xùn)練更快的多層特征抽取ViT
【GiantPandaCV導(dǎo)語(yǔ)】
來(lái)自商湯和南洋理工的工作,也是使用卷積來(lái)增強(qiáng)模型提出low-level特征的能力,增強(qiáng)模型獲取局部性的能力,核心貢獻(xiàn)是LCA模塊,可以用于捕獲多層特征表示。相比DeiT,訓(xùn)練速度更快。
引言
針對(duì)先前Transformer架構(gòu)需要大量額外數(shù)據(jù)或者額外的監(jiān)督(Deit),才能獲得與卷積神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)相當(dāng)?shù)男阅?,為了克服這種缺陷,提出結(jié)合CNN來(lái)彌補(bǔ)Transformer的缺陷,提出了CeiT:
(1)設(shè)計(jì)Image-to-Tokens模塊來(lái)從low-level特征中得到embedding。
(2)將Transformer中的Feed Forward模塊替換為L(zhǎng)ocally-enhanced Feed-Forward(LeFF)模塊,增加了相鄰token之間的相關(guān)性。
(3)使用Layer-wise Class Token Attention(LCA)捕獲多層的特征表示。
經(jīng)過(guò)以上修改,可以發(fā)現(xiàn)模型效率方面以及泛化能力得到了提升,收斂性也有所改善,如下圖所示:

方法
1. Image-to-Tokens

使用卷積+池化來(lái)取代原先ViT中7x7的大型patch。
2. LeFF

將tokens重新拼成feature map,然后使用深度可分離卷積添加局部性的處理,然后再使用一個(gè)Linear層映射至tokens。
3. LCA
前兩個(gè)都比較常規(guī),最后一個(gè)比較有特色,經(jīng)過(guò)所有Transformer層以后使用的Layer-wise Class-token Attention,如下圖所示:

LCA模塊會(huì)將所有Transformer Block中得到的class token作為輸入,然后再在其基礎(chǔ)上使用一個(gè)MSA+FFN得到最終的logits輸出。作者認(rèn)為這樣可以獲取多尺度的表征。
實(shí)驗(yàn)
SOTA比較:

I2T消融實(shí)驗(yàn):

LeFF消融實(shí)驗(yàn):

LCA有效性比較:

收斂速度比較:

代碼
模塊1:I2T Image-to-Token
??#?IoT
??self.conv?=?nn.Sequential(
??????nn.Conv2d(in_channels,?out_channels,?conv_kernel,?stride,?4),
??????nn.BatchNorm2d(out_channels),
??????nn.MaxPool2d(pool_kernel,?stride)????
??)
??
??feature_size?=?image_size?//?4
??assert?feature_size?%?patch_size?==?0,?'Image?dimensions?must?be?divisible?by?the?patch?size.'
??num_patches?=?(feature_size?//?patch_size)?**?2
??patch_dim?=?out_channels?*?patch_size?**?2
??self.to_patch_embedding?=?nn.Sequential(
??????Rearrange('b?c?(h?p1)?(w?p2)?->?b?(h?w)?(p1?p2?c)',?p1?=?patch_size,?p2?=?patch_size),
??????nn.Linear(patch_dim,?dim),
??)
模塊2:LeFF
class?LeFF(nn.Module):
????
????def?__init__(self,?dim?=?192,?scale?=?4,?depth_kernel?=?3):
????????super().__init__()
????????
????????scale_dim?=?dim*scale
????????self.up_proj?=?nn.Sequential(nn.Linear(dim,?scale_dim),
????????????????????????????????????Rearrange('b?n?c?->?b?c?n'),
????????????????????????????????????nn.BatchNorm1d(scale_dim),
????????????????????????????????????nn.GELU(),
????????????????????????????????????Rearrange('b?c?(h?w)?->?b?c?h?w',?h=14,?w=14)
????????????????????????????????????)
????????
????????self.depth_conv?=??nn.Sequential(nn.Conv2d(scale_dim,?scale_dim,?kernel_size=depth_kernel,?padding=1,?groups=scale_dim,?bias=False),
??????????????????????????nn.BatchNorm2d(scale_dim),
??????????????????????????nn.GELU(),
??????????????????????????Rearrange('b?c?h?w?->?b?(h?w)?c',?h=14,?w=14)
??????????????????????????)
????????
????????self.down_proj?=?nn.Sequential(nn.Linear(scale_dim,?dim),
????????????????????????????????????Rearrange('b?n?c?->?b?c?n'),
????????????????????????????????????nn.BatchNorm1d(dim),
????????????????????????????????????nn.GELU(),
????????????????????????????????????Rearrange('b?c?n?->?b?n?c')
????????????????????????????????????)
????????
????def?forward(self,?x):
????????x?=?self.up_proj(x)
????????x?=?self.depth_conv(x)
????????x?=?self.down_proj(x)
????????return?x
????????
class?TransformerLeFF(nn.Module):
????def?__init__(self,?dim,?depth,?heads,?dim_head,?scale?=?4,?depth_kernel?=?3,?dropout?=?0.):
????????super().__init__()
????????self.layers?=?nn.ModuleList([])
????????for?_?in?range(depth):
????????????self.layers.append(nn.ModuleList([
????????????????Residual(PreNorm(dim,?Attention(dim,?heads?=?heads,?dim_head?=?dim_head,?dropout?=?dropout))),
????????????????Residual(PreNorm(dim,?LeFF(dim,?scale,?depth_kernel)))
????????????]))
????def?forward(self,?x):
????????c?=?list()
????????for?attn,?leff?in?self.layers:
????????????x?=?attn(x)
????????????cls_tokens?=?x[:,?0]
????????????c.append(cls_tokens)
????????????x?=?leff(x[:,?1:])
????????????x?=?torch.cat((cls_tokens.unsqueeze(1),?x),?dim=1)?
????????return?x,?torch.stack(c).transpose(0,?1)
模塊3:LCA
class?LCAttention(nn.Module):
????def?__init__(self,?dim,?heads?=?8,?dim_head?=?64,?dropout?=?0.):
????????super().__init__()
????????inner_dim?=?dim_head?*??heads
????????project_out?=?not?(heads?==?1?and?dim_head?==?dim)
????????self.heads?=?heads
????????self.scale?=?dim_head?**?-0.5
????????self.to_qkv?=?nn.Linear(dim,?inner_dim?*?3,?bias?=?False)
????????self.to_out?=?nn.Sequential(
????????????nn.Linear(inner_dim,?dim),
????????????nn.Dropout(dropout)
????????)?if?project_out?else?nn.Identity()
????def?forward(self,?x):
????????b,?n,?_,?h?=?*x.shape,?self.heads
????????qkv?=?self.to_qkv(x).chunk(3,?dim?=?-1)
????????q,?k,?v?=?map(lambda?t:?rearrange(t,?'b?n?(h?d)?->?b?h?n?d',?h?=?h),?qkv)
????????q?=?q[:,?:,?-1,?:].unsqueeze(2)?#?Only?Lth?element?use?as?query
????????dots?=?einsum('b?h?i?d,?b?h?j?d?->?b?h?i?j',?q,?k)?*?self.scale
????????attn?=?dots.softmax(dim=-1)
????????out?=?einsum('b?h?i?j,?b?h?j?d?->?b?h?i?d',?attn,?v)
????????out?=?rearrange(out,?'b?h?n?d?->?b?n?(h?d)')
????????out?=??self.to_out(out)
????????return?out
class?LCA(nn.Module):
????#?I?remove?Residual?connection?from?here,?in?paper?author?didn't?explicitly?mentioned?to?use?Residual?connection,?
????#?so?I?removed?it,?althougth?with?Residual?connection?also?this?code?will?work.
????def?__init__(self,?dim,?heads,?dim_head,?mlp_dim,?dropout?=?0.):
????????super().__init__()
????????self.layers?=?nn.ModuleList([])
????????self.layers.append(nn.ModuleList([
????????????????PreNorm(dim,?LCAttention(dim,?heads?=?heads,?dim_head?=?dim_head,?dropout?=?dropout)),
????????????????PreNorm(dim,?FeedForward(dim,?mlp_dim,?dropout?=?dropout))
????????????]))
????def?forward(self,?x):
????????for?attn,?ff?in?self.layers:
????????????x?=?attn(x)?+?x[:,?-1].unsqueeze(1)
????????????x?=?x[:,?-1].unsqueeze(1)?+?ff(x)
????????return?x
參考
https://arxiv.org/abs/2103.11816
https://github.com/rishikksh20/CeiT-pytorch/blob/master/ceit.py
為了感謝讀者的長(zhǎng)期支持,今天我們將送出三本由?機(jī)械工業(yè)出版社?提供的:《從零開(kāi)始構(gòu)建深度前饋神經(jīng)網(wǎng)絡(luò)》 。點(diǎn)擊下方抽獎(jiǎng)助手參與抽獎(jiǎng)。沒(méi)抽到并且對(duì)本書(shū)有興趣的也可以使用下方鏈接進(jìn)行購(gòu)買。
《從零開(kāi)始構(gòu)建深度前饋神經(jīng)網(wǎng)絡(luò)》抽獎(jiǎng)鏈接

本書(shū)通過(guò)Python+NumPy從零開(kāi)始構(gòu)建神經(jīng)網(wǎng)絡(luò)模型,強(qiáng)化讀者對(duì)算法思想的理解,并通過(guò)TensorFlow構(gòu)建模型來(lái)驗(yàn)證讀者親手從零構(gòu)建的版本。前饋神經(jīng)網(wǎng)絡(luò)是深度學(xué)習(xí)的重要知識(shí),其核心思想是反向傳播與梯度下降。本書(shū)從極易理解的示例開(kāi)始,逐漸深入,幫助讀者充分理解并熟練掌握反向傳播與梯度下降算法,為后續(xù)學(xué)習(xí)打下堅(jiān)實(shí)的基礎(chǔ)。


