ICCV2021 Best Paper:Swin Transformer是如何煉成的!
點藍色字關(guān)注“機器學(xué)習(xí)算法工程師”
設(shè)為星標,干貨直達!
近期,swin transformer榮獲ICCV2021 best paper,這篇文章帶你深入理解。
歡迎start基于detectron2的SwinT實現(xiàn):xiaohu2015/SwinT_detectron2
https://github.com/xiaohu2015/SwinT_detectron2
最近,微軟亞研院提出的Swin Transformer在目標檢測和分割任務(wù)上取得了新的SOTA:在COCO test-dev 達到58.7 box AP和51.1 mask AP,在ADE20K val上達到53.5 mIoU。Swin Transformer的成功恰恰說明了vision transformer在dense prediction任務(wù)上的優(yōu)勢(model long-range dependencies)。Swin Transformer的核心主要有兩點:層級結(jié)構(gòu)(即金字塔結(jié)構(gòu))和window attention,這兩個設(shè)計讓Vision Transformer更高效。

層級結(jié)構(gòu)
主流的CNN模型都是采用金字塔結(jié)構(gòu)或者層級結(jié)構(gòu),即將模型分成不同stage,每個stage都會降低特征圖大?。╢eature map size)但同時提升特征數(shù)量(channels)。但是ViT等模型卻沒有stage的概念,所有的layers都是采用同樣的參數(shù)配置,其輸入也是同樣維度的特征。對于圖像來說,這樣的設(shè)計從計算量來看并沒太友好,所以一些最近的工作如PVT就采用了金字塔結(jié)構(gòu)來構(gòu)造vision transformer,并在速度和效果取得了較好的balance。Swin Transformer設(shè)計的金字塔結(jié)構(gòu)和PVT基本一致,網(wǎng)絡(luò)包括一個patch partition和4個stage:
(1)patch partition:將圖像拆分成個patch,每個patch大小為(ViT一般為或);
(2)stage1開始是一個patch embedding操作:通過linear embedding層得到的patch embeddings,通過卷積來實現(xiàn);
(3)剩余的3個stage開始都先有一個patch merging:將每個相鄰patchs的特征合并,此時特征維度大小,然后通過一個linear層映射到的特征空間。這個過程patchs的數(shù)量降低4x,而特征增長2x,和CNN的stride=2的downsample類似。其實這個patch merging也等價于對區(qū)域做patch embedding。具體實現(xiàn)如下:
class?PatchMerging(nn.Module):
????r"""?Patch?Merging?Layer.
????Args:
????????input_resolution?(tuple[int]):?Resolution?of?input?feature.
????????dim?(int):?Number?of?input?channels.
????????norm_layer?(nn.Module,?optional):?Normalization?layer.??Default:?nn.LayerNorm
????"""
????def?__init__(self,?input_resolution,?dim,?norm_layer=nn.LayerNorm):
????????super().__init__()
????????self.input_resolution?=?input_resolution
????????self.dim?=?dim
????????self.reduction?=?nn.Linear(4?*?dim,?2?*?dim,?bias=False)
????????self.norm?=?norm_layer(4?*?dim)
????def?forward(self,?x):
????????"""
????????x:?B,?H*W,?C
????????"""
????????H,?W?=?self.input_resolution
????????B,?L,?C?=?x.shape
????????assert?L?==?H?*?W,?"input?feature?has?wrong?size"
????????assert?H?%?2?==?0?and?W?%?2?==?0,?f"x?size?({H}*{W})?are?not?even."
????????x?=?x.view(B,?H,?W,?C)
????????x0?=?x[:,?0::2,?0::2,?:]??#?B?H/2?W/2?C
????????x1?=?x[:,?1::2,?0::2,?:]??#?B?H/2?W/2?C
????????x2?=?x[:,?0::2,?1::2,?:]??#?B?H/2?W/2?C
????????x3?=?x[:,?1::2,?1::2,?:]??#?B?H/2?W/2?C
????????x?=?torch.cat([x0,?x1,?x2,?x3],?-1)??#?B?H/2?W/2?4*C
????????x?=?x.view(B,?-1,?4?*?C)??#?B?H/2*W/2?4*C
????????x?=?self.norm(x)
????????x?=?self.reduction(x)
????????return?x
(4)然后每個stage包含相同配置的transformer blocks,stage1到stage4的特征圖分辨率分別是原圖的,,,,這樣Swin Transformer就非常容易應(yīng)用在基于FPN的dense prediction任務(wù)中;(5)最后對所有的patch embeddings求平均,即CNN中常用的global average pooling,然后送入一個linear classifier進行分類。這和ViT采用class token來做分類不一樣。
Window Attention
ViT中的self-attention是對所有的tokens進行attention,這樣雖然可以建立tokens間的全局聯(lián)系,但是計算量和tokens數(shù)量的平方成正比。Swin Transformer提出采用window attention來降低計算量,首先將特征圖分成互不重疊的window,每個window包含相鄰的個patchs,每個window內(nèi)部單獨做self-attention,這可看成是一種local attention方法。對于一個包含個patchs的圖像來說,基于window的attention方法(W-MSA)和原始的MSA計算復(fù)雜度對比如下:
可以看到MSA的計算量與圖像大小平方-成正比,而由于window大小是一個固定值(論文中默認為7,相比圖像大小較小),所以W-MSA的計算量和圖像大小平方-成正比,這對于高分辨率的圖像,計算量大大降低。另外,各個window的參數(shù)是共享的,這個和卷積的kernel類似,所以W-MSA就和convolution一樣具有l(wèi)ocality和parameter sharing 兩大特性。對于W-MSA,首先要實現(xiàn)window partition和reverse,具體實現(xiàn)如下(由于參數(shù)共享,可以將num_windows并入Batch維度):
def?window_partition(x,?window_size):
????"""
????Args:
????????x:?(B,?H,?W,?C)
????????window_size?(int):?window?size
????Returns:
????????windows:?(num_windows*B,?window_size,?window_size,?C)
????"""
????B,?H,?W,?C?=?x.shape
????x?=?x.view(B,?H?//?window_size,?window_size,?W?//?window_size,?window_size,?C)
????windows?=?x.permute(0,?1,?3,?2,?4,?5).contiguous().view(-1,?window_size,?window_size,?C)
????return?windows
def?window_reverse(windows,?window_size,?H,?W):
????"""
????Args:
????????windows:?(num_windows*B,?window_size,?window_size,?C)
????????window_size?(int):?Window?size
????????H?(int):?Height?of?image
????????W?(int):?Width?of?image
????Returns:
????????x:?(B,?H,?W,?C)
????"""
????B?=?int(windows.shape[0]?/?(H?*?W?/?window_size?/?window_size))
????x?=?windows.view(B,?H?//?window_size,?W?//?window_size,?window_size,?window_size,?-1)
????x?=?x.permute(0,?1,?3,?2,?4,?5).contiguous().view(B,?H,?W,?-1)
????return?x
由于self-attention是permutation-invariant的,所以需要引入position embedding來增加位置信息。Swin Transformer采用的是一種相對位置編碼,具體的是在計算query和key的相似度時加一個relative position bias:
這里的為query,key和value,就是一個window里的patchs總數(shù),而是relative position bias,用來表征patchs間的相對位置,attention mask與之相加后就能引入位置信息了。但是實際上我們并需要定義那么大的參數(shù),由于一個window里的tokens在和每個維度上的相對位置都在范圍內(nèi),共有個取值,如果我們采用2D relative position來編碼的話,相對位置共有 ?,那么只需要定義的relative position bias就可以了,這樣就降低了參數(shù)量(論文中每個W-MSA都獨有自己的relative position bias),實際的通過索引從中得到。從論文中的實驗來看(如下所示),直接采用relative position效果是最好的,如果加上ViT類似的abs. pos.,雖然分類效果一致,但是分割效果下降。

最終window attention的實現(xiàn)如下所示:
class?WindowAttention(nn.Module):
????r"""?Window?based?multi-head?self?attention?(W-MSA)?module?with?relative?position?bias.
????It?supports?both?of?shifted?and?non-shifted?window.
????Args:
????????dim?(int):?Number?of?input?channels.
????????window_size?(tuple[int]):?The?height?and?width?of?the?window.
????????num_heads?(int):?Number?of?attention?heads.
????????qkv_bias?(bool,?optional):??If?True,?add?a?learnable?bias?to?query,?key,?value.?Default:?True
????????qk_scale?(float?|?None,?optional):?Override?default?qk?scale?of?head_dim?**?-0.5?if?set
????????attn_drop?(float,?optional):?Dropout?ratio?of?attention?weight.?Default:?0.0
????????proj_drop?(float,?optional):?Dropout?ratio?of?output.?Default:?0.0
????"""
????def?__init__(self,?dim,?window_size,?num_heads,?qkv_bias=True,?qk_scale=None,?attn_drop=0.,?proj_drop=0.):
????????super().__init__()
????????self.dim?=?dim
????????self.window_size?=?window_size??#?Wh,?Ww
????????self.num_heads?=?num_heads
????????head_dim?=?dim?//?num_heads
????????self.scale?=?qk_scale?or?head_dim?**?-0.5
????????#?define?a?parameter?table?of?relative?position?bias
????????#?區(qū)分head
????????self.relative_position_bias_table?=?nn.Parameter(
????????????torch.zeros((2?*?window_size[0]?-?1)?*?(2?*?window_size[1]?-?1),?num_heads))??#?2*Wh-1?*?2*Ww-1,?nH
????????#?get?pair-wise?relative?position?index?for?each?token?inside?the?window
????????#?計算tokens間的相對位置
????????coords_h?=?torch.arange(self.window_size[0])
????????coords_w?=?torch.arange(self.window_size[1])
????????coords?=?torch.stack(torch.meshgrid([coords_h,?coords_w]))??#?2,?Wh,?Ww
????????coords_flatten?=?torch.flatten(coords,?1)??#?2,?Wh*Ww
????????relative_coords?=?coords_flatten[:,?:,?None]?-?coords_flatten[:,?None,?:]??#?2,?Wh*Ww,?Wh*Ww
????????relative_coords?=?relative_coords.permute(1,?2,?0).contiguous()??#?Wh*Ww,?Wh*Ww,?2
????????relative_coords[:,?:,?0]?+=?self.window_size[0]?-?1??#?shift?to?start?from?0
????????relative_coords[:,?:,?1]?+=?self.window_size[1]?-?1
????????relative_coords[:,?:,?0]?*=?2?*?self.window_size[1]?-?1?#?乘以行count,變?yōu)橐痪S索引
????????relative_position_index?=?relative_coords.sum(-1)??#?Wh*Ww,?Wh*Ww
????????self.register_buffer("relative_position_index",?relative_position_index)
????????
????????#?參數(shù)共享,只需要定義一套參數(shù)
????????self.qkv?=?nn.Linear(dim,?dim?*?3,?bias=qkv_bias)
????????self.attn_drop?=?nn.Dropout(attn_drop)
????????self.proj?=?nn.Linear(dim,?dim)
????????self.proj_drop?=?nn.Dropout(proj_drop)
????????trunc_normal_(self.relative_position_bias_table,?std=.02)
????????self.softmax?=?nn.Softmax(dim=-1)
????def?forward(self,?x,?mask=None):
????????"""
????????Args:
????????????x:?input?features?with?shape?of?(num_windows*B,?N,?C)
????????????mask:?(0/-inf)?mask?with?shape?of?(num_windows,?Wh*Ww,?Wh*Ww)?or?None
????????"""
????????B_,?N,?C?=?x.shape
????????qkv?=?self.qkv(x).reshape(B_,?N,?3,?self.num_heads,?C?//?self.num_heads).permute(2,?0,?3,?1,?4)
????????q,?k,?v?=?qkv[0],?qkv[1],?qkv[2]??#?make?torchscript?happy?(cannot?use?tensor?as?tuple)
????????q?=?q?*?self.scale
????????attn?=?(q?@?k.transpose(-2,?-1))
????????
????????#?獲取B
????????relative_position_bias?=?self.relative_position_bias_table[self.relative_position_index.view(-1)].view(
????????????self.window_size[0]?*?self.window_size[1],?self.window_size[0]?*?self.window_size[1],?-1)??#?Wh*Ww,Wh*Ww,nH
????????relative_position_bias?=?relative_position_bias.permute(2,?0,?1).contiguous()??#?nH,?Wh*Ww,?Wh*Ww
????????attn?=?attn?+?relative_position_bias.unsqueeze(0)
????????
????????#?attention?mask
????????if?mask?is?not?None:
????????????nW?=?mask.shape[0]
????????????attn?=?attn.view(B_?//?nW,?nW,?self.num_heads,?N,?N)?+?mask.unsqueeze(1).unsqueeze(0)
????????????attn?=?attn.view(-1,?self.num_heads,?N,?N)
????????????attn?=?self.softmax(attn)
????????else:
????????????attn?=?self.softmax(attn)
????????attn?=?self.attn_drop(attn)
????????x?=?(attn?@?v).transpose(1,?2).reshape(B_,?N,?C)
????????x?=?self.proj(x)
????????x?=?self.proj_drop(x)
????????return?x
對于window attention,還有一個小問題就是如果圖像或者特征圖不能被整分,那么可以采用的方式是padding,論文中采用的是對右下位置進行padding(可以通過設(shè)置attention mask來消除padding的影響,不設(shè)置也影響不大):
To make the window size (M, M) divisible by the feature map size of (h, w), bottom-right padding is employed on the feature map if needed.
Shifted Window
window attention畢竟是一種local attention,如果每個stage采用相同的window attention,那么信息交換只存在每個window內(nèi)部。用CNN的話語說,那么感受野是沒有發(fā)生變化的,此時只有當進入下一個stage后,感受野才增大2倍。論文中提出的解決方案是采用shifted window來建立windows間的信息交互。常規(guī)的window切分是從特征圖的左上角開始均勻地切分,比如下圖中的大小的特征圖切分成4個windows,這里。那么shifted window是從和兩個維度各shift 個patchs,此時切分的windows如下圖所示,注意此時邊緣部分產(chǎn)生的windows就不再是大小的。通過shifted window進行的window attention記為SW-MSA,當交替地進行W-MSA和SW-MSA,模型的表達能力就會增強,因為windows間也有了信息傳遞。

但是shifted window會帶來一個問題:邊緣處的windows的大小發(fā)生了改變,而且這還會導(dǎo)致windows的數(shù)量從增加至。在實現(xiàn)SW-MSA時,一種最直接的處理方式是對邊緣進行padding,使得所有的windows的大小為,這樣就可以像W-MSA一樣將windows組成batch進行計算,但是這會增加計算量,特別是當windows總數(shù)較少時。比如上圖中例子,W-MSA的windows數(shù)量為,而SW-MSA的windows數(shù)量為,這樣計算量將變?yōu)樵瓉淼?.25倍。論文中提出了一種cyclic-shift的策略來高效地實現(xiàn)SW-MSA,如下圖所示,直觀來看是講左上的6個windows移動到右下位置,這樣處理后8個較小的windows就可以組成3個大小為的windows,那么總的windows數(shù)量就沒有變化,計算量一樣,完成attention計算后reverse就好。這種cyclic-shift可以通過torch.roll來實現(xiàn)。雖然小的windows可以組成常規(guī)windows,但是在attention計算時要通過mask來保證原來的效果。生成attention mask的實現(xiàn)也是比較簡單:所有的windows可以分成9組,其中邊緣部分共8個windows,而中間的windows都是大小為的正常windows,可看成1組;9組可以設(shè)定不同的id,在組成新的windows后id不同的tokens就通過mask不進行attention計算。

具體到實現(xiàn)如下所示,可以通過shift_size來區(qū)分是W-MSA還是SW-MSA:
class?SwinTransformerBlock(nn.Module):
????r"""?Swin?Transformer?Block.
????Args:
????????dim?(int):?Number?of?input?channels.
????????input_resolution?(tuple[int]):?Input?resulotion.
????????num_heads?(int):?Number?of?attention?heads.
????????window_size?(int):?Window?size.
????????shift_size?(int):?Shift?size?for?SW-MSA.
????????mlp_ratio?(float):?Ratio?of?mlp?hidden?dim?to?embedding?dim.
????????qkv_bias?(bool,?optional):?If?True,?add?a?learnable?bias?to?query,?key,?value.?Default:?True
????????qk_scale?(float?|?None,?optional):?Override?default?qk?scale?of?head_dim?**?-0.5?if?set.
????????drop?(float,?optional):?Dropout?rate.?Default:?0.0
????????attn_drop?(float,?optional):?Attention?dropout?rate.?Default:?0.0
????????drop_path?(float,?optional):?Stochastic?depth?rate.?Default:?0.0
????????act_layer?(nn.Module,?optional):?Activation?layer.?Default:?nn.GELU
????????norm_layer?(nn.Module,?optional):?Normalization?layer.??Default:?nn.LayerNorm
????"""
????def?__init__(self,?dim,?input_resolution,?num_heads,?window_size=7,?shift_size=0,
?????????????????mlp_ratio=4.,?qkv_bias=True,?qk_scale=None,?drop=0.,?attn_drop=0.,?drop_path=0.,
?????????????????act_layer=nn.GELU,?norm_layer=nn.LayerNorm):
????????super().__init__()
????????self.dim?=?dim
????????self.input_resolution?=?input_resolution
????????self.num_heads?=?num_heads
????????self.window_size?=?window_size
????????self.shift_size?=?shift_size
????????self.mlp_ratio?=?mlp_ratio
????????
????????#?如果輸入分辨率小于M,就直接一個window?attention就好
????????if?min(self.input_resolution)?<=?self.window_size:
????????????#?if?window?size?is?larger?than?input?resolution,?we?don't?partition?windows
????????????self.shift_size?=?0
????????????self.window_size?=?min(self.input_resolution)
????????assert?0?<=?self.shift_size?"shift_size?must?in?0-window_size"
????????self.norm1?=?norm_layer(dim)
????????self.attn?=?WindowAttention(
????????????dim,?window_size=to_2tuple(self.window_size),?num_heads=num_heads,
????????????qkv_bias=qkv_bias,?qk_scale=qk_scale,?attn_drop=attn_drop,?proj_drop=drop)
????????self.drop_path?=?DropPath(drop_path)?if?drop_path?>?0.?else?nn.Identity()
????????self.norm2?=?norm_layer(dim)
????????mlp_hidden_dim?=?int(dim?*?mlp_ratio)
????????self.mlp?=?Mlp(in_features=dim,?hidden_features=mlp_hidden_dim,?act_layer=act_layer,?drop=drop)
????????
????????#?對SW-MSA,生成attention?mask
????????if?self.shift_size?>?0:
????????????#?calculate?attention?mask?for?SW-MSA
????????????#?將windows的9組設(shè)定不同的id:0~8
????????????H,?W?=?self.input_resolution
????????????img_mask?=?torch.zeros((1,?H,?W,?1))??#?1?H?W?1
????????????h_slices?=?(slice(0,?-self.window_size),
????????????????????????slice(-self.window_size,?-self.shift_size),
????????????????????????slice(-self.shift_size,?None))
????????????w_slices?=?(slice(0,?-self.window_size),
????????????????????????slice(-self.window_size,?-self.shift_size),
????????????????????????slice(-self.shift_size,?None))
????????????cnt?=?0
????????????for?h?in?h_slices:
????????????????for?w?in?w_slices:
????????????????????img_mask[:,?h,?w,?:]?=?cnt
????????????????????cnt?+=?1
????????????mask_windows?=?window_partition(img_mask,?self.window_size)??#?nW,?window_size,?window_size,?1
????????????mask_windows?=?mask_windows.view(-1,?self.window_size?*?self.window_size)
????????????attn_mask?=?mask_windows.unsqueeze(1)?-?mask_windows.unsqueeze(2)
????????????attn_mask?=?attn_mask.masked_fill(attn_mask?!=?0,?float(-100.0)).masked_fill(attn_mask?==?0,?float(0.0))
????????else:
????????????attn_mask?=?None
????????self.register_buffer("attn_mask",?attn_mask)
????def?forward(self,?x):
????????H,?W?=?self.input_resolution
????????B,?L,?C?=?x.shape
????????assert?L?==?H?*?W,?"input?feature?has?wrong?size"
????????shortcut?=?x
????????x?=?self.norm1(x)
????????x?=?x.view(B,?H,?W,?C)
????????#?cyclic?shift
????????if?self.shift_size?>?0:
????????????shifted_x?=?torch.roll(x,?shifts=(-self.shift_size,?-self.shift_size),?dims=(1,?2))
????????else:
????????????shifted_x?=?x
????????#?partition?windows
????????x_windows?=?window_partition(shifted_x,?self.window_size)??#?nW*B,?window_size,?window_size,?C
????????x_windows?=?x_windows.view(-1,?self.window_size?*?self.window_size,?C)??#?nW*B,?window_size*window_size,?C
????????#?W-MSA/SW-MSA
????????attn_windows?=?self.attn(x_windows,?mask=self.attn_mask)??#?nW*B,?window_size*window_size,?C
????????#?merge?windows
????????attn_windows?=?attn_windows.view(-1,?self.window_size,?self.window_size,?C)
????????shifted_x?=?window_reverse(attn_windows,?self.window_size,?H,?W)??#?B?H'?W'?C
????????#?reverse?cyclic?shift
????????if?self.shift_size?>?0:
????????????x?=?torch.roll(shifted_x,?shifts=(self.shift_size,?self.shift_size),?dims=(1,?2))
????????else:
????????????x?=?shifted_x
????????x?=?x.view(B,?H?*?W,?C)
????????#?FFN
????????x?=?shortcut?+?self.drop_path(x)
????????x?=?x?+?self.drop_path(self.mlp(self.norm2(x)))
????????return?x
Swin Transformer
Swin Transformer的核心就是在每個stage交替采用W-MSA和SW-MSA,其中window size ,論文中共設(shè)計了4種不同的模型:Swin-T, Swin-S, Swin-B,Swin-L,具體的參數(shù)如下所示:
? Swin-T: C = 96, layer numbers = {2, 2, 6, 2}?
? Swin-S: C = 96, layer numbers ={2, 2, 18, 2}?
? Swin-B: C = 128, layer numbers ={2, 2, 18, 2}?
? Swin-L: C = 192, layer numbers ={2, 2, 18, 2}
不同模型主要是特征維度不同和每個stage的層數(shù)不同,其中Swin-T和Swin-S的模型復(fù)雜度類比ResNet-50 (DeiT-S) 和ResNet-101。模型中的MSA每個head的特征維度(變化的只是heads數(shù)量),F(xiàn)FN中的expansion系數(shù)。
SwinT模型在ImageNet上與其它模型的對比如下所示,可以看到SwinT效果要優(yōu)于DeiT,相比CNN網(wǎng)絡(luò)RegNet和EfficientNet也有較好的speed-accuracy trade-off。

對于目標檢測(實例分割),將backbone替換成SwinT,均表現(xiàn)了更好的效果,其中基于Swin-L的HTC++模型在在COCO test-dev 達到58.7 box AP和51.1 mask AP。

對于語義分割,將backone換成SwinT也可以得到更好的效果,基于Swin-L的UperNet模型在ADE20K val上達到53.5 mIoU。

所以,SwinT模型在dense prediction任務(wù)上表現(xiàn)還是非常好的。
自監(jiān)督
近期,SwinT的作者又放出了將SwinT模型用于自監(jiān)督訓(xùn)練中:Self-Supervised Learning with Swin Transformers。關(guān)于將vision transformer模型應(yīng)用在自監(jiān)督領(lǐng)域,已經(jīng)有多篇論文了,這包括Facebook AI的兩篇研究:MoCoV3: An Empirical Study of Training Self-Supervised Vision Transformers和DINO: Emerging Properties in Self-Supervised Vision Transformers,這兩篇都證明了ViT模型在自監(jiān)督領(lǐng)域的前景。SwinT的這篇自監(jiān)督報告更多的傾向于論證基于自監(jiān)督訓(xùn)練的SwinT模型也可以像CNN一樣遷移到下游任務(wù)并取得好的效果。
報告中采用的自監(jiān)督訓(xùn)練方法為MoBY,其實就是將MoCo和BYOL結(jié)合在一起了:
MoBY is a combination of two popular selfsupervised learning approaches: MoCo v2 and BYOL. It inherits the momentum design, the key queue, and the contrastive loss used in MoCo v2, and inherits the asymmetric encoders, asymmetric data augmentations and the momentum scheduler in BYOL

在ImageNet1K linear evaluation實驗上,MoBY優(yōu)于MoCoV3和DINO(無multi-crop scheme),同時基于SwinT的模型效果也優(yōu)于DeiT:

當遷移到下游任務(wù)如實例分割中,自監(jiān)督模型和有監(jiān)督模型效果相當(稍微差一點):

結(jié)語
SwinT的成功讓我們看到了vision transformer模型在dense prediction任務(wù)的應(yīng)用前景,但是我個人覺得SwinT設(shè)計上還是過于復(fù)雜,近期美團提出的Twins: Revisiting the Design of Spatial Attention in Vision Transformers感覺更優(yōu)雅一些。
參考
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows microsoft/Swin-Transformer xiaohu2015/SwinT_detectron2 Self-Supervised Learning with Swin Transformers?
推薦閱讀
"未來"的經(jīng)典之作ViT:transformer is all you need!
PVT:可用于密集任務(wù)backbone的金字塔視覺transformer!
漲點神器FixRes:兩次超越ImageNet數(shù)據(jù)集上的SOTA
不妨試試MoCo,來替換ImageNet上pretrain模型!
機器學(xué)習(xí)算法工程師
? ??? ? ? ? ? ? ? ? ? ? ? ??????? ??一個用心的公眾號

