緊追Google的MLP-Mixer!Facebook AI提出ResMLP!
點藍色字關(guān)注“機器學習算法工程師”
設(shè)為星標,干貨直達!
近日,大家都被Google AI發(fā)布MLP-Mixer: An all-MLP Architecture for Vision(Google AI提出MLP-Mixer:只需MLP就在ImageNet達到SOTA!)給刷屏了。論文中證明了僅包含最簡單的MLP結(jié)構(gòu)就能在ImageNet上達到SOTA。而就在Google公布論文的第二天,F(xiàn)acebook AI也公布了一篇論文:ResMLP: Feedforward networks for image classification with data-efficient training。這篇論文提出的ResMLP和Google提出MLP-Mixer模型簡直如出一轍,都證明了簡單的MLP結(jié)構(gòu)就能在圖像分類問題上取得比較好的效果。

ResMLP的網(wǎng)絡(luò)結(jié)構(gòu)如上圖所示,網(wǎng)絡(luò)的輸入也是一系列的patch emmbeddings,模型的基本block包括一個linear層和一個MLP,其中l(wèi)inear層完成patchs間的信息交互,而MLP則是各個patch的channel間的信息交互(就是原始transformer中的FFN):

雖然整體思路和MLP-Mixer一致,但是ResMLP還是有以下幾點不同的地方:
(1)ResMLP并沒有采用LayerNorm,而是采用了一種Affine transformation來進行norm,這種norm方式不需要像LayerNorm那樣計算統(tǒng)計值來做歸一化,而是直接用兩個學習的參數(shù)α和β做線性變換:

具體的實現(xiàn)代碼比較簡單:
# No norm layerclass Affine(nn.Module):def __init__(self, dim):super().__init__()self.alpha = nn.Parameter(torch.ones(dim))self.beta = nn.Parameter(torch.zeros(dim))def forward(self, x):return self.alpha * x + self.beta
這種norm方式和Going deeper with image transformers論文中采用的LayerScale很相似,不過后者沒有偏置項。實際上在ResMLP中pre-normalization中采用Aff,而residual block的pre-normalization采用LayerScale:
# MLP on channelsclass Mlp(nn.Module):def __init__(self, dim):super().__init__()self.fc1 = nn.Linear(dim, 4 * dim)self.act = nn.GELU()self.fc2 = nn.Linear(4 * dim, dim)def forward(self, x):x = self.fc1(x)x = self.act(x)x = self.fc2(x)return x# ResMLP blocks: a linear between patches + a MLP to process them independentlyclass ResMLP_BLocks(nn.Module):def __init__(self, nb_patches ,dim, layerscale_init):super().__init__()self.affine_1 = Affine(dim)self.affine_2 = Affine(dim)self.linear_patches = nn.Linear(nb_patches, nb_patches) #Linear layer on patchesself.mlp_channels = Mlp(dim) #MLP on channelsself.layerscale_1 = nn.Parameter(layerscale_init * torch.ones((dim))) #LayerScaleself.layerscale_2 = nn.Parameter(layerscale_init * torch.ones((dim))) # parametersdef forward(self, x):res_1 = self.linear_patches(self.affine_1(x).transpose(1,2)).transpose(1,2))x = x + self.layerscale_1 * res_1res_2 = self.mlp_channels(self.affine_2(x))x = x + self.layerscale_2 * res_2return x
與ViT和MLP-Mixer一樣,ResMLP也是堆積同樣的block來提取特征:
# ResMLP model: Stacking the full networkclass ResMLP_models(nn.Module):def __init__(self, dim, depth, nb_patches, layerscale_init, num_classes):super().__init__()self.patch_projector = Patch_projector()self.blocks = nn.ModuleList([ResMLP_BLocks(nb_patches ,dim, layerscale_init)for i in range(depth)])self.affine = Affine(dim)self.linear_classifier = nn.Linear(dim, num_classes)def forward(self, x):B, C, H, W = x.shapex = self.patch_projector(x)for blk in self.blocks:x = blk(x)x = self.affine(x)x = x.mean(dim=1).reshape(B,-1) #average poolingreturn self.linear_classifier(x)
(2)ResMLP不像MLP-Mixer一樣采用兩個MLP,對于token mixing部分只是采用一個linear層。其實ResMLP的本意是將self-attention替換成MLP,而self-attention后面的FFN本身就是一個MLP,這樣就和Google的MLP-Mixer一樣了,但是最終實驗發(fā)現(xiàn)替換self-attention的MLP中間隱含層的維度越大反而效果越差,索性就直接簡化成a simple linear layer of size N × N;

(3)沒有在大數(shù)據(jù)(ImageNet-21K,JFT-300M)上pretrain,模型都是直接用ImageNet-1K訓練的,主要是用training strategy (heavy data-augmentation and optionally distillation)來得到好的performance。

ResMLP相比ViT和CNN,效果還是稍差一點,但是通過知識蒸餾可以進一步提升效果,這說明MLP模型比較容易過擬合,知識蒸餾可能引入了一些正則來提升效果。
另外一點就是ResMLP采用linear層來做patch mixing,這樣就可以簡單對學習的weights進行可視化(N^2xN^2大小的圖像),可視化后發(fā)現(xiàn)學習的參數(shù)和卷積很類似,表現(xiàn)出了局部性(特別是前面的layer):

論文中進一步分析了模型的稀疏性,論文中發(fā)現(xiàn)不僅linear層很稀疏,而且MLP的參數(shù)也很稀疏,這可能有利于做模型壓縮:

雖然在同等條件下,MLP還稍差于ViT和CNN,但是MLP結(jié)構(gòu)是非常簡單的,或許在不遠的將來,會有更好的訓練策略和設(shè)計使得MLP效果進一步提升。
推薦閱讀
"未來"的經(jīng)典之作ViT:transformer is all you need!
PVT:可用于密集任務(wù)backbone的金字塔視覺transformer!
漲點神器FixRes:兩次超越ImageNet數(shù)據(jù)集上的SOTA
不妨試試MoCo,來替換ImageNet上pretrain模型!
機器學習算法工程師
一個用心的公眾號

