黄视频大几吧操逼,丰满少妇一级A片,日韩乱码人妻无码超清蜜桃丨,免费观看91看片,黑人大屌在线,免费的看污片丝瓜视频,欧美另类性爱视频,亚洲中文字幕无码视频在线观看

↑ 點擊藍字?關注極市平臺

作者丨HUST小菜雞@知乎

來源丨h(huán)ttps://zhuanlan.zhihu.com/p/146130215

編輯丨極市平臺

極市導讀

Attention是一種資源分配的機制，在視覺方面，其核心思想是突出對象的某些重要特征。本文主要介紹了多種注意力機制的基本結構和實現(xiàn)過程。

之前在看DETR這篇論文中的self_attention，然后結合之前實驗室組會經(jīng)常提起的注意力機制，所以本周時間對注意力機制進行了相關的梳理，以及相關的源碼閱讀了解其實現(xiàn)的機制。

一、注意力機制（attention mechanism）

attention機制可以它認為是一種資源分配的機制，可以理解為對于原本平均分配的資源根據(jù)attention對象的重要程度重新分配資源，重要的單位就多分一點，不重要或者不好的單位就少分一點，在深度神經(jīng)網(wǎng)絡的結構設計中，attention所要分配的資源基本上就是權重了。

視覺注意力分為幾種，核心思想是基于原有的數(shù)據(jù)找到其之間的關聯(lián)性，然后突出其某些重要特征，有通道注意力，像素注意力，多階注意力等，也有把NLP中的自注意力引入。

二、自注意力（self-attention）

參考文獻：http://papers.nips.cc/paper/7181-attention-is-all-you-need

參考資料：https://zhuanlan.zhihu.com/p/48508221

GitHub：https://github.com/huggingface/transformers

自注意力有時候也稱為內(nèi)部注意力，是一個與單個序列的不同位置相關的注意力機制，目的是計算序列的表達形式，因為解碼器的位置不變性，以及在DETR中，每個像素不僅僅包含數(shù)值信息，并且每個像素的位置信息也很重要。

所有的編碼器在結構上都是相同的，但它們沒有共享參數(shù)。每個編碼器都可以分解成兩個子層：

在transformer中，每個encoder子層有Multi-head self-attention和position-wise FFN組成。

輸入的每個單詞通過嵌入的方式形成詞向量，通過自注意進行編碼，然后再送入FFN得出一個層級的編碼。

解碼器在結構上也是多個相同的堆疊而成，在有和encoder相似的結構的Multi-head self-attention和position-wise FFN，同時還多了一個注意力層用來關注輸入句子的相關部分。

Self-Attention

Self-Attention是Transformer最核心的內(nèi)容，可以理解位將隊列和一組值與輸入對應，即形成querry，key，value向output的映射，output可以看作是value的加權求和，加權值則是由Self-Attention來得出的。

具體實施細節(jié)如下：

在self-attention中，每個單詞有3個不同的向量，它們分別是Query向量，Key向量和Value向量，長度均是64。它們是通過3個不同的權值矩陣由嵌入向量X乘以三個不同的權值矩陣得到，其中三個矩陣的尺寸也是相同的。均是512×64。

Self_attention的計算過程如下

將輸入單詞轉(zhuǎn)化成嵌入向量；
根據(jù)嵌入向量得到q，k，v三個向量；
為每個向量計算一個score：score=q×v；
為了梯度的穩(wěn)定，Transformer使用了score歸一化，即除以sqrt(dk)；
對score施以softmax激活函數(shù)；
softmax點乘Value值v，得到加權的每個輸入向量的評分v；
相加之后得到最終的輸出結果z。

矩陣形式的計算過程：

對于Multi-head self-attention，通過論文可以看出就是將單個點積注意力進行融合，兩者相結合得出了transformer

具體的實施可以參照detr的models/transformer

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved"""DETR Transformer class.
Copy-paste from torch.nn.Transformer with modifications:    * positional encodings are passed in MHattention    * extra LN at the end of encoder is removed    * decoder returns a stack of activations from all decoding layers"""import copyfrom typing import Optional, List
import torchimport torch.nn.functional as Ffrom torch import nn, Tensor

class Transformer(nn.Module):
    def __init__(self, d_model=512, nhead=8, num_encoder_layers=6,                 num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,                 activation="relu", normalize_before=False,                 return_intermediate_dec=False):        super().__init__()
        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward,                                                dropout, activation, normalize_before)        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
        decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,                                                dropout, activation, normalize_before)        decoder_norm = nn.LayerNorm(d_model)        self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm,                                          return_intermediate=return_intermediate_dec)
        self._reset_parameters()
        self.d_model = d_model        self.nhead = nhead
    def _reset_parameters(self):        for p in self.parameters():            if p.dim() > 1:                nn.init.xavier_uniform_(p)
    def forward(self, src, mask, query_embed, pos_embed):        # flatten NxCxHxW to HWxNxC        bs, c, h, w = src.shape        src = src.flatten(2).permute(2, 0, 1)        pos_embed = pos_embed.flatten(2).permute(2, 0, 1)        query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)        mask = mask.flatten(1)
        tgt = torch.zeros_like(query_embed)        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)        hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,                          pos=pos_embed, query_pos=query_embed)        return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)

class TransformerEncoder(nn.Module):
    def __init__(self, encoder_layer, num_layers, norm=None):        super().__init__()        self.layers = _get_clones(encoder_layer, num_layers)        self.num_layers = num_layers        self.norm = norm
    def forward(self, src,                mask: Optional[Tensor] = None,                src_key_padding_mask: Optional[Tensor] = None,                pos: Optional[Tensor] = None):        output = src
        for layer in self.layers:            output = layer(output, src_mask=mask,                           src_key_padding_mask=src_key_padding_mask, pos=pos)
        if self.norm is not None:            output = self.norm(output)
        return output

class TransformerDecoder(nn.Module):
    def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):        super().__init__()        self.layers = _get_clones(decoder_layer, num_layers)        self.num_layers = num_layers        self.norm = norm        self.return_intermediate = return_intermediate
    def forward(self, tgt, memory,                tgt_mask: Optional[Tensor] = None,                memory_mask: Optional[Tensor] = None,                tgt_key_padding_mask: Optional[Tensor] = None,                memory_key_padding_mask: Optional[Tensor] = None,                pos: Optional[Tensor] = None,                query_pos: Optional[Tensor] = None):        output = tgt
        intermediate = []
        for layer in self.layers:            output = layer(output, memory, tgt_mask=tgt_mask,                           memory_mask=memory_mask,                           tgt_key_padding_mask=tgt_key_padding_mask,                           memory_key_padding_mask=memory_key_padding_mask,                           pos=pos, query_pos=query_pos)            if self.return_intermediate:                intermediate.append(self.norm(output))
        if self.norm is not None:            output = self.norm(output)            if self.return_intermediate:                intermediate.pop()                intermediate.append(output)
        if self.return_intermediate:            return torch.stack(intermediate)
        return output

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,                 activation="relu", normalize_before=False):        super().__init__()        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)        # Implementation of Feedforward model        self.linear1 = nn.Linear(d_model, dim_feedforward)        self.dropout = nn.Dropout(dropout)        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)        self.norm2 = nn.LayerNorm(d_model)        self.dropout1 = nn.Dropout(dropout)        self.dropout2 = nn.Dropout(dropout)
        self.activation = _get_activation_fn(activation)        self.normalize_before = normalize_before
    def with_pos_embed(self, tensor, pos: Optional[Tensor]):        return tensor if pos is None else tensor + pos
    def forward_post(self,                     src,                     src_mask: Optional[Tensor] = None,                     src_key_padding_mask: Optional[Tensor] = None,                     pos: Optional[Tensor] = None):        q = k = self.with_pos_embed(src, pos)        src2 = self.self_attn(q, k, value=src, attn_mask=src_mask,                              key_padding_mask=src_key_padding_mask)[0]        src = src + self.dropout1(src2)        src = self.norm1(src)        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))        src = src + self.dropout2(src2)        src = self.norm2(src)        return src
    def forward_pre(self, src,                    src_mask: Optional[Tensor] = None,                    src_key_padding_mask: Optional[Tensor] = None,                    pos: Optional[Tensor] = None):        src2 = self.norm1(src)        q = k = self.with_pos_embed(src2, pos)        src2 = self.self_attn(q, k, value=src2, attn_mask=src_mask,                              key_padding_mask=src_key_padding_mask)[0]        src = src + self.dropout1(src2)        src2 = self.norm2(src)        src2 = self.linear2(self.dropout(self.activation(self.linear1(src2))))        src = src + self.dropout2(src2)        return src
    def forward(self, src,                src_mask: Optional[Tensor] = None,                src_key_padding_mask: Optional[Tensor] = None,                pos: Optional[Tensor] = None):        if self.normalize_before:            return self.forward_pre(src, src_mask, src_key_padding_mask, pos)        return self.forward_post(src, src_mask, src_key_padding_mask, pos)

class TransformerDecoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,                 activation="relu", normalize_before=False):        super().__init__()        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)        # Implementation of Feedforward model        self.linear1 = nn.Linear(d_model, dim_feedforward)        self.dropout = nn.Dropout(dropout)        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)        self.norm2 = nn.LayerNorm(d_model)        self.norm3 = nn.LayerNorm(d_model)        self.dropout1 = nn.Dropout(dropout)        self.dropout2 = nn.Dropout(dropout)        self.dropout3 = nn.Dropout(dropout)
        self.activation = _get_activation_fn(activation)        self.normalize_before = normalize_before
    def with_pos_embed(self, tensor, pos: Optional[Tensor]):        return tensor if pos is None else tensor + pos
    def forward_post(self, tgt, memory,                     tgt_mask: Optional[Tensor] = None,                     memory_mask: Optional[Tensor] = None,                     tgt_key_padding_mask: Optional[Tensor] = None,                     memory_key_padding_mask: Optional[Tensor] = None,                     pos: Optional[Tensor] = None,                     query_pos: Optional[Tensor] = None):        q = k = self.with_pos_embed(tgt, query_pos)        tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,                              key_padding_mask=tgt_key_padding_mask)[0]        tgt = tgt + self.dropout1(tgt2)        tgt = self.norm1(tgt)        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),                                   key=self.with_pos_embed(memory, pos),                                   value=memory, attn_mask=memory_mask,                                   key_padding_mask=memory_key_padding_mask)[0]        tgt = tgt + self.dropout2(tgt2)        tgt = self.norm2(tgt)        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))        tgt = tgt + self.dropout3(tgt2)        tgt = self.norm3(tgt)        return tgt
    def forward_pre(self, tgt, memory,                    tgt_mask: Optional[Tensor] = None,                    memory_mask: Optional[Tensor] = None,                    tgt_key_padding_mask: Optional[Tensor] = None,                    memory_key_padding_mask: Optional[Tensor] = None,                    pos: Optional[Tensor] = None,                    query_pos: Optional[Tensor] = None):        tgt2 = self.norm1(tgt)        q = k = self.with_pos_embed(tgt2, query_pos)        tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,                              key_padding_mask=tgt_key_padding_mask)[0]        tgt = tgt + self.dropout1(tgt2)        tgt2 = self.norm2(tgt)        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),                                   key=self.with_pos_embed(memory, pos),                                   value=memory, attn_mask=memory_mask,                                   key_padding_mask=memory_key_padding_mask)[0]        tgt = tgt + self.dropout2(tgt2)        tgt2 = self.norm3(tgt)        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))        tgt = tgt + self.dropout3(tgt2)        return tgt
    def forward(self, tgt, memory,                tgt_mask: Optional[Tensor] = None,                memory_mask: Optional[Tensor] = None,                tgt_key_padding_mask: Optional[Tensor] = None,                memory_key_padding_mask: Optional[Tensor] = None,                pos: Optional[Tensor] = None,                query_pos: Optional[Tensor] = None):        if self.normalize_before:            return self.forward_pre(tgt, memory, tgt_mask, memory_mask,                                    tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)        return self.forward_post(tgt, memory, tgt_mask, memory_mask,                                 tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)

def _get_clones(module, N):    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])

def build_transformer(args):    return Transformer(        d_model=args.hidden_dim,        dropout=args.dropout,        nhead=args.nheads,        dim_feedforward=args.dim_feedforward,        num_encoder_layers=args.enc_layers,        num_decoder_layers=args.dec_layers,        normalize_before=args.pre_norm,        return_intermediate_dec=True,    )

def _get_activation_fn(activation):    """Return an activation function given a string"""    if activation == "relu":        return F.relu    if activation == "gelu":        return F.gelu    if activation == "glu":        return F.glu    raise RuntimeError(F"activation should be relu/gelu, not {activation}.")

三、軟注意力（soft-attention）

軟注意力是一個[0,1]間的連續(xù)分布問題，更加關注區(qū)域或者通道，軟注意力是確定性注意力，學習完成后可以通過網(wǎng)絡生成，并且是可微的，可以通過神經(jīng)網(wǎng)絡計算出梯度并且可以前向傳播和后向反饋來學習得到注意力的權重。

1、空間域注意力（spatial transformer network）

論文地址：http://papers.nips.cc/paper/5854-spatial-transformer-networks

GitHub地址：https://github.com/fxia22/stn.pytorch

空間區(qū)域注意力可以理解為讓神經(jīng)網(wǎng)絡在看哪里。通過注意力機制，將原始圖片中的空間信息變換到另一個空間中并保留了關鍵信息，在很多現(xiàn)有的方法中都有使用這種網(wǎng)絡，自己接觸過的一個就是ALPHA Pose。spatial transformer其實就是注意力機制的實現(xiàn)，因為訓練出的spatial transformer能夠找出圖片信息中需要被關注的區(qū)域，同時這個transformer又能夠具有旋轉(zhuǎn)、縮放變換的功能，這樣圖片局部的重要信息能夠通過變換而被框盒提取出來。

主要在于空間變換矩陣的學習

class STN(Module):    def __init__(self, layout = 'BHWD'):        super(STN, self).__init__()        if layout == 'BHWD':            self.f = STNFunction()        else:            self.f = STNFunctionBCHW()    def forward(self, input1, input2):        return self.f(input1, input2)

class STNFunction(Function):    def forward(self, input1, input2):        self.input1 = input1        self.input2 = input2        self.device_c = ffi.new("int *")        output = torch.zeros(input1.size()[0], input2.size()[1], input2.size()[2], input1.size()[3])        #print('decice %d' % torch.cuda.current_device())        if input1.is_cuda:            self.device = torch.cuda.current_device()        else:            self.device = -1        self.device_c[0] = self.device        if not input1.is_cuda:            my_lib.BilinearSamplerBHWD_updateOutput(input1, input2, output)        else:            output = output.cuda(self.device)            my_lib.BilinearSamplerBHWD_updateOutput_cuda(input1, input2, output, self.device_c)        return output
    def backward(self, grad_output):        grad_input1 = torch.zeros(self.input1.size())        grad_input2 = torch.zeros(self.input2.size())        #print('backward decice %d' % self.device)        if not grad_output.is_cuda:            my_lib.BilinearSamplerBHWD_updateGradInput(self.input1, self.input2, grad_input1, grad_input2, grad_output)        else:            grad_input1 = grad_input1.cuda(self.device)            grad_input2 = grad_input2.cuda(self.device)            my_lib.BilinearSamplerBHWD_updateGradInput_cuda(self.input1, self.input2, grad_input1, grad_input2, grad_output, self.device_c)        return grad_input1, grad_input2

2、通道注意力（Channel Attention，CA）

通道注意力可以理解為讓神經(jīng)網(wǎng)絡在看什么，典型的代表是SENet。卷積網(wǎng)絡的每一層都有好多卷積核，每個卷積核對應一個特征通道，相對于空間注意力機制，通道注意力在于分配各個卷積通道之間的資源，分配粒度上比前者大了一個級別。

論文：Squeeze-and-Excitation Networks（https://arxiv.org/abs/1709.01507）

GitHub地址：https://github.com/moskomule/senet.pytorch

Squeeze操作：將各通道的全局空間特征作為該通道的表示，使用全局平均池化生成各通道的統(tǒng)計量

Excitation操作：學習各通道的依賴程度，并根據(jù)依賴程度對不同的特征圖進行調(diào)整，得到最后的輸出，需要考察各通道的依賴程度

整體的結構如圖所示：

卷積層的輸出并沒有考慮對各通道的依賴，SEBlock的目的在于然根網(wǎng)絡選擇性的增強信息量最大的特征，是的后期處理充分利用這些特征并抑制無用的特征。

將輸入特征進行 Global avgpooling，得到1×1×Channel
然后bottleneck特征交互一下，先壓縮channel數(shù)，再重構回channel數(shù)
最后接個sigmoid，生成channel間0~1的attention weights，最后scale乘回原輸入特征

SE-ResNet的SE-Block

class SEBasicBlock(nn.Module):    expansion = 1
    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,                 base_width=64, dilation=1, norm_layer=None,                 *, reduction=16):        super(SEBasicBlock, self).__init__()        self.conv1 = conv3x3(inplanes, planes, stride)        self.bn1 = nn.BatchNorm2d(planes)        self.relu = nn.ReLU(inplace=True)        self.conv2 = conv3x3(planes, planes, 1)        self.bn2 = nn.BatchNorm2d(planes)        self.se = SELayer(planes, reduction)        self.downsample = downsample        self.stride = stride
    def forward(self, x):        residual = x        out = self.conv1(x)        out = self.bn1(out)        out = self.relu(out)
        out = self.conv2(out)        out = self.bn2(out)        out = self.se(out)
        if self.downsample is not None:            residual = self.downsample(x)
        out += residual        out = self.relu(out)
        return out


class SELayer(nn.Module):    def __init__(self, channel, reduction=16):        super(SELayer, self).__init__()        self.avg_pool = nn.AdaptiveAvgPool2d(1)        self.fc = nn.Sequential(            nn.Linear(channel, channel // reduction, bias=False),            nn.ReLU(inplace=True),            nn.Linear(channel // reduction, channel, bias=False),            nn.Sigmoid()        )
    def forward(self, x):        b, c, _, _ = x.size()        y = self.avg_pool(x).view(b, c)        y = self.fc(y).view(b, c, 1, 1)        return x * y.expand_as(x)

ResNet的Basic Block

class BasicBlock(nn.Module):    def __init__(self, inplanes, planes, stride=1):        super(BasicBlock, self).__init__()        self.conv1 = conv3x3(inplanes, planes, stride)        self.bn1 = nn.BatchNorm2d(planes)        self.relu = nn.ReLU(inplace=True)        self.conv2 = conv3x3(planes, planes)        self.bn2 = nn.BatchNorm2d(planes)        if inplanes != planes:            self.downsample = nn.Sequential(nn.Conv2d(inplanes, planes, kernel_size=1, stride=stride, bias=False),                                            nn.BatchNorm2d(planes))        else:            self.downsample = lambda x: x        self.stride = stride
    def forward(self, x):        residual = self.downsample(x)        out = self.conv1(x)        out = self.bn1(out)        out = self.relu(out)
        out = self.conv2(out)        out = self.bn2(out)
        out += residual        out = self.relu(out)
        return out

兩者的差別主要體現(xiàn)在多了一個SElayer，詳細可以查看源碼

3、混合域模型(融合空間域和通道域注意力)

（1）論文：Residual Attention Network for image classification（CVPR 2017 Open Access Repository）

文章中注意力的機制是軟注意力基本的加掩碼(mask)機制，但是不同的是，這種注意力機制的mask借鑒了殘差網(wǎng)絡的想法，不只根據(jù)當前網(wǎng)絡層的信息加上mask，還把上一層的信息傳遞下來，這樣就防止mask之后的信息量過少引起的網(wǎng)絡層數(shù)不能堆疊很深的問題。

該文章的注意力機制的創(chuàng)新點在于提出了殘差注意力學習(residual attention learning)，不僅只把mask之后的特征張量作為下一層的輸入，同時也將mask之前的特征張量作為下一層的輸入，這時候可以得到的特征更為豐富，從而能夠更好的注意關鍵特征。同時采用三階注意力模塊來構成整個的注意力。

（2）Dual Attention Network for Scene Segmentation（CVPR 2019 Open Access Repository）

4、Non-Local

論文：non-local neural networks(CVPR 2018 Open Access Repository)

GitHub地址：https://github.com/AlexHex7/Non-local_pytorch

Local這個詞主要是針對感受野(receptive field)來說的。以單一的卷積操作為例，它的感受野大小就是卷積核大小，而我們一般都選用3*3，5*5之類的卷積核，它們只考慮局部區(qū)域，因此都是local的運算。同理，池化(Pooling)也是。相反的，non-local指的就是感受野可以很大，而不是一個局部領域。全連接就是non-local的，而且是global的。但是全連接帶來了大量的參數(shù)，給優(yōu)化帶來困難。卷積層的堆疊可以增大感受野，但是如果看特定層的卷積核在原圖上的感受野，它畢竟是有限的。這是local運算不能避免的。然而有些任務，它們可能需要原圖上更多的信息，比如attention。如果在某些層能夠引入全局的信息，就能很好地解決local操作無法看清全局的情況，為后面的層帶去更豐富的信息。

文章定義的對于神經(jīng)網(wǎng)絡通用的Non-Local計算如下所示：

如果按照上面的公式，用for循環(huán)實現(xiàn)肯定是很慢的。此外，如果在尺寸很大的輸入上應用non-local layer，也是計算量很大的。后者的解決方案是，只在高階語義層中引入non-local layer。還可以通過對embedding(θ,?,g)的結果加pooling層來進一步地減少計算量。

首先對輸入的 feature map X 進行線性映射（通過1x1卷積，來壓縮通道數(shù)），然后得到?θ,?,g?特征
通過reshape操作，強行合并上述的三個特征除通道數(shù)外的維度，然后對進行矩陣點乘操作，得到類似協(xié)方差矩陣的東西（這個過程很重要，計算出特征中的自相關性，即得到每幀中每個像素對其他所有幀所有像素的關系）
然后對自相關特征以列or以行（具體看矩陣 g 的形式而定）進行 Softmax 操作，得到0~1的weights，這里就是我們需要的 Self-attention 系數(shù)
最后將 attention系數(shù)，對應乘回特征矩陣g中，然后再上擴channel 數(shù)，與原輸入feature map X殘差

5、位置注意力（position-wise attention）

論文：CCNet: Criss-Cross Attention for Semantic Segmentation(ICCV 2019 Open Access Repository)

Github地址：https://github.com/speedinghzl/CCNet

本篇文章的亮點在于用了巧妙的方法減少了參數(shù)量。在上面的DANet中，attention map計算的是所有像素與所有像素之間的相似性，空間復雜度為(HxW)x(HxW)，而本文采用了criss-cross思想，只計算每個像素與其同行同列即十字上的像素的相似性，通過進行循環(huán)(兩次相同操作)，間接計算到每個像素與每個像素的相似性，將空間復雜度降為(HxW)x(H+W-1)

在計算矩陣相乘時每個像素只抽取特征圖中對應十字位置的像素進行點乘，計算相似度。和non-local的方法相比極大的降低了計算量，同時采用二階注意力，能夠從所有像素中獲取全圖像的上下文信息，以生成具有密集且豐富的上下文信息的新特征圖。在計算矩陣相乘時，每個像素只抽取特征圖中對應十字位置的像素進行點乘，計算相似度。

def _check_contiguous(*args):    if not all([mod is None or mod.is_contiguous() for mod in args]):        raise ValueError("Non-contiguous input")

class CA_Weight(autograd.Function):    @staticmethod    def forward(ctx, t, f):        # Save context        n, c, h, w = t.size()        size = (n, h+w-1, h, w)        weight = torch.zeros(size, dtype=t.dtype, layout=t.layout, device=t.device)
        _ext.ca_forward_cuda(t, f, weight)
        # Output        ctx.save_for_backward(t, f)
        return weight
    @staticmethod    @once_differentiable    def backward(ctx, dw):        t, f = ctx.saved_tensors
        dt = torch.zeros_like(t)        df = torch.zeros_like(f)
        _ext.ca_backward_cuda(dw.contiguous(), t, f, dt, df)
        _check_contiguous(dt, df)
        return dt, df
class CA_Map(autograd.Function):    @staticmethod    def forward(ctx, weight, g):        # Save context        out = torch.zeros_like(g)        _ext.ca_map_forward_cuda(weight, g, out)
        # Output        ctx.save_for_backward(weight, g)
        return out
    @staticmethod    @once_differentiable    def backward(ctx, dout):        weight, g = ctx.saved_tensors
        dw = torch.zeros_like(weight)        dg = torch.zeros_like(g)
        _ext.ca_map_backward_cuda(dout.contiguous(), weight, g, dw, dg)
        _check_contiguous(dw, dg)
        return dw, dg
ca_weight = CA_Weight.applyca_map = CA_Map.apply

class CrissCrossAttention(nn.Module):    """ Criss-Cross Attention Module"""    def __init__(self,in_dim):        super(CrissCrossAttention,self).__init__()        self.chanel_in = in_dim
        self.query_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//8 , kernel_size= 1)        self.key_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//8 , kernel_size= 1)        self.value_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim , kernel_size= 1)        self.gamma = nn.Parameter(torch.zeros(1))
    def forward(self,x):        proj_query = self.query_conv(x)        proj_key = self.key_conv(x)        proj_value = self.value_conv(x)
        energy = ca_weight(proj_query, proj_key)        attention = F.softmax(energy, 1)        out = ca_map(attention, proj_value)        out = self.gamma*out + x
        return out


__all__ = ["CrissCrossAttention", "ca_weight", "ca_map"]

四、強注意力（hard attention）

0/1問題，哪些被attention，哪些不被attention。更加關注點，圖像中的每個點都可能延伸出注意力，同時強注意力是一個隨機預測的過程，更加強調(diào)動態(tài)變化，并且是不可微，所以訓練過程往往通過增強學習。

參考資料

https://blog.csdn.net/xys430381_1/article/details/89323444

Gapeng：Non-local neural networks

NX-8MAA09148HY：雙注意力網(wǎng)絡，是豐富了還是牽強了attention？

推薦閱讀

魔改Attention大集合
使用注意力機制來做醫(yī)學圖像分割的解釋和Pytorch實現(xiàn)
深入理解圖注意力機制

還有1天！火焰/摔倒/頭盔識別等四個賽題即將截止報名

添加極市小助手微信（ID : cvmart2），備注：姓名-學校/公司-研究方向-城市（如：小極-北大-目標檢測-深圳），即可申請加入極市目標檢測/圖像分割/工業(yè)檢測/人臉/醫(yī)學影像/3D/SLAM/自動駕駛/超分辨率/姿態(tài)估計/ReID/GAN/圖像增強/OCR/視頻理解等技術交流群：每月大咖直播分享、真實項目需求對接、求職內(nèi)推、算法競賽、干貨資訊匯總、與?10000+來自港科大、北大、清華、中科院、CMU、騰訊、百度等名校名企視覺開發(fā)者互動交流~

△長按添加極市小助手

△長按關注極市平臺，獲取最新CV干貨

覺得有用麻煩給個在看啦~??

綜述｜計算機視覺中的注意力機制

一、注意力機制（attention mechanism）

二、自注意力（self-attention）

三、軟注意力（soft-attention）

1、空間域注意力（spatial transformer network）

2、通道注意力（Channel Attention，CA）