欧美大鸡巴,大香蕉在线精品视频,亚洲av在线免费看,www.17c亚洲蜜桃,日本A片在线免费观看,大香蕉伊人成人电影,亚洲AV高清无码,国产精品无码午夜福利

↑ 點擊藍字關注極市平臺

作者丨林小平@知乎（已授權）

來源丨h(huán)ttps://zhuanlan.zhihu.com/p/353365423

編輯丨極市平臺

極市導讀

本文對pytorch的mas的參數(shù)進行了一些補充解釋以及說明，主要說明了mask_和key_padding_mask的作用。 >>本周六，極市CVPR2021線下沙龍即將舉辦，三位CVPR2021論文作者齊聚深圳！【報告三：戴志港-UP-DETR：針對目標檢測的無監(jiān)督預訓練transformer】。點擊藍字即可免費報名，名額有限，先到先得！

pytorch也自己實現(xiàn)了transformer的模型，不同于huggingface或者其他地方，pytorch的mask參數(shù)要更難理解一些（即便是有文檔的情況下），這里做一些補充和說明。（順帶提一句，這里的transformer是需要自己實現(xiàn)position embedding的，別樂呵樂呵的就直接去跑數(shù)據(jù)了）

>>> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)>>> src = torch.rand((10, 32, 512))>>> tgt = torch.rand((20, 32, 512))>>> out = transformer_model(src, tgt) # 沒有實現(xiàn)position embedding ，也需要自己實現(xiàn)mask機制。否則不是你想象的transformer

首先看一下官網(wǎng)的參數(shù)

src – the sequence to the encoder (required).
tgt – the sequence to the decoder (required).
src_mask – the additive mask for the src sequence (optional).
tgt_mask – the additive mask for the tgt sequence (optional).
memory_mask – the additive mask for the encoder output (optional).
src_key_padding_mask – the ByteTensor mask for src keys per batch (optional).
tgt_key_padding_mask – the ByteTensor mask for tgt keys per batch (optional).
memory_key_padding_mask – the ByteTensor mask for memory keys per batch (optional).

這里面最大的區(qū)別就是*mask_和*_key_padding_mask,_至于*是src還是tgt，memory，這不重要，模塊出現(xiàn)在encoder，就是src，出現(xiàn)在decoder，就是tgt，decoder每個block的第二層和encoder做cross attention的時候，就是memory。

*mask 對應的API是attn_mask，*_key_padding_mask對應的API是key_padding_mask

我們看看torch/nn/modules/activation.py當中MultiheadAttention模塊對于這2個API的解釋：


def forward(self, query, key, value, key_padding_mask=None,                need_weights=True, attn_mask=None):        # type: (Tensor, Tensor, Tensor, Optional[Tensor], bool, Optional[Tensor]) -> Tuple[Tensor, Optional[Tensor]]        r"""    Args:        query, key, value: map a query and a set of key-value pairs to an output.            See "Attention Is All You Need" for more details.        key_padding_mask: if provided, specified padding elements in the key will            be ignored by the attention. When given a binary mask and a value is True,            the corresponding value on the attention layer will be ignored. When given            a byte mask and a value is non-zero, the corresponding value on the attention            layer will be ignored        need_weights: output attn_output_weights.        attn_mask: 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all            the batches while a 3D mask allows to specify a different mask for the entries of each batch.    Shape:        - Inputs:        - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is          the embedding dimension.        - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is          the embedding dimension.        - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is          the embedding dimension.        - key_padding_mask: :math:`(N, S)` where N is the batch size, S is the source sequence length.          If a ByteTensor is provided, the non-zero positions will be ignored while the position          with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the          value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.        - attn_mask: 2D mask :math:`(L, S)` where L is the target sequence length, S is the source sequence length.          3D mask :math:`(N*num_heads, L, S)` where N is the batch size, L is the target sequence length,          S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked          positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend          while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``          is not allowed to attend while ``False`` values will be unchanged. If a FloatTensor          is provided, it will be added to the attention weight.
        - Outputs:        - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,          E is the embedding dimension.        - attn_output_weights: :math:`(N, L, S)` where N is the batch size,          L is the target sequence length, S is the source sequence length.        """

key_padding_mask：用來遮蔽<PAD>以避免pad token的embedding輸入。形狀要求：（N,S）
attn_mask：2維或者3維的矩陣。用來避免指定位置的embedding輸入。2維矩陣形狀要求：（L, S）；也支持3維矩陣輸入，形狀要求：（N*num_heads, L, S）

其中，N是batch size的大小，L是目標序列的長度(the target sequence length)，S是源序列的長度(the source sequence length)。這個模塊會出現(xiàn)在上圖的3個橙色區(qū)域，所以the target sequence 并不一定就是指decoder輸入的序列，the source sequence 也不一定就是encoder輸入的序列。

更準確的理解是，target sequence代表多頭attention當中q（查詢）的序列，source sequence代表k（鍵值）和v（值）的序列。例如，當decoder在做self-attention的時候，target sequence和source sequence都是它本身，所以此時L=S，都是decoder編碼的序列長度。

key_padding_mask的作用

這里舉一個簡單的例子：

現(xiàn)在有一個batch，batch_size = 3，長度為4，token表現(xiàn)形式如下：


[    [‘a(chǎn)’,'b','c','<PAD>'],    [‘a(chǎn)’,'b','c','d'],    [‘a(chǎn)’,'b','<PAD>','<PAD>']]

現(xiàn)在假設你要對其進行self-attention的計算（可以在encoder，也可以在decoder），那么以第三行數(shù)據(jù)為例，‘a(chǎn)’在做qkv計算的時候，會看到'b','<PAD>','<PAD>'，但是我們不希望‘a(chǎn)’看到'<PAD>'，因為他們本身毫無意義，所以，需要key_padding_mask遮住他們。

key_padding_mask的形狀大小為（N,S），對應這個例子，key_padding_mask為以下形式，key_padding_mask.shape = （3,4）：


[    [False, False, False, True],    [False, False, False, False],    [False, False, True, True]]

值得說明的是，key_padding_mask本質(zhì)上是遮住key這個位置的值（置0），但是<PAD> token本身，也是會做qkv的計算的，以第三行數(shù)據(jù)的第三個位置為例，它的q是<PAD>的embedding，k和v分別各是第一個的‘a(chǎn)’和第二個的‘b’，它也會輸出一個embedding。

所以你的模型訓練在transformer最后的output計算loss的時候，還需要指定ignoreindex=pad_index。以第三行數(shù)據(jù)為例，它的監(jiān)督信號是[3205,1890,0,0]，pad_index=0 。如此一來，即便位于<PAD>的transformer會瘋狂的和有意義的position做qkv，也會輸出embedding，但是我們不算它的loss，任憑它各種作妖。

attn_mask的作用

一開始看到有2個mask參數(shù)的時候，我也是一臉懵逼的，并且他們的shape居然要求還不一樣。attn_mask到底用在什么地方呢？

decoder在做self-attention的時候，每一個位置不同于encoder，他是只能看到上文的信息的。key_padding_mask的shape為(batch_size, source_length)，這意味著每個位置的query，他所看到的畫面經(jīng)過key_padding_mask后都是一樣的（盡管他能做到batch的每一行數(shù)據(jù)mask的不一樣），這不能滿足如下模塊的需求：

decoder的mask 多頭注意力模塊

這里需要的mask如下：

黃色是看得到的部分，紫色是看不到的部分，不同位置需要mask的部分是不一樣的

而pytorch的nn.Transformer已經(jīng)有了幫我們實現(xiàn)的函數(shù)：


    def generate_square_subsequent_mask(self, sz: int) -> Tensor:        r"""Generate a square mask for the sequence. The masked positions are filled with float('-inf').            Unmasked positions are filled with float(0.0).        """        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))        return mask

還是上面那個例子，以第一行數(shù)據(jù)['a','b','c','<PAD>'],為例（假設我們在用decoder做生成，研究block 的第一層layer 也就是self-attention），此時：

'a'可以看到'a'
'b'可以看到'a','b'
'c'可以看到'a','b','c'
'<PAD>'理論上不應該看到什么，但是只要它頭頂?shù)谋O(jiān)督信號是ignore_index，那就沒有關系，所以讓他看到'a','b','c','<PAD>'

回想一下attn_mask的形狀要求，2維的時候是（L,S），3維的時候是（N*num_heads, L, S）。此時，由于qkv都是同一個序列（decoder底下的序列）所以L=S；又因為對于batch每一行數(shù)據(jù)來說，他們的mask機制都是一樣的，即第i個位置的值，都只能看到上文的信息，所以我們的attn_mask用二維的就行，內(nèi)部實現(xiàn)的時候會把mask矩陣廣播到batch每一行數(shù)據(jù)中：