10分鐘帶你深入理解Transformer原理及實(shí)現(xiàn)
點(diǎn)擊上方“小白學(xué)視覺”,選擇加"星標(biāo)"或“置頂”
重磅干貨,第一時間送達(dá)
本文轉(zhuǎn)自|深度學(xué)習(xí)這件小事
整體架構(gòu)描述
Input & Output Embedding
OneHot Encoding
Word Embedding
Positional Embedding
Input short summary
Encoder
Encoder Sub-layer 1: Multi-Head Attention Mechanism
Step 1
Step 2
Step 3
Encoder Sub-layer 2: Position-Wise fully connected feed-forward
Encoder short summary
Decoder
Diff_1:“masked” Multi-Headed Attention
Diff_2:encoder-decoder multi-head attention
Diff_3:Linear and Softmax to Produce Output Probabilities
greedy search
beam search
Scheduled Sampling
0.模型架構(gòu)
Embedding 部分
Encoder 部分
Decoder 部分
1. 對 Input 和 Output 進(jìn)行 representation
1.1 對 Input 的 represent
1.2 word embedding
使用 pre-trained 的 embeddings 并固化,這種情況下實(shí)際就是一個 lookup table。
對其進(jìn)行隨機(jī)初始化(當(dāng)然也可以選擇 pre-trained 的結(jié)果),但設(shè)為 trainable。這樣在 training 過程中不斷地對 embeddings 進(jìn)行改進(jìn)。
class Embeddings(nn.Module):
def __init__(self, d_model, vocab):
super(Embeddings, self).__init__()
self.lut = nn.Embedding(vocab, d_model)
self.d_model = d_model
def forward(self, x):
return self.lut(x) * math.sqrt(self.d_model)
1.3 Positional Embedding
通過訓(xùn)練學(xué)習(xí) positional encoding 向量
使用公式來計(jì)算 positional encoding向量
pos 指的是這個 word 在這個句子中的位置
i指的是 embedding 維度。比如選擇 d_model=512,那么i就從1數(shù)到512
class PositionalEncoding(nn.Module):
"Implement the PE function."
def __init__(self, d_model, dropout, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + Variable(self.pe[:, :x.size(1)],requires_grad=False)
return self.dropout(x)
1.4 Input 小總結(jié)
nbatches 指的是定義的 batch_size
L 指的是 sequence 的長度,(比如“我愛你”,L = 3)
512 指的是 embedding 的 dimension
2. Encoder
第一個是 ”multi-head self-attention mechanism“
第二個是 ”simple,position-wise fully connected feed-forward network“
class Encoder(nn.Module):
"Core encoder is a stack of N layers"
def __init__(self, layer, N):
super(Encoder, self).__init__()
self.layers = clones(layer, N)
self.norm = LayerNorm(layer.size)
def forward(self, x, mask):
"Pass the input (and mask) through each layer in turn."
for layer in self.layers:
x = layer(x, mask)
return self.norm(x)
class EncoderLayer(nn.Module):
"Encoder is made up of self-attn and feed forward (defined below)"
def __init__(self, size, self_attn, feed_forward, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = self_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 2)
self.size = size
def forward(self, x, mask):
"Follow Figure 1 (left) for connections."
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
return self.sublayer[1](x, self.feed_forward)
class “Encoder” 將 <layer> 堆疊N次。是 class “EncoderLayer” 的實(shí)例。
“EncoderLayer” 初始化需要指定<size>,<self_attn>,<feed_forward>,<dropout>:
<size> 對應(yīng) d_model,論文中為512
<self_attn> 是 class MultiHeadedAttention 的實(shí)例,對應(yīng)sub-layer 1
<feed_forward> 是 class PositionwiseFeedForward 的實(shí)例,對應(yīng)sub-layer 2
<dropout> 對應(yīng) dropout rate
2.1 Encoder Sub-layer 1: Multi-Head Attention Mechanism
我們把 attention 機(jī)制的輸入定義為 x。x 在 Encoder 的不同位置,含義有所不同。在 Encoder 的開始,x 的含義是句子的 representation。在 EncoderLayer 的各層中間,x 代表前一層 EncoderLayer 的輸出。
key = linear_k(x)
query = linear_q(x)
value = linear_v(x)
增大時,
點(diǎn)積值過大,所以用
對其進(jìn)行縮放。引用一下原文”We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients” 對
取 softmax 之后值都介于0到1之間,可以理解成得到了 attention weights。然后基于這個 attention weights 對 V 求 weighted sum 值 Attention(Q, K, V)。
class MultiHeadedAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1):
"Take in model size and number of heads."
super(MultiHeadedAttention, self).__init__()
assert d_model % h == 0
# We assume d_v always equals d_k
self.d_k = d_model // h
self.h = h
self.linears = clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout)
def forward(self, query, key, value, mask=None):
"Implements Figure 2"
if mask is not None:
# Same mask applied to all h heads.
mask = mask.unsqueeze(1)
nbatches = query.size(0)
# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = \
[l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]
# 2) Apply attention on all the projected vectors in batch.
x, self.attn = attention(query, key, value, mask=mask,
dropout=self.dropout)
# 3) "Concat" using a view and apply a final linear.
x = x.transpose(1, 2).contiguous() \
.view(nbatches, -1, self.h * self.d_k)
return self.linears[-1](x)
<h> = 8,即 “heads” 的數(shù)目。在 Transformer 的 base model 中有8 heads
<d_model> = 512
<dropout> = dropout rate = 0.1
計(jì)算來的。在上面的例子中 d_k = 512 / 8 = 64。
nbatches 對應(yīng) batch size
L 對應(yīng) sequence length ,512 對應(yīng) d_mode
“key” 和 “value” 的 shape 也為 [nbatches, L, 512]
對 “query”,“key”和“value”進(jìn)行 linear transform ,他們的 shape 依然是[nbatches, L, 512]。
對其通過 view() 進(jìn)行 reshape,shape 變成 [nbatches, L, 8, 64]。這里的h=8對應(yīng) heads 的數(shù)目,d_k=64 是 key 的維度。
transpose 交換 dimension1和2,shape 變成 [nbatches, 8, L 64]。
def attention(query, key, value, mask=None, dropout=None):
"Compute 'Scaled Dot Product Attention'"
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) \
/ math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = F.softmax(scores, dim = -1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn
2.2 Encoder Sub-layer 2: Position-Wise fully connected feed-forward network
class PositionwiseFeedForward(nn.Module):
"Implements FFN equation."
def __init__(self, d_model, d_ff, dropout=0.1):
super(PositionwiseFeedForward, self).__init__()
self.w_1 = nn.Linear(d_model, d_ff)
self.w_2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.w_2(self.dropout(F.relu(self.w_1(x))))
2.3 Encoder short summary
SubLayer-1 做 Multi-Headed Attention
SubLayer-2 做 feedforward neural network
3. The Decoder
Diff_1:Decoder SubLayer-1 使用的是 “masked” Multi-Headed Attention 機(jī)制,防止為了模型看到要預(yù)測的數(shù)據(jù),防止泄露。
Diff_2:SubLayer-2 是一個 encoder-decoder multi-head attention。
Diff_3:LinearLayer 和 SoftmaxLayer 作用于 SubLayer-3 的輸出后面,來預(yù)測對應(yīng)的 word 的 probabilities 。
3.1 Diff_1 : “masked” Multi-Headed Attention
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
3.2 Diff_2 : encoder-decoder multi-head attention
class DecoderLayer(nn.Module):
"Decoder is made of self-attn, src-attn, and feed forward (defined below)"
def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
super(DecoderLayer, self).__init__()
self.size = size
self.self_attn = self_attn
self.src_attn = src_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 3)
def forward(self, x, memory, src_mask, tgt_mask):
m = memory
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
return self.sublayer[2](x, self.feed_forward)
3.3 Diff_3 : Linear and Softmax to Produce Output Probabilities
給 decoder 輸入 encoder 對整個句子 embedding 的結(jié)果 和一個特殊的開始符號 </s>。decoder 將產(chǎn)生預(yù)測,在我們的例子中應(yīng)該是 ”I”。
給 decoder 輸入 encoder 的 embedding 結(jié)果和 “</s>I”,在這一步 decoder 應(yīng)該產(chǎn)生預(yù)測 “Love”。
給 decoder 輸入 encoder 的 embedding 結(jié)果和 “</s>I Love”,在這一步 decoder 應(yīng)該產(chǎn)生預(yù)測 “China”。
給 decoder 輸入 encoder 的 embedding 結(jié)果和 “</s>I Love China”, decoder應(yīng)該生成句子結(jié)尾的標(biāo)記,decoder 應(yīng)該輸出 ”</eos>”。
然后 decoder 生成了 </eos>,翻譯完成。
class Generator(nn.Module):
"Define standard linear + softmax generation step."
def __init__(self, d_model, vocab):
super(Generator, self).__init__()
self.proj = nn.Linear(d_model, vocab)
def forward(self, x):
return F.log_softmax(self.proj(x), dim=-1)
—完—
交流群
歡迎加入公眾號讀者群一起和同行交流,目前有SLAM、三維視覺、傳感器、自動駕駛、計(jì)算攝影、檢測、分割、識別、醫(yī)學(xué)影像、GAN、算法競賽等微信群(以后會逐漸細(xì)分),請掃描下面微信號加群,備注:”昵稱+學(xué)校/公司+研究方向“,例如:”張三 + 上海交大 + 視覺SLAM“。請按照格式備注,否則不予通過。添加成功后會根據(jù)研究方向邀請進(jìn)入相關(guān)微信群。請勿在群內(nèi)發(fā)送廣告,否則會請出群,謝謝理解~

