国产欧美精品久久,吸咬奶头狂揉60分钟视频,国产一级免费在线,亚洲无吗高清视频,变态骚逼人妻3p露脸合集,色激情五月天天,婷婷丁香五月婷婷,激情动态视频

總覽

了解最先進(jìn)的變壓器模型。
了解我們?nèi)绾问褂肨ensorflow在已經(jīng)看到的圖像字幕問(wèn)題上實(shí)現(xiàn)變形金剛
比較《變形金剛》與注意力模型的結(jié)果。

介紹

我們已經(jīng)看到，注意力機(jī)制已成為各種任務(wù)（例如圖像字幕）中引人注目的序列建模和轉(zhuǎn)導(dǎo)模型的組成部分，從而允許對(duì)依賴項(xiàng)進(jìn)行建模，而無(wú)需考慮它們?cè)谳斎牖蜉敵鲂蛄兄械木嚯x。

Transformer是一種避免重復(fù)發(fā)生的模型體系結(jié)構(gòu)，而是完全依賴于注意力機(jī)制來(lái)繪制輸入和輸出之間的全局依存關(guān)系。Transformer體系結(jié)構(gòu)允許更多并行化，并可以達(dá)到翻譯質(zhì)量方面的最新水平。

在本文中，讓我們看看如何使用TensorFlow來(lái)實(shí)現(xiàn)用變形金剛生成字幕的注意力機(jī)制。

開(kāi)始之前的先決條件：

Python編程
Tensorflow和Keras
RNN和LSTM
轉(zhuǎn)移學(xué)習(xí)
編碼器和解碼器架構(gòu)
深度學(xué)習(xí)的要點(diǎn)–注意序列到序列建模

我建議您在閱讀本文前可以參考下面資料：

一個(gè)動(dòng)手教程來(lái)學(xué)習(xí)Python中圖像標(biāo)題生成的注意機(jī)制

https://www.analyticsvidhya.com/blog/2020/11/attention-mechanism-for-caption-generation/

一、Transformer 架構(gòu)

二、使用Tensorflow的變壓器字幕生成注意機(jī)制的實(shí)現(xiàn)

2.1、導(dǎo)入所需的庫(kù)

2.2、數(shù)據(jù)加載和預(yù)處理

2.3、模型定義

2.4、位置編碼

2.5、多頭注意力

2.6、編碼器-解碼器層

2.7、Transformer

2.8、模型超參數(shù)

2.9、模型訓(xùn)練

2.10、BLEU評(píng)估

2.11、比較方式

三、下一步是什么？

四、尾注

Transformer 架構(gòu)

Transformer 網(wǎng)絡(luò)采用類似于RNN的編解碼器架構(gòu)。主要區(qū)別在于，轉(zhuǎn)換器可以并行接收輸入的句子/順序，即沒(méi)有與輸入相關(guān)的時(shí)間步長(zhǎng)，并且句子中的所有單詞都可以同時(shí)傳遞。

讓我們從了解變壓器的輸入開(kāi)始。

考慮一下英語(yǔ)到德語(yǔ)的翻譯。我們將整個(gè)英語(yǔ)句子輸入到輸入嵌入中?？梢詫⑤斎肭度雽右暈榭臻g中的一個(gè)點(diǎn)，其中含義相似的單詞在物理上彼此更接近，即，每個(gè)單詞映射到具有連續(xù)值的矢量來(lái)表示該單詞。

現(xiàn)在的問(wèn)題是，不同句子中的相同單詞可能具有不同的含義，這就是位置編碼輸入的地方。由于轉(zhuǎn)換器不包含遞歸和卷積，因此為了使模型能夠利用序列的順序，它必須利用一些有關(guān)序列中單詞相對(duì)或絕對(duì)位置的信息。這個(gè)想法是使用固定或?qū)W習(xí)的權(quán)重，該權(quán)重對(duì)與句子中標(biāo)記的特定位置有關(guān)的信息進(jìn)行編碼。

類似地，將目標(biāo)德語(yǔ)單詞輸入到輸出嵌入中，并將其位置編碼矢量傳遞到解碼器塊中。

編碼器塊具有兩個(gè)子層。第一個(gè)是多頭自我關(guān)注機(jī)制，第二個(gè)是簡(jiǎn)單的位置完全連接的前饋網(wǎng)絡(luò)。對(duì)于每個(gè)單詞，我們可以生成一個(gè)注意力向量，該向量捕獲句子中單詞之間的上下文關(guān)系。編碼器中的多頭注意力會(huì)應(yīng)用一種稱為自我注意力的特定注意力機(jī)制。自注意力允許模型將輸入中的每個(gè)單詞與其他單詞相關(guān)聯(lián)。

除了每個(gè)編碼器層中的兩個(gè)子層之外，解碼器還插入第三子層，該第三子層對(duì)編碼器堆棧的輸出執(zhí)行多頭關(guān)注。與編碼器類似，我們?cè)诿總€(gè)子層周?chē)捎脷堄噙B接，然后進(jìn)行層歸一化。來(lái)自編碼器的德語(yǔ)單詞的注意力向量和英語(yǔ)句子的注意力向量被傳遞到第二多頭注意力。

該注意塊將確定每個(gè)單詞向量彼此之間的關(guān)聯(lián)程度。這是英語(yǔ)到德語(yǔ)單詞映射的地方。解碼器以充當(dāng)分類器的線性層和softmax來(lái)封閉，以獲取單詞概率。

現(xiàn)在，您已基本了解了轉(zhuǎn)換器的工作方式，讓我們看看如何使用Tensorflow將其實(shí)現(xiàn)用于圖像字幕任務(wù)，并將我們的結(jié)果與其他方法進(jìn)行比較。

使用TensorFlow在Transformers 上生成字幕的注意機(jī)制的實(shí)現(xiàn)

步驟1：導(dǎo)入所需的庫(kù)

在這里，我們將利用Tensorflow創(chuàng)建模型并對(duì)其進(jìn)行訓(xùn)練。大部分代碼歸功于TensorFlow教程。如果您想要GPU進(jìn)行訓(xùn)練，則可以使用Google Colab或Kaggle筆記本。

import string
import numpy as np
import pandas as pd
from numpy import array
from PIL import Image
import pickle
 
import matplotlib.pyplot as plt
import sys, time, os, warnings
warnings.filterwarnings("ignore")
import re
 
import keras
import tensorflow as tf
from tqdm import tqdm
from nltk.translate.bleu_score import sentence_bleu
 
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense, BatchNormalization
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.image import load_img, img_to_array
from keras.preprocessing.text import Tokenizer
 
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

步驟2：數(shù)據(jù)加載和預(yù)處理

定義圖像和字幕路徑，并檢查數(shù)據(jù)集中總共有多少圖像。

image_path = "/content/gdrive/My Drive/FLICKR8K/Flicker8k_Dataset"
dir_Flickr_text = "/content/gdrive/My Drive/FLICKR8K/Flickr8k_text/Flickr8k.token.txt"
jpgs = os.listdir(image_path)
 
print("Total Images in Dataset = {}".format(len(jpgs)))

輸出如下：

我們創(chuàng)建一個(gè)數(shù)據(jù)框來(lái)存儲(chǔ)圖像ID和標(biāo)題，以便于使用。

file = open(dir_Flickr_text,'r')
text = file.read()
file.close()
 
datatxt = []
for line in text.split('\n'):
   col = line.split('\t')
   if len(col) == 1:
       continue
   w = col[0].split("#")
   datatxt.append(w + [col[1].lower()])
 
data = pd.DataFrame(datatxt,columns=["filename","index","caption"])
data = data.reindex(columns =['index','filename','caption'])
data = data[data.filename != '2258277193_586949ec62.jpg.1']
uni_filenames = np.unique(data.filename.values)
 
data.head()

輸出如下：

接下來(lái)，讓我們可視化一些圖片及其5個(gè)標(biāo)題：

npic = 5
npix = 224
target_size = (npix,npix,3)
count = 1
 
fig = plt.figure(figsize=(10,20))
for jpgfnm in uni_filenames[10:14]:
   filename = image_path + '/' + jpgfnm
   captions = list(data["caption"].loc[data["filename"]==jpgfnm].values)
   image_load = load_img(filename, target_size=target_size)
   ax = fig.add_subplot(npic,2,count,xticks=[],yticks=[])
   ax.imshow(image_load)
   count += 1
 
   ax = fig.add_subplot(npic,2,count)
   plt.axis('off')
   ax.plot()
   ax.set_xlim(0,1)
   ax.set_ylim(0,len(captions))
   for i, caption in enumerate(captions):
       ax.text(0,i,caption,fontsize=20)
   count += 1
plt.show()

輸出如下：

接下來(lái)，讓我們看看我們當(dāng)前的詞匯量是多少：

vocabulary = []
for txt in data.caption.values:
   vocabulary.extend(txt.split())
print('Vocabulary Size: %d' % len(set(vocabulary)))

輸出如下：

接下來(lái)執(zhí)行一些文本清理，例如刪除標(biāo)點(diǎn)符號(hào)，單個(gè)字符和數(shù)字值：

def remove_punctuation(text_original):
   text_no_punctuation = text_original.translate(string.punctuation)
   return(text_no_punctuation)
 
def remove_single_character(text):
   text_len_more_than1 = ""
   for word in text.split():
       if len(word) > 1:
           text_len_more_than1 += " " + word
   return(text_len_more_than1)
 
def remove_numeric(text):
   text_no_numeric = ""
   for word in text.split():
       isalpha = word.isalpha()
       if isalpha:
           text_no_numeric += " " + word
   return(text_no_numeric)
 
def text_clean(text_original):
   text = remove_punctuation(text_original)
   text = remove_single_character(text)
   text = remove_numeric(text)
   return(text)
 
for i, caption in enumerate(data.caption.values):
   newcaption = text_clean(caption)
   data["caption"].iloc[i] = newcaption

現(xiàn)在讓我們看一下清理后詞匯量的大小

clean_vocabulary = []
for txt in data.caption.values:
   clean_vocabulary.extend(txt.split())
print('Clean Vocabulary Size: %d' % len(set(clean_vocabulary)))

輸出如下：

接下來(lái)，我們將所有標(biāo)題和圖像路徑保存在兩個(gè)列表中，以便我們可以使用路徑集立即加載圖像。我們還向每個(gè)字幕添加了“ <開(kāi)始>”和“ <結(jié)束>”標(biāo)簽，以便模型可以理解每個(gè)字幕的開(kāi)始和結(jié)束。

PATH = "/content/gdrive/My Drive/FLICKR8K/Flicker8k_Dataset/"
all_captions = []
for caption  in data["caption"].astype(str):
   caption = '<start> ' + caption+ ' <end>'
   all_captions.append(caption)
 
all_captions[:10]

輸出如下：

all_img_name_vector = []
for annot in data["filename"]:
   full_image_path = PATH + annot
   all_img_name_vector.append(full_image_path)
 
all_img_name_vector[:10]

現(xiàn)在您可以看到我們有40455個(gè)圖像路徑和標(biāo)題。

print(f"len(all_img_name_vector) : {len(all_img_name_vector)}")
print(f"len(all_captions) : {len(all_captions)}")

輸出如下：

我們將僅取每個(gè)批次的40000個(gè)，以便可以正確選擇批次大小，即如果批次大小= 64，則可以選擇625個(gè)批次。為此，我們定義了一個(gè)函數(shù)來(lái)將數(shù)據(jù)集限制為40000個(gè)圖像和標(biāo)題。

def data_limiter(num,total_captions,all_img_name_vector):
   train_captions, img_name_vector = shuffle(total_captions,all_img_name_vector,random_state=1)
   train_captions = train_captions[:num]
   img_name_vector = img_name_vector[:num]
   return train_captions,img_name_vector
 
train_captions,img_name_vector = data_limiter(40000,total_captions,all_img_name_vector)

步驟3：模型定義

讓我們使用InceptionV3定義圖像特征提取模型。我們必須記住，這里不需要分類圖像，只需要為圖像提取圖像矢量即可。因此，我們從模型中刪除了softmax層。我們必須先將所有圖像預(yù)處理為相同的尺寸，即299×299，然后再將其輸入模型，并且該層的輸出形狀為8x8x2048。

def load_image(image_path):
   img = tf.io.read_file(image_path)
   img = tf.image.decode_jpeg(img, channels=3)
   img = tf.image.resize(img, (299, 299))
   img = tf.keras.applications.inception_v3.preprocess_input(img)
   return img, image_path
 
image_model = tf.keras.applications.InceptionV3(include_top=False, weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

接下來(lái)，讓我們將每個(gè)圖片名稱映射到要加載圖片的函數(shù)。我們將使用InceptionV3預(yù)處理每個(gè)圖像，并將輸出緩存到磁盤(pán)，然后將圖像特征重塑為64×2048。

encode_train = sorted(set(img_name_vector))
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(64)

我們提取特征并將其存儲(chǔ)在各自的.npy文件中，然后將這些特征通過(guò)編碼器傳遞.NPY文件存儲(chǔ)在任何計(jì)算機(jī)上重建數(shù)組所需的所有信息，包括dtype和shape信息。

for img, path in tqdm(image_dataset):
   batch_features = image_features_extract_model(img)
   batch_features = tf.reshape(batch_features,
                              (batch_features.shape[0], -1, batch_features.shape[3]))
 
 for bf, p in zip(batch_features, path):
   path_of_feature = p.numpy().decode("utf-8")
   np.save(path_of_feature, bf.numpy())

接下來(lái)，我們標(biāo)記標(biāo)題，并為數(shù)據(jù)中所有唯一的單詞建立詞匯表。我們還將詞匯量限制在前5000個(gè)單詞以節(jié)省內(nèi)存。我們將更換的話不詞匯與令牌。

top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
                                                 oov_token="<unk>",
                                                 filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
 
tokenizer.fit_on_texts(train_captions)
train_seqs = tokenizer.texts_to_sequences(train_captions)
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'
 
train_seqs = tokenizer.texts_to_sequences(train_captions)
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')

接下來(lái)，使用80-20拆分創(chuàng)建訓(xùn)練和驗(yàn)證集：

img_name_train, img_name_val, cap_train, cap_val = train_test_split(img_name_vector,cap_vector, test_size=0.2, random_state=0)

接下來(lái)，讓我們創(chuàng)建一個(gè)tf.data數(shù)據(jù)集以用于訓(xùn)練我們的模型。

BATCH_SIZE = 64
BUFFER_SIZE = 1000
num_steps = len(img_name_train) // BATCH_SIZE
 
def map_func(img_name, cap):
   img_tensor = np.load(img_name.decode('utf-8')+'.npy')
   return img_tensor, cap
 
dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))
dataset = dataset.map(lambda item1, item2: tf.numpy_function(map_func, [item1, item2], [tf.float32, tf.int32]),num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

步驟4：位置編碼

位置編碼使用不同頻率的正弦和余弦函數(shù)。對(duì)于輸入向量上的每個(gè)奇數(shù)索引，請(qǐng)使用cos函數(shù)創(chuàng)建一個(gè)向量，對(duì)于每個(gè)偶數(shù)索引，請(qǐng)使用sin函數(shù)創(chuàng)建一個(gè)向量。然后將這些向量添加到其相應(yīng)的輸入嵌入中，從而成功地提供有關(guān)每個(gè)向量位置的網(wǎng)絡(luò)信息。

def get_angles(pos, i, d_model):
   angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
   return pos * angle_rates
 
def positional_encoding_1d(position, d_model):
   angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                           np.arange(d_model)[np.newaxis, :],
                           d_model)
 
   angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
   angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
   pos_encoding = angle_rads[np.newaxis, ...]
   return tf.cast(pos_encoding, dtype=tf.float32)
 
def positional_encoding_2d(row,col,d_model):
   assert d_model % 2 == 0
   row_pos = np.repeat(np.arange(row),col)[:,np.newaxis]
   col_pos = np.repeat(np.expand_dims(np.arange(col),0),row,axis=0).reshape(-1,1)
 
   angle_rads_row = get_angles(row_pos,np.arange(d_model//2)[np.newaxis,:],d_model//2)
   angle_rads_col = get_angles(col_pos,np.arange(d_model//2)[np.newaxis,:],d_model//2)
 
   angle_rads_row[:, 0::2] = np.sin(angle_rads_row[:, 0::2])
   angle_rads_row[:, 1::2] = np.cos(angle_rads_row[:, 1::2])
   angle_rads_col[:, 0::2] = np.sin(angle_rads_col[:, 0::2])
   angle_rads_col[:, 1::2] = np.cos(angle_rads_col[:, 1::2])
   pos_encoding = np.concatenate([angle_rads_row,angle_rads_col],axis=1)[np.newaxis, ...]
   return tf.cast(pos_encoding, dtype=tf.float32)

步驟5：多頭注意力

計(jì)算注意力權(quán)重。q，k，v必須具有匹配的前導(dǎo)尺寸。k，v必須具有匹配的倒數(shù)第二個(gè)維度，即：seq_len_k = seq_len_v。遮罩根據(jù)其類型（填充或向前看）而具有不同的形狀，但必須廣播以添加。

def create_padding_mask(seq):
   seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
   return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)
 
def create_look_ahead_mask(size):
   mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
   return mask  # (seq_len, seq_len)
 
def scaled_dot_product_attention(q, k, v, mask):
   matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)
   dk = tf.cast(tf.shape(k)[-1], tf.float32)
   scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
.
   if mask is not None:
      scaled_attention_logits += (mask * -1e9) 
 
   attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) 
   output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)
 
   return output, attention_weights
 
class MultiHeadAttention(tf.keras.layers.Layer):
   def __init__(self, d_model, num_heads):
      super(MultiHeadAttention, self).__init__()
      self.num_heads = num_heads
      self.d_model = d_model
      assert d_model % self.num_heads == 0
      self.depth = d_model // self.num_heads
      self.wq = tf.keras.layers.Dense(d_model)
      self.wk = tf.keras.layers.Dense(d_model)
      self.wv = tf.keras.layers.Dense(d_model)
      self.dense = tf.keras.layers.Dense(d_model)
 
   def split_heads(self, x, batch_size):
      x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
      return tf.transpose(x, perm=[0, 2, 1, 3])
 
   def call(self, v, k, q, mask=None):
      batch_size = tf.shape(q)[0]
      q = self.wq(q)  # (batch_size, seq_len, d_model)
      k = self.wk(k)  # (batch_size, seq_len, d_model)
      v = self.wv(v)  # (batch_size, seq_len, d_model)
 
      q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
      k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
      v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)
 
      scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)
      scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q,      num_heads, depth)
 
      concat_attention = tf.reshape(scaled_attention,
                                 (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)
 
      output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)
      return output, attention_weights
 
def point_wise_feed_forward_network(d_model, dff):
   return tf.keras.Sequential([
                tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
                tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)])

步驟6：編碼器-解碼器層

class EncoderLayer(tf.keras.layers.Layer):
   def __init__(self, d_model, num_heads, dff, rate=0.1):
      super(EncoderLayer, self).__init__()
      self.mha = MultiHeadAttention(d_model, num_heads)
      self.ffn = point_wise_feed_forward_network(d_model, dff)
 
      self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
      self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
 
      self.dropout1 = tf.keras.layers.Dropout(rate)
      self.dropout2 = tf.keras.layers.Dropout(rate)
 
 
   def call(self, x, training, mask=None):
      attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
      attn_output = self.dropout1(attn_output, training=training)
      out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)
 
      ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
      ffn_output = self.dropout2(ffn_output, training=training)
      out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)
      return out2
 
class DecoderLayer(tf.keras.layers.Layer):
   def __init__(self, d_model, num_heads, dff, rate=0.1):
      super(DecoderLayer, self).__init__()
      self.mha1 = MultiHeadAttention(d_model, num_heads)
      self.mha2 = MultiHeadAttention(d_model, num_heads)
 
      self.ffn = point_wise_feed_forward_network(d_model, dff)
 
      self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
      self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
      self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
 
      self.dropout1 = tf.keras.layers.Dropout(rate)
      self.dropout2 = tf.keras.layers.Dropout(rate)
      self.dropout3 = tf.keras.layers.Dropout(rate)
 
   def call(self, x, enc_output, training,look_ahead_mask=None, padding_mask=None):
      attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
      attn1 = self.dropout1(attn1, training=training)
      out1 = self.layernorm1(attn1 + x)
 
      attn2, attn_weights_block2 = self.mha2(enc_output, enc_output, out1, padding_mask) 
      attn2 = self.dropout2(attn2, training=training)
      out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)
 
      ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
      ffn_output = self.dropout3(ffn_output, training=training)
      out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)
 
      return out3, attn_weights_block1, attn_weights_block2
 
 
class Encoder(tf.keras.layers.Layer):
   def __init__(self, num_layers, d_model, num_heads, dff, row_size,col_size,rate=0.1):
      super(Encoder, self).__init__()
      self.d_model = d_model
      self.num_layers = num_layers
 
      self.embedding = tf.keras.layers.Dense(self.d_model,activation='relu')
      self.pos_encoding = positional_encoding_2d(row_size,col_size,self.d_model)
 
      self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]
      self.dropout = tf.keras.layers.Dropout(rate)
 
   def call(self, x, training, mask=None):
      seq_len = tf.shape(x)[1]
      x = self.embedding(x)  # (batch_size, input_seq_len(H*W), d_model)
      x += self.pos_encoding[:, :seq_len, :]
      x = self.dropout(x, training=training)
 
      for i in range(self.num_layers):
         x = self.enc_layers[i](x, training, mask)
 
      return x  # (batch_size, input_seq_len, d_model)
 
 
class Decoder(tf.keras.layers.Layer):
   def __init__(self, num_layers,d_model,num_heads,dff, target_vocab_size, maximum_position_encoding,   rate=0.1):
      super(Decoder, self).__init__()
      self.d_model = d_model
      self.num_layers = num_layers
 
      self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
      self.pos_encoding = positional_encoding_1d(maximum_position_encoding, d_model)
 
      self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate)
                         for _ in range(num_layers)]
      self.dropout = tf.keras.layers.Dropout(rate)
 
   def call(self, x, enc_output, training,look_ahead_mask=None, padding_mask=None):
      seq_len = tf.shape(x)[1]
      attention_weights = {}
 
      x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
      x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
      x += self.pos_encoding[:, :seq_len, :]
      x = self.dropout(x, training=training)
 
      for i in range(self.num_layers):
         x, block1, block2 = self.dec_layers[i](x, enc_output, training,
                                            look_ahead_mask, padding_mask)
         
         attention_weights['decoder_layer{}_block1'.format(i+1)] = block1
         attention_weights['decoder_layer{}_block2'.format(i+1)] = block2
 
      return x, attention_weights

步驟7：Transformer

class Transformer(tf.keras.Model):
   def __init__(self, num_layers, d_model, num_heads, dff,row_size,col_size,
              target_vocab_size,max_pos_encoding, rate=0.1):
      super(Transformer, self).__init__()
      self.encoder = Encoder(num_layers, d_model, num_heads, dff,row_size,col_size, rate)
      self.decoder = Decoder(num_layers, d_model, num_heads, dff,
                          target_vocab_size,max_pos_encoding, rate)
      self.final_layer = tf.keras.layers.Dense(target_vocab_size)
 
   def call(self, inp, tar, training,look_ahead_mask=None,dec_padding_mask=None,enc_padding_mask=None   ):
      enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model      )
      dec_output, attention_weights = self.decoder(
      tar, enc_output, training, look_ahead_mask, dec_padding_mask)
      final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)
      return final_output, attention_weights

步驟8：模型超參數(shù)

定義訓(xùn)練參數(shù)：

num_layer = 4
d_model = 512
dff = 2048
num_heads = 8
row_size = 8
col_size = 8
target_vocab_size = top_k + 1
dropout_rate = 0.1 
 
 
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
   def __init__(self, d_model, warmup_steps=4000):
      super(CustomSchedule, self).__init__()
      self.d_model = d_model
      self.d_model = tf.cast(self.d_model, tf.float32)
      self.warmup_steps = warmup_steps
 
   def __call__(self, step):
      arg1 = tf.math.rsqrt(step)
      arg2 = step * (self.warmup_steps ** -1.5)
      return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)
 
 
learning_rate = CustomSchedule(d_model)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98,
                                    epsilon=1e-9)
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
 
def loss_function(real, pred):
   mask = tf.math.logical_not(tf.math.equal(real, 0))
   loss_ = loss_object(real, pred)
   mask = tf.cast(mask, dtype=loss_.dtype)
   loss_ *= mask
  return tf.reduce_sum(loss_)/tf.reduce_sum(mask)
 
 
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
transformer = Transformer(num_layer,d_model,num_heads,dff,row_size,col_size,target_vocab_size,                                 max_pos_encoding=target_vocab_size,rate=dropout_rate)

步驟9：模型訓(xùn)練

def create_masks_decoder(tar):
   look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
   dec_target_padding_mask = create_padding_mask(tar)
   combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)
   return combined_mask
 

@tf.function
def train_step(img_tensor, tar):
   tar_inp = tar[:, :-1]
   tar_real = tar[:, 1:]
   dec_mask = create_masks_decoder(tar_inp)
   with tf.GradientTape() as tape:
      predictions, _ = transformer(img_tensor, tar_inp,True, dec_mask)
      loss = loss_function(tar_real, predictions)
 
   gradients = tape.gradient(loss, transformer.trainable_variables)   
   optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
   train_loss(loss)
   train_accuracy(tar_real, predictions)
 
for epoch in range(30):
   start = time.time()
   train_loss.reset_states()
   train_accuracy.reset_states()
   for (batch, (img_tensor, tar)) in enumerate(dataset):
      train_step(img_tensor, tar)
      if batch % 50 == 0:
         print ('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format(
         epoch + 1, batch, train_loss.result(), train_accuracy.result()))
 
   print ('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1,
                                               train_loss.result(),
                                
                                   train_accuracy.result()))
   print ('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))

步驟10：BLEU評(píng)估

def evaluate(image):
   temp_input = tf.expand_dims(load_image(image)[0], 0)
   img_tensor_val = image_features_extract_model(temp_input)
   img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))
   start_token = tokenizer.word_index['<start>']
   end_token = tokenizer.word_index['<end>']
   decoder_input = [start_token]
   output = tf.expand_dims(decoder_input, 0) #tokens
   result = [] #word list
 
   for i in range(100):
      dec_mask = create_masks_decoder(output)
      predictions, attention_weights = transformer(img_tensor_val,output,False,dec_mask)
      predictions = predictions[: ,-1:, :]  # (batch_size, 1, vocab_size)
      predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)
      if predicted_id == end_token:
         return result,tf.squeeze(output, axis=0), attention_weights
      result.append(tokenizer.index_word[int(predicted_id)])
      output = tf.concat([output, predicted_id], axis=-1)
 
   return result,tf.squeeze(output, axis=0), attention_weights
 
 
 
rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
caption,result,attention_weights = evaluate(image)
 
first = real_caption.split(' ', 1)[1]
real_caption = first.rsplit(' ', 1)[0]
 
for i in caption:
   if i=="<unk>":
      caption.remove(i)
 
for i in real_caption:
   if i=="<unk>":
      real_caption.remove(i)
 
result_join = ' '.join(caption)
result_final = result_join.rsplit(' ', 1)[0]
real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = caption
 
score = sentence_bleu(reference, candidate, weights=(1.0,0,0,0))
print(f"BLEU-1 score: {score*100}")
score = sentence_bleu(reference, candidate, weights=(0.5,0.5,0,0))
print(f"BLEU-2 score: {score*100}")
score = sentence_bleu(reference, candidate, weights=(0.3,0.3,0.3,0))
print(f"BLEU-3 score: {score*100}")
score = sentence_bleu(reference, candidate, weights=(0.25,0.25,0.25,0.25))
print(f"BLEU-4 score: {score*100}")
print ('Real Caption:', real_caption)
print ('Predicted Caption:', ' '.join(caption))
temp_image = np.array(Image.open(image))
plt.imshow(temp_image)

輸出如下：

rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
caption,result,attention_weights = evaluate(image)
 
first = real_caption.split(' ', 1)[1]
real_caption = first.rsplit(' ', 1)[0]
 
for i in caption:
   if i=="<unk>":
      caption.remove(i)
 
for i in real_caption:
   if i=="<unk>":
      real_caption.remove(i)
 
result_join = ' '.join(caption)
result_final = result_join.rsplit(' ', 1)[0]
real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = caption
 
score = sentence_bleu(reference, candidate, weights=(1.0,0,0,0))
print(f"BLEU-1 score: {score*100}")
score = sentence_bleu(reference, candidate, weights=(0.5,0.5,0,0))
print(f"BLEU-2 score: {score*100}")
score = sentence_bleu(reference, candidate, weights=(0.3,0.3,0.3,0))
print(f"BLEU-3 score: {score*100}")
score = sentence_bleu(reference, candidate, weights=(0.25,0.25,0.25,0.25))
print(f"BLEU-4 score: {score*100}")
print ('Real Caption:', real_caption)
print ('Predicted Caption:', ' '.join(caption))
temp_image = np.array(Image.open(image))
plt.imshow(temp_image)

輸出如下：

步驟11：比較

讓我們比較上一篇文章中使用Bahdanau的Attention vs我們的Transformers獲得的BLEU得分。

左側(cè)的BLEU得分使用Bahdanau Attention，右側(cè)的BLEU得分使用Transformers。正如我們所看到的，Transformer的性能遠(yuǎn)勝于注意力模型。

在那里！我們已經(jīng)使用Tensorflow成功實(shí)現(xiàn)了Transformers，并看到了它如何產(chǎn)生最先進(jìn)的結(jié)果。

尾注

總而言之，Transformers比我們之前看到的所有其他體系結(jié)構(gòu)都要好，因?yàn)樗鼈兺耆苊饬诉f歸，因?yàn)樗ㄟ^(guò)多頭注意機(jī)制和位置嵌入來(lái)處理句子，并通過(guò)學(xué)習(xí)單詞之間的關(guān)系來(lái)完全避免遞歸。還必須指出，使用Tensorflow的轉(zhuǎn)換器只能捕獲用于訓(xùn)練它們的固定輸入大小內(nèi)的依賴項(xiàng)。

有許多新的功能強(qiáng)大的Transformers，例如Transformer-XL，纏結(jié)Transformers，網(wǎng)狀存儲(chǔ)器Transformers，它們也可以為諸如圖像字幕之類的應(yīng)用實(shí)現(xiàn)，以達(dá)到更好的效果。

作者：沂水寒城，CSDN博客專家，個(gè)人研究方向：機(jī)器學(xué)習(xí)、深度學(xué)習(xí)、NLP、CV

Blog: http://yishuihancheng.blog.csdn.net

贊賞作者

更多閱讀

2020 年最佳流行 Python 庫(kù) Top 10

2020 Python中文社區(qū)熱門(mén)文章 Top 10

5分鐘快速掌握 Python 定時(shí)任務(wù)框架

特別推薦

點(diǎn)擊下方閱讀原文加入社區(qū)會(huì)員

用 TensorFlow 在 Transformers 上生成字幕的注意機(jī)制的實(shí)現(xiàn)