亚洲www,亚洲青青草,一本不卡免费特黄视频在线观看,97性爱视频,最新天堂中文在线,欧美国产一二三区小说,欧美亚洲国产a,日本在线看

總覽

了解圖像字幕生成的注意力機制實現(xiàn)注意力機制以在python中生成字幕

介紹

注意機制是人類所具有的復雜的認知能力。當人們收到信息時，他們可以有意識地選擇一些主要信息，而忽略其他次要信息。

這種自我選擇的能力稱為注意力。注意機制使神經(jīng)網(wǎng)絡能夠?qū)Ｗ⒂谄漭斎胱蛹赃x擇特定特征。

近年來，神經(jīng)網(wǎng)絡推動了圖像字幕的巨大發(fā)展。研究人員正在為計算機視覺和序列到序列建模系統(tǒng)尋找更具挑戰(zhàn)性的應用程序。他們試圖用人類的術(shù)語描述世界。之前我們看到了通過Merge架構(gòu)進行圖像標題處理的過程，今天，我們將探討一種更為復雜而精致的設計來解決此問題。

注意機制已成為深度學習社區(qū)中從業(yè)者的首選方法。它最初是在使用Seq2Seq模型的神經(jīng)機器翻譯的背景下設計的，但今天我們將看看它在圖像字幕中的實現(xiàn)。

注意機制不是將整個圖像壓縮為靜態(tài)表示，而是使顯著特征在需要時動態(tài)地走在最前列。當圖像中有很多雜波時，這一點尤其重要。

讓我們舉個例子來更好地理解：

我們的目標是生成一個標題，例如“兩只白狗在雪地上奔跑”。為此，我們將看到如何實現(xiàn)一種稱為Bahdanau的注意力或本地注意力的特定類型的注意力機制。

通過這種方式，我們可以看到模型在生成標題時將焦點放在圖像的哪些部分。此實現(xiàn)將需要深度學習的強大背景。

1、問題陳述的處理

2、了解數(shù)據(jù)集

3、實現(xiàn)

3.1、導入所需的庫

3.2、數(shù)據(jù)加載和預處理

3.3、模型定義

3.4、模型訓練

3.5、貪婪搜索和BLEU評估

4、下一步是什么？

5、尾注

問題陳述的處理

編碼器-解碼器圖像字幕系統(tǒng)將使用將產(chǎn)生隱藏狀態(tài)的預訓練卷積神經(jīng)網(wǎng)絡對圖像進行編碼。然后，它將使用LSTM解碼此隱藏狀態(tài)并生成標題。

對于每個序列元素，將先前元素的輸出與新序列數(shù)據(jù)結(jié)合起來用作輸入。這為RNN網(wǎng)絡提供了一種記憶，可能使字幕更具信息性和上下文感知能力。

但是RNN的訓練和評估在計算上往往很昂貴，因此在實踐中，內(nèi)存只限于少數(shù)幾個元素。注意模型可以通過從輸入圖像中選擇最相關(guān)的元素來幫助解決此問題。使用Attention機制，首先將圖像分為n個部分，然后我們計算每個圖像的圖像表示形式。當RNN生成新單詞時，注意機制將注意力集中在圖像的相關(guān)部分上，因此解碼器僅使用特定的圖片的一部分。

在Bahdanau或本地關(guān)注中，關(guān)注僅放在少數(shù)幾個來源位置。由于全球關(guān)注集中于所有目標詞的所有來源方詞，因此在計算上非常昂貴。為了克服這種缺陷，本地注意力選擇只關(guān)注每個目標詞的編碼器隱藏狀態(tài)的一小部分。

局部注意力首先找到對齊位置，然后在其位置所在的左右窗口中計算注意力權(quán)重，最后對上下文向量進行加權(quán)。局部注意的主要優(yōu)點是減少了注意機制計算的成本。

在計算中，本地注意力不是考慮源語言端的所有單詞，而是根據(jù)預測函數(shù)預測在當前解碼時要對齊的源語言端的位置，然后在上下文窗口中導航，僅考慮窗口中的單詞。

Bahdanau注意的設計

編碼器和解碼器的所有隱藏狀態(tài)用于生成上下文向量。注意機制將輸入和輸出序列與前饋網(wǎng)絡參數(shù)化的比對得分進行比對。它有助于注意源序列中最相關(guān)的信息。該模型基于與源位置和先前生成的目標詞關(guān)聯(lián)的上下文向量來預測目標詞。

為了參考原始字幕評估字幕，我們使用一種稱為BLEU的評估方法。它是使用最廣泛的評估指標。它用于分析要評估的翻譯語句與參考翻譯語句之間n-gram的相關(guān)性。

在本文中，多個圖像等效于翻譯中的多個源語言句子。BLEU的優(yōu)點是考慮更長的匹配信息，它認為的粒度是n元語法字而不是單詞。BLEU的缺點是無論匹配哪種n-gram，都將被視為相同。

我希望這使您對我們?nèi)绾翁幚泶藛栴}陳述有所了解。讓我們深入研究實施！

了解數(shù)據(jù)集

我使用了Flickr8k數(shù)據(jù)集，其中每個圖像都與五個不同的標題相關(guān)聯(lián)，這些標題描述了所收集的圖像中描述的實體和事件。

Flickr8k體積小巧，可以使用CPU在低端筆記本電腦/臺式機上輕松進行培訓，因此是一個很好的入門數(shù)據(jù)集。

我們的數(shù)據(jù)集結(jié)構(gòu)如下：

讓我們實現(xiàn)字幕生成的注意力機制！

步驟1：導入所需的庫

在這里，我們將利用Tensorflow創(chuàng)建模型并對其進行訓練。大部分代碼歸功于TensorFlow教程。如果您想要GPU進行訓練，則可以使用Google Colab或Kaggle筆記本。

import string
import numpy as np
import pandas as pd
from numpy import array
from pickle import load
 
from PIL import Image
import pickle
from collections import Counter
import matplotlib.pyplot as plt
 
import sys, time, os, warnings
warnings.filterwarnings("ignore")
import re
 
import keras
import tensorflow as tf
from tqdm import tqdm
from nltk.translate.bleu_score import sentence_bleu
 
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense, BatchNormalization
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.image import load_img, img_to_array
from keras.preprocessing.text import Tokenizer
from keras.applications.vgg16 import VGG16, preprocess_input
 
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

步驟2：數(shù)據(jù)加載和預處理

定義圖像和字幕路徑，并檢查數(shù)據(jù)集中總共有多少圖像。

image_path = "/content/gdrive/My Drive/FLICKR8K/Flicker8k_Dataset"
dir_Flickr_text = "/content/gdrive/My Drive/FLICKR8K/Flickr8k_text/Flickr8k.token.txt"
jpgs = os.listdir(image_path)
 
print("Total Images in Dataset = {}".format(len(jpgs)))

輸出如下：

我們創(chuàng)建一個數(shù)據(jù)框來存儲圖像ID和標題，以便于使用。

file = open(dir_Flickr_text,'r')
text = file.read()
file.close()
 
datatxt = []
for line in text.split('\n'):
   col = line.split('\t')
   if len(col) == 1:
       continue
   w = col[0].split("#")
   datatxt.append(w + [col[1].lower()])
 
data = pd.DataFrame(datatxt,columns=["filename","index","caption"])
data = data.reindex(columns =['index','filename','caption'])
data = data[data.filename != '2258277193_586949ec62.jpg.1']
uni_filenames = np.unique(data.filename.values)
 
data.head()

輸出如下：

接下來，讓我們可視化一些圖片及其5個標題：

npic = 5
npix = 224
target_size = (npix,npix,3)
count = 1
 
fig = plt.figure(figsize=(10,20))
for jpgfnm in uni_filenames[10:14]:
   filename = image_path + '/' + jpgfnm
   captions = list(data["caption"].loc[data["filename"]==jpgfnm].values)
   image_load = load_img(filename, target_size=target_size)
   ax = fig.add_subplot(npic,2,count,xticks=[],yticks=[])
   ax.imshow(image_load)
   count += 1
 
   ax = fig.add_subplot(npic,2,count)
   plt.axis('off')
   ax.plot()
   ax.set_xlim(0,1)
   ax.set_ylim(0,len(captions))
   for i, caption in enumerate(captions):
       ax.text(0,i,caption,fontsize=20)
   count += 1
plt.show()

輸出如下：

接下來，讓我們看看我們當前的詞匯量是多少：

vocabulary = []
for txt in data.caption.values:
   vocabulary.extend(txt.split())
print('Vocabulary Size: %d' % len(set(vocabulary)))

輸出如下：

接下來執(zhí)行一些文本清理，例如刪除標點符號，單個字符和數(shù)字值：

def remove_punctuation(text_original):
   text_no_punctuation = text_original.translate(string.punctuation)
   return(text_no_punctuation)
 
def remove_single_character(text):
   text_len_more_than1 = ""
   for word in text.split():
       if len(word) > 1:
           text_len_more_than1 += " " + word
   return(text_len_more_than1)
 
def remove_numeric(text):
   text_no_numeric = ""
   for word in text.split():
       isalpha = word.isalpha()
       if isalpha:
           text_no_numeric += " " + word
   return(text_no_numeric)
 
def text_clean(text_original):
   text = remove_punctuation(text_original)
   text = remove_single_character(text)
   text = remove_numeric(text)
   return(text)
 
for i, caption in enumerate(data.caption.values):
   newcaption = text_clean(caption)
   data["caption"].iloc[i] = newcaption

現(xiàn)在讓我們看一下清理后詞匯量的大小

clean_vocabulary = []
for txt in data.caption.values:
   clean_vocabulary.extend(txt.split())
print('Clean Vocabulary Size: %d' % len(set(clean_vocabulary)))

輸出如下：

接下來，我們將所有標題和圖像路徑保存在兩個列表中，以便我們可以使用路徑集立即加載圖像。我們還向每個字幕添加了“ <開始>”和“ <結(jié)束>”標簽，以便模型可以理解每個字幕的開始和結(jié)束。

PATH = "/content/gdrive/My Drive/FLICKR8K/Flicker8k_Dataset/"
all_captions = []
for caption  in data["caption"].astype(str):
   caption = '<start> ' + caption+ ' <end>'
   all_captions.append(caption)
 
all_captions[:10]

輸出如下：

all_img_name_vector = []
for annot in data["filename"]:
   full_image_path = PATH + annot
   all_img_name_vector.append(full_image_path)
 
all_img_name_vector[:10]

輸出如下：

現(xiàn)在您可以看到我們有40455個圖像路徑和標題。

print(f"len(all_img_name_vector) : {len(all_img_name_vector)}")
print(f"len(all_captions) : {len(all_captions)}")

輸出如下：

我們將僅取每個批次的40000個，以便可以正確選擇批次大小，即如果批次大小= 64，則可以選擇625個批次。為此，我們定義了一個函數(shù)來將數(shù)據(jù)集限制為40000個圖像和標題。

def data_limiter(num,total_captions,all_img_name_vector):
 train_captions, img_name_vector = shuffle(total_captions,all_img_name_vector,random_state=1)
 train_captions = train_captions[:num]
 img_name_vector = img_name_vector[:num]
 return train_captions,img_name_vector
 
train_captions,img_name_vector = data_limiter(40000,total_captions,all_img_name_vector)

步驟3：模型定義

讓我們使用VGG16定義圖像特征提取模型。我們必須記住，這里不需要分類圖像，只需要為圖像提取圖像矢量即可。因此，我們從模型中刪除了softmax層。我們必須先將所有圖像預處理為相同大小，即224×224，然后再將其輸入模型。

def load_image(image_path):
   img = tf.io.read_file(image_path)
   img = tf.image.decode_jpeg(img, channels=3)
   img = tf.image.resize(img, (224, 224))
   img = preprocess_input(img)
   return img, image_path
 
image_model = tf.keras.applications.VGG16(include_top=False, weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)
 
image_features_extract_model.summary()

輸出如下：

接下來，讓我們將每個圖片名稱映射到要加載圖片的函數(shù)：

encode_train = sorted(set(img_name_vector))
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(64)

我們提取特征并將其存儲在各自的.npy文件中，然后將這些特征通過編碼器傳遞.NPY文件存儲在任何計算機上重建數(shù)組所需的所有信息，包括dtype和shape信息。

%%time
for img, path in tqdm(image_dataset):
 batch_features = image_features_extract_model(img)
 batch_features = tf.reshape(batch_features,
                             (batch_features.shape[0], -1, batch_features.shape[3]))
 
 for bf, p in zip(batch_features, path):
   path_of_feature = p.numpy().decode("utf-8")
   np.save(path_of_feature, bf.numpy())

接下來，我們標記標題，并為數(shù)據(jù)中所有唯一的單詞建立詞匯表。我們還將詞匯量限制在前5000個單詞以節(jié)省內(nèi)存。我們將更換的話不詞匯與令牌

top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
                                                 oov_token="<unk>",
                                                 filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
 
tokenizer.fit_on_texts(train_captions)
train_seqs = tokenizer.texts_to_sequences(train_captions)
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'
 
train_seqs = tokenizer.texts_to_sequences(train_captions)
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')

讓我們可視化填充的訓練和標題以及標記化的向量：

train_captions[:3]

輸出如下：

train_seqs[:3]

輸出如下：

接下來，我們可以計算所有字幕的最大和最小長度：

def calc_max_length(tensor):
   return max(len(t) for t in tensor)
max_length = calc_max_length(train_seqs)
 
def calc_min_length(tensor):
   return min(len(t) for t in tensor)
min_length = calc_min_length(train_seqs)
 
print('Max Length of any caption : Min Length of any caption = '+ str(max_length) +" : "+str(min_length))

輸出如下：

接下來，使用80-20拆分創(chuàng)建訓練和驗證集：

img_name_train, img_name_val, cap_train, cap_val = train_test_split(img_name_vector,cap_vector, test_size=0.2, random_state=0)

定義訓練參數(shù)：

BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = len(tokenizer.word_index) + 1
num_steps = len(img_name_train) // BATCH_SIZE
features_shape = 512
attention_features_shape = 49
 
 
 
def map_func(img_name, cap):
 img_tensor = np.load(img_name.decode('utf-8')+'.npy')
 return img_tensor, cap
dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))
 
# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item1, item2: tf.numpy_function(
        map_func, [item1, item2], [tf.float32, tf.int32]),
         num_parallel_calls=tf.data.experimental.AUTOTUNE)
 
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

接下來，讓我們重點定義編碼器-解碼器的體系結(jié)構(gòu)。本文定義的架構(gòu)類似于論文“ Show and Tell：一種神經(jīng)圖像字幕生成器”中描述的架構(gòu)：-

VGG-16編碼器定義如下：

class VGG16_Encoder(tf.keras.Model):
   # This encoder passes the features through a Fully connected layer
   def __init__(self, embedding_dim):
       super(VGG16_Encoder, self).__init__()
       # shape after fc == (batch_size, 49, embedding_dim)
       self.fc = tf.keras.layers.Dense(embedding_dim)
       self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)
 
   def call(self, x):
       #x= self.dropout(x)
       x = self.fc(x)
       x = tf.nn.relu(x)
       return x

我們基于GPU / CPU功能定義RNN

def rnn_type(units):
   if tf.test.is_gpu_available():
       return tf.compat.v1.keras.layers.CuDNNLSTM(units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
   else:
       return tf.keras.layers.GRU(units,
                                  return_sequences=True,
                                  return_state=True,
                                  recurrent_activation='sigmoid',
                                  recurrent_initializer='glorot_uniform')

接下來，使用Bahdanau注意定義RNN解碼器：

'''The encoder output(i.e. 'features'), hidden state(initialized to 0)(i.e. 'hidden') and
the decoder input (which is the start token)(i.e. 'x') is passed to the decoder.'''
 
class Rnn_Local_Decoder(tf.keras.Model):
 def __init__(self, embedding_dim, units, vocab_size):
   super(Rnn_Local_Decoder, self).__init__()
   self.units = units
   self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
   self.gru = tf.keras.layers.GRU(self.units,
                                  return_sequences=True,
                                  return_state=True,
                                  recurrent_initializer='glorot_uniform')
  
   self.fc1 = tf.keras.layers.Dense(self.units)
 
   self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)
   self.batchnormalization = tf.keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)
 
   self.fc2 = tf.keras.layers.Dense(vocab_size)
 
   # Implementing Attention Mechanism
   self.Uattn = tf.keras.layers.Dense(units)
   self.Wattn = tf.keras.layers.Dense(units)
   self.Vattn = tf.keras.layers.Dense(1)
 
 def call(self, x, features, hidden):
   # features shape ==> (64,49,256) ==> Output from ENCODER
   # hidden shape == (batch_size, hidden_size) ==>(64,512)
   # hidden_with_time_axis shape == (batch_size, 1, hidden_size) ==> (64,1,512)
 
   hidden_with_time_axis = tf.expand_dims(hidden, 1)
 
   # score shape == (64, 49, 1)
   # Attention Function
   '''e(ij) = f(s(t-1),h(j))'''
   ''' e(ij) = Vattn(T)*tanh(Uattn * h(j) + Wattn * s(t))'''
 
   score = self.Vattn(tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis)))
 
   # self.Uattn(features) : (64,49,512)
   # self.Wattn(hidden_with_time_axis) : (64,1,512)
   # tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis)) : (64,49,512)
   # self.Vattn(tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis))) : (64,49,1) ==> score
 
   # you get 1 at the last axis because you are applying score to self.Vattn
   # Then find Probability using Softmax
   '''attention_weights(alpha(ij)) = softmax(e(ij))'''
 
   attention_weights = tf.nn.softmax(score, axis=1)
 
   # attention_weights shape == (64, 49, 1)
   # Give weights to the different pixels in the image
   ''' C(t) = Summation(j=1 to T) (attention_weights * VGG-16 features) '''
 
   context_vector = attention_weights * features
   context_vector = tf.reduce_sum(context_vector, axis=1)
 
   # Context Vector(64,256) = AttentionWeights(64,49,1) * features(64,49,256)
   # context_vector shape after sum == (64, 256)
   # x shape after passing through embedding == (64, 1, 256)
 
   x = self.embedding(x)
   # x shape after concatenation == (64, 1,  512)
 
   x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
   # passing the concatenated vector to the GRU
 
   output, state = self.gru(x)
   # shape == (batch_size, max_length, hidden_size)
 
   x = self.fc1(output)
   # x shape == (batch_size * max_length, hidden_size)
 
   x = tf.reshape(x, (-1, x.shape[2]))
 
   # Adding Dropout and BatchNorm Layers
   x= self.dropout(x)
   x= self.batchnormalization(x)
 
   # output shape == (64 * 512)
   x = self.fc2(x)
 
   # shape : (64 * 8329(vocab))
   return x, state, attention_weights
 
 def reset_state(self, batch_size):
   return tf.zeros((batch_size, self.units))
 
 
encoder = VGG16_Encoder(embedding_dim)
decoder = Rnn_Local_Decoder(embedding_dim, units, vocab_size)

接下來，我們定義損失函數(shù)和優(yōu)化器：

optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
   from_logits=True, reduction='none')
 
def loss_function(real, pred):
 mask = tf.math.logical_not(tf.math.equal(real, 0))
 loss_ = loss_object(real, pred)
 mask = tf.cast(mask, dtype=loss_.dtype)
 loss_ *= mask
 
 return tf.reduce_mean(loss_)

步驟4：模型訓練

接下來，讓我們定義培訓步驟。我們使用一種稱為教師強制的技術(shù)，該技術(shù)將目標單詞作為下一個輸入傳遞給解碼器。此技術(shù)有助于快速了解正確的序列或序列的正確統(tǒng)計屬性。

loss_plot = []
 
@tf.function
def train_step(img_tensor, target):
 loss = 0
 # initializing the hidden state for each batch
 # because the captions are not related from image to image
 
 hidden = decoder.reset_state(batch_size=target.shape[0])
 dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * BATCH_SIZE, 1)
 
 with tf.GradientTape() as tape:
     features = encoder(img_tensor)
     for i in range(1, target.shape[1]):
         # passing the features through the decoder
         predictions, hidden, _ = decoder(dec_input, features, hidden)
         loss += loss_function(target[:, i], predictions)
 
         # using teacher forcing
         dec_input = tf.expand_dims(target[:, i], 1)
 
 total_loss = (loss / int(target.shape[1]))
 trainable_variables = encoder.trainable_variables + decoder.trainable_variables
 gradients = tape.gradient(loss, trainable_variables)
 optimizer.apply_gradients(zip(gradients, trainable_variables))
 
 return loss, total_loss

接下來，我們訓練模型：

EPOCHS = 20
for epoch in range(start_epoch, EPOCHS):
   start = time.time()
   total_loss = 0
 
   for (batch, (img_tensor, target)) in enumerate(dataset):
       batch_loss, t_loss = train_step(img_tensor, target)
       total_loss += t_loss
 
       if batch % 100 == 0:
           print ('Epoch {} Batch {} Loss {:.4f}'.format(
             epoch + 1, batch, batch_loss.numpy() / int(target.shape[1])))
   # storing the epoch end loss value to plot later
   loss_plot.append(total_loss / num_steps)
 
   print ('Epoch {} Loss {:.6f}'.format(epoch + 1,
                                        total_loss/num_steps))
 
   print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

讓我們繪制誤差圖：

plt.plot(loss_plot)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Plot')
plt.show()

輸出如下：

步驟5：貪婪搜尋和BLEU評估

讓我們定義定義字幕的貪婪方法：

def evaluate(image):
   attention_plot = np.zeros((max_length, attention_features_shape))
 
   hidden = decoder.reset_state(batch_size=1)
   temp_input = tf.expand_dims(load_image(image)[0], 0)
   img_tensor_val = image_features_extract_model(temp_input)
   img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3])
 
   features = encoder(img_tensor_val)
   dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
   result = []
 
   for i in range(max_length):
       predictions, hidden, attention_weights = decoder(dec_input, features, hidden)
       attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()
       predicted_id = tf.argmax(predictions[0]).numpy()
       result.append(tokenizer.index_word[predicted_id])
 
       if tokenizer.index_word[predicted_id] == '<end>':
           return result, attention_plot
 
       dec_input = tf.expand_dims([predicted_id], 0)
   attention_plot = attention_plot[:len(result), :]
 
   return result, attention_plot

另外，我們定義了一個函數(shù)來繪制生成的每個單詞的注意力圖，就像在簡介中看到的那樣

def plot_attention(image, result, attention_plot):
   temp_image = np.array(Image.open(image))
   fig = plt.figure(figsize=(10, 10))
   len_result = len(result)
   for l in range(len_result):
       temp_att = np.resize(attention_plot[l], (8, 8))
       ax = fig.add_subplot(len_result//2, len_result//2, l+1)
       ax.set_title(result[l])
       img = ax.imshow(temp_image)
       ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())
 
   plt.tight_layout()
   plt.show()

最后，讓我們在文章開頭為圖片生成標題，看看注意力機制關(guān)注什么并生成

# captions on the validation set
rid = np.random.randint(0, len(img_name_val))
image = '/content/gdrive/My Drive/FLICKR8K/Flicker8k_Dataset/2319175397_3e586cfaf8.jpg'
 
# real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)
 
# remove <start> and <end> from the real_caption
first = real_caption.split(' ', 1)[1]
real_caption = 'Two white dogs are playing in the snow'
 
#remove "<unk>" in result
for i in result:
   if i=="<unk>":
       result.remove(i)
 
for i in real_caption:
   if i=="<unk>":
       real_caption.remove(i)
 
#remove <end> from result        
result_join = ' '.join(result)
result_final = result_join.rsplit(' ', 1)[0]
 
real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = result
 
score = sentence_bleu(reference, candidate)
print(f"BELU score: {score*100}")
 
print ('Real Caption:', real_caption)
print ('Prediction Caption:', result_final)
plot_attention(image, result, attention_plot)

輸出如下：

您可以看到我們能夠生成與真實字幕相同的字幕。讓我們嘗試一下測試集中的其他圖像。

rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
start = time.time()
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)
 
first = real_caption.split(' ', 1)[1]
real_caption = first.rsplit(' ', 1)[0]
 
#remove "<unk>" in result
for i in result:
   if i=="<unk>":
       result.remove(i)
 
#remove <end> from result        
result_join = ' '.join(result)
result_final = result_join.rsplit(' ', 1)[0]
 
real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = result_final
 
print ('Real Caption:', real_caption)
print ('Prediction Caption:', result_final)
 
plot_attention(image, result, attention_plot)
print(f"time took to Predict: {round(time.time()-start)} sec")
 
Image.open(img_name_val[rid])

輸出如下：

您可以看到，即使我們的字幕與真實字幕有很大不同，它仍然非常準確。它能夠識別出女人的黃色襯衫和她的手在口袋里。

讓我們看看另一個：

rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
 
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)
 
# remove <start> and <end> from the real_caption
first = real_caption.split(' ', 1)[1]
real_caption = first.rsplit(' ', 1)[0]
 
#remove "<unk>" in result
for i in result:
   if i=="<unk>":
       result.remove(i)
 
for i in real_caption:
   if i=="<unk>":
       real_caption.remove(i)
 
#remove <end> from result        
result_join = ' '.join(result)
result_final = result_join.rsplit(' ', 1)[0]
 
real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = result
 
score = sentence_bleu(reference, candidate)
print(f"BELU score: {score*100}")
 
print ('Real Caption:', real_caption)
print ('Prediction Caption:', result_final)
 
plot_attention(image, result, attention_plot)

在這里，我們可以看到我們的字幕比真實的字幕之一更好地定義了圖像。

在那里！我們已經(jīng)成功實現(xiàn)了用于生成圖像標題的注意力機制。

下一步是什么？

近年來，注意力機制得到了高度利用，這僅僅是更多先進系統(tǒng)的開始。您可以實施以改善模型的事情：-利用較大的數(shù)據(jù)集，尤其是MS COCO數(shù)據(jù)集或比MS COCO大26倍的Stock3M數(shù)據(jù)集。實現(xiàn)不同的注意力機制，例如帶有Visual Sentinel和的自適應注意力。語義注意實現(xiàn)基于Transformer的模型，該模型的性能應比LSTM好得多。為圖像特征提取實現(xiàn)更好的體系結(jié)構(gòu)，例如Inception，Xception和Efficient network。

尾注

這對注意力機制及其如何應用于深度學習應用程序非常有趣。在注意力機制和取得最新成果方面進行了大量研究。請務必嘗試我的一些建議！您覺得這篇文章對您有幫助嗎？請在下面的評論部分中分享您的寶貴反饋。

作者：沂水寒城，CSDN博客專家，個人研究方向：機器學習、深度學習、NLP、CV

Blog: http://yishuihancheng.blog.csdn.net

贊賞作者

更多閱讀

2020 年最佳流行 Python 庫 Top 10

2020 Python中文社區(qū)熱門文章 Top 10

5分鐘快速掌握 Python 定時任務框架

特別推薦

點擊下方閱讀原文加入社區(qū)會員

Python 中圖像標題生成的注意力機制實戰(zhàn)