<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          BERT源碼解析(一)

          共 24341字,需瀏覽 49分鐘

           ·

          2021-02-04 09:27


          BERT 模型也出來(lái)很久了, 之前有看過(guò)論文和一些博客對(duì)其做了解讀:NLP 大殺器 BERT 模型解讀[1],但是一直沒有細(xì)致地去看源碼具體實(shí)現(xiàn)。最近有用到就抽時(shí)間來(lái)仔細(xì)看看記錄下來(lái),和大家一起討論。

          注意,源碼閱讀系列需要提前對(duì) NLP 相關(guān)知識(shí)有所了解,比如 attention 機(jī)制、transformer 框架以及 python 和 tensorflow 基礎(chǔ)等,關(guān)于 BERT 的原理不是本文的重點(diǎn)。

          附上關(guān)于 BERT 的資料匯總:BERT 相關(guān)論文、文章和代碼資源匯總[2]

          今天要介紹的是 BERT 最主要的模型實(shí)現(xiàn)部分-----BertModel,代碼位于

          • modeling.py 模塊[3]

          除了代碼塊外部,在代碼塊內(nèi)部也有注釋噢

          如有解讀不正確,請(qǐng)務(wù)必指出~

          1、配置類(BertConfig)

          這部分代碼主要定義了 BERT 模型的一些默認(rèn)參數(shù),另外包括了一些文件處理函數(shù)。

          class BertConfig(object):
          """BERT模型的配置類."""

          def __init__(self,
          vocab_size,
          hidden_size=768,
          num_hidden_layers=12,
          num_attention_heads=12,
          intermediate_size=3072,
          hidden_act="gelu",
          hidden_dropout_prob=0.1,
          attention_probs_dropout_prob=0.1,
          max_position_embeddings=512,
          type_vocab_size=16,
          initializer_range=0.02)
          :


          self.vocab_size = vocab_size
          self.hidden_size = hidden_size
          self.num_hidden_layers = num_hidden_layers
          self.num_attention_heads = num_attention_heads
          self.hidden_act = hidden_act
          self.intermediate_size = intermediate_size
          self.hidden_dropout_prob = hidden_dropout_prob
          self.attention_probs_dropout_prob = attention_probs_dropout_prob
          self.max_position_embeddings = max_position_embeddings
          self.type_vocab_size = type_vocab_size
          self.initializer_range = initializer_range

          @classmethod
          def from_dict(cls, json_object):
          """Constructs a `BertConfig` from a Python dictionary of parameters."""
          config = BertConfig(vocab_size=None)
          for (key, value) in six.iteritems(json_object):
          config.__dict__[key] = value
          return config

          @classmethod
          def from_json_file(cls, json_file):
          """Constructs a `BertConfig` from a json file of parameters."""
          with tf.gfile.GFile(json_file, "r") as reader:
          text = reader.read()
          return cls.from_dict(json.loads(text))

          def to_dict(self):
          """Serializes this instance to a Python dictionary."""
          output = copy.deepcopy(self.__dict__)
          return output

          def to_json_string(self):
          """Serializes this instance to a JSON string."""
          return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

          「參數(shù)具體含義」

          • vocab_size:詞表大小
          • hidden_size:隱藏層神經(jīng)元數(shù)
          • num_hidden_layers:Transformer encoder 中的隱藏層數(shù)
          • *num_attention_heads:*multi-head attention 的 head 數(shù)
          • intermediate_size:encoder 的“中間”隱層神經(jīng)元數(shù)(例如 feed-forward layer)
          • hidden_act:隱藏層激活函數(shù)
          • hidden_dropout_prob:隱層 dropout 率
          • attention_probs_dropout_prob:注意力部分的 dropout
          • max_position_embeddings:最大位置編碼
          • type_vocab_size:token_type_ids 的詞典大小
          • initializer_range:truncated_normal_initializer 初始化方法的 stdev

          這里要注意一點(diǎn),可能剛看的時(shí)候?qū)?code style="overflow-wrap: break-word;padding: 2px 4px;border-radius: 4px;margin-right: 2px;margin-left: 2px;background-color: rgba(27, 31, 35, 0.05);font-family: "Operator Mono", Consolas, Monaco, Menlo, monospace;word-break: break-all;">type_vocab_size這個(gè)參數(shù)會(huì)有點(diǎn)不理解,其實(shí)就是在next sentence prediction任務(wù)里的Segment ASegment B。在下載的bert_config.json文件里也有說(shuō)明,默認(rèn)值應(yīng)該為 2。參考這個(gè) Issue[4]

          2、獲取詞向量(Embedding_lookup)

          對(duì)于輸入 word_ids,返回 embedding table??梢赃x用 one-hot 或者 tf.gather()

          def embedding_lookup(input_ids,						# word_id:【batch_size, seq_length】
          vocab_size,
          embedding_size=128,
          initializer_range=0.02,
          word_embedding_name="word_embeddings",
          use_one_hot_embeddings=False)
          :


          # 該函數(shù)默認(rèn)輸入的形狀為【batch_size, seq_length, input_num】
          # 如果輸入為2D的【batch_size, seq_length】,則擴(kuò)展到【batch_size, seq_length, 1】
          if input_ids.shape.ndims == 2:
          input_ids = tf.expand_dims(input_ids, axis=[-1])

          embedding_table = tf.get_variable(
          name=word_embedding_name,
          shape=[vocab_size, embedding_size],
          initializer=create_initializer(initializer_range))

          flat_input_ids = tf.reshape(input_ids, [-1]) #【batch_size*seq_length*input_num】
          if use_one_hot_embeddings:
          one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
          output = tf.matmul(one_hot_input_ids, embedding_table)
          else: # 按索引取值
          output = tf.gather(embedding_table, flat_input_ids)

          input_shape = get_shape_list(input_ids)

          # output:[batch_size, seq_length, num_inputs]
          # 轉(zhuǎn)成:[batch_size, seq_length, num_inputs*embedding_size]
          output = tf.reshape(output,
          input_shape[0:-1] + [input_shape[-1] * embedding_size])
          return (output, embedding_table)

          「參數(shù)具體含義」

          • input_ids:word id 【batch_size, seq_length】
          • vocab_size:embedding 詞表
          • embedding_size:embedding 維度
          • initializer_range:embedding 初始化范圍
          • word_embedding_name:embeddding table 命名
          • use_one_hot_embeddings:是否使用 one-hotembedding
          • Return:【batch_size, seq_length, embedding_size】

          3、詞向量的后續(xù)處理(embedding_postprocessor)

          我們知道 BERT 模型的輸入有三部分:token embedding ,segment embedding以及position embedding。上一節(jié)中我們只獲得了 token embedding,這部分代碼對(duì)其完善信息,正則化,dropout 之后輸出最終 embedding。注意,在 Transformer 論文中的position embedding是由 sin/cos 函數(shù)生成的固定的值,而在這里代碼實(shí)現(xiàn)中是跟普通 word embedding 一樣隨機(jī)生成的,可以訓(xùn)練的。作者這里這樣選擇的原因可能是 BERT 訓(xùn)練的數(shù)據(jù)比 Transformer 那篇大很多,完全可以讓模型自己去學(xué)習(xí)。

          def embedding_postprocessor(input_tensor,				# [batch_size, seq_length, embedding_size]
          use_token_type=False,
          token_type_ids=None,
          token_type_vocab_size=16, # 一般是2
          token_type_embedding_name="token_type_embeddings",
          use_position_embeddings=True,
          position_embedding_name="position_embeddings",
          initializer_range=0.02,
          max_position_embeddings=512, #最大位置編碼,必須大于等于max_seq_len
          dropout_prob=0.1)
          :


          input_shape = get_shape_list(input_tensor, expected_rank=3) #【batch_size,seq_length,embedding_size】
          batch_size = input_shape[0]
          seq_length = input_shape[1]
          width = input_shape[2]

          output = input_tensor

          # Segment position信息
          if use_token_type:
          if token_type_ids is None:
          raise ValueError("`token_type_ids` must be specified if"
          "`use_token_type` is True.")
          token_type_table = tf.get_variable(
          name=token_type_embedding_name,
          shape=[token_type_vocab_size, width],
          initializer=create_initializer(initializer_range))
          # 由于token-type-table比較小,所以這里采用one-hot的embedding方式加速
          flat_token_type_ids = tf.reshape(token_type_ids, [-1])
          one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
          token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
          token_type_embeddings = tf.reshape(token_type_embeddings,
          [batch_size, seq_length, width])
          output += token_type_embeddings

          # Position embedding信息
          if use_position_embeddings:
          # 確保seq_length小于等于max_position_embeddings
          assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
          with tf.control_dependencies([assert_op]):
          full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))

          # 這里position embedding是可學(xué)習(xí)的參數(shù),[max_position_embeddings, width]
          # 但是通常實(shí)際輸入序列沒有達(dá)到max_position_embeddings
          # 所以為了提高訓(xùn)練速度,使用tf.slice取出句子長(zhǎng)度的embedding
          position_embeddings = tf.slice(full_position_embeddings, [0, 0],
          [seq_length, -1])
          num_dims = len(output.shape.as_list())

          # word embedding之后的tensor是[batch_size, seq_length, width]
          # 因?yàn)槲恢镁幋a是與輸入內(nèi)容無(wú)關(guān),它的shape總是[seq_length, width]
          # 我們無(wú)法把位置Embedding加到word embedding上
          # 因此我們需要擴(kuò)展位置編碼為[1, seq_length, width]
          # 然后就能通過(guò)broadcasting加上去了。
          position_broadcast_shape = []
          for _ in range(num_dims - 2):
          position_broadcast_shape.append(1)
          position_broadcast_shape.extend([seq_length, width])
          position_embeddings = tf.reshape(position_embeddings,
          position_broadcast_shape)
          output += position_embeddings

          output = layer_norm_and_dropout(output, dropout_prob)
          return output

          4、構(gòu)造 attention_mask

          該部分代碼的作用是構(gòu)造 attention 可視域的 attention_mask, 因?yàn)槊總€(gè)樣本都經(jīng)過(guò) padding 過(guò)程,在做self-attention的是padding的部分不能attend到其他部分上。輸入為形狀為 [batch_size, from_seq_length,...] 的 padding 好的 input_ids 和形狀為 [batch_size, to_seq_length] 的 mask 標(biāo)記向量。

          def create_attention_mask_from_input_mask(from_tensor, to_mask):
          from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
          batch_size = from_shape[0]
          from_seq_length = from_shape[1]

          to_shape = get_shape_list(to_mask, expected_rank=2)
          to_seq_length = to_shape[1]

          to_mask = tf.cast(
          tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)

          broadcast_ones = tf.ones(
          shape=[batch_size, from_seq_length, 1], dtype=tf.float32)

          mask = broadcast_ones * to_mask

          return mask

          5、注意力層(attention layer)

          這部分代碼是「multi-head attention」的實(shí)現(xiàn),主要來(lái)自《Attention is all you need》這篇論文??紤]key-query-value形式的 attention,輸入的from_tensor當(dāng)做是 query, to_tensor當(dāng)做是 key 和 value,當(dāng)兩者相同的時(shí)候即為 self-attention。關(guān)于 attention 更詳細(xì)的介紹可以轉(zhuǎn)到【理解 Attention 機(jī)制原理及模型[5]】。

          def attention_layer(from_tensor,   # 【batch_size, from_seq_length, from_width】
          to_tensor, #【batch_size, to_seq_length, to_width】
          attention_mask=None, #【batch_size,from_seq_length, to_seq_length】
          num_attention_heads=1, # attention head numbers
          size_per_head=512, # 每個(gè)head的大小
          query_act=None, # query變換的激活函數(shù)
          key_act=None, # key變換的激活函數(shù)
          value_act=None, # value變換的激活函數(shù)
          attention_probs_dropout_prob=0.0, # attention層的dropout
          initializer_range=0.02, # 初始化取值范圍
          do_return_2d_tensor=False, # 是否返回2d張量。
          #如果True,輸出形狀【batch_size*from_seq_length,num_attention_heads*size_per_head】
          #如果False,輸出形狀【batch_size, from_seq_length, num_attention_heads*size_per_head】
          batch_size=None, #如果輸入是3D的,
          #那么batch就是第一維,但是可能3D的壓縮成了2D的,所以需要告訴函數(shù)batch_size
          from_seq_length=None, # 同上
          to_seq_length=None)
          :
          # 同上

          def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
          seq_length, width)
          :

          output_tensor = tf.reshape(
          input_tensor, [batch_size, seq_length, num_attention_heads, width])

          output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3]) #[batch_size, num_attention_heads, seq_length, width]
          return output_tensor

          from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
          to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

          if len(from_shape) != len(to_shape):
          raise ValueError(
          "The rank of `from_tensor` must match the rank of `to_tensor`.")

          if len(from_shape) == 3:
          batch_size = from_shape[0]
          from_seq_length = from_shape[1]
          to_seq_length = to_shape[1]
          elif len(from_shape) == 2:
          if (batch_size is None or from_seq_length is None or to_seq_length is None):
          raise ValueError(
          "When passing in rank 2 tensors to attention_layer, the values "
          "for `batch_size`, `from_seq_length`, and `to_seq_length` "
          "must all be specified.")

          # 為了方便備注shape,采用以下簡(jiǎn)寫:
          # B = batch size (number of sequences)
          # F = `from_tensor` sequence length
          # T = `to_tensor` sequence length
          # N = `num_attention_heads`
          # H = `size_per_head`

          # 把from_tensor和to_tensor壓縮成2D張量
          from_tensor_2d = reshape_to_matrix(from_tensor) # 【B*F, hidden_size】
          to_tensor_2d = reshape_to_matrix(to_tensor) # 【B*T, hidden_size】

          # 將from_tensor輸入全連接層得到query_layer
          # `query_layer` = [B*F, N*H]
          query_layer = tf.layers.dense(
          from_tensor_2d,
          num_attention_heads * size_per_head,
          activation=query_act,
          name="query",
          kernel_initializer=create_initializer(initializer_range))

          # 將from_tensor輸入全連接層得到query_layer
          # `key_layer` = [B*T, N*H]
          key_layer = tf.layers.dense(
          to_tensor_2d,
          num_attention_heads * size_per_head,
          activation=key_act,
          name="key",
          kernel_initializer=create_initializer(initializer_range))

          # 同上
          # `value_layer` = [B*T, N*H]
          value_layer = tf.layers.dense(
          to_tensor_2d,
          num_attention_heads * size_per_head,
          activation=value_act,
          name="value",
          kernel_initializer=create_initializer(initializer_range))

          # query_layer轉(zhuǎn)成多頭:[B*F, N*H]==>[B, F, N, H]==>[B, N, F, H]
          query_layer = transpose_for_scores(query_layer, batch_size,
          num_attention_heads, from_seq_length,
          size_per_head)

          # key_layer轉(zhuǎn)成多頭:[B*T, N*H] ==> [B, T, N, H] ==> [B, N, T, H]
          key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
          to_seq_length, size_per_head)

          # 將query與key做點(diǎn)積,然后做一個(gè)scale,公式可以參見原始論文
          # `attention_scores` = [B, N, F, T]
          attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
          attention_scores = tf.multiply(attention_scores,
          1.0 / math.sqrt(float(size_per_head)))

          if attention_mask is not None:
          # `attention_mask` = [B, 1, F, T]
          attention_mask = tf.expand_dims(attention_mask, axis=[1])

          # 如果attention_mask里的元素為1,則通過(guò)下面運(yùn)算有(1-1)*-10000,adder就是0
          # 如果attention_mask里的元素為0,則通過(guò)下面運(yùn)算有(1-0)*-10000,adder就是-10000
          adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

          # 我們最終得到的attention_score一般不會(huì)很大,
          #所以上述操作對(duì)mask為0的地方得到的score可以認(rèn)為是負(fù)無(wú)窮
          attention_scores += adder

          # 負(fù)無(wú)窮經(jīng)過(guò)softmax之后為0,就相當(dāng)于mask為0的位置不計(jì)算attention_score
          # `attention_probs` = [B, N, F, T]
          attention_probs = tf.nn.softmax(attention_scores)

          # 對(duì)attention_probs進(jìn)行dropout,這雖然有點(diǎn)奇怪,但是Transforme原始論文就是這么做的
          attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

          # `value_layer` = [B, T, N, H]
          value_layer = tf.reshape(
          value_layer,
          [batch_size, to_seq_length, num_attention_heads, size_per_head])

          # `value_layer` = [B, N, T, H]
          value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

          # `context_layer` = [B, N, F, H]
          context_layer = tf.matmul(attention_probs, value_layer)

          # `context_layer` = [B, F, N, H]
          context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

          if do_return_2d_tensor:
          # `context_layer` = [B*F, N*H]
          context_layer = tf.reshape(
          context_layer,
          [batch_size * from_seq_length, num_attention_heads * size_per_head])
          else:
          # `context_layer` = [B, F, N*H]
          context_layer = tf.reshape(
          context_layer,
          [batch_size, from_seq_length, num_attention_heads * size_per_head])

          return context_layer

          總結(jié)一下,attention layer 的主要流程:

          • 對(duì)輸入的 tensor 進(jìn)行形狀校驗(yàn),提取batch_size、from_seq_length 、to_seq_length;
          • 輸入如果是 3d 張量則轉(zhuǎn)化成 2d 矩陣;
          • from_tensor 作為 query, to_tensor 作為 key 和 value,經(jīng)過(guò)一層全連接層后得到 query_layer、key_layer 、value_layer;
          • 將上述張量通過(guò)transpose_for_scores轉(zhuǎn)化成 multi-head;
          • 根據(jù)論文公式計(jì)算 attention_score 以及 attention_probs(注意 attention_mask 的 trick):
          • 將得到的 attention_probs 與 value 相乘,返回 2D 或 3D 張量

          6、Transformer

          接下來(lái)的代碼就是大名鼎鼎的 Transformer 的核心代碼了,可以認(rèn)為是"Attention is All You Need"原始代碼重現(xiàn)??梢詤⒁姟驹颊撐?sup style="line-height: 0;">[6]】和【原始代碼[7]】。

          def transformer_model(input_tensor,						# 【batch_size, seq_length, hidden_size】
          attention_mask=None, # 【batch_size, seq_length, seq_length】
          hidden_size=768,
          num_hidden_layers=12,
          num_attention_heads=12,
          intermediate_size=3072,
          intermediate_act_fn=gelu, # feed-forward層的激活函數(shù)
          hidden_dropout_prob=0.1,
          attention_probs_dropout_prob=0.1,
          initializer_range=0.02,
          do_return_all_layers=False)
          :


          # 這里注意,因?yàn)樽罱K要輸出hidden_size, 我們有num_attention_head個(gè)區(qū)域,
          # 每個(gè)head區(qū)域有size_per_head多的隱層
          # 所以有 hidden_size = num_attention_head * size_per_head
          if hidden_size % num_attention_heads != 0:
          raise ValueError(
          "The hidden size (%d) is not a multiple of the number of attention "
          "heads (%d)" % (hidden_size, num_attention_heads))

          attention_head_size = int(hidden_size / num_attention_heads)
          input_shape = get_shape_list(input_tensor, expected_rank=3)
          batch_size = input_shape[0]
          seq_length = input_shape[1]
          input_width = input_shape[2]

          # 因?yàn)閑ncoder中有殘差操作,所以需要shape相同
          if input_width != hidden_size:
          raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
          (input_width, hidden_size))

          # reshape操作在CPU/GPU上很快,但是在TPU上很不友好
          # 所以為了避免2D和3D之間的頻繁reshape,我們把所有的3D張量用2D矩陣表示
          prev_output = reshape_to_matrix(input_tensor)

          all_layer_outputs = []
          for layer_idx in range(num_hidden_layers):
          with tf.variable_scope("layer_%d" % layer_idx):
          layer_input = prev_output

          with tf.variable_scope("attention"):
          # multi-head attention
          attention_heads = []
          with tf.variable_scope("self"):
          # self-attention
          attention_head = attention_layer(
          from_tensor=layer_input,
          to_tensor=layer_input,
          attention_mask=attention_mask,
          num_attention_heads=num_attention_heads,
          size_per_head=attention_head_size,
          attention_probs_dropout_prob=attention_probs_dropout_prob,
          initializer_range=initializer_range,
          do_return_2d_tensor=True,
          batch_size=batch_size,
          from_seq_length=seq_length,
          to_seq_length=seq_length)
          attention_heads.append(attention_head)

          attention_output = None
          if len(attention_heads) == 1:
          attention_output = attention_heads[0]
          else:
          # 如果有多個(gè)head,將他們拼接起來(lái)
          attention_output = tf.concat(attention_heads, axis=-1)

          # 對(duì)attention的輸出進(jìn)行線性映射, 目的是將shape變成與input一致
          # 然后dropout+residual+norm
          with tf.variable_scope("output"):
          attention_output = tf.layers.dense(
          attention_output,
          hidden_size,
          kernel_initializer=create_initializer(initializer_range))
          attention_output = dropout(attention_output, hidden_dropout_prob)
          attention_output = layer_norm(attention_output + layer_input)

          # feed-forward
          with tf.variable_scope("intermediate"):
          intermediate_output = tf.layers.dense(
          attention_output,
          intermediate_size,
          activation=intermediate_act_fn,
          kernel_initializer=create_initializer(initializer_range))

          # 對(duì)feed-forward層的輸出使用線性變換變回‘hidden_size’
          # 然后dropout + residual + norm
          with tf.variable_scope("output"):
          layer_output = tf.layers.dense(
          intermediate_output,
          hidden_size,
          kernel_initializer=create_initializer(initializer_range))
          layer_output = dropout(layer_output, hidden_dropout_prob)
          layer_output = layer_norm(layer_output + attention_output)
          prev_output = layer_output
          all_layer_outputs.append(layer_output)

          if do_return_all_layers:
          final_outputs = []
          for layer_output in all_layer_outputs:
          final_output = reshape_from_matrix(layer_output, input_shape)
          final_outputs.append(final_output)
          return final_outputs
          else:
          final_output = reshape_from_matrix(prev_output, input_shape)
          return final_output

          配上下圖一同使用效果更佳,因?yàn)?BERT 里只有 encoder,所有 decoder 沒有姓名

          7、函數(shù)入口(init)

          BertModel 類的構(gòu)造函數(shù),有了上面幾節(jié)的鋪墊,我們就可以來(lái)實(shí)現(xiàn) BERT 模型了。

          def __init__(self,
          config, # BertConfig對(duì)象
          is_training,
          input_ids, # 【batch_size, seq_length】
          input_mask=None, # 【batch_size, seq_length】
          token_type_ids=None, # 【batch_size, seq_length】
          use_one_hot_embeddings=False, # 是否使用one-hot;否則tf.gather()
          scope=None)
          :


          config = copy.deepcopy(config)
          if not is_training:
          config.hidden_dropout_prob = 0.0
          config.attention_probs_dropout_prob = 0.0

          input_shape = get_shape_list(input_ids, expected_rank=2)
          batch_size = input_shape[0]
          seq_length = input_shape[1]
          # 不做mask,即所有元素為1
          if input_mask is None:
          input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

          if token_type_ids is None:
          token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

          with tf.variable_scope(scope, default_name="bert"):
          with tf.variable_scope("embeddings"):
          # word embedding
          (self.embedding_output, self.embedding_table) = embedding_lookup(
          input_ids=input_ids,
          vocab_size=config.vocab_size,
          embedding_size=config.hidden_size,
          initializer_range=config.initializer_range,
          word_embedding_name="word_embeddings",
          use_one_hot_embeddings=use_one_hot_embeddings)

          # 添加position embedding和segment embedding
          # layer norm + dropout
          self.embedding_output = embedding_postprocessor(
          input_tensor=self.embedding_output,
          use_token_type=True,
          token_type_ids=token_type_ids,
          token_type_vocab_size=config.type_vocab_size,
          token_type_embedding_name="token_type_embeddings",
          use_position_embeddings=True,
          position_embedding_name="position_embeddings",
          initializer_range=config.initializer_range,
          max_position_embeddings=config.max_position_embeddings,
          dropout_prob=config.hidden_dropout_prob)

          with tf.variable_scope("encoder"):

          # input_ids是經(jīng)過(guò)padding的word_ids:[25, 120, 34, 0, 0]
          # input_mask是有效詞標(biāo)記:[1, 1, 1, 0, 0]
          attention_mask = create_attention_mask_from_input_mask(
          input_ids, input_mask)

          # transformer模塊疊加
          # `sequence_output` shape = [batch_size, seq_length, hidden_size].
          self.all_encoder_layers = transformer_model(
          input_tensor=self.embedding_output,
          attention_mask=attention_mask,
          hidden_size=config.hidden_size,
          num_hidden_layers=config.num_hidden_layers,
          num_attention_heads=config.num_attention_heads,
          intermediate_size=config.intermediate_size,
          intermediate_act_fn=get_activation(config.hidden_act),
          hidden_dropout_prob=config.hidden_dropout_prob,
          attention_probs_dropout_prob=config.attention_probs_dropout_prob,
          initializer_range=config.initializer_range,
          do_return_all_layers=True)

          # `self.sequence_output`是最后一層的輸出,shape為【batch_size, seq_length, hidden_size】
          self.sequence_output = self.all_encoder_layers[-1]

          # ‘pooler’部分將encoder輸出【batch_size, seq_length, hidden_size】
          # 轉(zhuǎn)成【batch_size, hidden_size】
          with tf.variable_scope("pooler"):
          # 取最后一層的第一個(gè)時(shí)刻[CLS]對(duì)應(yīng)的tensor, 對(duì)于分類任務(wù)很重要
          # sequence_output[:, 0:1, :]得到的是[batch_size, 1, hidden_size]
          # 我們需要用squeeze把第二維去掉
          first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
          # 然后再加一個(gè)全連接層,輸出仍然是[batch_size, hidden_size]
          self.pooled_output = tf.layers.dense(
          first_token_tensor,
          config.hidden_size,
          activation=tf.tanh,
          kernel_initializer=create_initializer(config.initializer_range))

          總結(jié)一哈

          有了以上對(duì)源碼的深入了解之后,我們?cè)谑褂?BertModel 的時(shí)候就會(huì)更加得心應(yīng)手。舉個(gè)模型使用的簡(jiǎn)單栗子:

          # 假設(shè)輸入已經(jīng)經(jīng)過(guò)分詞變成word_ids. shape=[2, 3]
          input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
          input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
          # segment_emebdding. 表示第一個(gè)樣本前兩個(gè)詞屬于句子1,后一個(gè)詞屬于句子2.
          # 第二個(gè)樣本的第一個(gè)詞屬于句子1, 第二次詞屬于句子2,第三個(gè)元素0表示padding
          # 原始代碼是下面這樣的,但是感覺么必要用 2,不知道是不是我哪里沒理解
          token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])

          # 創(chuàng)建BertConfig實(shí)例
          config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
          num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)

          # 創(chuàng)建BertModel實(shí)例
          model = modeling.BertModel(config=config, is_training=True,
          input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)


          label_embeddings = tf.get_variable(...)
          #得到最后一層的第一個(gè)Token也就是[CLS]向量表示,可以看成是一個(gè)句子的embedding
          pooled_output = model.get_pooled_output()
          logits = tf.matmul(pooled_output, label_embeddings)

          在 BERT 模型構(gòu)建這一塊的主要流程:

          • 對(duì)輸入序列進(jìn)行 Embedding(三個(gè)),接下去就是‘Attention is all you need’的內(nèi)容了
          • 簡(jiǎn)單一點(diǎn)就是將 embedding 輸入 transformer 得到輸出結(jié)果;
          • 詳細(xì)一點(diǎn)就是 embedding --> N *【multi-head attention --> Add(Residual) &Norm--> Feed-Forward --> Add(Residual) &Norm】;
          • 哈,是不是很簡(jiǎn)單~
          • 源碼中還有一些其他的輔助函數(shù),不是很難理解,這里就不再啰嗦。

          以上~?

          本文參考資料

          [1]

          NLP 大殺器 BERT 模型解讀: https://blog.csdn.net/Kaiyuan_sjtu/article/details/83991186

          [2]

          BERT 相關(guān)論文、文章和代碼資源匯總: http://www.52nlp.cn/bert-paper-%E8%AE%BA%E6%96%87-%E6%96%87%E7%AB%A0-%E4%BB%A3%E7%A0%81%E8%B5%84%E6%BA%90%E6%B1%87%E6%80%BB

          [3]

          modeling.py 模塊: https://github.com/google-research/bert/blob/master/modeling.py

          [4]

          參考這個(gè) Issue: https://github.com/google-research/bert/issues/16

          [5]

          理解 Attention 機(jī)制原理及模型: https://blog.csdn.net/Kaiyuan_sjtu/article/details/81806123

          [6]

          原始論文: https://arxiv.org/abs/1706.03762

          [7]

          原始代碼: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py


          瀏覽 60
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  日本无套内射 | 色婷婷精品国产 | 人妻精品无码偷拍 | 台湾成人综合网 | 日韩精品18 |