<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          【關(guān)于 關(guān)系抽取 之 HBT】 那些的你不知道的事

          共 39291字,需瀏覽 79分鐘

           ·

          2021-05-18 00:22

          作者:楊夕

          項(xiàng)目地址:https://github.com/km1994/nlp_paper_study

          keras4bert 版本:https://github.com/bojone/lic2020_baselines

          pytorch 版本:https://github.com/powerycy/Lic2020-

          論文名稱:A Novel Hierarchical Binary Tagging Framework for Relational Triple Extraction

          【注:手機(jī)閱讀可能圖片打不開?。?!】


          摘要

          Extracting relational triples from unstructured text is crucial for large-scale knowledge graph construction. However, few existing works excel in solving the overlapping triple problem where multiple relational triples in the same sentence share the same entities. In this work, we introduce a fresh perspective to revisit the relational triple extraction task and propose a novel Hierarchical Binary Tagging (HBT) framework derived from a principled problem formulation. Instead of treating relations as discrete labels as in previous works, our new framework models relations as functions that map subjects to objects in a sentence, which naturally handles the overlapping problem. Experiments show that the proposed framework already outperforms state-of-the-art methods even when its encoder module uses a randomly initialized BERT encoder, showing the power of the new tagging framework. It enjoys further performance boost when employing a pretrained BERT encoder, outperforming the strongest baseline by 17.5 and 30.2 absolute gain in F1-score on two public datasets NYT and WebNLG, respectively. In-depth analysis on different scenarios of overlapping triples shows that the method delivers consistent performance gain across all these scenarios.

          從非結(jié)構(gòu)化文本中提取關(guān)系三元組對(duì)于大規(guī)模知識(shí)圖的構(gòu)建至關(guān)重要。

          但是,很少有現(xiàn)有的著作能很好地解決重疊三重問題,在該問題中,同一句子中的多個(gè)關(guān)系三重共享同一實(shí)體。

          在這項(xiàng)工作中,我們引入了一個(gè)新的視角來(lái)重新審視關(guān)系三重提取任務(wù),并提出了一個(gè)從有原則的問題表達(dá)中衍生出來(lái)的新穎的分層二進(jìn)制標(biāo)記(HBT)框架。我們的新框架沒有像以前的作品那樣將關(guān)系視為離散標(biāo)簽,而是將關(guān)系建模為將主語(yǔ)映射到句子中的賓語(yǔ)的函數(shù),從而自然地解決了重疊問題。

          實(shí)驗(yàn)表明,即使在其編碼器模塊使用隨機(jī)初始化的BERT編碼器的情況下,所提出的框架也已經(jīng)超越了最新方法,顯示了新標(biāo)簽框架的強(qiáng)大功能。當(dāng)使用預(yù)訓(xùn)練的BERT編碼器時(shí),它在性能上得到了進(jìn)一步的提升,在兩個(gè)公共數(shù)據(jù)集NYT和WebNLG上,F(xiàn)1評(píng)分的絕對(duì)增益分別比最強(qiáng)的基線高17.5和30.2。對(duì)重疊三元組的不同方案的深入分析表明,該方法在所有這些方案中均提供了一致的性能提升。

          一、引言

          1.1 背景知識(shí)

          關(guān)系三元組抽取(Relational Triple Extraction, RTE),也叫實(shí)體-關(guān)系聯(lián)合抽取,是信息抽取領(lǐng)域中的一個(gè)經(jīng)典任務(wù),三元組抽取旨在從文本中抽取出結(jié)構(gòu)化的關(guān)系三元組(Subject, Relation, Object)用以構(gòu)建知識(shí)圖譜。

          近年來(lái),隨著NLP領(lǐng)域的不斷發(fā)展,在簡(jiǎn)單語(yǔ)境下(例如,一個(gè)句子僅包含一個(gè)關(guān)系三元組)進(jìn)行關(guān)系三元組抽取已經(jīng)能夠達(dá)到不錯(cuò)的效果。但在復(fù)雜語(yǔ)境下(一個(gè)句子中包含多個(gè)關(guān)系三元組,有時(shí)甚至多達(dá)五個(gè)以上),尤其當(dāng)多個(gè)三元組有重疊的情況時(shí),許多現(xiàn)有模型的表現(xiàn)就顯得有些捉襟見肘了。

          1.2 之前方法介紹

          1.2.1 pipeline approach

          1.2.1.1 思路

          pipeline approach 方法的核心就是將 實(shí)體-關(guān)系聯(lián)合抽取任務(wù) 分成 實(shí)體抽取+關(guān)系分類 兩個(gè)任務(wù),思路如下:

          1. 實(shí)體抽取:利用一個(gè)命名實(shí)體識(shí)別模型 識(shí)別句子中的所有實(shí)體;

          2. 關(guān)系分類:利用 一個(gè)關(guān)系分類模型 對(duì)每個(gè)實(shí)體對(duì)執(zhí)行關(guān)系分類?!具@一步其實(shí)可以理解為文本分類任務(wù),但是和文本分類任務(wù)的區(qū)別在于,關(guān)系分類不僅需要學(xué)習(xí)句子信息,還要知道 實(shí)體對(duì)在 句子中 位置信息】

          1.2.1.2 問題
          • 誤差傳遞問題:由于 該方法將 實(shí)體-關(guān)系聯(lián)合抽取任務(wù) 分成 實(shí)體抽取+關(guān)系分類 兩個(gè)任務(wù)處理,所以 實(shí)體抽取任務(wù)的錯(cuò)誤無(wú)法在后期階段進(jìn)行糾正,因此這種方法容易遭受錯(cuò)誤傳播問題;

          1.2.2 feature-based models and neural network-based models

          1.2.2.1 思路

          通過用學(xué)習(xí)表示替換人工構(gòu)建的特征,基于神經(jīng)網(wǎng)絡(luò)的模型在 關(guān)系三元組 提取 任務(wù)中取得了相當(dāng)大的成功。

          1.2.2.2 問題
          • 實(shí)體關(guān)系重疊問題:大多數(shù)現(xiàn)有方法無(wú)法正確處理句子包含多個(gè)相互重疊的關(guān)系三元組的情況。

          圖 1 中介紹了三種 關(guān)系三元組 場(chǎng)景
          Normal 關(guān)系。表示三元組之間無(wú)重疊;(United states ,Trump) 之間的 關(guān)系為 Country_president,(Tim Cook,Apple Inc) 之間的關(guān)系為 Company_CEO;這種 三元組關(guān)系 比較簡(jiǎn)單
          EPO(Entity Pair Overlap)。表示多(兩)個(gè)三元組之間共享同一個(gè)實(shí)體對(duì);(IQuentin Tarantino,Django Unchained) 實(shí)體對(duì) 間 存在 Act_in 和 Direct_movic 兩種關(guān)系。
          SEO(Single Entity Overlap)。表示多(兩)個(gè)三元組之間僅共享一個(gè)實(shí)體;(Jackie,Birth, Wachinghua) 和 (Wachinghua,Capital, United States) 共享 實(shí)體 Wachinghua。

          1.2.3 基于Seq2Seq模型 and GCN

          1.2.3.1 思路

          Zeng 是最早在關(guān)系三重提取中考慮重疊三重問題的人之一。他們介紹了如圖 1 所示的不同重疊模式的類別,并提出了具有復(fù)制機(jī)制以提取三元組的序列到序列(Seq2Seq)模型。他們基于Seq2Seq模型,進(jìn)一步研究了提取順序的影響,并通過強(qiáng)化學(xué)習(xí)獲得了很大的改進(jìn)。

          Fu 還通過使用基于圖卷積網(wǎng)絡(luò)(GCN)的模型將文本建模為關(guān)系圖來(lái)研究重疊三重問題。

          1.2.3.2 問題
          • 過多 negative examples:在所有提取的實(shí)體對(duì)中,很多都不形成有效關(guān)系,從而產(chǎn)生了太多的negative examples;

          • EPO(Entity Pair Overlap) 問題:當(dāng)同一實(shí)體參與多個(gè)關(guān)系時(shí),分類器可能會(huì)感到困惑。沒有足夠的訓(xùn)練樣例的情況下,分類器就很難準(zhǔn)確指出實(shí)體參與的關(guān)系;

          二、論文工作

          • 方式:實(shí)現(xiàn)了一個(gè)不受重疊三元組問題困擾的HBT標(biāo)注框架(Hierarchical Binary Tagging Framework)來(lái)解決RTE任務(wù);

          • 核心思想:把關(guān)系(Relation)建模為將頭實(shí)體(Subject)映射到尾實(shí)體(Object)的函數(shù),而不是將其視為實(shí)體對(duì)上的標(biāo)簽。

          論文并不是學(xué)習(xí)關(guān)系分類器f(s,o)→r,而是學(xué)習(xí)關(guān)系特定的標(biāo)記器fr(s)→o;每個(gè)標(biāo)記器都可以識(shí)別特定關(guān)系下給定 subject 的可能 object(s)。或不返回任何 object,表示給定的主題和關(guān)系沒有 triple。

          • 思路:

            • 首先,我們確定句子中所有可能的 subjects;

            • 然后針對(duì)每個(gè)subjects,我們應(yīng)用特定于關(guān)系的標(biāo)記器來(lái)同時(shí)識(shí)別所有可能的 relations 和相應(yīng)的 objects。

          三、HBT 結(jié)構(gòu)介紹

          3.1 BERT Encoder層

          這里使用 Bert 做 Encoder,其實(shí)就是 用 Bert 做 Embedding 層使用。

          • 代碼介紹:

          介紹:將 input_ids,attention_mask,token_type_ids,position_ids,head_mask,inputs_embeds 作為參數(shù) 輸入 Bert 模型中,并取 Bert 模型最后一層作為 輸出,即 隱藏狀態(tài)。

          from transformers import BertModel,BertPreTrainedModel
          class REModel_sbuject_2(BertPreTrainedModel):
          def __init__(self, config):
          super().__init__(config)
          self.bert = BertModel(config)
          self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
          ...

          def forward(
          self, input_ids=None, attention_mask=None,
          token_type_ids=None, position_ids=None,
          head_mask=None, inputs_embeds=None,
          labels=None, subject_ids = None,
          batch_size = None, obj_labels = None,
          sub_train = False, obj_train = False
          ):

          outputs_1 = self.bert(
          input_ids=input_ids,
          attention_mask=attention_mask,
          token_type_ids=token_type_ids,
          position_ids=position_ids,
          head_mask=head_mask,
          inputs_embeds=inputs_embeds,
          )
          sequence_output = outputs_1[0]
          sequence_output = self.dropout(sequence_output)
          ...

          3.2 Hierarchical Decoder層

          Hierarchical Decoder 層 由兩部分組成:

          1. Subject tagger 層:用于 提取 Subject;

          2. Relation-Specific Object Taggers 層:由一系列relation-specific object taggers(之所以這里是多個(gè)taggers是因?yàn)橛卸鄠€(gè)可能的relation);

          3.2.1 Subject Tagger 層

          • 目標(biāo):檢測(cè) Subject 的開始和結(jié)束位置

          • 方法:利用兩個(gè)相同的 二分類器,來(lái)檢測(cè) 每個(gè) Subject 的開始和結(jié)束位置;

          • 做法:

          對(duì)BERT的輸出的特征向量作sigmoid激活,計(jì)算該token作為subject的開始、結(jié)束的概率大小。如果 概率 超過設(shè)定閾值,則標(biāo)記為 1,反之為 0。

          其中xi是第i個(gè)token的編碼表示;pi是第i個(gè)token是subject的start或者end的概率

          為了獲得更好的W(weight)和b(bias)subject tagger需要優(yōu)化這個(gè)似然函數(shù):

          • 代碼介紹:

          from torch.nn import BCELoss
          class REModel_sbuject_2(BertPreTrainedModel):
          def __init__(self, config):
          super().__init__(config)
          self.num_labels = config.num_labels
          self.obj_labels = 110
          self.bert = BertModel(config)
          self.linear = nn.Linear(768, 768)
          self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
          self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
          self.classifier = nn.Linear(config.hidden_size,config.num_labels)
          self.obj_classifier = nn.Linear(config.hidden_size, self.obj_labels)
          self.sub_pos_emb = nn.Embedding(256, 768)
          self.relu = nn.ReLU()
          self.init_weights()

          def forward(
          self, input_ids=None, attention_mask=None,
          token_type_ids=None, position_ids=None,
          head_mask=None, inputs_embeds=None,
          labels=None, subject_ids = None,
          batch_size = None, obj_labels = None,
          sub_train = False, obj_train = False
          ):

          ...
          # step 2:Subject Tagger 層。預(yù)測(cè) subject
          if sub_train == True:
          logits = self.classifier(sequence_output)
          outputs = (logits,) # add hidden states and attention if they are here
          loss_sig = nn.Sigmoid()
          # Only keep active parts of the loss
          active_logits = logits.view(-1, self.num_labels)
          active_logits = loss_sig(active_logits)
          active_logits = active_logits ** 2
          if labels is not None :
          active_labels = labels.view(-1, self.num_labels).float()
          loss_fct = BCELoss(reduction='none')
          loss = loss_fct(active_logits, active_labels)
          loss = loss.view(batch_size, -1, 2)
          loss = torch.mean(loss, 2)
          loss = torch.sum(attention_mask * loss) / torch.sum(attention_mask)
          outputs = (loss,) + outputs
          else:
          outputs = active_logits
          ...

          3.2.2 Relation-specific Object Taggers層

          • 目標(biāo):檢測(cè) Object 的開始和結(jié)束位置

          • 方法:利用兩個(gè)相同的 二分類器,來(lái)檢測(cè) 每個(gè) Object 的開始和結(jié)束位置,但是 Relation-specific Object Taggers層 需要 融入上一步的 subject 特征,結(jié)合之前BERT Encoder的編碼內(nèi)容,用來(lái)在指定的relation下預(yù)測(cè)對(duì)應(yīng)的object的起止位置,概率計(jì)算如下和之前相比多了v:

          Suject Tagger預(yù)測(cè)的第k個(gè)實(shí)體的平均向量,如

          這么做的目的是保證xi和v是相同的維度

          對(duì)于每個(gè)關(guān)系r對(duì)應(yīng)的tagger,需要優(yōu)化的似然函數(shù)如下來(lái)獲得更好的W(weight)和b(bias)這個(gè)公式等號(hào)右邊和之前是完全一樣的:

          • 代碼介紹:

          import torch.nn as nn
          import torch
          import torch.nn.functional as F
          from torch.nn import BCELoss
          BertLayerNorm = torch.nn.LayerNorm
          ...
          # 功能:得到相應(yīng)的向量過后對(duì)向量進(jìn)行簡(jiǎn)單的相加求平均
          def merge_function(inputs):
          '''
          功能:得到相應(yīng)的向量過后對(duì)向量進(jìn)行簡(jiǎn)單的相加求平均
          '''

          output = inputs[0]
          for i in range(1, len(inputs)):
          output += inputs[i]
          return output / len(inputs)

          # 功能:根據(jù) index 從 data 中取出 對(duì)應(yīng)的向量表征
          def batch_gather(data:torch.Tensor,index:torch.Tensor):
          """
          功能:根據(jù) index 從 data 中取出 對(duì)應(yīng)的向量表征
          """

          index = index.unsqueeze(-1)
          index = index.expand(data.size()[0],index.size()[1],data.size()[2])
          return torch.gather(data,1,index)

          # 功能:根據(jù) subject_ids 從 output 中取出 subject 的 start 和 end 向量表征
          def extrac_subject_1(output, subject_ids):
          """
          功能:根據(jù) subject_ids 從 output 中取出 subject 的 start 和 end 向量表征
          """

          start = batch_gather(output, subject_ids[:, :1])
          end = batch_gather(output, subject_ids[:, 1:])
          return start,end

          class REModel_sbuject_2(BertPreTrainedModel):
          def __init__(self, config):
          super().__init__(config)
          self.num_labels = config.num_labels
          self.obj_labels = 110
          self.bert = BertModel(config)
          self.linear = nn.Linear(768, 768)
          self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
          self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
          self.classifier = nn.Linear(config.hidden_size,config.num_labels)
          self.obj_classifier = nn.Linear(config.hidden_size, self.obj_labels)
          self.sub_pos_emb = nn.Embedding(256, 768)
          self.relu = nn.ReLU()
          self.init_weights()

          def forward(
          self, input_ids=None, attention_mask=None,
          token_type_ids=None, position_ids=None,
          head_mask=None, inputs_embeds=None,
          labels=None, subject_ids = None,
          batch_size = None, obj_labels = None,
          sub_train = False, obj_train = False
          ):

          ...
          # step 3:Relation-specific Object Taggers層。傳入subject,預(yù)測(cè)object
          if obj_train == True:
          ## step 3.1:得到subject的開始與結(jié)束位置之后,在取出倒數(shù)第二層的隱狀態(tài),利用輸入的 subject_id 取出隱狀態(tài)中的首尾向量
          hidden_states = outputs_1[2][-2]
          hidden_states_1 = outputs_1[2][-3]
          loss_sig = nn.Sigmoid()

          ## step 3.2 從 不同隱藏層 提取出 Subject 的 start 和 end 值
          sub_pos_start = self.sub_pos_emb(subject_ids[:, :1]).to(device)
          sub_pos_end = self.sub_pos_emb(subject_ids[:, 1:]).to(device)
          subject_start_last, subject_end_last = extrac_subject_1(sequence_output, subject_ids)
          subject_start_1,subject_end_1 = extrac_subject_1(hidden_states_1, subject_ids)
          subject_start,subject_end = extrac_subject_1(hidden_states, subject_ids)

          subject = (sub_pos_start + subject_start + sub_pos_end + subject_end + subject_start_last + subject_start_1 + subject_end_1 + subject_end_1).to(device)

          ## step 3.3 通過Conditional Layer Normalization將subject融入到object的預(yù)測(cè)中
          batch_token_ids_obj = torch.add(hidden_states, subject)
          batch_token_ids_obj = self.LayerNorm(batch_token_ids_obj)
          batch_token_ids_obj = self.dropout(batch_token_ids_obj)
          batch_token_ids_obj = self.relu(self.linear(batch_token_ids_obj))
          batch_token_ids_obj = self.dropout(batch_token_ids_obj)
          obj_logits = self.obj_classifier(batch_token_ids_obj)

          obj_logits = loss_sig(obj_logits)
          obj_logits = obj_logits ** 4
          obj_outputs = (obj_logits,)
          ## step 3.4:計(jì)算 Object 和 Relation 的 損失函數(shù)
          if obj_labels is not None:
          loss_obj = BCELoss(reduction='none')
          obj_loss = loss_obj(obj_logits.view(batch_size, -1, self.obj_labels // 2, 2), obj_labels.float())
          obj_loss = torch.sum(torch.mean(obj_loss, 3), 2)
          # 損失函數(shù)中的 MASK
          obj_loss = torch.sum(obj_loss * attention_mask) / torch.sum(attention_mask)
          s_o_loss = torch.add(obj_loss, loss)
          outputs_obj = (s_o_loss,) + obj_outputs
          else:
          outputs_obj = obj_logits.view(batch_size, -1, self.obj_labels // 2, 2)

          if obj_train == True:
          return outputs ,outputs_obj # (loss), scores, (hidden_states), (attentions)
          else:
          return outputs

          3.3 損失函數(shù)

          1. 公式 1 :對(duì)于training set D上的sentence xj和xj中可能存在的三元組的集合 Tj, 利用公式 1 去最大化data likelihood;

          2. 公式 2 :采用 鏈?zhǔn)椒▌t 將第一個(gè)公式 轉(zhuǎn)化為 第二個(gè)公式;

          右邊部分下角標(biāo)的 表示 Tj中指定s的三元組集合,集合中的ro對(duì)來(lái)計(jì)算后面這個(gè)部分

          1. 公式 3 :對(duì)于給定的一個(gè)subject,其在句子中所參與的關(guān)系個(gè)數(shù)一般來(lái)說(shuō)是有限的,因此只有部分relation能夠?qū)⑵溆成涞较鄳?yīng)的object上去(對(duì)應(yīng)公式3的中間部分),最終得到一個(gè)有效的三元組。

          注:對(duì)于未參與的關(guān)系,文中提出了”null object”的概念,也就是說(shuō),在這種情況下函數(shù)會(huì)將subject映射到一個(gè)空的尾實(shí)體上(對(duì)應(yīng)公式3的右端部分),表示subject并不參與該關(guān)系,也就無(wú)法抽取出有效的三元組。

          1. 損失函數(shù):

          • 代碼介紹:

          class REModel_sbuject_2(BertPreTrainedModel):
          def __init__(self, config):
          super().__init__(config)
          self.num_labels = config.num_labels
          self.obj_labels = 110
          self.bert = BertModel(config)
          self.linear = nn.Linear(768, 768)
          self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
          self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
          self.classifier = nn.Linear(config.hidden_size,config.num_labels)
          self.obj_classifier = nn.Linear(config.hidden_size, self.obj_labels)
          self.sub_pos_emb = nn.Embedding(256, 768)
          self.relu = nn.ReLU()
          self.init_weights()

          def forward(
          self, input_ids=None, attention_mask=None,
          token_type_ids=None, position_ids=None,
          head_mask=None, inputs_embeds=None,
          labels=None, subject_ids = None,
          batch_size = None, obj_labels = None,
          sub_train = False, obj_train = False
          ):

          ...

          # step 2:Subject Tagger 層。預(yù)測(cè) subject
          if sub_train == True:
          ...
          ## step 2.2:計(jì)算 Subject 的 損失函數(shù)
          if labels is not None :
          active_labels = labels.view(-1, self.num_labels).float()
          loss_fct = BCELoss(reduction='none')
          loss = loss_fct(active_logits, active_labels)
          loss = loss.view(batch_size, -1, 2)
          loss = torch.mean(loss, 2)
          loss = torch.sum(attention_mask * loss) / torch.sum(attention_mask)
          outputs = (loss,) + outputs
          else:
          outputs = active_logits

          # step 3:Relation-specific Object Taggers層。傳入subject,預(yù)測(cè)object
          if obj_train == True:
          ...
          ## step 3.4:計(jì)算 Object 和 Relation 的 損失函數(shù)
          if obj_labels is not None:
          loss_obj = BCELoss(reduction='none')
          obj_loss = loss_obj(obj_logits.view(batch_size, -1, self.obj_labels // 2, 2), obj_labels.float())
          obj_loss = torch.sum(torch.mean(obj_loss, 3), 2)
          # 損失函數(shù)中的 MASK
          obj_loss = torch.sum(obj_loss * attention_mask) / torch.sum(attention_mask)
          s_o_loss = torch.add(obj_loss, loss)
          outputs_obj = (s_o_loss,) + obj_outputs
          else:
          outputs_obj = obj_logits.view(batch_size, -1, self.obj_labels // 2, 2)

          if obj_train == True:
          return outputs ,outputs_obj # (loss), scores, (hidden_states), (attentions)
          else:
          return outputs

          四、實(shí)踐

          4.1 數(shù)據(jù)集介紹

          數(shù)據(jù)集 來(lái)自于百度 Lic 2020-關(guān)系抽取比賽,數(shù)據(jù)集 格式如下:

          訓(xùn)練數(shù)據(jù)

          {
          "text": "《步步驚心》改編自著名作家桐華的同名清穿小說(shuō)《甄嬛傳》改編自流瀲紫所著的同名小說(shuō)電視劇《何以笙簫默》改編自顧漫同名小說(shuō)《花千骨》改編自fresh果果同名小說(shuō)《裸婚時(shí)代》是月影蘭析創(chuàng)作的一部情感小說(shuō)《瑯琊榜》是根據(jù)海宴同名網(wǎng)絡(luò)小說(shuō)改編電視劇《宮鎖心玉》,又名《宮》《雪豹》,該劇改編自網(wǎng)絡(luò)小說(shuō)《特戰(zhàn)先驅(qū)》《我是特種兵》由紅遍網(wǎng)絡(luò)的小說(shuō)《最后一顆子彈留給我》改編電視劇《來(lái)不及說(shuō)我愛你》改編自匪我思存同名小說(shuō)《來(lái)不及說(shuō)我愛你》",
          "spo_list": [
          {
          "predicate": "作者",
          "object_type": {"@value": "人物"},
          "subject_type": "圖書作品",
          "object": {"@value": "顧漫"},
          "subject": "何以笙簫默"
          },
          {
          "predicate": "改編自",
          "object_type": {"@value": "作品"},
          "subject_type": "影視作品",
          "object": {"@value": "最后一顆子彈留給我"},
          "subject": "我是特種兵"
          }, ...
          ]
          }
          ...

          schema 數(shù)據(jù)

          {"object_type": {"@value": "學(xué)校"}, "predicate": "畢業(yè)院校", "subject_type": "人物"}
          {"object_type": {"@value": "人物"}, "predicate": "嘉賓", "subject_type": "電視綜藝"}
          {"object_type": {"inWork": "影視作品", "@value": "人物"}, "predicate": "配音", "subject_type": "娛樂人物"}
          {"object_type": {"@value": "歌曲"}, "predicate": "主題曲", "subject_type": "影視作品"}
          {"object_type": {"@value": "人物"}, "predicate": "代言人", "subject_type": "企業(yè)/品牌"}
          {"object_type": {"@value": "音樂專輯"}, "predicate": "所屬專輯", "subject_type": "歌曲"}
          {"object_type": {"@value": "人物"}, "predicate": "父親", "subject_type": "人物"}
          {"object_type": {"@value": "人物"}, "predicate": "作者", "subject_type": "圖書作品"}
          {"object_type": {"inArea": "地點(diǎn)", "@value": "Date"}, "predicate": "上映時(shí)間", "subject_type": "影視作品"}
          {"object_type": {"@value": "人物"}, "predicate": "母親", "subject_type": "人物"}
          {"object_type": {"@value": "Text"}, "predicate": "專業(yè)代碼", "subject_type": "學(xué)科專業(yè)"}
          {"object_type": {"@value": "Number"}, "predicate": "占地面積", "subject_type": "機(jī)構(gòu)"}
          {"object_type": {"@value": "Text"}, "predicate": "郵政編碼", "subject_type": "行政區(qū)"}
          {"object_type": {"inArea": "地點(diǎn)", "@value": "Number"}, "predicate": "票房", "subject_type": "影視作品"}
          {"object_type": {"@value": "Number"}, "predicate": "注冊(cè)資本", "subject_type": "企業(yè)"}
          {"object_type": {"@value": "人物"}, "predicate": "主角", "subject_type": "文學(xué)作品"}
          {"object_type": {"@value": "人物"}, "predicate": "妻子", "subject_type": "人物"}
          {"object_type": {"@value": "人物"}, "predicate": "編劇", "subject_type": "影視作品"}
          {"object_type": {"@value": "氣候"}, "predicate": "氣候", "subject_type": "行政區(qū)"}
          {"object_type": {"@value": "人物"}, "predicate": "歌手", "subject_type": "歌曲"}
          {"object_type": {"inWork": "作品", "onDate": "Date", "@value": "獎(jiǎng)項(xiàng)", "period": "Number"}, "predicate": "獲獎(jiǎng)", "subject_type": "娛樂人物"}
          {"object_type": {"@value": "人物"}, "predicate": "校長(zhǎng)", "subject_type": "學(xué)校"}
          {"object_type": {"@value": "人物"}, "predicate": "創(chuàng)始人", "subject_type": "企業(yè)"}
          {"object_type": {"@value": "城市"}, "predicate": "首都", "subject_type": "國(guó)家"}
          {"object_type": {"@value": "人物"}, "predicate": "丈夫", "subject_type": "人物"}
          {"object_type": {"@value": "Text"}, "predicate": "朝代", "subject_type": "歷史人物"}
          {"object_type": {"inWork": "影視作品", "@value": "人物"}, "predicate": "飾演", "subject_type": "娛樂人物"}
          {"object_type": {"@value": "Number"}, "predicate": "面積", "subject_type": "行政區(qū)"}
          {"object_type": {"@value": "地點(diǎn)"}, "predicate": "總部地點(diǎn)", "subject_type": "企業(yè)"}
          {"object_type": {"@value": "地點(diǎn)"}, "predicate": "祖籍", "subject_type": "人物"}
          {"object_type": {"@value": "Number"}, "predicate": "人口數(shù)量", "subject_type": "行政區(qū)"}
          {"object_type": {"@value": "人物"}, "predicate": "制片人", "subject_type": "影視作品"}
          {"object_type": {"@value": "Number"}, "predicate": "修業(yè)年限", "subject_type": "學(xué)科專業(yè)"}
          {"object_type": {"@value": "城市"}, "predicate": "所在城市", "subject_type": "景點(diǎn)"}
          {"object_type": {"@value": "人物"}, "predicate": "董事長(zhǎng)", "subject_type": "企業(yè)"}
          {"object_type": {"@value": "人物"}, "predicate": "作詞", "subject_type": "歌曲"}
          {"object_type": {"@value": "作品"}, "predicate": "改編自", "subject_type": "影視作品"}
          {"object_type": {"@value": "企業(yè)"}, "predicate": "出品公司", "subject_type": "影視作品"}
          {"object_type": {"@value": "人物"}, "predicate": "導(dǎo)演", "subject_type": "影視作品"}
          {"object_type": {"@value": "人物"}, "predicate": "作曲", "subject_type": "歌曲"}
          {"object_type": {"@value": "人物"}, "predicate": "主演", "subject_type": "影視作品"}
          {"object_type": {"@value": "人物"}, "predicate": "主持人", "subject_type": "電視綜藝"}
          {"object_type": {"@value": "Date"}, "predicate": "成立日期", "subject_type": "機(jī)構(gòu)"}
          {"object_type": {"@value": "Text"}, "predicate": "簡(jiǎn)稱", "subject_type": "機(jī)構(gòu)"}
          {"object_type": {"@value": "Number"}, "predicate": "海拔", "subject_type": "地點(diǎn)"}
          {"object_type": {"@value": "Text"}, "predicate": "號(hào)", "subject_type": "歷史人物"}
          {"object_type": {"@value": "國(guó)家"}, "predicate": "國(guó)籍", "subject_type": "人物"}
          {"object_type": {"@value": "語(yǔ)言"}, "predicate": "官方語(yǔ)言", "subject_type": "國(guó)家"}

          4.2 數(shù)據(jù)加載

          加載數(shù)據(jù)集函數(shù)

          import json
          # 功能:加載數(shù)據(jù)集
          def load_data(filename):
          D = []
          with open(filename,'r',encoding='utf8') as f:
          for l in f:
          l = json.loads(l)
          d = {'text': l['text'], 'spo_list': []}
          for spo in l['spo_list']:
          for k, v in spo['object'].items():
          d['spo_list'].append(
          (spo['subject'], spo['predicate'] + '_' + k, v)
          )
          D.append(d)
          return D

          # 功能:讀取schema
          def load_schema(schema_path):
          with open(schema_path,encoding='utf8') as f:
          id2predicate, predicate2id, n = {}, {}, 0
          predicate2type = {}
          for l in f:
          l = json.loads(l)
          predicate2type[l['predicate']] = (l['subject_type'], l['object_type'])
          for k, _ in sorted(l['object_type'].items()):
          key = l['predicate'] + '_' + k
          id2predicate[n] = key
          predicate2id[key] = n
          n += 1
          return id2predicate, predicate2id

          函數(shù)調(diào)用

          # step 2:加載數(shù)據(jù)集
          train_data = load_data(config.path['train_path'])
          valid_data = load_data(config.path['valid_path'])
          id2predicate, predicate2id = load_schema(config.path['schema_path'])
          >>>
          train_data[0:1]:[
          {
          'text': '《步步驚心》改編自著名作家桐華的同名清穿小說(shuō)《甄嬛傳》改編自流瀲紫所著的同名小說(shuō)電視劇《何以 笙簫默》改編自顧漫同名小說(shuō)《花千骨》改編自fresh果果同名小說(shuō)《裸婚時(shí)代》是月影蘭析創(chuàng)作的一部情感小說(shuō)《瑯琊榜》是根據(jù)海宴 同名網(wǎng)絡(luò)小說(shuō)改編電視劇《宮鎖心玉》,又名《宮》《雪豹》,該劇改編自網(wǎng)絡(luò)小說(shuō)《特戰(zhàn)先驅(qū)》《我是特種兵》由紅遍網(wǎng)絡(luò)的小說(shuō)《最后一顆子彈留給我》改編電視劇《來(lái)不及說(shuō)我愛你》改編自匪我思存同名小說(shuō)《來(lái)不及說(shuō)我愛你》',
          'spo_list':
          [
          ('何以笙簫默', '作者_(dá)@value', '顧漫'),
          ('我是特種兵', '改編自_@value', '最后一顆子彈留給我'),
          ('步步驚心', '作者_(dá)@value', '桐華'),
          ('甄嬛 傳', '作者_(dá)@value', '流瀲紫'),
          ('花千骨', '作者_(dá)@value', 'fresh果果'),
          ('裸婚時(shí)代', '作者_(dá)@value', '月影蘭析'),
          ('瑯琊榜', '作者_(dá)@value', '海宴'),
          ('雪豹', '改編自_@value', '特戰(zhàn)先驅(qū)'),
          ('來(lái)不及說(shuō)我愛你', '改編自_@value', '來(lái)不及說(shuō)我愛你'),
          ('來(lái)不及說(shuō)我愛你', '作者_(dá)@value', '匪我思存')
          ]
          }
          ]

          id2predicate:{
          0: '畢業(yè)院校_@value', 1: '嘉賓_@value', 2: '配音_@value', 3: '配音_inWork',
          4: '主題曲_@value', 5: '代言人_@value', 6: '所屬專輯_@value', 7: '父親_@value',
          8: '作者_(dá)@value', 9: '上映時(shí)間_@value', 10: '上映時(shí)間_inArea',
          11: '母親_@value', 12: '專業(yè)代碼_@value', 13: '占地面積_@value',
          14: '郵政編碼_@value', 15: '票房_@value', 16: '票房_inArea',
          17: '注冊(cè)資本_@value', 18: '主角_@value', 19: '妻子_@value',
          20: '編劇_@value', 21: '氣候_@value', 22: '歌手_@value', 23: '獲獎(jiǎng)_@value',
          24: '獲獎(jiǎng)_inWork', 25: '獲獎(jiǎng)_onDate', 26: '獲獎(jiǎng)_period', 27: '校長(zhǎng)_@value',
          28: '創(chuàng)始人_@value', 29: '首都_@value', 30: '丈夫_@value', 31: '朝代_@value',
          32: '飾演_@value', 33: '飾演_inWork', 34: '面積_@value', 35: '總部地點(diǎn)_@value',
          36: '祖籍_@value', 37: '人口數(shù)量_@value', 38: '制片人_@value',
          39: '修業(yè)年限_@value', 40: '所在城市_@value', 41: '董事長(zhǎng)_@value',
          42: '作詞_@value', 43: '改編自_@value', 44: '出品公司_@value',
          45: '導(dǎo)演_@value', 46: '作曲_@value', 47: '主演_@value', 48: '主持人_@value',
          49: '成立日期_@value', 50: '簡(jiǎn)稱_@value', 51: '海拔_@value', 52: '號(hào)_@value',
          53: '國(guó)籍_@value', 54: '官方語(yǔ)言_@value'
          }

          predicate2id:{
          '畢業(yè)院校_@value': 0, '嘉賓_@value': 1, '配音_@value': 2, '配音_inWork': 3, '主題曲_@value': 4, '代言人_@value': 5, '所屬專輯_@value': 6, '父親_@value': 7, '作者_(dá)@value': 8, '上映時(shí)間_@value': 9, '上映時(shí)間_inArea': 10, '母親_@value': 11, '專業(yè)代碼_@value': 12, '占地面積_@value': 13, '郵政編碼_@value': 14, '票房_@value': 15, '票房_inArea': 16, '注冊(cè)資本_@value': 17, '主角_@value': 18, '妻子_@value': 19, '編劇_@value': 20, '氣候_@value': 21, '歌手_@value': 22, '獲獎(jiǎng)_@value': 23, '獲獎(jiǎng)_inWork': 24, '獲獎(jiǎng)_onDate': 25, '獲獎(jiǎng)_period': 26, '校長(zhǎng)_@value': 27, '創(chuàng)始人_@value': 28, '首都_@value': 29, '丈夫_@value': 30, '朝代_@value': 31, '飾演_@value': 32, '飾演_inWork': 33, '面積_@value': 34, '總部地點(diǎn)_@value': 35, '祖籍_@value': 36, '人口數(shù)量_@value': 37, '制片人_@value': 38, '修業(yè)年限_@value': 39, '所在城市_@value': 40, '董事長(zhǎng)_@value': 41, '作詞_@value': 42, '改編自_@value': 43, '出品公司_@value': 44, '導(dǎo)演_@value': 45, '作曲_@value': 46, '主演_@value': 47, '主持人_@value': 48, '成立日期_@value': 49, '簡(jiǎn)稱_@value': 50, '海拔_@value': 51, '號(hào)_@value': 52, '國(guó)籍_@value': 53, '官方語(yǔ)言_@value': 54
          }

          4.3 數(shù)據(jù)生成器 定義

          數(shù)據(jù)生成器 定義

          import numpy as np
          # 功能:數(shù)據(jù)生成器
          class data_generator:
          """
          功能:數(shù)據(jù)生成器
          """

          def __init__(self, data, batch_size=64, buffer_size=None):
          self.data = data
          self.batch_size = batch_size
          if hasattr(self.data, '__len__'):
          self.steps = len(self.data) // self.batch_size
          if len(self.data) % self.batch_size != 0:
          self.steps += 1
          else:
          self.steps = None
          self.buffer_size = buffer_size or batch_size * 1000

          def __len__(self):
          return self.steps

          def data_res(self, tokenizer, predicate2id, maxlen):
          batch_token_ids, batch_segment_ids,batch_attention_mask = [], [], []
          batch_subject_labels, batch_subject_ids, batch_object_labels = [], [], []
          # step 1:打亂 數(shù)組
          indices = list(range(len(self.data)))
          np.random.shuffle(indices)
          # step 2:根據(jù) indices 遍歷 data
          for i in indices:
          d = self.data[i]
          ## step 2.1 對(duì) 句子 進(jìn)行編碼
          token = tokenizer.encode_plus(
          d['text'],
          max_length=maxlen,
          truncation=True
          )
          token_ids, segment_ids,attention_mask = token['input_ids'],token['token_type_ids'],token['attention_mask']

          # step 2.2 整理三元組 {s: [(o, p)]}
          spoes = {}
          for s, p, o in d['spo_list']:
          # step 2.2.1 對(duì) s、p、o 進(jìn)行 編碼
          s = tokenizer.encode_plus(s)['input_ids'][1:-1]
          p = predicate2id[p]
          o = tokenizer.encode_plus(o)['input_ids'][1:-1]
          # step 2.2.2 從sequence中尋找子串patter,如果找到,返回第一個(gè)下標(biāo);否則返回-1。
          s_idx = search(s, token_ids)
          o_idx = search(o, token_ids)
          # step 2.2.3 當(dāng) s 和 o 都 存在時(shí),記錄 s 和 o 的 初始位置 和 結(jié)束位置
          if s_idx != -1 and o_idx != -1:
          s = (s_idx, s_idx + len(s) - 1)
          o = (o_idx, o_idx + len(o) - 1, p)
          if s not in spoes:
          spoes[s] = []
          spoes[s].append(o)

          # step 2.3 設(shè)置 subject 和 object 標(biāo)簽
          if spoes:
          # step 2.3.1 設(shè)置 subject 的 start 和 end 標(biāo)簽
          subject_labels = np.zeros((len(token_ids), 2))
          for s in spoes:
          subject_labels[s[0], 0] = 1
          subject_labels[s[1], 1] = 1

          # step 2.3.2 隨機(jī)選一個(gè)subject
          start, end = np.array(list(spoes.keys())).T
          start = np.random.choice(start)
          end = np.random.choice(end[end >= start])
          subject_ids = (start, end)

          # step 2.3.3 設(shè)置 對(duì)應(yīng)的 object 標(biāo)簽
          object_labels = np.zeros((len(token_ids), len(predicate2id), 2))
          for o in spoes.get(subject_ids, []):
          object_labels[o[0], o[2], 0] = 1
          object_labels[o[1], o[2], 1] = 1

          # step 2.3.4 構(gòu)建 batch
          batch_token_ids.append(token_ids)
          batch_segment_ids.append(segment_ids)
          batch_subject_labels.append(subject_labels)
          batch_subject_ids.append(subject_ids)
          batch_object_labels.append(object_labels)
          batch_attention_mask.append(attention_mask)

          # step 3:序列 padding
          batch_token_ids = sequence_padding(batch_token_ids)
          batch_segment_ids = sequence_padding(batch_segment_ids)
          batch_subject_labels = sequence_padding(
          batch_subject_labels, padding=np.zeros(2)
          )
          batch_subject_ids = np.array(batch_subject_ids)
          batch_object_labels = sequence_padding(
          batch_object_labels,
          padding=np.zeros((len(predicate2id), 2))
          )
          batch_attention_mask = sequence_padding(batch_attention_mask)
          return [
          batch_token_ids, batch_segment_ids,
          batch_subject_labels, batch_subject_ids,
          batch_object_labels,batch_attention_mask
          ]

          數(shù)據(jù)生成器 調(diào)用

          dg = data_generator(train_data)
          dg_dev = data_generator(valid_data)
          T, S1, S2, K1, K2, M1 = dg.data_res(
          tokenizer, predicate2id, config.maxlen
          )
          >>>
          print(f"T[0:1]:{T[0:1]}")
          >>>
          T[0:1]:
          [
          [
          101 7032 2225 2209 8024 9120 118 8110 118 8149 1139 4495 754 4263
          2209 1065 6963 3377 3360 8024 6639 4413 3136 5298 102 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0
          ]
          ]
          print(f"S1[0:1]:{S1[0:1]}")
          >>>
          S1[0:1]:
          [
          [
          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0
          ]
          ]
          print(f"S2[0:1]:{S2[0:1]}")
          >>>
          S2[0:1]:
          [
          [
          [0. 0.]
          [1. 0.]
          [0. 0.]
          [0. 1.]
          [0. 0.]
          [0. 0.]
          [0. 0.]
          [0. 0.]
          [0. 0.]
          [0. 0.]
          ...
          ]
          ]
          print(f"K1[0:1]:{K1[0:1]}")
          >>>
          K1[0:1]:[[1 3]]
          print(f"K2[0:1]:{K2[0:1]}")
          K2[0:1]:
          [
          [
          [
          [0. 0.]
          [0. 0.]
          [0. 0.]
          ...
          [0. 0.]
          [0. 0.]
          [0. 0.]
          ]
          [
          [0. 0.]
          [0. 0.]
          [0. 0.]
          ...
          [0. 0.]
          [0. 0.]
          [0. 0.]
          ]
          ...
          ]
          ]
          print(f"M1[0:1]:{M1[0:1]}")
          >>>
          M1[0:1]:
          [
          [
          1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          0 0 0 0
          ]
          ]

          4.4 數(shù)據(jù)讀取類 定義

          數(shù)據(jù)讀取類 定義

          import torch.utils.data as Data
          class Dataset(Data.Dataset):
          def __init__(self,_batch_token_ids,_batch_segment_ids,_batch_subject_labels,_batch_subject_ids,_batch_obejct_labels,_batch_attention_mask):
          self.batch_token_data_ids = _batch_token_ids
          self.batch_segment_data_ids = _batch_segment_ids
          self.batch_subject_data_labels = _batch_subject_labels
          self.batch_subject_data_ids = _batch_subject_ids
          self.batch_object_data_labels = _batch_obejct_labels
          self.batch_attention_mask = _batch_attention_mask
          self.len = len(self.batch_token_data_ids)
          def __getitem__(self, index):
          return self.batch_token_data_ids[index],self.batch_segment_data_ids[index],\
          self.batch_subject_data_labels[index],self.batch_subject_data_ids[index],\
          self.batch_object_data_labels[index],self.batch_attention_mask[index]
          def __len__(self):
          return self.len

          數(shù)據(jù) 讀取類 定義

          torch_dataset = Dataset(T, S1, S2, K1, K2 , M1)
          loader_train = Data.DataLoader(
          dataset=torch_dataset, # torch TensorDataset format
          batch_size=config.batch_size, # mini batch size
          shuffle=config.shuffle, # random shuffle for training
          num_workers=config.num_workers,
          collate_fn=collate_fn, # subprocesses for loading data
          )

          4.5 模型 定義

          model_name_or_path = config.path['model_path']
          sub_model = REModel_sbuject_2.from_pretrained(model_name_or_path, num_labels=2,output_hidden_states=True)

          if config.fp16 == True:
          sub_model.half()

          4.6 優(yōu)化器 定義

          優(yōu)化器 定義

          from transformers import AdamW
          # 功能:獲取 優(yōu)化器 optimizer
          def get_optimizer(sub_model, no_decay, learning_rate, adam_epsilon, weight_decay):
          param_optimizer = list(sub_model.named_parameters()) # 打印每一次 迭代元素的名字與參數(shù)
          optimizer_grouped_parameters = [
          {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
          'weight_decay': weight_decay}, # n wei 層的名稱, p為參數(shù)
          {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
          # 如果是 no_decay 中的元素則衰減為 0
          ]
          #
          optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=adam_epsilon) # adamw算法
          return optimizer

          優(yōu)化器

          optimizer = get_optimizer(
          sub_model, config.no_decay, config.learning_rate,
          config.adam_epsilon, config.weight_decay
          )
          train_steps = len(torch_dataset) // config.epochs
          scheduler = get_linear_schedule_with_warmup(
          optimizer,
          num_warmup_steps = config.warmup_steps,
          num_training_steps = train_steps
          )

          4.7 模型訓(xùn)練和驗(yàn)證模塊

          模型訓(xùn)練 函數(shù)定義

          def train(
          sub_model,loader_train,
          device,
          optimizer,scheduler,
          ):

          sub_model.train()
          train_loss = 0.0
          for setp,loader_res in tqdm(iter(enumerate(loader_train))):
          # scheduler = WarmupLinearSchedule(optimizer, warmup_steps=warmup_steps,
          # t_total=train_steps) # warmup can su
          batch_token_ids = loader_res['batch_token_ids'].to(device)
          batch_segment_ids = loader_res['batch_segment_ids'].to(device)
          batch_subject_labels = loader_res['batch_subject_labels'].long().to(device)
          batch_subject_ids = loader_res['batch_subject_ids'].to(device)
          batch_object_labels = loader_res['batch_object_labels'].to(device)
          labels_start = (batch_subject_labels[:,:,0].to(device))
          labels_end = (batch_subject_labels[:,:,1].to(device))
          batch_attention_mask = loader_res['batch_attention_mask'].long().to(device)
          batch_segment_ids = batch_segment_ids.long().to(device)
          batch_attention_mask = batch_attention_mask.long().to(device)
          sub_out,obj_out = sub_model(
          input_ids=batch_token_ids,
          token_type_ids=batch_segment_ids,
          attention_mask=batch_attention_mask,
          labels=batch_subject_labels,
          subject_ids = batch_subject_ids,
          batch_size = batch_token_ids.size()[0],
          obj_labels = batch_object_labels,
          sub_train=True,
          obj_train=True
          )
          obj_loss,scores = obj_out[0:2]
          nn.utils.clip_grad_norm_(
          parameters=sub_model.parameters(),
          max_norm=1
          )
          obj_loss.backward()
          train_loss += obj_loss.item()
          train_loss = round(train_loss, 4)
          optimizer.step()
          scheduler.step()
          optimizer.zero_grad()
          if setp % 200 == 0:
          print("loss",train_loss / (setp + 1))

          模型驗(yàn)證 函數(shù) 定義

          def dev(sub_model,valid_data,config):
          sub_model.eval()
          f1, precision, recall = evaluate(valid_data)
          if f1 > config.best_acc :
          print("Best F1", f1)
          print("Saving Model......")
          config.best_acc = f1
          # Save a trained model
          model_to_save = sub_model.module if hasattr(sub_model, 'module') else sub_model # Only save the model it-self
          output_model_file = os.path.join(config.output_dir, "pytorch_model.bin")
          torch.save(model_to_save.state_dict(), output_model_file) # 僅保存學(xué)習(xí)到的參數(shù)
          f.write(str(epoch)+'\t'+str(f1)+'\t'+str(precision)+'\t'+str(recall)+'\n')
          print(f1,precision,recall)
          f.write(str(epoch)+'\t'+str(f1)+'\t'+str(precision)+'\t'+str(recall)+'\n')

          模型訓(xùn)練和驗(yàn)證調(diào)用

          for epoch in range(config.epochs):
          train(
          sub_model,loader_train,
          config.device,
          optimizer,scheduler,
          )

          dev(sub_model,valid_data,config)

          五、貢獻(xiàn)

          1. We introduce a fresh perspective to revisit the relational triple extraction task with a principled problem formulation, which implies a general algorithmic framework that addresses the overlapping triple problem by design.

          2. We instantiate the above framework as a novel hierarchical binary tagging model on top of a Transformer encoder. This allows the model to combine the power of the novel tagging framework with the prior knowledge in pretrained large-scale language models.

          3. Extensive experiments on two public datasets show that the proposed framework overwhelmingly outperforms state-of-the-art methods, achieving 17.5 and 30.2 absolute gain in F1-score on the two datasets respectively. Detailed analyses show that our model gains consistent improvement in all scenarios.

          結(jié)論

          In this paper, we introduce a novel hierarchical binary tagging (HBT) framework derived from a principled problem formulation for relational triple extraction. Instead of modeling relations as discrete labels of entity pairs, we model the relations as functions that map subjects to objects, which provides a fresh perspective to revisit the relational triple extraction task. As a consequent, our model can simultaneously extract multiple relational triples from sentences, without suffering from the overlapping problem. We conduct extensive experiments on two widely used datasets to validate the effectiveness of the proposed HBT framework. Experimental results show that our model overwhelmingly outperforms state-of-theart baselines over different scenarios, especially on the extraction of overlapping relational triples.

          在本文中,我們介紹了一種新穎的層次化二進(jìn)制標(biāo)記(HBT)框架,該框架源自用于關(guān)系三重提取的原則性問題公式。我們不是將關(guān)系建模為實(shí)體對(duì)的離散標(biāo)簽,而是將關(guān)系建模為將主題映射到對(duì)象的函數(shù),這為重新審視關(guān)系三重提取任務(wù)提供了新的視角。因此,我們的模型可以同時(shí)從句子中提取多個(gè)關(guān)系三元組,而不會(huì)出現(xiàn)重疊問題。我們對(duì)兩個(gè)廣泛使用的數(shù)據(jù)集進(jìn)行了廣泛的實(shí)驗(yàn),以驗(yàn)證所提出的HBT框架的有效性。實(shí)驗(yàn)結(jié)果表明,在不同情況下,特別是在提取重疊的關(guān)系三元組時(shí),我們的模型絕對(duì)優(yōu)于最新的基準(zhǔn)。

          參考

          1. A Novel Hierarchical Binary Tagging Framework for Relational Triple Extraction

          2. 論文筆記:A Novel Cascade Binary Tagging Framework for Relational Triple Extraction

          3. 百度信息抽取Lic2020關(guān)系抽取

          4. bert4keras在手,baseline我有

          5. lic2020_baselines/ie


          瀏覽 403
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  大香蕉伊在视频 | 亚洲免费视频网站 | 夜夜爽夜夜干天天摸天天干 | 五月综合色网 | 少妇愉情理伦片BD在线播放 |