a 免费看,国产黄色视频在线,香蕉视频做爱的,91色色福利,亚洲中文字幕在线播放视频,色天堂在线播放,久草电影男人天堂,三级国产在线

作者：楊夕
項(xiàng)目地址：https://github.com/km1994/nlp_paper_study
keras4bert 版本：https://github.com/bojone/lic2020_baselines
pytorch 版本：https://github.com/powerycy/Lic2020-
論文名稱：A Novel Hierarchical Binary Tagging Framework for Relational Triple Extraction
【注：手機(jī)閱讀可能圖片打不開?。?！】

摘要

Extracting relational triples from unstructured text is crucial for large-scale knowledge graph construction. However, few existing works excel in solving the overlapping triple problem where multiple relational triples in the same sentence share the same entities. In this work, we introduce a fresh perspective to revisit the relational triple extraction task and propose a novel Hierarchical Binary Tagging (HBT) framework derived from a principled problem formulation. Instead of treating relations as discrete labels as in previous works, our new framework models relations as functions that map subjects to objects in a sentence, which naturally handles the overlapping problem. Experiments show that the proposed framework already outperforms state-of-the-art methods even when its encoder module uses a randomly initialized BERT encoder, showing the power of the new tagging framework. It enjoys further performance boost when employing a pretrained BERT encoder, outperforming the strongest baseline by 17.5 and 30.2 absolute gain in F1-score on two public datasets NYT and WebNLG, respectively. In-depth analysis on different scenarios of overlapping triples shows that the method delivers consistent performance gain across all these scenarios.

從非結(jié)構(gòu)化文本中提取關(guān)系三元組對(duì)于大規(guī)模知識(shí)圖的構(gòu)建至關(guān)重要。

但是，很少有現(xiàn)有的著作能很好地解決重疊三重問題，在該問題中，同一句子中的多個(gè)關(guān)系三重共享同一實(shí)體。

在這項(xiàng)工作中，我們引入了一個(gè)新的視角來(lái)重新審視關(guān)系三重提取任務(wù)，并提出了一個(gè)從有原則的問題表達(dá)中衍生出來(lái)的新穎的分層二進(jìn)制標(biāo)記（HBT）框架。我們的新框架沒有像以前的作品那樣將關(guān)系視為離散標(biāo)簽，而是將關(guān)系建模為將主語(yǔ)映射到句子中的賓語(yǔ)的函數(shù)，從而自然地解決了重疊問題。

實(shí)驗(yàn)表明，即使在其編碼器模塊使用隨機(jī)初始化的BERT編碼器的情況下，所提出的框架也已經(jīng)超越了最新方法，顯示了新標(biāo)簽框架的強(qiáng)大功能。當(dāng)使用預(yù)訓(xùn)練的BERT編碼器時(shí)，它在性能上得到了進(jìn)一步的提升，在兩個(gè)公共數(shù)據(jù)集NYT和WebNLG上，F(xiàn)1評(píng)分的絕對(duì)增益分別比最強(qiáng)的基線高17.5和30.2。對(duì)重疊三元組的不同方案的深入分析表明，該方法在所有這些方案中均提供了一致的性能提升。

一、引言

1.1 背景知識(shí)

關(guān)系三元組抽取(Relational Triple Extraction, RTE)，也叫實(shí)體-關(guān)系聯(lián)合抽取，是信息抽取領(lǐng)域中的一個(gè)經(jīng)典任務(wù)，三元組抽取旨在從文本中抽取出結(jié)構(gòu)化的關(guān)系三元組(Subject, Relation, Object)用以構(gòu)建知識(shí)圖譜。

近年來(lái)，隨著NLP領(lǐng)域的不斷發(fā)展，在簡(jiǎn)單語(yǔ)境下(例如，一個(gè)句子僅包含一個(gè)關(guān)系三元組)進(jìn)行關(guān)系三元組抽取已經(jīng)能夠達(dá)到不錯(cuò)的效果。但在復(fù)雜語(yǔ)境下(一個(gè)句子中包含多個(gè)關(guān)系三元組，有時(shí)甚至多達(dá)五個(gè)以上)，尤其當(dāng)多個(gè)三元組有重疊的情況時(shí)，許多現(xiàn)有模型的表現(xiàn)就顯得有些捉襟見肘了。

1.2 之前方法介紹

1.2.1 pipeline approach

1.2.1.1 思路

pipeline approach 方法的核心就是將實(shí)體-關(guān)系聯(lián)合抽取任務(wù) 分成實(shí)體抽取+關(guān)系分類兩個(gè)任務(wù)，思路如下：

實(shí)體抽取：利用一個(gè)命名實(shí)體識(shí)別模型識(shí)別句子中的所有實(shí)體；
關(guān)系分類：利用一個(gè)關(guān)系分類模型對(duì)每個(gè)實(shí)體對(duì)執(zhí)行關(guān)系分類?！具@一步其實(shí)可以理解為文本分類任務(wù)，但是和文本分類任務(wù)的區(qū)別在于，關(guān)系分類不僅需要學(xué)習(xí)句子信息，還要知道實(shí)體對(duì)在句子中位置信息】

1.2.1.2 問題

誤差傳遞問題：由于該方法將實(shí)體-關(guān)系聯(lián)合抽取任務(wù) 分成實(shí)體抽取+關(guān)系分類兩個(gè)任務(wù)處理，所以實(shí)體抽取任務(wù)的錯(cuò)誤無(wú)法在后期階段進(jìn)行糾正，因此這種方法容易遭受錯(cuò)誤傳播問題；

1.2.2 feature-based models and neural network-based models

1.2.2.1 思路

通過用學(xué)習(xí)表示替換人工構(gòu)建的特征，基于神經(jīng)網(wǎng)絡(luò)的模型在關(guān)系三元組提取任務(wù)中取得了相當(dāng)大的成功。

1.2.2.2 問題

實(shí)體關(guān)系重疊問題：大多數(shù)現(xiàn)有方法無(wú)法正確處理句子包含多個(gè)相互重疊的關(guān)系三元組的情況。

圖 1 中介紹了三種關(guān)系三元組場(chǎng)景
Normal 關(guān)系。表示三元組之間無(wú)重疊；(United states ，Trump) 之間的關(guān)系為 Country_president，（Tim Cook，Apple Inc）之間的關(guān)系為 Company_CEO；這種三元組關(guān)系比較簡(jiǎn)單
EPO(Entity Pair Overlap)。表示多（兩）個(gè)三元組之間共享同一個(gè)實(shí)體對(duì)；（IQuentin Tarantino，Django Unchained）實(shí)體對(duì) 間存在 Act_in 和 Direct_movic 兩種關(guān)系。
SEO(Single Entity Overlap)。表示多（兩）個(gè)三元組之間僅共享一個(gè)實(shí)體；（Jackie，Birth, Wachinghua）和（Wachinghua，Capital， United States）共享實(shí)體 Wachinghua。

1.2.3 基于Seq2Seq模型 and GCN

1.2.3.1 思路

Zeng 是最早在關(guān)系三重提取中考慮重疊三重問題的人之一。他們介紹了如圖 1 所示的不同重疊模式的類別，并提出了具有復(fù)制機(jī)制以提取三元組的序列到序列（Seq2Seq）模型。他們基于Seq2Seq模型，進(jìn)一步研究了提取順序的影響，并通過強(qiáng)化學(xué)習(xí)獲得了很大的改進(jìn)。

Fu 還通過使用基于圖卷積網(wǎng)絡(luò)（GCN）的模型將文本建模為關(guān)系圖來(lái)研究重疊三重問題。

1.2.3.2 問題

過多 negative examples：在所有提取的實(shí)體對(duì)中，很多都不形成有效關(guān)系，從而產(chǎn)生了太多的negative examples；
EPO(Entity Pair Overlap) 問題：當(dāng)同一實(shí)體參與多個(gè)關(guān)系時(shí)，分類器可能會(huì)感到困惑。沒有足夠的訓(xùn)練樣例的情況下，分類器就很難準(zhǔn)確指出實(shí)體參與的關(guān)系；

二、論文工作

方式：實(shí)現(xiàn)了一個(gè)不受重疊三元組問題困擾的HBT標(biāo)注框架(Hierarchical Binary Tagging Framework)來(lái)解決RTE任務(wù)；
核心思想：把關(guān)系(Relation)建模為將頭實(shí)體(Subject)映射到尾實(shí)體(Object)的函數(shù)，而不是將其視為實(shí)體對(duì)上的標(biāo)簽。

論文并不是學(xué)習(xí)關(guān)系分類器f（s，o）→r，而是學(xué)習(xí)關(guān)系特定的標(biāo)記器fr（s）→o；每個(gè)標(biāo)記器都可以識(shí)別特定關(guān)系下給定 subject 的可能 object(s)。或不返回任何 object，表示給定的主題和關(guān)系沒有 triple。

思路：

首先，我們確定句子中所有可能的 subjects；
然后針對(duì)每個(gè)subjects，我們應(yīng)用特定于關(guān)系的標(biāo)記器來(lái)同時(shí)識(shí)別所有可能的 relations 和相應(yīng)的 objects。

三、HBT 結(jié)構(gòu)介紹

3.1 BERT Encoder層

這里使用 Bert 做 Encoder，其實(shí)就是用 Bert 做 Embedding 層使用。

代碼介紹：

介紹：將 input_ids，attention_mask，token_type_ids，position_ids，head_mask，inputs_embeds 作為參數(shù) 輸入 Bert 模型中，并取 Bert 模型最后一層作為輸出，即隱藏狀態(tài)。

from transformers import BertModel,BertPreTrainedModel
class REModel_sbuject_2(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        ...

    def forward(
        self, input_ids=None, attention_mask=None,
        token_type_ids=None, position_ids=None,
        head_mask=None, inputs_embeds=None,
        labels=None, subject_ids = None,
        batch_size = None, obj_labels = None,
        sub_train = False, obj_train = False
    ):
        outputs_1 = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
        )
        sequence_output = outputs_1[0]
        sequence_output = self.dropout(sequence_output)
        ...

3.2 Hierarchical Decoder層

Hierarchical Decoder 層由兩部分組成：

Subject tagger 層：用于提取 Subject;
Relation-Specific Object Taggers 層：由一系列relation-specific object taggers（之所以這里是多個(gè)taggers是因?yàn)橛卸鄠€(gè)可能的relation）；

3.2.1 Subject Tagger 層

目標(biāo)：檢測(cè) Subject 的開始和結(jié)束位置
方法：利用兩個(gè)相同的二分類器，來(lái)檢測(cè) 每個(gè) Subject 的開始和結(jié)束位置；
做法：

對(duì)BERT的輸出的特征向量作sigmoid激活，計(jì)算該token作為subject的開始、結(jié)束的概率大小。如果概率超過設(shè)定閾值，則標(biāo)記為 1，反之為 0。

其中xi是第i個(gè)token的編碼表示；pi是第i個(gè)token是subject的start或者end的概率

為了獲得更好的W（weight）和b（bias）subject tagger需要優(yōu)化這個(gè)似然函數(shù)：

代碼介紹：

from torch.nn import BCELoss
class REModel_sbuject_2(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.obj_labels = 110
        self.bert = BertModel(config)
        self.linear = nn.Linear(768, 768)
        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.classifier = nn.Linear(config.hidden_size,config.num_labels)
        self.obj_classifier = nn.Linear(config.hidden_size, self.obj_labels)
        self.sub_pos_emb = nn.Embedding(256, 768)
        self.relu = nn.ReLU()
        self.init_weights()

    def forward(
        self, input_ids=None, attention_mask=None,
        token_type_ids=None, position_ids=None,
        head_mask=None, inputs_embeds=None,
        labels=None, subject_ids = None,
        batch_size = None, obj_labels = None,
        sub_train = False, obj_train = False
    ):
        ...
        # step 2：Subject Tagger 層。預(yù)測(cè) subject
        if sub_train == True:
            logits = self.classifier(sequence_output)
            outputs = (logits,)   # add hidden states and attention if they are here
            loss_sig = nn.Sigmoid()
            # Only keep active parts of the loss
            active_logits = logits.view(-1, self.num_labels)
            active_logits = loss_sig(active_logits)
            active_logits = active_logits ** 2
            if labels is not None :
                active_labels = labels.view(-1, self.num_labels).float()
                loss_fct = BCELoss(reduction='none')
                loss = loss_fct(active_logits, active_labels)
                loss = loss.view(batch_size, -1, 2)
                loss = torch.mean(loss, 2)
                loss = torch.sum(attention_mask * loss) / torch.sum(attention_mask)
                outputs = (loss,) + outputs
            else:
                outputs = active_logits
        ...

3.2.2 Relation-specific Object Taggers層

目標(biāo)：檢測(cè) Object 的開始和結(jié)束位置
方法：利用兩個(gè)相同的二分類器，來(lái)檢測(cè) 每個(gè) Object 的開始和結(jié)束位置，但是 Relation-specific Object Taggers層需要融入上一步的 subject 特征，結(jié)合之前BERT Encoder的編碼內(nèi)容，用來(lái)在指定的relation下預(yù)測(cè)對(duì)應(yīng)的object的起止位置，概率計(jì)算如下和之前相比多了v：

Suject Tagger預(yù)測(cè)的第k個(gè)實(shí)體的平均向量，如

這么做的目的是保證xi和v是相同的維度

對(duì)于每個(gè)關(guān)系r對(duì)應(yīng)的tagger，需要優(yōu)化的似然函數(shù)如下來(lái)獲得更好的W（weight）和b（bias）這個(gè)公式等號(hào)右邊和之前是完全一樣的：

代碼介紹：

import torch.nn as nn
import torch
import torch.nn.functional as F
from torch.nn import BCELoss
BertLayerNorm = torch.nn.LayerNorm
...
# 功能：得到相應(yīng)的向量過后對(duì)向量進(jìn)行簡(jiǎn)單的相加求平均
def merge_function(inputs):
    '''
        功能：得到相應(yīng)的向量過后對(duì)向量進(jìn)行簡(jiǎn)單的相加求平均
    '''
    output = inputs[0]
    for i in range(1, len(inputs)):
        output += inputs[i]
    return output / len(inputs)

# 功能：根據(jù) index 從 data 中取出 對(duì)應(yīng)的向量表征
def batch_gather(data:torch.Tensor,index:torch.Tensor):
    """
        功能：根據(jù) index 從 data 中取出 對(duì)應(yīng)的向量表征
    """
    index = index.unsqueeze(-1)
    index = index.expand(data.size()[0],index.size()[1],data.size()[2])
    return torch.gather(data,1,index)

# 功能：根據(jù) subject_ids 從 output 中取出 subject 的 start 和 end 向量表征
def extrac_subject_1(output, subject_ids):
    """
        功能：根據(jù) subject_ids 從 output 中取出 subject 的 start 和 end 向量表征
    """
    start = batch_gather(output, subject_ids[:, :1])
    end = batch_gather(output, subject_ids[:, 1:])
    return start,end

class REModel_sbuject_2(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.obj_labels = 110
        self.bert = BertModel(config)
        self.linear = nn.Linear(768, 768)
        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.classifier = nn.Linear(config.hidden_size,config.num_labels)
        self.obj_classifier = nn.Linear(config.hidden_size, self.obj_labels)
        self.sub_pos_emb = nn.Embedding(256, 768)
        self.relu = nn.ReLU()
        self.init_weights()

    def forward(
        self, input_ids=None, attention_mask=None,
        token_type_ids=None, position_ids=None,
        head_mask=None, inputs_embeds=None,
        labels=None, subject_ids = None,
        batch_size = None, obj_labels = None,
        sub_train = False, obj_train = False
    ):
        ...
        # step 3：Relation-specific Object Taggers層。傳入subject，預(yù)測(cè)object
        if obj_train == True:
            ## step 3.1：得到subject的開始與結(jié)束位置之后，在取出倒數(shù)第二層的隱狀態(tài)，利用輸入的 subject_id 取出隱狀態(tài)中的首尾向量
            hidden_states = outputs_1[2][-2]
            hidden_states_1 = outputs_1[2][-3]
            loss_sig = nn.Sigmoid()

            ## step 3.2 從 不同隱藏層 提取出 Subject 的 start 和 end 值
            sub_pos_start = self.sub_pos_emb(subject_ids[:, :1]).to(device)
            sub_pos_end = self.sub_pos_emb(subject_ids[:, 1:]).to(device)
            subject_start_last, subject_end_last = extrac_subject_1(sequence_output, subject_ids)
            subject_start_1,subject_end_1 = extrac_subject_1(hidden_states_1, subject_ids)
            subject_start,subject_end = extrac_subject_1(hidden_states, subject_ids)

            subject = (sub_pos_start + subject_start + sub_pos_end + subject_end + subject_start_last + subject_start_1 + subject_end_1 + subject_end_1).to(device)

            ## step 3.3 通過Conditional Layer Normalization將subject融入到object的預(yù)測(cè)中
            batch_token_ids_obj = torch.add(hidden_states, subject)
            batch_token_ids_obj = self.LayerNorm(batch_token_ids_obj)
            batch_token_ids_obj = self.dropout(batch_token_ids_obj)
            batch_token_ids_obj = self.relu(self.linear(batch_token_ids_obj))
            batch_token_ids_obj = self.dropout(batch_token_ids_obj)
            obj_logits = self.obj_classifier(batch_token_ids_obj)

            obj_logits = loss_sig(obj_logits)
            obj_logits = obj_logits ** 4
            obj_outputs = (obj_logits,)
            ## step 3.4：計(jì)算 Object 和 Relation 的 損失函數(shù)
            if obj_labels is not None:
                loss_obj = BCELoss(reduction='none')
                obj_loss = loss_obj(obj_logits.view(batch_size, -1, self.obj_labels // 2, 2), obj_labels.float())
                obj_loss = torch.sum(torch.mean(obj_loss, 3), 2)
                # 損失函數(shù)中的 MASK
                obj_loss = torch.sum(obj_loss * attention_mask) / torch.sum(attention_mask)
                s_o_loss = torch.add(obj_loss, loss)
                outputs_obj = (s_o_loss,) + obj_outputs
            else:
                outputs_obj = obj_logits.view(batch_size, -1, self.obj_labels // 2, 2)

        if obj_train == True:
            return outputs ,outputs_obj # (loss), scores, (hidden_states), (attentions)
        else:
            return outputs

3.3 損失函數(shù)

公式 1 ：對(duì)于training set D上的sentence xj和xj中可能存在的三元組的集合 Tj，利用公式 1 去最大化data likelihood；
公式 2 ：采用鏈?zhǔn)椒▌t 將第一個(gè)公式轉(zhuǎn)化為第二個(gè)公式；

右邊部分下角標(biāo)的表示 Tj中指定s的三元組集合，集合中的ro對(duì)來(lái)計(jì)算后面這個(gè)部分

公式 3 ：對(duì)于給定的一個(gè)subject，其在句子中所參與的關(guān)系個(gè)數(shù)一般來(lái)說(shuō)是有限的，因此只有部分relation能夠?qū)⑵溆成涞较鄳?yīng)的object上去(對(duì)應(yīng)公式3的中間部分)，最終得到一個(gè)有效的三元組。

注：對(duì)于未參與的關(guān)系，文中提出了”null object”的概念，也就是說(shuō)，在這種情況下函數(shù)會(huì)將subject映射到一個(gè)空的尾實(shí)體上(對(duì)應(yīng)公式3的右端部分)，表示subject并不參與該關(guān)系，也就無(wú)法抽取出有效的三元組。

損失函數(shù)：

代碼介紹：

class REModel_sbuject_2(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.obj_labels = 110
        self.bert = BertModel(config)
        self.linear = nn.Linear(768, 768)
        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.classifier = nn.Linear(config.hidden_size,config.num_labels)
        self.obj_classifier = nn.Linear(config.hidden_size, self.obj_labels)
        self.sub_pos_emb = nn.Embedding(256, 768)
        self.relu = nn.ReLU()
        self.init_weights()

    def forward(
        self, input_ids=None, attention_mask=None,
        token_type_ids=None, position_ids=None,
        head_mask=None, inputs_embeds=None,
        labels=None, subject_ids = None,
        batch_size = None, obj_labels = None,
        sub_train = False, obj_train = False
    ):
        ...
        
        # step 2：Subject Tagger 層。預(yù)測(cè) subject
        if sub_train == True:
            ...
            ## step 2.2：計(jì)算 Subject 的 損失函數(shù)
            if labels is not None :
                active_labels = labels.view(-1, self.num_labels).float()
                loss_fct = BCELoss(reduction='none')
                loss = loss_fct(active_logits, active_labels)
                loss = loss.view(batch_size, -1, 2)
                loss = torch.mean(loss, 2)
                loss = torch.sum(attention_mask * loss) / torch.sum(attention_mask)
                outputs = (loss,) + outputs
            else:
                outputs = active_logits

        # step 3：Relation-specific Object Taggers層。傳入subject，預(yù)測(cè)object
        if obj_train == True:
            ...
            ## step 3.4：計(jì)算 Object 和 Relation 的 損失函數(shù)
            if obj_labels is not None:
                loss_obj = BCELoss(reduction='none')
                obj_loss = loss_obj(obj_logits.view(batch_size, -1, self.obj_labels // 2, 2), obj_labels.float())
                obj_loss = torch.sum(torch.mean(obj_loss, 3), 2)
                # 損失函數(shù)中的 MASK
                obj_loss = torch.sum(obj_loss * attention_mask) / torch.sum(attention_mask)
                s_o_loss = torch.add(obj_loss, loss)
                outputs_obj = (s_o_loss,) + obj_outputs
            else:
                outputs_obj = obj_logits.view(batch_size, -1, self.obj_labels // 2, 2)

        if obj_train == True:
            return outputs ,outputs_obj # (loss), scores, (hidden_states), (attentions)
        else:
            return outputs

四、實(shí)踐

4.1 數(shù)據(jù)集介紹

數(shù)據(jù)集來(lái)自于百度 Lic 2020-關(guān)系抽取比賽，數(shù)據(jù)集格式如下：

訓(xùn)練數(shù)據(jù)

{
    "text": "《步步驚心》改編自著名作家桐華的同名清穿小說(shuō)《甄嬛傳》改編自流瀲紫所著的同名小說(shuō)電視劇《何以笙簫默》改編自顧漫同名小說(shuō)《花千骨》改編自fresh果果同名小說(shuō)《裸婚時(shí)代》是月影蘭析創(chuàng)作的一部情感小說(shuō)《瑯琊榜》是根據(jù)海宴同名網(wǎng)絡(luò)小說(shuō)改編電視劇《宮鎖心玉》，又名《宮》《雪豹》，該劇改編自網(wǎng)絡(luò)小說(shuō)《特戰(zhàn)先驅(qū)》《我是特種兵》由紅遍網(wǎng)絡(luò)的小說(shuō)《最后一顆子彈留給我》改編電視劇《來(lái)不及說(shuō)我愛你》改編自匪我思存同名小說(shuō)《來(lái)不及說(shuō)我愛你》", 
    "spo_list": [
        {
            "predicate": "作者", 
            "object_type": {"@value": "人物"}, 
            "subject_type": "圖書作品", 
            "object": {"@value": "顧漫"}, 
            "subject": "何以笙簫默"
        }, 
        {
            "predicate": "改編自", 
            "object_type": {"@value": "作品"}, 
            "subject_type": "影視作品", 
            "object": {"@value": "最后一顆子彈留給我"}, 
            "subject": "我是特種兵"
        }, ...
    ]
}
...

schema 數(shù)據(jù)

{"object_type": {"@value": "學(xué)校"}, "predicate": "畢業(yè)院校", "subject_type": "人物"}
{"object_type": {"@value": "人物"}, "predicate": "嘉賓", "subject_type": "電視綜藝"}
{"object_type": {"inWork": "影視作品", "@value": "人物"}, "predicate": "配音", "subject_type": "娛樂人物"}
{"object_type": {"@value": "歌曲"}, "predicate": "主題曲", "subject_type": "影視作品"}
{"object_type": {"@value": "人物"}, "predicate": "代言人", "subject_type": "企業(yè)/品牌"}
{"object_type": {"@value": "音樂專輯"}, "predicate": "所屬專輯", "subject_type": "歌曲"}
{"object_type": {"@value": "人物"}, "predicate": "父親", "subject_type": "人物"}
{"object_type": {"@value": "人物"}, "predicate": "作者", "subject_type": "圖書作品"}
{"object_type": {"inArea": "地點(diǎn)", "@value": "Date"}, "predicate": "上映時(shí)間", "subject_type": "影視作品"}
{"object_type": {"@value": "人物"}, "predicate": "母親", "subject_type": "人物"}
{"object_type": {"@value": "Text"}, "predicate": "專業(yè)代碼", "subject_type": "學(xué)科專業(yè)"}
{"object_type": {"@value": "Number"}, "predicate": "占地面積", "subject_type": "機(jī)構(gòu)"}
{"object_type": {"@value": "Text"}, "predicate": "郵政編碼", "subject_type": "行政區(qū)"}
{"object_type": {"inArea": "地點(diǎn)", "@value": "Number"}, "predicate": "票房", "subject_type": "影視作品"}
{"object_type": {"@value": "Number"}, "predicate": "注冊(cè)資本", "subject_type": "企業(yè)"}
{"object_type": {"@value": "人物"}, "predicate": "主角", "subject_type": "文學(xué)作品"}
{"object_type": {"@value": "人物"}, "predicate": "妻子", "subject_type": "人物"}
{"object_type": {"@value": "人物"}, "predicate": "編劇", "subject_type": "影視作品"}
{"object_type": {"@value": "氣候"}, "predicate": "氣候", "subject_type": "行政區(qū)"}
{"object_type": {"@value": "人物"}, "predicate": "歌手", "subject_type": "歌曲"}
{"object_type": {"inWork": "作品", "onDate": "Date", "@value": "獎(jiǎng)項(xiàng)", "period": "Number"}, "predicate": "獲獎(jiǎng)", "subject_type": "娛樂人物"}
{"object_type": {"@value": "人物"}, "predicate": "校長(zhǎng)", "subject_type": "學(xué)校"}
{"object_type": {"@value": "人物"}, "predicate": "創(chuàng)始人", "subject_type": "企業(yè)"}
{"object_type": {"@value": "城市"}, "predicate": "首都", "subject_type": "國(guó)家"}
{"object_type": {"@value": "人物"}, "predicate": "丈夫", "subject_type": "人物"}
{"object_type": {"@value": "Text"}, "predicate": "朝代", "subject_type": "歷史人物"}
{"object_type": {"inWork": "影視作品", "@value": "人物"}, "predicate": "飾演", "subject_type": "娛樂人物"}
{"object_type": {"@value": "Number"}, "predicate": "面積", "subject_type": "行政區(qū)"}
{"object_type": {"@value": "地點(diǎn)"}, "predicate": "總部地點(diǎn)", "subject_type": "企業(yè)"}
{"object_type": {"@value": "地點(diǎn)"}, "predicate": "祖籍", "subject_type": "人物"}
{"object_type": {"@value": "Number"}, "predicate": "人口數(shù)量", "subject_type": "行政區(qū)"}
{"object_type": {"@value": "人物"}, "predicate": "制片人", "subject_type": "影視作品"}
{"object_type": {"@value": "Number"}, "predicate": "修業(yè)年限", "subject_type": "學(xué)科專業(yè)"}
{"object_type": {"@value": "城市"}, "predicate": "所在城市", "subject_type": "景點(diǎn)"}
{"object_type": {"@value": "人物"}, "predicate": "董事長(zhǎng)", "subject_type": "企業(yè)"}
{"object_type": {"@value": "人物"}, "predicate": "作詞", "subject_type": "歌曲"}
{"object_type": {"@value": "作品"}, "predicate": "改編自", "subject_type": "影視作品"}
{"object_type": {"@value": "企業(yè)"}, "predicate": "出品公司", "subject_type": "影視作品"}
{"object_type": {"@value": "人物"}, "predicate": "導(dǎo)演", "subject_type": "影視作品"}
{"object_type": {"@value": "人物"}, "predicate": "作曲", "subject_type": "歌曲"}
{"object_type": {"@value": "人物"}, "predicate": "主演", "subject_type": "影視作品"}
{"object_type": {"@value": "人物"}, "predicate": "主持人", "subject_type": "電視綜藝"}
{"object_type": {"@value": "Date"}, "predicate": "成立日期", "subject_type": "機(jī)構(gòu)"}
{"object_type": {"@value": "Text"}, "predicate": "簡(jiǎn)稱", "subject_type": "機(jī)構(gòu)"}
{"object_type": {"@value": "Number"}, "predicate": "海拔", "subject_type": "地點(diǎn)"}
{"object_type": {"@value": "Text"}, "predicate": "號(hào)", "subject_type": "歷史人物"}
{"object_type": {"@value": "國(guó)家"}, "predicate": "國(guó)籍", "subject_type": "人物"}
{"object_type": {"@value": "語(yǔ)言"}, "predicate": "官方語(yǔ)言", "subject_type": "國(guó)家"}

4.2 數(shù)據(jù)加載

加載數(shù)據(jù)集函數(shù)

import json
# 功能：加載數(shù)據(jù)集
def load_data(filename):
    D = []
    with open(filename,'r',encoding='utf8') as f:
        for l in f:
            l = json.loads(l)
            d = {'text': l['text'], 'spo_list': []}
            for spo in l['spo_list']:
                for k, v in spo['object'].items():
                    d['spo_list'].append(
                        (spo['subject'], spo['predicate'] + '_' + k, v)
                    )
            D.append(d)
    return D

# 功能：讀取schema
def load_schema(schema_path):
    with open(schema_path,encoding='utf8') as f:
        id2predicate, predicate2id, n = {}, {}, 0
        predicate2type = {}
        for l in f:
            l = json.loads(l)
            predicate2type[l['predicate']] = (l['subject_type'], l['object_type'])
            for k, _ in sorted(l['object_type'].items()):
                key = l['predicate'] + '_' + k
                id2predicate[n] = key
                predicate2id[key] = n
                n += 1
    return id2predicate, predicate2id

函數(shù)調(diào)用

# step 2：加載數(shù)據(jù)集
train_data = load_data(config.path['train_path'])
valid_data = load_data(config.path['valid_path'])
id2predicate, predicate2id = load_schema(config.path['schema_path'])
>>>
train_data[0:1]:[
        {
            'text': '《步步驚心》改編自著名作家桐華的同名清穿小說(shuō)《甄嬛傳》改編自流瀲紫所著的同名小說(shuō)電視劇《何以 笙簫默》改編自顧漫同名小說(shuō)《花千骨》改編自fresh果果同名小說(shuō)《裸婚時(shí)代》是月影蘭析創(chuàng)作的一部情感小說(shuō)《瑯琊榜》是根據(jù)海宴 同名網(wǎng)絡(luò)小說(shuō)改編電視劇《宮鎖心玉》，又名《宮》《雪豹》，該劇改編自網(wǎng)絡(luò)小說(shuō)《特戰(zhàn)先驅(qū)》《我是特種兵》由紅遍網(wǎng)絡(luò)的小說(shuō)《最后一顆子彈留給我》改編電視劇《來(lái)不及說(shuō)我愛你》改編自匪我思存同名小說(shuō)《來(lái)不及說(shuō)我愛你》', 
            'spo_list': 
            [
                ('何以笙簫默', '作者_(dá)@value', '顧漫'), 
                ('我是特種兵', '改編自_@value', '最后一顆子彈留給我'), 
                ('步步驚心', '作者_(dá)@value', '桐華'), 
                ('甄嬛 傳', '作者_(dá)@value', '流瀲紫'), 
                ('花千骨', '作者_(dá)@value', 'fresh果果'), 
                ('裸婚時(shí)代', '作者_(dá)@value', '月影蘭析'), 
                ('瑯琊榜', '作者_(dá)@value', '海宴'), 
                ('雪豹', '改編自_@value', '特戰(zhàn)先驅(qū)'),
                ('來(lái)不及說(shuō)我愛你', '改編自_@value', '來(lái)不及說(shuō)我愛你'), 
                ('來(lái)不及說(shuō)我愛你', '作者_(dá)@value', '匪我思存')
            ]
        }
    ]

id2predicate:{
        0: '畢業(yè)院校_@value', 1: '嘉賓_@value', 2: '配音_@value', 3: '配音_inWork', 
        4: '主題曲_@value', 5: '代言人_@value', 6: '所屬專輯_@value', 7: '父親_@value', 
        8: '作者_(dá)@value', 9: '上映時(shí)間_@value', 10: '上映時(shí)間_inArea', 
        11: '母親_@value', 12: '專業(yè)代碼_@value', 13: '占地面積_@value',
        14: '郵政編碼_@value', 15: '票房_@value', 16: '票房_inArea', 
        17: '注冊(cè)資本_@value', 18: '主角_@value', 19: '妻子_@value', 
        20: '編劇_@value', 21: '氣候_@value', 22: '歌手_@value', 23: '獲獎(jiǎng)_@value',
        24: '獲獎(jiǎng)_inWork', 25: '獲獎(jiǎng)_onDate', 26: '獲獎(jiǎng)_period', 27: '校長(zhǎng)_@value',
        28: '創(chuàng)始人_@value', 29: '首都_@value', 30: '丈夫_@value', 31: '朝代_@value', 
        32: '飾演_@value', 33: '飾演_inWork', 34: '面積_@value', 35: '總部地點(diǎn)_@value', 
        36: '祖籍_@value', 37: '人口數(shù)量_@value', 38: '制片人_@value', 
        39: '修業(yè)年限_@value', 40: '所在城市_@value', 41: '董事長(zhǎng)_@value', 
        42: '作詞_@value', 43: '改編自_@value', 44: '出品公司_@value', 
        45: '導(dǎo)演_@value', 46: '作曲_@value', 47: '主演_@value', 48: '主持人_@value', 
        49: '成立日期_@value', 50: '簡(jiǎn)稱_@value', 51: '海拔_@value', 52: '號(hào)_@value', 
        53: '國(guó)籍_@value', 54: '官方語(yǔ)言_@value'
    }

predicate2id:{
        '畢業(yè)院校_@value': 0, '嘉賓_@value': 1, '配音_@value': 2, '配音_inWork': 3, '主題曲_@value': 4, '代言人_@value': 5, '所屬專輯_@value': 6, '父親_@value': 7, '作者_(dá)@value': 8, '上映時(shí)間_@value': 9, '上映時(shí)間_inArea': 10, '母親_@value': 11, '專業(yè)代碼_@value': 12, '占地面積_@value': 13, '郵政編碼_@value': 14, '票房_@value': 15, '票房_inArea': 16, '注冊(cè)資本_@value': 17, '主角_@value': 18, '妻子_@value': 19, '編劇_@value': 20, '氣候_@value': 21, '歌手_@value': 22, '獲獎(jiǎng)_@value': 23, '獲獎(jiǎng)_inWork': 24, '獲獎(jiǎng)_onDate': 25, '獲獎(jiǎng)_period': 26, '校長(zhǎng)_@value': 27, '創(chuàng)始人_@value': 28, '首都_@value': 29, '丈夫_@value': 30, '朝代_@value': 31, '飾演_@value': 32, '飾演_inWork': 33, '面積_@value': 34, '總部地點(diǎn)_@value': 35, '祖籍_@value': 36, '人口數(shù)量_@value': 37, '制片人_@value': 38, '修業(yè)年限_@value': 39, '所在城市_@value': 40, '董事長(zhǎng)_@value': 41, '作詞_@value': 42, '改編自_@value': 43, '出品公司_@value': 44, '導(dǎo)演_@value': 45, '作曲_@value': 46, '主演_@value': 47, '主持人_@value': 48, '成立日期_@value': 49, '簡(jiǎn)稱_@value': 50, '海拔_@value': 51, '號(hào)_@value': 52, '國(guó)籍_@value': 53, '官方語(yǔ)言_@value': 54
    }

4.3 數(shù)據(jù)生成器定義

數(shù)據(jù)生成器定義

import numpy as np
# 功能：數(shù)據(jù)生成器
class data_generator:
    """
        功能：數(shù)據(jù)生成器
    """
    def __init__(self, data, batch_size=64, buffer_size=None):
        self.data = data
        self.batch_size = batch_size
        if hasattr(self.data, '__len__'):
            self.steps = len(self.data) // self.batch_size
            if len(self.data) % self.batch_size != 0:
                self.steps += 1
        else:
            self.steps = None
        self.buffer_size = buffer_size or batch_size * 1000

    def __len__(self):
        return self.steps

    def data_res(self, tokenizer, predicate2id, maxlen):
        batch_token_ids, batch_segment_ids,batch_attention_mask = [], [], []
        batch_subject_labels, batch_subject_ids, batch_object_labels = [], [], []
        # step 1：打亂 數(shù)組
        indices = list(range(len(self.data)))
        np.random.shuffle(indices)
        # step 2：根據(jù) indices 遍歷 data
        for i in indices:
            d = self.data[i]
            ## step 2.1 對(duì) 句子 進(jìn)行編碼 
            token = tokenizer.encode_plus(
                d['text'], 
                max_length=maxlen,
                truncation=True
            )
            token_ids, segment_ids,attention_mask = token['input_ids'],token['token_type_ids'],token['attention_mask']

            # step 2.2 整理三元組 {s: [(o, p)]}
            spoes = {}
            for s, p, o in d['spo_list']:
                # step 2.2.1 對(duì) s、p、o 進(jìn)行 編碼
                s = tokenizer.encode_plus(s)['input_ids'][1:-1]
                p = predicate2id[p]
                o = tokenizer.encode_plus(o)['input_ids'][1:-1]
                # step 2.2.2 從sequence中尋找子串patter，如果找到，返回第一個(gè)下標(biāo)；否則返回-1。
                s_idx = search(s, token_ids)
                o_idx = search(o, token_ids)
                # step 2.2.3 當(dāng) s 和 o 都 存在時(shí)，記錄 s 和 o 的 初始位置 和 結(jié)束位置
                if s_idx != -1 and o_idx != -1:
                    s = (s_idx, s_idx + len(s) - 1)
                    o = (o_idx, o_idx + len(o) - 1, p)
                    if s not in spoes:
                        spoes[s] = []
                    spoes[s].append(o)
            
            # step 2.3 設(shè)置 subject 和 object 標(biāo)簽
            if spoes:
                # step 2.3.1 設(shè)置 subject 的 start 和 end 標(biāo)簽
                subject_labels = np.zeros((len(token_ids), 2))
                for s in spoes:
                    subject_labels[s[0], 0] = 1
                    subject_labels[s[1], 1] = 1

                # step 2.3.2 隨機(jī)選一個(gè)subject
                start, end = np.array(list(spoes.keys())).T
                start = np.random.choice(start)
                end = np.random.choice(end[end >= start])
                subject_ids = (start, end)

                # step 2.3.3 設(shè)置 對(duì)應(yīng)的 object 標(biāo)簽
                object_labels = np.zeros((len(token_ids), len(predicate2id), 2))
                for o in spoes.get(subject_ids, []):
                    object_labels[o[0], o[2], 0] = 1
                    object_labels[o[1], o[2], 1] = 1

                # step 2.3.4 構(gòu)建 batch
                batch_token_ids.append(token_ids)
                batch_segment_ids.append(segment_ids)
                batch_subject_labels.append(subject_labels)
                batch_subject_ids.append(subject_ids)
                batch_object_labels.append(object_labels)
                batch_attention_mask.append(attention_mask)

        # step 3：序列 padding     
        batch_token_ids = sequence_padding(batch_token_ids)
        batch_segment_ids = sequence_padding(batch_segment_ids)
        batch_subject_labels = sequence_padding(
            batch_subject_labels, padding=np.zeros(2)
        )
        batch_subject_ids = np.array(batch_subject_ids)
        batch_object_labels = sequence_padding(
            batch_object_labels,
            padding=np.zeros((len(predicate2id), 2))
        )
        batch_attention_mask = sequence_padding(batch_attention_mask)
        return [
                    batch_token_ids, batch_segment_ids,
                    batch_subject_labels, batch_subject_ids,
                    batch_object_labels,batch_attention_mask
                ]

數(shù)據(jù)生成器調(diào)用

dg = data_generator(train_data)
dg_dev = data_generator(valid_data)
T, S1, S2, K1, K2, M1 = dg.data_res(
    tokenizer, predicate2id, config.maxlen
)
>>>
    print(f"T[0:1]:{T[0:1]}")
        >>>
        T[0:1]:
            [
                [    
                101 7032 2225 2209 8024 9120  118 8110  118 8149 1139 4495  754 4263
                2209 1065 6963 3377 3360 8024 6639 4413 3136 5298  102    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                    0    0    0    0
                ]
            ]
    print(f"S1[0:1]:{S1[0:1]}")
        >>>
        S1[0:1]:
            [
                [
                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                    0 0 0 0
                ]
            ]
    print(f"S2[0:1]:{S2[0:1]}")
        >>>
        S2[0:1]:
        [
            [
                [0. 0.]
                [1. 0.]
                [0. 0.]
                [0. 1.]
                [0. 0.]
                [0. 0.]
                [0. 0.]
                [0. 0.]
                [0. 0.]
                [0. 0.]
                ...
            ]
        ]
    print(f"K1[0:1]:{K1[0:1]}")
        >>>
        K1[0:1]:[[1 3]]
    print(f"K2[0:1]:{K2[0:1]}")
        K2[0:1]:
        [
            [
                [
                    [0. 0.]
                    [0. 0.]
                    [0. 0.]
                    ...
                    [0. 0.]
                    [0. 0.]
                    [0. 0.]
                ]
                [
                    [0. 0.]
                    [0. 0.]
                    [0. 0.]
                    ...
                    [0. 0.]
                    [0. 0.]
                    [0. 0.]
                ]
                ...
            ]
        ]
    print(f"M1[0:1]:{M1[0:1]}")
        >>>
        M1[0:1]:
        [
            [
                1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
                0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                0 0 0 0
            ]
        ]

4.4 數(shù)據(jù)讀取類定義

數(shù)據(jù)讀取類定義

import torch.utils.data as Data
class Dataset(Data.Dataset):
    def __init__(self,_batch_token_ids,_batch_segment_ids,_batch_subject_labels,_batch_subject_ids,_batch_obejct_labels,_batch_attention_mask):
        self.batch_token_data_ids = _batch_token_ids
        self.batch_segment_data_ids = _batch_segment_ids
        self.batch_subject_data_labels = _batch_subject_labels
        self.batch_subject_data_ids = _batch_subject_ids
        self.batch_object_data_labels = _batch_obejct_labels
        self.batch_attention_mask = _batch_attention_mask
        self.len = len(self.batch_token_data_ids)
    def __getitem__(self, index):
        return self.batch_token_data_ids[index],self.batch_segment_data_ids[index],\
        self.batch_subject_data_labels[index],self.batch_subject_data_ids[index],\
        self.batch_object_data_labels[index],self.batch_attention_mask[index]
    def __len__(self):
        return self.len

數(shù)據(jù) 讀取類定義

torch_dataset = Dataset(T, S1, S2, K1, K2 , M1)
loader_train = Data.DataLoader(
    dataset=torch_dataset,  # torch TensorDataset format
    batch_size=config.batch_size,  # mini batch size
    shuffle=config.shuffle,  # random shuffle for training
    num_workers=config.num_workers,
    collate_fn=collate_fn,  # subprocesses for loading data
)

4.5 模型定義

model_name_or_path = config.path['model_path']
sub_model = REModel_sbuject_2.from_pretrained(model_name_or_path, num_labels=2,output_hidden_states=True)

if config.fp16 == True:
    sub_model.half()

4.6 優(yōu)化器定義

優(yōu)化器定義

from transformers import AdamW
# 功能：獲取 優(yōu)化器 optimizer
def get_optimizer(sub_model, no_decay, learning_rate, adam_epsilon, weight_decay):
    param_optimizer = list(sub_model.named_parameters())  # 打印每一次 迭代元素的名字與參數(shù)
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
        'weight_decay': weight_decay},  # n wei 層的名稱, p為參數(shù)
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
        # 如果是 no_decay 中的元素則衰減為 0
    ]
    #
    optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=adam_epsilon)  # adamw算法
    return optimizer

優(yōu)化器

optimizer = get_optimizer(
    sub_model, config.no_decay, config.learning_rate, 
    config.adam_epsilon, config.weight_decay
)
train_steps = len(torch_dataset) // config.epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps = config.warmup_steps,
    num_training_steps = train_steps
)

4.7 模型訓(xùn)練和驗(yàn)證模塊

模型訓(xùn)練函數(shù)定義

def train(
    sub_model,loader_train,
    device,
    optimizer,scheduler,
):
    sub_model.train()
    train_loss = 0.0
    for setp,loader_res in tqdm(iter(enumerate(loader_train))):
        # scheduler = WarmupLinearSchedule(optimizer, warmup_steps=warmup_steps,
        #                                  t_total=train_steps)  # warmup can su
        batch_token_ids = loader_res['batch_token_ids'].to(device)
        batch_segment_ids = loader_res['batch_segment_ids'].to(device)
        batch_subject_labels = loader_res['batch_subject_labels'].long().to(device)
        batch_subject_ids = loader_res['batch_subject_ids'].to(device)
        batch_object_labels = loader_res['batch_object_labels'].to(device)
        labels_start = (batch_subject_labels[:,:,0].to(device))
        labels_end = (batch_subject_labels[:,:,1].to(device))
        batch_attention_mask = loader_res['batch_attention_mask'].long().to(device)
        batch_segment_ids = batch_segment_ids.long().to(device)
        batch_attention_mask = batch_attention_mask.long().to(device)
        sub_out,obj_out = sub_model(
            input_ids=batch_token_ids,
            token_type_ids=batch_segment_ids,
            attention_mask=batch_attention_mask,
            labels=batch_subject_labels,
            subject_ids = batch_subject_ids,
            batch_size = batch_token_ids.size()[0],
            obj_labels = batch_object_labels,
            sub_train=True,
            obj_train=True
        )
        obj_loss,scores = obj_out[0:2]
        nn.utils.clip_grad_norm_(
            parameters=sub_model.parameters(), 
            max_norm=1
        )
        obj_loss.backward()
        train_loss += obj_loss.item()
        train_loss = round(train_loss, 4)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        if setp % 200 == 0:
            print("loss",train_loss / (setp + 1))

模型驗(yàn)證函數(shù) 定義

def dev(sub_model,valid_data,config):
    sub_model.eval()
    f1, precision, recall = evaluate(valid_data)
    if f1 > config.best_acc :
        print("Best F1", f1)
        print("Saving Model......")
        config.best_acc = f1
        # Save a trained model
        model_to_save = sub_model.module if hasattr(sub_model, 'module') else sub_model  # Only save the model it-self
        output_model_file = os.path.join(config.output_dir, "pytorch_model.bin")
        torch.save(model_to_save.state_dict(), output_model_file)  # 僅保存學(xué)習(xí)到的參數(shù)
        f.write(str(epoch)+'\t'+str(f1)+'\t'+str(precision)+'\t'+str(recall)+'\n')
    print(f1,precision,recall)
    f.write(str(epoch)+'\t'+str(f1)+'\t'+str(precision)+'\t'+str(recall)+'\n')

模型訓(xùn)練和驗(yàn)證調(diào)用

for epoch in range(config.epochs):
    train(
        sub_model,loader_train,
        config.device,
        optimizer,scheduler,
    )

    dev(sub_model,valid_data,config)

五、貢獻(xiàn)

We introduce a fresh perspective to revisit the relational triple extraction task with a principled problem formulation, which implies a general algorithmic framework that addresses the overlapping triple problem by design.
We instantiate the above framework as a novel hierarchical binary tagging model on top of a Transformer encoder. This allows the model to combine the power of the novel tagging framework with the prior knowledge in pretrained large-scale language models.
Extensive experiments on two public datasets show that the proposed framework overwhelmingly outperforms state-of-the-art methods, achieving 17.5 and 30.2 absolute gain in F1-score on the two datasets respectively. Detailed analyses show that our model gains consistent improvement in all scenarios.

結(jié)論

In this paper, we introduce a novel hierarchical binary tagging (HBT) framework derived from a principled problem formulation for relational triple extraction. Instead of modeling relations as discrete labels of entity pairs, we model the relations as functions that map subjects to objects, which provides a fresh perspective to revisit the relational triple extraction task. As a consequent, our model can simultaneously extract multiple relational triples from sentences, without suffering from the overlapping problem. We conduct extensive experiments on two widely used datasets to validate the effectiveness of the proposed HBT framework. Experimental results show that our model overwhelmingly outperforms state-of-theart baselines over different scenarios, especially on the extraction of overlapping relational triples.

在本文中，我們介紹了一種新穎的層次化二進(jìn)制標(biāo)記（HBT）框架，該框架源自用于關(guān)系三重提取的原則性問題公式。我們不是將關(guān)系建模為實(shí)體對(duì)的離散標(biāo)簽，而是將關(guān)系建模為將主題映射到對(duì)象的函數(shù)，這為重新審視關(guān)系三重提取任務(wù)提供了新的視角。因此，我們的模型可以同時(shí)從句子中提取多個(gè)關(guān)系三元組，而不會(huì)出現(xiàn)重疊問題。我們對(duì)兩個(gè)廣泛使用的數(shù)據(jù)集進(jìn)行了廣泛的實(shí)驗(yàn)，以驗(yàn)證所提出的HBT框架的有效性。實(shí)驗(yàn)結(jié)果表明，在不同情況下，特別是在提取重疊的關(guān)系三元組時(shí)，我們的模型絕對(duì)優(yōu)于最新的基準(zhǔn)。

參考

A Novel Hierarchical Binary Tagging Framework for Relational Triple Extraction
論文筆記：A Novel Cascade Binary Tagging Framework for Relational Triple Extraction
百度信息抽取Lic2020關(guān)系抽取
bert4keras在手，baseline我有
lic2020_baselines/ie

【關(guān)于 關(guān)系抽取 之 HBT】 那些的你不知道的事

摘要

一、引言