<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          【關(guān)于 BERT to TextCNN】那些你不知道的事

          共 11059字,需瀏覽 23分鐘

           ·

          2021-04-21 11:39


          作者:楊夕

          論文:Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

          論文地址:https://arxiv.org/abs/1903.12136

          項(xiàng)目地址:https://github.com/km1994/nlp_paper_study

          【注:手機(jī)閱讀可能圖片打不開?。?!】

          一、動(dòng)機(jī)

          • 隨著 BERT 的橫空出世,意味著 上一代用于語言理解的較淺的神經(jīng)網(wǎng)絡(luò)(RNN、CNN等) 的 過時(shí)?

          • BERT模型是真的大,計(jì)算起來太慢了?

          • 是否可以將BERT(一種最先進(jìn)的語言表示模型)中的知識(shí)提取到一個(gè)單層BiLSTM 或 TextCNN 中?

          二、論文思路

          1. 確定 Teacher 模型(Bert) 和 Student 模型(TextCNN、TextRNN);

          2. 蒸餾的兩個(gè)過程:

            1. 第一,在目標(biāo)函數(shù)附加logits回歸部分;

            2. 第二,構(gòu)建遷移數(shù)據(jù)集,從而增加了訓(xùn)練集,可以更有效地進(jìn)行知識(shí)遷移。

          三、模型框架講解【以單句分類任務(wù)為例】

          3.1 Teacher 模型(Bert) 微調(diào)

          1. Bert 模型 模型構(gòu)建

          構(gòu)建 Bert 模型,然后將 Bert 輸出的句子的向量表示過dense層和softmax層,得到logits輸出,代碼如下:

          • 代碼實(shí)現(xiàn):

          class BertClassification(BertPreTrainedModel):
          def __init__(self, config, num_labels=2):
          super(BertClassification, self).__init__(config)
          self.num_labels = num_labels
          self.bert = BertModel(config)
          self.dropout = nn.Dropout(config.hidden_dropout_prob)
          self.classifier = nn.Linear(config.hidden_size, num_labels)
          self.init_weights()

          def forward(self, input_ids, input_mask, label_ids):
          _, pooled_output = self.bert(input_ids, None, input_mask)
          pooled_output = self.dropout(pooled_output)
          logits = self.classifier(pooled_output)
          if label_ids is not None:
          loss_fct = CrossEntropyLoss()
          return loss_fct(logits.view(-1, self.num_labels), label_ids.view(-1))
          return logits

          1. Bert 模型微調(diào)

          • 代碼實(shí)現(xiàn):

          def main(model_type="bert",bert_model='bert-base-chinese', cache_dir=None,
          max_seq=128, batch_size=16, num_epochs=10, lr=2e-5
          ):

          processor = Processor()
          train_examples = processor.get_train_examples('data/hotel')
          label_list = processor.get_labels()
          tokenizer = BertTokenizer.from_pretrained(bert_model, do_lower_case=True)
          if model_type=="bert":
          model = BertClassification.from_pretrained(bert_model, cache_dir=cache_dir, num_labels=len(label_list))
          else:
          model = BertTextCNN.from_pretrained(bert_model,cache_dir=cache_dir,num_labels=len(label_list))
          model.to(device)
          param_optimizer = list(model.named_parameters())
          no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
          optimizer_grouped_parameters = [
          {'params': [p for n, p in param_optimizer if not \
          any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
          {'params': [p for n, p in param_optimizer if \
          any(nd in n for nd in no_decay)], 'weight_decay': 0.00}]
          print('train...')
          num_train_steps = int(len(train_examples) / batch_size * num_epochs)
          optimizer = AdamW(optimizer_grouped_parameters, lr=lr)
          train_features = convert_examples_to_features(train_examples, label_list, max_seq, tokenizer)
          all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
          all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
          all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
          train_data = TensorDataset(all_input_ids, all_input_mask, all_label_ids)
          train_sampler = RandomSampler(train_data)
          train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
          model.train()
          for _ in trange(num_epochs, desc='Epoch'):
          tr_loss = 0
          for step, batch in enumerate(tqdm(train_dataloader, desc='Iteration')):
          input_ids, input_mask, label_ids = tuple(t.to(device) for t in batch)
          loss = model(input_ids, input_mask, label_ids)
          loss.backward()
          optimizer.step()
          optimizer.zero_grad()
          tr_loss += loss.item()
          print('tr_loss', tr_loss)
          print('eval...')
          eval_examples = processor.get_dev_examples('data/hotel')
          eval_features = convert_examples_to_features(eval_examples, label_list, max_seq, tokenizer)
          eval_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
          eval_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
          eval_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
          eval_data = TensorDataset(eval_input_ids, eval_input_mask, eval_label_ids)
          eval_sampler = SequentialSampler(eval_data)
          eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=batch_size)
          model.eval()
          preds = []
          for batch in tqdm(eval_dataloader, desc='Evaluating'):
          input_ids, input_mask, label_ids = tuple(t.to(device) for t in batch)
          with torch.no_grad():
          logits = model(input_ids, input_mask, None)
          preds.append(logits.detach().cpu().numpy())
          preds = np.argmax(np.vstack(preds), axis=1)
          print(compute_metrics(preds, eval_label_ids.numpy()))
          torch.save(model, f'data/cache/{model_type}_model')

          3.2 Student 模型(TextCNN、TextRNN)構(gòu)建

          3.2.1 TextRNN 模型構(gòu)建

          • 模型結(jié)構(gòu):sentence -> Embedding Layer -> Word Embeddings -> BiLSTM -> Hidden States (Bidirection) -> dense -> Relu -> dense -> logits ->softmax


          單句子分類的BiLSTM模型。標(biāo)簽分別是(a)輸入 emb,(b)BiLSTM,(c,d)向后和向前 hidden 狀態(tài),(e,g)全連接層,(e)帶ReLU,(f)隱藏表示,(h)logit outputs,(i)softmax 激活函數(shù),和(j)最終概率

          • 代碼實(shí)現(xiàn):

          class RNN(nn.Module):
          def __init__(self, x_dim, e_dim, h_dim, o_dim):
          super(RNN, self).__init__()
          self.h_dim = h_dim
          self.dropout = nn.Dropout(0.2)
          self.emb = nn.Embedding(x_dim, e_dim, padding_idx=0)
          self.lstm = nn.LSTM(e_dim, h_dim, bidirectional=True, batch_first=True)
          self.fc = nn.Linear(h_dim * 2, o_dim)
          self.softmax = nn.Softmax(dim=1)
          self.log_softmax = nn.LogSoftmax(dim=1)

          def forward(self, x, lens):
          embed = self.dropout(self.emb(x))
          out, _ = self.lstm(embed)
          hidden = self.fc(out[:, -1, :])
          return self.softmax(hidden), self.log_softmax(hidden)

          3.2.2 TextCNN 模型構(gòu)建

          • 模型結(jié)構(gòu):sentence -> Embedding Layer -> Word Embeddings -> dropout -> Conv -> Relu -> max_pool1d -> cat -> dropout -> dense -> logits ->softmax

          • 代碼實(shí)現(xiàn):

          class CNN(nn.Module):
          def __init__(self, x_dim, e_dim, h_dim, o_dim):
          super(CNN, self).__init__()
          self.emb = nn.Embedding(x_dim, e_dim, padding_idx=0)
          self.dropout = nn.Dropout(0.2)
          self.conv1 = nn.Conv2d(1, h_dim, (3, e_dim))
          self.conv2 = nn.Conv2d(1, h_dim, (4, e_dim))
          self.conv3 = nn.Conv2d(1, h_dim, (5, e_dim))
          self.fc = nn.Linear(h_dim * 3, o_dim)
          self.softmax = nn.Softmax(dim=1)
          self.log_softmax = nn.LogSoftmax(dim=1)

          def forward(self, x, lens):
          embed = self.dropout(self.emb(x)).unsqueeze(1)
          c1 = torch.relu(self.conv1(embed).squeeze(3))
          p1 = torch.max_pool1d(c1, c1.size()[2]).squeeze(2)
          c2 = torch.relu(self.conv2(embed).squeeze(3))
          p2 = torch.max_pool1d(c2, c2.size()[2]).squeeze(2)
          c3 = torch.relu(self.conv3(embed).squeeze(3))
          p3 = torch.max_pool1d(c3, c3.size()[2]).squeeze(2)
          pool = self.dropout(torch.cat((p1, p2, p3), 1))
          hidden = self.fc(pool)
          return self.softmax(hidden), self.log_softmax(hidden)

          3.3 Distillation Objective

          在 3.1、3.2 節(jié),我們分別介紹了 Teacher 模型(Bert)和 Student 模型(TextCNN、TextRNN),那么現(xiàn)在問題來了:“如何 才能將 Teacher 模型 的知識(shí)遷移到Student 模型中呢?

          在 論文中,作者主要將 Teacher 模型的 logits 輸出作為 Student 模型的 Distillation Objective,通過這種方式 將 Teacher 模型 的知識(shí)遷移到 Student 模型中,公式如下所示:

          1. Teacher 模型的 logits 輸出

          class Teacher(object):
          def __init__(self, bert_model='bert-base-chinese', max_seq=128):
          self.max_seq = max_seq
          self.tokenizer = BertTokenizer.from_pretrained(
          bert_model, do_lower_case=True)
          self.model = torch.load('./data/cache/model')
          self.model.eval()

          def predict(self, text):
          tokens = self.tokenizer.tokenize(text)[:self.max_seq]
          input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
          input_mask = [1] * len(input_ids)
          padding = [0] * (self.max_seq - len(input_ids))
          input_ids = torch.tensor([input_ids + padding], dtype=torch.long).to(device)
          input_mask = torch.tensor([input_mask + padding], dtype=torch.long).to(device)
          logits = self.model(input_ids, input_mask, None)
          return F.softmax(logits, dim=1).detach().cpu().numpy()

          1. Distillation目標(biāo)函數(shù)

          model = RNN(v_size, 256, 256, 2)
          # model = CNN(v_size,256,128,2)
          if USE_CUDA: model = model.cuda()
          opt = optim.Adam(model.parameters(), lr=lr)
          ce_loss = nn.NLLLoss()
          mse_loss = nn.MSELoss()
          for epoch in range(epochs):
          losses = []
          accu = []
          model.train()
          for i in range(0, len(x_tr), b_size):
          model.zero_grad()
          bx = Variable(LTensor(x_tr[i:i + b_size]))
          by = Variable(LTensor(y_tr[i:i + b_size]))
          bl = Variable(LTensor(l_tr[i:i + b_size]))
          bt = Variable(FTensor(t_tr[i:i + b_size]))
          py1, py2 = model(bx, bl)
          loss = alpha * ce_loss(py2, by) + (1-alpha) * mse_loss(py1, bt) # in paper, only mse is used
          loss.backward()
          opt.step()
          losses.append(loss.item())
          for i in range(0, len(x_de), b_size):
          model.zero_grad()
          bx = Variable(LTensor(x_de[i:i + b_size]))
          bl = Variable(LTensor(l_de[i:i + b_size]))
          bt = Variable(FTensor(t_de[i:i + b_size]))
          py1, py2 = model(bx, bl)
          loss = mse_loss(py1, bt)
          if teach_on_dev:
          loss.backward()
          opt.step() # train only with teacher on dev set
          losses.append(loss.item())

          四、Data Augmentation for Distillation

          • 動(dòng)機(jī):在上文中,我們介紹了 如何 將 Teacher 模型 的知識(shí)遷移到Student 模型中,但是對(duì)于小數(shù)據(jù)集而言,在 Distillation 過程中,容易出現(xiàn)無法完全表達(dá)大模型的知識(shí)問題,導(dǎo)致模型出現(xiàn)過擬合,那有沒有比較好的解決方法呢?

          • 方法:數(shù)據(jù)增強(qiáng)。即 利用數(shù)據(jù)增強(qiáng)的方法認(rèn)為擴(kuò)充數(shù)據(jù)集,來防止過擬合

          • 思路:

            • Masking:以一定的概率,用[MASK]標(biāo)簽來取代句子中的某個(gè)單詞;

            • POS-guided word replacement:以一定的概率,用同詞性的詞來取代當(dāng)前詞。根據(jù)原始訓(xùn)練集中同詞性詞語的詞頻來確定取代詞;

            • n-gram sampling:以一定的概率,用n-gram來取代原始的句子。n的取值范圍是[1,5]。這個(gè)操作相當(dāng)于dropout,是升級(jí)版的Masking。



          五、單句分類任務(wù) 實(shí)驗(yàn)結(jié)果分析

          5.1 數(shù)據(jù)集介紹

          本文所用的數(shù)據(jù)集 為 一個(gè) 關(guān)于酒店的二分類數(shù)據(jù),該數(shù)據(jù)樣式如下:

          1 鬧中取靜的一個(gè)地方,在窗前能看到不錯(cuò)的風(fēng)景。酒店價(jià)格的確有些偏高
          0 房價(jià)要四百多,但感到非常失望,陳舊,臟,比錦江之星還差。以后肯定不會(huì)再去了。這樣的硬件設(shè)施和服務(wù)怎么吸引客人呢。
          1 酒店總體感覺不錯(cuò),很適合外賓入住,大堂的氛圍整個(gè)就像是一個(gè)外國人的社區(qū)。房間很舒服,攜程搞活動(dòng),還加送了紅酒和水果,很不錯(cuò),下次還會(huì)考慮入住。只是停車場比較麻煩,來賓進(jìn)停車場之前還要有狼狗繞車檢查,感覺不舒服。
          0 好小的門面,沒有電梯,房間也不是很一致!豪華房居然要400多,馬桶還是壞的!酒店太自作主張了。。。。
          0 房間以次充好,提出異議后才調(diào)整,調(diào)整后還是較差的房間
          0 面前就是高架,實(shí)在是太吵了,一晚上沒睡!
          ...

          5.2 實(shí)驗(yàn)結(jié)果分析

          模型AccF1速度
          Bert0.90.90514583827365420.24132323265075684 s
          TextCNN0.82636718824505810.82717282717282710.001s
          Bert->TextCNN0.881250.88836662749706230.004960536956787109 s

          六、總結(jié)

          1. Bert->TextCNN 模型 雖然 效果 低于 Bert,但是 比 直接用 TextCNN 高很多;

          2. Bert->TextCNN 模型 雖然 推理速度 低于 TextCNN,但是 比 直接用 Bert 高很多;

          參考資料

          1. 基于BERT的蒸餾實(shí)驗(yàn)

          2. 知識(shí)蒸餾論文選讀(二)

          3. 知識(shí)蒸餾(Knowledge Dis


          瀏覽 178
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  色噜噜亚洲欧美在线视频 | 日本不卡高清视频 | 国产一级a毛一级a看免费视频乱 | 亚洲色色成人 | 看黄色操逼视频 |