<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          NLP(五十三)在Keras中使用英文Roberta模型實(shí)現(xiàn)文本分類

          共 7141字,需瀏覽 15分鐘

           ·

          2022-06-09 14:26

          ??

          ????英文Roberta模型是2019年Facebook在論文RoBERTa: A Robustly Optimized BERT Pretraining Approach中新提出的預(yù)訓(xùn)練模型,其目的是改進(jìn)BERT模型存在的一些問題,當(dāng)時(shí)也刷新了一眾NLP任務(wù)的榜單,達(dá)到SOTA效果,其模型和代碼已開源,放在Github中的fairseq項(xiàng)目中。眾所周知,英文Roberta模型使用Torch框架訓(xùn)練的,因此,其torch版本模型最為常見。
          ??當(dāng)然,torch模型也是可以轉(zhuǎn)化為tensorflow模型的。本文將會(huì)介紹如何將原始torch版本的英文Roberta模型轉(zhuǎn)化為tensorflow版本模型,并且Keras中使用tensorflow版本模型實(shí)現(xiàn)英語文本分類。
          ??項(xiàng)目結(jié)構(gòu)如下圖所示:

          510009211c248ab7e3c2e6099484d4b2.webp項(xiàng)目結(jié)構(gòu)圖

          模型轉(zhuǎn)化

          ??本項(xiàng)目首先會(huì)將原始torch版本的英文Roberta模型轉(zhuǎn)化為tensorflow版本模型,該部分代碼主要參考Github項(xiàng)目keras_roberta。
          ??首先需下載Facebook發(fā)布在fairseq項(xiàng)目中的roberta base模型,其訪問網(wǎng)址為: https://github.com/pytorch/fairseq/blob/main/examples/roberta/README.md。

          cf8a216552e34ac3b97a3a02973b6e73.webpRoberta模型
          運(yùn)行convert_roberta_to_tf.py腳本,將torch模型轉(zhuǎn)化為tensorflow模型。具體代碼不在此給出,可以參考文章后續(xù)給出的Github項(xiàng)目地址。
          ??在模型的tokenizer方面,將RobertaTokenizer改為GPT2Tokenizer,因?yàn)镽obertaTokenizer是繼承自GPT2Tokenizer的,兩者相似性很高。測(cè)試原始torch模型和tensorflow模型的表現(xiàn),代碼如下(tf_roberta_demo.py):
          import?os
          import?tensorflow?as?tf
          from?keras_roberta.roberta?import?build_bert_model
          from?keras_roberta.tokenizer?import?RobertaTokenizer
          from?fairseq.models.roberta?import?RobertaModel?as?FairseqRobertaModel
          import?numpy?as?np
          import?argparse


          if?__name__?==?'__main__':
          ????roberta_path?=?'roberta-base'
          ????tf_roberta_path?=?'tf_roberta_base'
          ????tf_ckpt_name?=?'tf_roberta_base.ckpt'
          ????vocab_path?=?'keras_roberta'

          ????config_path?=?os.path.join(tf_roberta_path,?'bert_config.json')
          ????checkpoint_path?=?os.path.join(tf_roberta_path,?tf_ckpt_name)
          ????if?os.path.splitext(checkpoint_path)[-1]?!=?'.ckpt':
          ????????checkpoint_path?+=?'.ckpt'

          ????gpt_bpe_vocab?=?os.path.join(vocab_path,?'encoder.json')
          ????gpt_bpe_merge?=?os.path.join(vocab_path,?'vocab.bpe')
          ????roberta_dict?=?os.path.join(roberta_path,?'dict.txt')

          ????tokenizer?=?RobertaTokenizer(gpt_bpe_vocab,?gpt_bpe_merge,?roberta_dict)
          ????model?=?build_bert_model(config_path,?checkpoint_path,?roberta=True)??#?建立模型,加載權(quán)重

          ????#?編碼測(cè)試
          ????text1?=?"hello,?world!"
          ????text2?=?"This?is?Roberta!"
          ????sep?=?[tokenizer.sep_token]
          ????cls?=?[tokenizer.cls_token]
          ????#?1.?先用'bpe_tokenize'將文本轉(zhuǎn)換成bpe?tokens
          ????tokens1?=?cls?+?tokenizer.bpe_tokenize(text1)?+?sep
          ????tokens2?=?sep?+?tokenizer.bpe_tokenize(text2)?+?sep
          ????#?2.?最后轉(zhuǎn)換成id
          ????token_ids1?=?tokenizer.convert_tokens_to_ids(tokens1)
          ????token_ids2?=?tokenizer.convert_tokens_to_ids(tokens2)
          ????token_ids?=?token_ids1?+?token_ids2
          ????segment_ids?=?[0]?*?len(token_ids1)?+?[1]?*?len(token_ids2)
          ????print(token_ids)
          ????print(segment_ids)

          ????print('\n?=====?tf?model?predicting?=====\n')
          ????our_output?=?model.predict([np.array([token_ids]),?np.array([segment_ids])])
          ????print(our_output)

          ????print('\n?=====?torch?model?predicting?=====\n')
          ????roberta?=?FairseqRobertaModel.from_pretrained(roberta_path)
          ????roberta.eval()??#?disable?dropout

          ????input_ids?=?roberta.encode(text1,?text2).unsqueeze(0)??#?batch?of?size?1
          ????print(input_ids)
          ????their_output?=?roberta.model(input_ids,?features_only=True)[0]
          ????print(their_output)

          輸出結(jié)果如下:

          [0,?42891,?6,?232,?328,?2,?2,?713,?16,?1738,?102,?328,?2]
          [0,?0,?0,?0,?0,?0,?1,?1,?1,?1,?1,?1,?1]

          ?=====?tf?model?predicting?=====
          [[[-0.01123665??0.05132651?-0.02170264?...?-0.03562857?-0.02836962
          ???-0.00519008]
          ??[?0.04382067??0.07045364?-0.00431021?...?-0.04662359?-0.10770167
          ????0.1121687?]
          ??[?0.06198474??0.05240346??0.11088232?...?-0.08883709?-0.02932207
          ???-0.12898633]
          ??...
          ??[-0.00229368??0.045834????0.00811818?...?-0.11751424?-0.06718166
          ????0.04085271]
          ??[-0.08509324?-0.27506304?-0.02425355?...?-0.24215901?-0.15481825
          ????0.17167582]
          ??[-0.05180666??0.06384835?-0.05997407?...?-0.09398533?-0.05159672
          ???-0.03988626]]
          ]

          ?=====?torch?model?predicting?=====
          tensor([[????0,?42891,?????6,???232,???328,?????2,?????2,???713,????16,??1738,
          ???????????102,???328,?????2]]
          )
          tensor([[[-0.0525,??0.0818,?-0.0170,??...,?-0.0546,?-0.0569,?-0.0099],
          ?????????[-0.0765,?-0.0568,?-0.1400,??...,?-0.2612,?-0.0455,??0.2975],
          ?????????[-0.0142,??0.1184,??0.0530,??...,?-0.0844,??0.0199,??0.1340],
          ?????????...,
          ?????????[-0.0019,??0.1263,?-0.0787,??...,?-0.3986,?-0.0626,??0.1870],
          ?????????[?0.0127,?-0.2116,??0.0696,??...,?-0.1622,?-0.1265,??0.0986],
          ?????????[-0.0473,??0.0748,?-0.0419,??...,?-0.0892,?-0.0595,?-0.0281]]
          ],
          ???????grad_fn=)

          可以看到,兩者在tokenize時(shí)的token_ids是一致的。

          英語文本分類

          ??接著我們需要看下轉(zhuǎn)化為的tensorflow版本的Roberta模型在英語文本分類數(shù)據(jù)集上的效果了。
          ??這里我們使用的是GLUE數(shù)據(jù)集中的SST-2SST-2(The Stanford Sentiment Treebank,斯坦福情感樹庫),單句子分類任務(wù),包含電影評(píng)論中的句子和它們情感的人類注釋。這項(xiàng)任務(wù)是給定句子的情感,類別分為兩類正面情感(positive,樣本標(biāo)簽對(duì)應(yīng)為1)和負(fù)面情感(negative,樣本標(biāo)簽對(duì)應(yīng)為0),并且只用句子級(jí)別的標(biāo)簽。也就是,本任務(wù)也是一個(gè)二分類任務(wù),針對(duì)句子級(jí)別,分為正面和負(fù)面情感。關(guān)于該數(shù)據(jù)集的具體介紹可參考網(wǎng)址:https://nlp.stanford.edu/sentiment/index.html。
          ??SST-2數(shù)據(jù)集中訓(xùn)練集樣本數(shù)量為67349,驗(yàn)證集樣本數(shù)量為872,測(cè)試集樣本數(shù)量為1820,數(shù)據(jù)存儲(chǔ)格式為tsv,讀取數(shù)據(jù)的代碼如下:(utils/load_data.py)

          def?read_model_data(file_path):
          ????data?=?[]
          ????with?open(file_path,?'r',?encoding='utf-8')?as?f:
          ????????lines?=?[_.strip()?for?_?in?f.readlines()]
          ????for?i,?line?in?enumerate(lines):
          ????????if?i:
          ????????????items?=?line.split('\t')
          ????????????label?=?[0,?1]?if?int(items[1])?else?[1,?0]
          ????????????data.append([label,?items[0]])
          ????return?data

          ??在tokenizer部分,我們采用GTP2Tokenizer,該部分代碼如下(utils/roberta_tokenizer.py):

          #?roberta?tokenizer?function?for?text?pair
          def?tokenizer_encode(tokenizer,?text,?max_seq_length):
          ????sep?=?[tokenizer.sep_token]
          ????cls?=?[tokenizer.cls_token]
          ????#?1.?先用'bpe_tokenize'將文本轉(zhuǎn)換成bpe?tokens
          ????tokens1?=?cls?+?tokenizer.bpe_tokenize(text)?+?sep
          ????#?2.?最后轉(zhuǎn)換成id
          ????token_ids?=?tokenizer.convert_tokens_to_ids(tokens1)
          ????segment_ids?=?[0]?*?len(token_ids)
          ????pad_length?=?max_seq_length?-?len(token_ids)
          ????if?pad_length?>=?0:
          ????????token_ids?+=?[0]?*?pad_length
          ????????segment_ids?+=?[0]?*?pad_length
          ????else:
          ????????token_ids?=?token_ids[:max_seq_length]
          ????????segment_ids?=?segment_ids[:max_seq_length]

          ????return?token_ids,?segment_ids

          ??創(chuàng)建模型如下(model_train.py):

          #?構(gòu)建模型
          def?create_cls_model():
          ????#?Roberta?model
          ????roberta_model?=?build_bert_model(CONFIG_FILE_PATH,?CHECKPOINT_FILE_PATH,?roberta=True)??#?建立模型,加載權(quán)重

          ????for?layer?in?roberta_model.layers:
          ????????layer.trainable?=?True

          ????cls_layer?=?Lambda(lambda?x:?x[:,?0])(roberta_model.output)????#?取出[CLS]對(duì)應(yīng)的向量用來做分類
          ????p?=?Dense(2,?activation='softmax')(cls_layer)?????#?多分類

          ????model?=?Model(roberta_model.input,?p)
          ????model.compile(
          ????????loss='categorical_crossentropy',
          ????????optimizer=Adam(1e-5),???#?用足夠小的學(xué)習(xí)率
          ????????metrics=['accuracy']
          ????)

          ????return?model

          模型參數(shù)如下:

          #?模型參數(shù)配置
          EPOCH?=?10??????????????#?訓(xùn)練輪次
          BATCH_SIZE?=?64?????????#?批次數(shù)量
          MAX_SEQ_LENGTH?=?80?????#?最大長(zhǎng)度

          模型訓(xùn)練完后,在驗(yàn)證數(shù)據(jù)集上的準(zhǔn)確率(accuracy)為0.9415,F(xiàn)1值為0.9415,取得了不錯(cuò)效果。

          模型預(yù)測(cè)

          ??我們對(duì)新樣本進(jìn)行模型預(yù)測(cè)(model_predict.py),預(yù)測(cè)結(jié)果如下:

          Awesome movie for everyone to watch. Animation was flawless.
          label: 1, prob: 0.9999607

          I almost balled my eyes out 5 times. Almost. Beautiful movie, very inspiring.
          label: 1, prob: 0.9999519

          Not even worth it. It's a movie that's too stupid for adults, and too crappy for everyone. Skip if you're not 13, or even if you are.
          label: 0, prob: 0.9999864

          總結(jié)

          ??本文介紹了如何將原始torch版本的英文Roberta模型轉(zhuǎn)化為tensorflow版本模型,并且Keras中使用tensorflow版本模型實(shí)現(xiàn)英語文本分類。
          ??本項(xiàng)目代碼已放至Github,網(wǎng)址為:https://github.com/percent4/keras_roberta_text_classificaiton。
          ??感謝閱讀,如有任何問題,歡迎大家交流~

          參考網(wǎng)址

          1. fairseq: https://github.com/pytorch/fairseq

          2. GLUE tasks: https://gluebenchmark.com/tasks

          3. SST-2: https://nlp.stanford.edu/sentiment/index.html

          4. keras_roberta: https://github.com/midori1/keras_roberta

          5. Roberta paper: https://arxiv.org/pdf/1907.11692.pdf


          瀏覽 143
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  综合网天天 | 99精品视频16在线免费观看 | 成人无码777 | 色五月网| 手机亚洲色在线 |