<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          【NLP】bertorch: 基于 pytorch 的 bert 實(shí)現(xiàn)和下游任務(wù)微調(diào)

          共 26449字,需瀏覽 53分鐘

           ·

          2022-05-31 21:06

          bertorch ( https://github.com/zejunwang1/bertorch ) 是一個(gè)基于 pytorch 進(jìn)行 bert 實(shí)現(xiàn)和下游任務(wù)微調(diào)的工具,支持常用的自然語(yǔ)言處理任務(wù),包括文本分類、文本匹配、語(yǔ)義理解和序列標(biāo)注等。

          • 1. 依賴環(huán)境

          • 2. 文本分類

          • 3. 文本匹配

          • 4. 語(yǔ)義理解

            • 4.1 SimCSE

            • 4.2 In-Batch Negatives

          • 5. 序列標(biāo)注


          1. 依賴環(huán)境

          • Python >= 3.6

          • torch >= 1.1
          • argparse
          • json
          • loguru
          • numpy
          • packaging
          • re

          2. 文本分類

          本項(xiàng)目展示了以 BERT 為代表的預(yù)訓(xùn)練模型如何 Finetune 完成文本分類任務(wù)。我們以中文情感分類公開(kāi)數(shù)據(jù)集 ChnSentiCorp 為例,運(yùn)行如下的命令,基于 DistributedDataParallel 進(jìn)行單機(jī)多卡分布式訓(xùn)練,在訓(xùn)練集 (train.tsv) 上進(jìn)行模型訓(xùn)練,并在驗(yàn)證集 (dev.tsv) 上進(jìn)行評(píng)估:

          CUDA_VISIBLE_DEVICES=0,1?python?-m?torch.distributed.launch?--nproc_per_node=2?run_classifier.py?--train_data_file?./data/ChnSentiCorp/train.tsv?--dev_data_file?./data/ChnSentiCorp/dev.tsv?--label_file?./data/ChnSentiCorp/labels.txt?--save_best_model?--epochs?3?--batch_size?32

          可支持的配置參數(shù):

          usage:?run_classifier.py?[-h]?[--local_rank?LOCAL_RANK]
          ?????????????????????????[--pretrained_model_name_or_path?PRETRAINED_MODEL_NAME_OR_PATH]
          ?????????????????????????[--init_from_ckpt?INIT_FROM_CKPT]?--train_data_file
          ?????????????????????????TRAIN_DATA_FILE?[--dev_data_file?DEV_DATA_FILE]
          ?????????????????????????--label_file?LABEL_FILE?[--batch_size?BATCH_SIZE]
          ?????????????????????????[--scheduler?{linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}]
          ?????????????????????????[--learning_rate?LEARNING_RATE]
          ?????????????????????????[--warmup_proportion?WARMUP_PROPORTION]?[--seed?SEED]
          ?????????????????????????[--save_steps?SAVE_STEPS]
          ?????????????????????????[--logging_steps?LOGGING_STEPS]
          ?????????????????????????[--weight_decay?WEIGHT_DECAY]?[--epochs?EPOCHS]
          ?????????????????????????[--max_seq_length?MAX_SEQ_LENGTH]
          ?????????????????????????[--saved_dir?SAVED_DIR]
          ?????????????????????????[--max_grad_norm?MAX_GRAD_NORM]?[--save_best_model]
          ?????????????????????????[--is_text_pair]
          • local_rank: 可選,分布式訓(xùn)練的節(jié)點(diǎn)編號(hào),默認(rèn)為 -1。

          • pretrained_model_name_or_path: 可選,huggingface 中的預(yù)訓(xùn)練模型名稱或路徑,默認(rèn)為 bert-base-chinese。

          • train_data_file: 必選,訓(xùn)練集數(shù)據(jù)文件路徑。

          • dev_data_file: 可選,驗(yàn)證集數(shù)據(jù)文件路徑,默認(rèn)為 None。

          • label_file: 必選,類別標(biāo)簽文件路徑。

          • batch_size: 可選,批處理大小,請(qǐng)結(jié)合顯存情況進(jìn)行調(diào)整,若出現(xiàn)顯存不足,請(qǐng)適當(dāng)調(diào)低這一參數(shù)。默認(rèn)為 32。

          • init_from_ckpt: 可選,要加載的模型參數(shù)路徑,熱啟動(dòng)模型訓(xùn)練。默認(rèn)為None。

          • scheduler: 可選,優(yōu)化器學(xué)習(xí)率變化策略,默認(rèn)為 linear。

          • learning_rate: 可選,優(yōu)化器的最大學(xué)習(xí)率,默認(rèn)為 5e-5。

          • warmup_proportion: 可選,學(xué)習(xí)率 warmup 策略的比例,如果為 0.1,則學(xué)習(xí)率會(huì)在前 10% 訓(xùn)練 step 的過(guò)程中從 0 慢慢增長(zhǎng)到 learning_rate,而后再緩慢衰減。默認(rèn)為 0。

          • weight_decay: 可選,控制正則項(xiàng)力度的參數(shù),用于防止過(guò)擬合,默認(rèn)為 0.0。

          • seed: 可選,隨機(jī)種子,默認(rèn)為1000。

          • logging_steps: 可選,日志打印的間隔 steps,默認(rèn)為 20。

          • save_steps: 可選,保存模型參數(shù)的間隔 steps,默認(rèn)為 100。

          • epochs: 可選,訓(xùn)練輪次,默認(rèn)為 3。

          • max_seq_length: 可選,輸入到預(yù)訓(xùn)練模型中的最大序列長(zhǎng)度,最大不能超過(guò) 512,默認(rèn)為 128。

          • saved_dir: 可選,保存訓(xùn)練模型的文件夾路徑,默認(rèn)保存在當(dāng)前目錄的 checkpoint 文件夾下。

          • max_grad_norm: 可選,訓(xùn)練過(guò)程中梯度裁剪的 max_norm 參數(shù),默認(rèn)為 1.0。

          • save_best_model: 可選,是否在最佳驗(yàn)證集指標(biāo)上保存模型,當(dāng)訓(xùn)練命令中加入

            --save_best_model 時(shí),save_best_model 為 True,否則為 False。

          • is_text_pair: 可選,是否進(jìn)行文本對(duì)分類,當(dāng)訓(xùn)練命令中加入 --is_text_pair 時(shí),進(jìn)行文本對(duì)的分類,否則進(jìn)行普通文本分類。

          模型訓(xùn)練的中間日志如下:

          2022-05-25?07:22:29.403?|?INFO?????|?__main__:train:301?-?global?step:?20,?epoch:?1,?batch:?20,?loss:?0.23227,?accuracy:?0.87500,?speed:?2.12?step/s
          2022-05-25?07:22:39.131?|?INFO?????|?__main__:train:301?-?global?step:?40,?epoch:?1,?batch:?40,?loss:?0.30054,?accuracy:?0.87500,?speed:?2.06?step/s
          2022-05-25?07:22:49.010?|?INFO?????|?__main__:train:301?-?global?step:?60,?epoch:?1,?batch:?60,?loss:?0.23514,?accuracy:?0.93750,?speed:?2.02?step/s
          2022-05-25?07:22:58.909?|?INFO?????|?__main__:train:301?-?global?step:?80,?epoch:?1,?batch:?80,?loss:?0.12026,?accuracy:?0.96875,?speed:?2.02?step/s
          2022-05-25?07:23:08.804?|?INFO?????|?__main__:train:301?-?global?step:?100,?epoch:?1,?batch:?100,?loss:?0.21955,?accuracy:?0.90625,?speed:?2.02?step/s
          2022-05-25?07:23:13.534?|?INFO?????|?__main__:train:307?-?eval?loss:?0.22564,?accuracy:?0.91750
          2022-05-25?07:23:25.222?|?INFO?????|?__main__:train:301?-?global?step:?120,?epoch:?1,?batch:?120,?loss:?0.32157,?accuracy:?0.90625,?speed:?2.03?step/s
          2022-05-25?07:23:35.104?|?INFO?????|?__main__:train:301?-?global?step:?140,?epoch:?1,?batch:?140,?loss:?0.20107,?accuracy:?0.87500,?speed:?2.02?step/s
          2022-05-25?07:23:44.978?|?INFO?????|?__main__:train:301?-?global?step:?160,?epoch:?2,?batch:?10,?loss:?0.08750,?accuracy:?0.96875,?speed:?2.03?step/s
          2022-05-25?07:23:54.869?|?INFO?????|?__main__:train:301?-?global?step:?180,?epoch:?2,?batch:?30,?loss:?0.08308,?accuracy:?1.00000,?speed:?2.02?step/s
          2022-05-25?07:24:04.754?|?INFO?????|?__main__:train:301?-?global?step:?200,?epoch:?2,?batch:?50,?loss:?0.10256,?accuracy:?0.93750,?speed:?2.02?step/s
          2022-05-25?07:24:09.480?|?INFO?????|?__main__:train:307?-?eval?loss:?0.22497,?accuracy:?0.93083
          2022-05-25?07:24:21.020?|?INFO?????|?__main__:train:301?-?global?step:?220,?epoch:?2,?batch:?70,?loss:?0.23989,?accuracy:?0.93750,?speed:?2.03?step/s
          2022-05-25?07:24:30.919?|?INFO?????|?__main__:train:301?-?global?step:?240,?epoch:?2,?batch:?90,?loss:?0.00897,?accuracy:?1.00000,?speed:?2.02?step/s
          2022-05-25?07:24:40.777?|?INFO?????|?__main__:train:301?-?global?step:?260,?epoch:?2,?batch:?110,?loss:?0.13605,?accuracy:?0.93750,?speed:?2.03?step/s
          2022-05-25?07:24:50.640?|?INFO?????|?__main__:train:301?-?global?step:?280,?epoch:?2,?batch:?130,?loss:?0.14508,?accuracy:?0.93750,?speed:?2.03?step/s
          2022-05-25?07:25:00.529?|?INFO?????|?__main__:train:301?-?global?step:?300,?epoch:?2,?batch:?150,?loss:?0.04770,?accuracy:?0.96875,?speed:?2.02?step/s
          2022-05-25?07:25:05.256?|?INFO?????|?__main__:train:307?-?eval?loss:?0.23039,?accuracy:?0.93500
          2022-05-25?07:25:16.818?|?INFO?????|?__main__:train:301?-?global?step:?320,?epoch:?3,?batch:?20,?loss:?0.04312,?accuracy:?0.96875,?speed:?2.04?step/s
          2022-05-25?07:25:26.700?|?INFO?????|?__main__:train:301?-?global?step:?340,?epoch:?3,?batch:?40,?loss:?0.05103,?accuracy:?0.96875,?speed:?2.02?step/s
          2022-05-25?07:25:36.588?|?INFO?????|?__main__:train:301?-?global?step:?360,?epoch:?3,?batch:?60,?loss:?0.12114,?accuracy:?0.87500,?speed:?2.02?step/s
          2022-05-25?07:25:46.443?|?INFO?????|?__main__:train:301?-?global?step:?380,?epoch:?3,?batch:?80,?loss:?0.01080,?accuracy:?1.00000,?speed:?2.03?step/s
          2022-05-25?07:25:56.228?|?INFO?????|?__main__:train:301?-?global?step:?400,?epoch:?3,?batch:?100,?loss:?0.14839,?accuracy:?0.96875,?speed:?2.04?step/s
          2022-05-25?07:26:00.953?|?INFO?????|?__main__:train:307?-?eval?loss:?0.22589,?accuracy:?0.94083
          2022-05-25?07:26:12.483?|?INFO?????|?__main__:train:301?-?global?step:?420,?epoch:?3,?batch:?120,?loss:?0.14986,?accuracy:?0.96875,?speed:?2.05?step/s
          2022-05-25?07:26:22.289?|?INFO?????|?__main__:train:301?-?global?step:?440,?epoch:?3,?batch:?140,?loss:?0.00687,?accuracy:?1.00000,?speed:?2.04?step/s

          當(dāng)需要進(jìn)行文本對(duì)分類時(shí),僅需設(shè)置 is_text_pair 為 True。以 CLUEbenchmark 中的 AFQMC 螞蟻金融語(yǔ)義相似度數(shù)據(jù)集為例,可以運(yùn)行如下的命令進(jìn)行訓(xùn)練:

          CUDA_VISIBLE_DEVICES=0,1?python?-m?torch.distributed.launch?--nproc_per_node=2?run_classifier.py?--train_data_file?./data/AFQMC/train.txt?--dev_data_file?./data/AFQMC/dev.txt?--label_file?./data/AFQMC/labels.txt?--is_text_pair?--save_best_model?--epochs?3?--batch_size?32

          在不同數(shù)據(jù)集上進(jìn)行訓(xùn)練,驗(yàn)證集上的效果如下:

          TaskChnSentiCorpAFQMCTNEWS
          dev-acc0.940830.743050.56990
          TNEWS 為 CLUEbenchmark 中的今日頭條新聞分類數(shù)據(jù)集。
          CLUEbenchmark 數(shù)據(jù)集鏈接:https://github.com/CLUEbenchmark/CLUE

          3. 文本匹配

          本項(xiàng)目展示了如何基于 Sentence-BERT 結(jié)構(gòu) Finetune 完成中文文本匹配任務(wù)。Sentence BERT 采用了雙塔 (Siamese) 的網(wǎng)絡(luò)結(jié)構(gòu)。Query 和 Title 分別輸入到兩個(gè)共享參數(shù)的 bert encoder 中,得到各自的 token embedding 特征。然后對(duì) token embedding 進(jìn)行 pooling (論文中使用 mean pooling 操作),輸出分別記作 u 和 v。最后將三個(gè)向量 (u,v,|u-v|) 拼接起來(lái)輸入到線性分類器中進(jìn)行分類。網(wǎng)絡(luò)結(jié)構(gòu)如下圖所示:

          更多關(guān)于 Sentence-BERT 的信息可以參考論文:https://arxiv.org/abs/1908.10084

          我們以中文文本匹配數(shù)據(jù)集 LCQMC 為例,運(yùn)行下面的命令,基于 DistributedDataParallel 進(jìn)行單機(jī)多卡分布式訓(xùn)練,在訓(xùn)練集上進(jìn)行模型訓(xùn)練,在驗(yàn)證集上進(jìn)行評(píng)估:

          CUDA_VISIBLE_DEVICES=0,1?python?-m?torch.distributed.launch?--nproc_per_node=2?run_sentencebert.py?--train_data_file?./data/LCQMC/train.txt?--dev_data_file?./data/LCQMC/dev.txt?--save_best_model?--epochs?3?--batch_size?32

          可支持的配置參數(shù):

          usage:?run_sentencebert.py?[-h]?[--local_rank?LOCAL_RANK]
          ???????????????????????????[--pretrained_model_name_or_path?PRETRAINED_MODEL_NAME_OR_PATH]
          ???????????????????????????[--init_from_ckpt?INIT_FROM_CKPT]?--train_data_file
          ???????????????????????????TRAIN_DATA_FILE?[--dev_data_file?DEV_DATA_FILE]
          ???????????????????????????[--label_file?LABEL_FILE]?[--batch_size?BATCH_SIZE]
          ???????????????????????????[--scheduler?{linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}]
          ???????????????????????????[--learning_rate?LEARNING_RATE]
          ???????????????????????????[--warmup_proportion?WARMUP_PROPORTION]
          ???????????????????????????[--seed?SEED]?[--save_steps?SAVE_STEPS]
          ???????????????????????????[--logging_steps?LOGGING_STEPS]
          ???????????????????????????[--weight_decay?WEIGHT_DECAY]?[--epochs?EPOCHS]
          ???????????????????????????[--max_seq_length?MAX_SEQ_LENGTH]
          ???????????????????????????[--saved_dir?SAVED_DIR]
          ???????????????????????????[--max_grad_norm?MAX_GRAD_NORM]?[--save_best_model]
          ???????????????????????????[--is_nli]?[--pooling_mode?{linear,cls,mean}]
          ???????????????????????????[--concat_multiply]
          ???????????????????????????[--output_emb_size?OUTPUT_EMB_SIZE]

          其中大部分參數(shù)與文本分類中介紹的相同,如下為特有的參數(shù):

          • is_nli: 可選,當(dāng)訓(xùn)練命令中加入 --is_nli 時(shí),使用 NLI 自然語(yǔ)言推斷數(shù)據(jù)集進(jìn)行模型訓(xùn)練。

          • pooling_mode: 可選,當(dāng)為 linear 時(shí),使用 cls 向量經(jīng)過(guò) linear pooler 后的輸出作為 encoder 編碼的句子向量;當(dāng)為 cls 時(shí),使用 cls 向量作為 encoder 編碼的句子向量;當(dāng)為 mean 時(shí),使用所有 token 向量的平均值作為 encoder 編碼的句子向量。默認(rèn)為 linear。

          • concat_multiply: 可選,當(dāng)訓(xùn)練命令中加入 --concat_multiply 時(shí),使用 (u, v, |u-v|, u*v) 作為分類器的輸入特征;否則使用 (u, v, |u-v|) 作為分類器的輸入特征。

          • output_emb_size: 可選,encoder 輸出的句子向量維度,當(dāng)為 None 時(shí),輸出句子向量的維度為 encoder 的 hidden_size。默認(rèn)為 None。

          模型訓(xùn)練的部分中間日志如下:

          ......
          2022-05-24?17:07:26.672?|?INFO?????|?__main__:train:308?-?global?step:?9620,?epoch:?3,?batch:?2158,?loss:?0.16183,?accuracy:?0.90625,?speed:?3.38?step/s
          2022-05-24?17:07:32.407?|?INFO?????|?__main__:train:308?-?global?step:?9640,?epoch:?3,?batch:?2178,?loss:?0.09866,?accuracy:?0.96875,?speed:?3.49?step/s
          2022-05-24?17:07:38.177?|?INFO?????|?__main__:train:308?-?global?step:?9660,?epoch:?3,?batch:?2198,?loss:?0.38715,?accuracy:?0.90625,?speed:?3.47?step/s
          2022-05-24?17:07:43.796?|?INFO?????|?__main__:train:308?-?global?step:?9680,?epoch:?3,?batch:?2218,?loss:?0.12515,?accuracy:?0.93750,?speed:?3.56?step/s
          2022-05-24?17:07:49.740?|?INFO?????|?__main__:train:308?-?global?step:?9700,?epoch:?3,?batch:?2238,?loss:?0.03231,?accuracy:?1.00000,?speed:?3.37?step/s
          2022-05-24?17:08:04.752?|?INFO?????|?__main__:train:314?-?eval?loss:?0.38621,?accuracy:?0.86549
          2022-05-24?17:08:12.245?|?INFO?????|?__main__:train:308?-?global?step:?9720,?epoch:?3,?batch:?2258,?loss:?0.08337,?accuracy:?0.96875,?speed:?3.45?step/s
          2022-05-24?17:08:18.112?|?INFO?????|?__main__:train:308?-?global?step:?9740,?epoch:?3,?batch:?2278,?loss:?0.15085,?accuracy:?0.93750,?speed:?3.41?step/s
          2022-05-24?17:08:23.895?|?INFO?????|?__main__:train:308?-?global?step:?9760,?epoch:?3,?batch:?2298,?loss:?0.11466,?accuracy:?0.93750,?speed:?3.46?step/s
          2022-05-24?17:08:29.703?|?INFO?????|?__main__:train:308?-?global?step:?9780,?epoch:?3,?batch:?2318,?loss:?0.04269,?accuracy:?1.00000,?speed:?3.44?step/s
          2022-05-24?17:08:35.658?|?INFO?????|?__main__:train:308?-?global?step:?9800,?epoch:?3,?batch:?2338,?loss:?0.28312,?accuracy:?0.90625,?speed:?3.36?step/s
          2022-05-24?17:08:50.674?|?INFO?????|?__main__:train:314?-?eval?loss:?0.39262,?accuracy:?0.86424
          2022-05-24?17:08:56.609?|?INFO?????|?__main__:train:308?-?global?step:?9820,?epoch:?3,?batch:?2358,?loss:?0.13456,?accuracy:?0.96875,?speed:?3.37?step/s
          2022-05-24?17:09:02.259?|?INFO?????|?__main__:train:308?-?global?step:?9840,?epoch:?3,?batch:?2378,?loss:?0.06361,?accuracy:?1.00000,?speed:?3.54?step/s
          2022-05-24?17:09:08.120?|?INFO?????|?__main__:train:308?-?global?step:?9860,?epoch:?3,?batch:?2398,?loss:?0.09087,?accuracy:?0.96875,?speed:?3.41?step/s
          2022-05-24?17:09:13.834?|?INFO?????|?__main__:train:308?-?global?step:?9880,?epoch:?3,?batch:?2418,?loss:?0.19537,?accuracy:?0.90625,?speed:?3.50?step/s
          2022-05-24?17:09:19.531?|?INFO?????|?__main__:train:308?-?global?step:?9900,?epoch:?3,?batch:?2438,?loss:?0.05254,?accuracy:?1.00000,?speed:?3.51?step/s
          2022-05-24?17:09:34.531?|?INFO?????|?__main__:train:314?-?eval?loss:?0.39561,?accuracy:?0.86560
          2022-05-24?17:09:42.084?|?INFO?????|?__main__:train:308?-?global?step:?9920,?epoch:?3,?batch:?2458,?loss:?0.05342,?accuracy:?1.00000,?speed:?3.41?step/s
          2022-05-24?17:09:47.781?|?INFO?????|?__main__:train:308?-?global?step:?9940,?epoch:?3,?batch:?2478,?loss:?0.22660,?accuracy:?0.87500,?speed:?3.51?step/s
          2022-05-24?17:09:53.496?|?INFO?????|?__main__:train:308?-?global?step:?9960,?epoch:?3,?batch:?2498,?loss:?0.14745,?accuracy:?0.93750,?speed:?3.50?step/s
          2022-05-24?17:09:59.350?|?INFO?????|?__main__:train:308?-?global?step:?9980,?epoch:?3,?batch:?2518,?loss:?0.06218,?accuracy:?0.96875,?speed:?3.42?step/s
          2022-05-24?17:10:05.157?|?INFO?????|?__main__:train:308?-?global?step:?10000,?epoch:?3,?batch:?2538,?loss:?0.15225,?accuracy:?0.96875,?speed:?3.44?step/s
          2022-05-24?17:10:20.159?|?INFO?????|?__main__:train:314?-?eval?loss:?0.39152,?accuracy:?0.86730
          ......

          當(dāng)使用 NLI 數(shù)據(jù)進(jìn)行訓(xùn)練時(shí),需要加入 --is_nli 選項(xiàng)和 --label_file LABEL_FILE,訓(xùn)練命令如下:

          CUDA_VISIBLE_DEVICES=0,1?python?-m?torch.distributed.launch?--nproc_per_node=2?run_sentencebert.py?--train_data_file?./data/CMNLI/train.txt?--dev_data_file?./data/CMNLI/dev.txt?--label_file?./data/CMNLI/labels.txt?--is_nli?--save_best_model?--epochs?3?--batch_size?32

          在不同數(shù)據(jù)集上進(jìn)行訓(xùn)練,驗(yàn)證集上的效果如下:

          TaskLCQMCChinese-MNLIChinese-SNLI
          dev-acc0.867300.711050.80567

          Chinese-MNLI 和 Chinese-SNLI 鏈接:https://github.com/zejunwang1/CSTS

          4. 語(yǔ)義理解

          4.1 SimCSE

          SimCSE 模型適合缺乏監(jiān)督數(shù)據(jù),但是又有大量無(wú)監(jiān)督數(shù)據(jù)的匹配和檢索場(chǎng)景。本項(xiàng)目實(shí)現(xiàn)了 SimCSE 無(wú)監(jiān)督方法,并在中文維基百科句子數(shù)據(jù)上進(jìn)行句向量表示模型的訓(xùn)練。

          更多關(guān)于 SimCSE 的信息可以參考論文:https://arxiv.org/abs/2104.08821

          從中文維基百科中抽取 15 萬(wàn)條句子數(shù)據(jù),保存于 data/zhwiki/ 文件夾下的 wiki_sents.txt 文件中,運(yùn)行下面的命令,基于騰訊 uer 開(kāi)源的預(yù)訓(xùn)練語(yǔ)言模型 uer/chinese_roberta_L-6_H-128 (https://huggingface.co/uer/chinese_roberta_L-6_H-128) ,使用 SimCSE 無(wú)監(jiān)督方法進(jìn)行訓(xùn)練,并在 Chinese-STS-B 驗(yàn)證集 ( https://github.com/zejunwang1/CSTS ) 上進(jìn)行評(píng)估:

          CUDA_VISIBLE_DEVICES=0,1?python?-m?torch.distributed.launch?--nproc_per_node=2?run_simcse.py?--pretrained_model_name_or_path?uer/chinese_roberta_L-6_H-128?--train_data_file?./data/zhwiki/wiki_sents.txt?--dev_data_file?./data/STS-B/sts-b-dev.txt?--learning_rate?5e-5?--epochs?1?--dropout?0.1?--margin?0.2?--scale?20?--batch_size?32

          可支持的配置參數(shù):

          usage:?run_simcse.py?[-h]?[--local_rank?LOCAL_RANK]
          ?????????????????????[--pretrained_model_name_or_path?PRETRAINED_MODEL_NAME_OR_PATH]
          ?????????????????????[--init_from_ckpt?INIT_FROM_CKPT]?--train_data_file
          ?????????????????????TRAIN_DATA_FILE?[--dev_data_file?DEV_DATA_FILE]
          ?????????????????????[--batch_size?BATCH_SIZE]
          ?????????????????????[--scheduler?{linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}]
          ?????????????????????[--learning_rate?LEARNING_RATE]
          ?????????????????????[--warmup_proportion?WARMUP_PROPORTION]?[--seed?SEED]
          ?????????????????????[--save_steps?SAVE_STEPS]?[--logging_steps?LOGGING_STEPS]
          ?????????????????????[--weight_decay?WEIGHT_DECAY]?[--epochs?EPOCHS]
          ?????????????????????[--max_seq_length?MAX_SEQ_LENGTH]?[--saved_dir?SAVED_DIR]
          ?????????????????????[--max_grad_norm?MAX_GRAD_NORM]?[--save_best_model]
          ?????????????????????[--margin?MARGIN]?[--scale?SCALE]?[--dropout?DROPOUT]
          ?????????????????????[--pooling_mode?{linear,cls,mean}]
          ?????????????????????[--output_emb_size?OUTPUT_EMB_SIZE]
          其中大部分參數(shù)與文本分類中介紹的相同,如下為特有的參數(shù):
          • margin: 可選,正樣本相似度與負(fù)樣本之間的目標(biāo) Gap,默認(rèn)為 0.2。

          • dropout: 可選,SimCSE 網(wǎng)絡(luò)中 encoder 部分使用的 dropout 取值,默認(rèn)為 0.1。

          • scale: 可選,在計(jì)算交叉熵?fù)p失之前,對(duì)余弦相似度進(jìn)行縮放的因子,默認(rèn)為 20。

          • pooling_mode: 可選,當(dāng)為 linear 時(shí),使用 cls 向量經(jīng)過(guò) linear pooler 后的輸出作為 encoder 編碼的句子向量;當(dāng)為 cls 時(shí),使用 cls 向量作為 encoder 編碼的句子向量;當(dāng)為 mean 時(shí),使用所有 token 向量的平均值作為 encoder 編碼的句子向量。默認(rèn)為 linear。

          • output_emb_size: 可選,encoder 輸出的句子向量維度,當(dāng)為 None 時(shí),輸出句子向量的維度為 encoder 的 hidden_size。默認(rèn)為 None。

          模型訓(xùn)練的部分中間日志如下:

          2022-05-27?09:14:58.471?|?INFO?????|?__main__:train:315?-?global?step:?20,?epoch:?1,?batch:?20,?loss:?1.04241,?speed:?8.45?step/s
          2022-05-27?09:15:01.063?|?INFO?????|?__main__:train:315?-?global?step:?40,?epoch:?1,?batch:?40,?loss:?0.15792,?speed:?7.72?step/s
          2022-05-27?09:15:03.700?|?INFO?????|?__main__:train:315?-?global?step:?60,?epoch:?1,?batch:?60,?loss:?0.18357,?speed:?7.58?step/s
          2022-05-27?09:15:06.365?|?INFO?????|?__main__:train:315?-?global?step:?80,?epoch:?1,?batch:?80,?loss:?0.13284,?speed:?7.51?step/s
          2022-05-27?09:15:09.000?|?INFO?????|?__main__:train:315?-?global?step:?100,?epoch:?1,?batch:?100,?loss:?0.14146,?speed:?7.59?step/s
          2022-05-27?09:15:09.847?|?INFO?????|?__main__:train:321?-?spearman?corr:?0.6048,?pearson?corr:?0.5870
          2022-05-27?09:15:12.507?|?INFO?????|?__main__:train:315?-?global?step:?120,?epoch:?1,?batch:?120,?loss:?0.03073,?speed:?7.74?step/s
          2022-05-27?09:15:15.110?|?INFO?????|?__main__:train:315?-?global?step:?140,?epoch:?1,?batch:?140,?loss:?0.09425,?speed:?7.69?step/s
          2022-05-27?09:15:17.749?|?INFO?????|?__main__:train:315?-?global?step:?160,?epoch:?1,?batch:?160,?loss:?0.08629,?speed:?7.58?step/s
          2022-05-27?09:15:20.386?|?INFO?????|?__main__:train:315?-?global?step:?180,?epoch:?1,?batch:?180,?loss:?0.03206,?speed:?7.59?step/s
          2022-05-27?09:15:23.052?|?INFO?????|?__main__:train:315?-?global?step:?200,?epoch:?1,?batch:?200,?loss:?0.11463,?speed:?7.50?step/s
          2022-05-27?09:15:24.023?|?INFO?????|?__main__:train:321?-?spearman?corr:?0.5954,?pearson?corr:?0.5807
          ......

          隱藏層數(shù) num_hidden_layers=6,維度 hidden_size=128 的 SimCSE 句向量預(yù)訓(xùn)練模型 simcse_tiny_chinese_wiki 可以從如下鏈接獲取:

          model_namelink
          WangZeJun/simcse-tiny-chinese-wikihttps://huggingface.co/WangZeJun/simcse-tiny-chinese-wiki

          4.2 In-Batch Negatives

          從哈工大 LCQMC 數(shù)據(jù)集、谷歌 PAWS-X 數(shù)據(jù)集、北大文本復(fù)述 PKU-Paraphrase-Bank 數(shù)據(jù)集 (https://github.com/zejunwang1/CSTS) 中抽取出所有語(yǔ)義相似的文本 Pair 作為訓(xùn)練集,保存于:data/batchneg/paraphrase_lcqmc_semantic_pairs.txt

          運(yùn)行下面的命令,基于騰訊 uer 開(kāi)源的預(yù)訓(xùn)練語(yǔ)言模型 uer/chinese_roberta_L-6_H-128,采用 In-batch negatives 策略,在 GPU 0,1,2,3 四張卡上訓(xùn)練句向量表示模型,并在 Chinese-STS-B 驗(yàn)證集上進(jìn)行評(píng)估:

          CUDA_VISIBLE_DEVICES=0,1,2,3?python?-m?torch.distributed.launch?--nproc_per_node=4?run_batchneg.py?--pretrained_model_name_or_path?uer/chinese_roberta_L-6_H-128?--train_data_file?./data/batchneg/paraphrase_lcqmc_semantic_pairs.txt?--dev_data_file?./data/STS-B/sts-b-dev.txt?--learning_rate?5e-5?--epochs?3?--margin?0.2?--scale?20?--batch_size?64?--mean_loss

          可支持的配置參數(shù):

          usage:?run_batchneg.py?[-h]?[--local_rank?LOCAL_RANK]
          ???????????????????????[--pretrained_model_name_or_path?PRETRAINED_MODEL_NAME_OR_PATH]
          ???????????????????????[--init_from_ckpt?INIT_FROM_CKPT]?--train_data_file
          ???????????????????????TRAIN_DATA_FILE?[--dev_data_file?DEV_DATA_FILE]
          ???????????????????????[--batch_size?BATCH_SIZE]
          ???????????????????????[--scheduler?{linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}]
          ???????????????????????[--learning_rate?LEARNING_RATE]
          ???????????????????????[--warmup_proportion?WARMUP_PROPORTION]?[--seed?SEED]
          ???????????????????????[--save_steps?SAVE_STEPS]
          ???????????????????????[--logging_steps?LOGGING_STEPS]
          ???????????????????????[--weight_decay?WEIGHT_DECAY]?[--epochs?EPOCHS]
          ???????????????????????[--max_seq_length?MAX_SEQ_LENGTH]
          ???????????????????????[--saved_dir?SAVED_DIR]?[--max_grad_norm?MAX_GRAD_NORM]
          ???????????????????????[--save_best_model]?[--margin?MARGIN]?[--scale?SCALE]
          ???????????????????????[--pooling_mode?{linear,cls,mean}]
          ???????????????????????[--output_emb_size?OUTPUT_EMB_SIZE]?[--mean_loss]

          各參數(shù)的介紹與 SimCSE 中相同,模型訓(xùn)練的部分中間日志如下:

          ......
          2022-05-27?13:20:48.428?|?INFO?????|?__main__:train:318?-?global?step:?7220,?epoch:?3,?batch:?1888,?loss:?0.73655,?speed:?6.70?step/s
          2022-05-27?13:20:51.454?|?INFO?????|?__main__:train:318?-?global?step:?7240,?epoch:?3,?batch:?1908,?loss:?0.70207,?speed:?6.61?step/s
          2022-05-27?13:20:54.308?|?INFO?????|?__main__:train:318?-?global?step:?7260,?epoch:?3,?batch:?1928,?loss:?1.10231,?speed:?7.01?step/s
          2022-05-27?13:20:57.107?|?INFO?????|?__main__:train:318?-?global?step:?7280,?epoch:?3,?batch:?1948,?loss:?0.94975,?speed:?7.15?step/s
          2022-05-27?13:20:59.898?|?INFO?????|?__main__:train:318?-?global?step:?7300,?epoch:?3,?batch:?1968,?loss:?0.34252,?speed:?7.17?step/s
          2022-05-27?13:21:00.322?|?INFO?????|?__main__:train:324?-?spearman?corr:?0.6950,?pearson?corr:?0.6801
          2022-05-27?13:21:03.168?|?INFO?????|?__main__:train:318?-?global?step:?7320,?epoch:?3,?batch:?1988,?loss:?1.10022,?speed:?7.20?step/s
          2022-05-27?13:21:05.929?|?INFO?????|?__main__:train:318?-?global?step:?7340,?epoch:?3,?batch:?2008,?loss:?1.00207,?speed:?7.25?step/s
          2022-05-27?13:21:08.687?|?INFO?????|?__main__:train:318?-?global?step:?7360,?epoch:?3,?batch:?2028,?loss:?0.72985,?speed:?7.25?step/s
          2022-05-27?13:21:11.372?|?INFO?????|?__main__:train:318?-?global?step:?7380,?epoch:?3,?batch:?2048,?loss:?0.88964,?speed:?7.45?step/s
          2022-05-27?13:21:14.090?|?INFO?????|?__main__:train:318?-?global?step:?7400,?epoch:?3,?batch:?2068,?loss:?0.70836,?speed:?7.36?step/s
          2022-05-27?13:21:14.520?|?INFO?????|?__main__:train:324?-?spearman?corr:?0.6922,?pearson?corr:?0.6764
          ......

          以上面得到的模型為熱啟,在句子數(shù)據(jù)集 data/batchneg/domain_finetune.txt 上繼續(xù)進(jìn)行 In-batch negatives 訓(xùn)練:

          CUDA_VISIBLE_DEVICES=0,1?python?-m?torch.distributed.launch?--nproc_per_node=2?run_batchneg.py?--pretrained_model_name_or_path?uer/chinese_roberta_L-6_H-128?--init_from_ckpt?./checkpoint/pytorch_model.bin?--train_data_file?./data/batchneg/domain_finetune.txt?--dev_data_file?./data/STS-B/sts-b-dev.txt?--learning_rate?1e-5?--epochs?1?--margin?0.2?--scale?20?--batch_size?32?--mean_loss

          可以得到隱藏層數(shù) num_hidden_layers=6,維度 hidden_size=128 的句向量預(yù)訓(xùn)練模型:

          model_namelink
          WangZeJun/batchneg-tiny-chinesehttps://huggingface.co/WangZeJun/batchneg-tiny-chinese

          5. 序列標(biāo)注

          本項(xiàng)目展示了以 BERT 為代表的預(yù)訓(xùn)練模型如何 Finetune 完成序列標(biāo)注任務(wù)。以中文命名實(shí)體識(shí)別任務(wù)為例,分別在 msra、ontonote4、resume 和 weibo 四個(gè)數(shù)據(jù)集上進(jìn)行訓(xùn)練和測(cè)試。每個(gè)數(shù)據(jù)集的訓(xùn)練集和驗(yàn)證集均被預(yù)處理為如下的格式,每一行為文本和標(biāo)簽組成的 json 字符串。

          {"text":?["我",?"們",?"的",?"藏",?"品",?"中",?"有",?"幾",?"十",?"冊(cè)",?"為",?"北",?"京",?"圖",?"書(shū)",?"館",?"等",?"國(guó)",?"家",?"級(jí)",?"藏",?"館",?"所",?"未",?"藏",?"。"],?"label":?["O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"B-NS",?"I-NS",?"I-NS",?"I-NS",?"I-NS",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O"]}
          {"text":?["由",?"于",?"這",?"一",?"時(shí)",?"期",?"戰(zhàn)",?"爭(zhēng)",?"頻",?"繁",?",",?"條",?"件",?"艱",?"苦",?",",?"又",?"遭",?"國(guó)",?"民",?"黨",?"毀",?"禁",?",",?"傳",?"世",?"量",?"稀",?"少",?",",?"購(gòu)",?"藏",?"不",?"易",?"。"],?"label":?["O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"B-NT",?"I-NT",?"I-NT",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O",?"O"]}

          運(yùn)行下面的命令,在 msra 數(shù)據(jù)集上使用 BERT+Linear 結(jié)構(gòu)進(jìn)行單機(jī)多卡分布式訓(xùn)練,并在驗(yàn)證集上進(jìn)行評(píng)估:

          CUDA_VISIBLE_DEVICES=0,1?python?-m?torch.distributed.launch?--nproc_per_node=2?run_ner.py?--train_data_file?./data/ner/msra/train.json?--dev_data_file?./data/ner/msra/dev.json?--label_file?./data/ner/msra/labels.txt?--tag?bios?--learning_rate?5e-5?--save_best_model?--batch_size?32

          可支持的配置參數(shù):

          usage:?run_ner.py?[-h]?[--local_rank?LOCAL_RANK]
          ??????????????????[--pretrained_model_name_or_path?PRETRAINED_MODEL_NAME_OR_PATH]
          ??????????????????[--init_from_ckpt?INIT_FROM_CKPT]?--train_data_file
          ??????????????????TRAIN_DATA_FILE?[--dev_data_file?DEV_DATA_FILE]?--label_file
          ??????????????????LABEL_FILE?[--tag?{bios,bio}]?[--batch_size?BATCH_SIZE]
          ??????????????????[--scheduler?{linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}]
          ??????????????????[--learning_rate?LEARNING_RATE]
          ??????????????????[--crf_learning_rate?CRF_LEARNING_RATE]
          ??????????????????[--warmup_proportion?WARMUP_PROPORTION]?[--seed?SEED]
          ??????????????????[--save_steps?SAVE_STEPS]?[--logging_steps?LOGGING_STEPS]
          ??????????????????[--weight_decay?WEIGHT_DECAY]?[--epochs?EPOCHS]
          ??????????????????[--max_seq_length?MAX_SEQ_LENGTH]?[--saved_dir?SAVED_DIR]
          ??????????????????[--max_grad_norm?MAX_GRAD_NORM]?[--save_best_model]
          ??????????????????[--use_crf]
          大部分參數(shù)與文本分類中介紹的相同,如下為特有的參數(shù):
          • tag: 可選,實(shí)體標(biāo)記方法,支持 bios 和 bio 的標(biāo)注方法,默認(rèn)為 bios。

          • use_crf: 可選,是否使用 CRF 結(jié)構(gòu),當(dāng)訓(xùn)練命令中加入 --use_crf 時(shí),使用 BERT+CRF 模型結(jié)構(gòu);否則使用 BERT+Linear 模型結(jié)構(gòu)。

          • crf_learning_rate: 可選,CRF 模型參數(shù)的初始學(xué)習(xí)率,默認(rèn)為 5e-5。

          模型訓(xùn)練的部分中間日志如下:

          2022-05-27?15:56:59.043?|?INFO?????|?__main__:train:355?-?global?step:?20,?epoch:?1,?batch:?20,?loss:?0.20780,?speed:?2.10?step/s
          2022-05-27?15:57:08.723?|?INFO?????|?__main__:train:355?-?global?step:?40,?epoch:?1,?batch:?40,?loss:?0.09440,?speed:?2.07?step/s
          2022-05-27?15:57:18.001?|?INFO?????|?__main__:train:355?-?global?step:?60,?epoch:?1,?batch:?60,?loss:?0.05570,?speed:?2.16?step/s
          2022-05-27?15:57:27.357?|?INFO?????|?__main__:train:355?-?global?step:?80,?epoch:?1,?batch:?80,?loss:?0.02468,?speed:?2.14?step/s
          2022-05-27?15:57:36.994?|?INFO?????|?__main__:train:355?-?global?step:?100,?epoch:?1,?batch:?100,?loss:?0.05032,?speed:?2.08?step/s
          2022-05-27?15:57:53.299?|?INFO?????|?__main__:train:362?-?eval?loss:?0.03203,?F1:?0.86481
          2022-05-27?15:58:03.264?|?INFO?????|?__main__:train:355?-?global?step:?120,?epoch:?1,?batch:?120,?loss:?0.04150,?speed:?2.16?step/s
          2022-05-27?15:58:12.712?|?INFO?????|?__main__:train:355?-?global?step:?140,?epoch:?1,?batch:?140,?loss:?0.04907,?speed:?2.12?step/s
          2022-05-27?15:58:21.959?|?INFO?????|?__main__:train:355?-?global?step:?160,?epoch:?1,?batch:?160,?loss:?0.01224,?speed:?2.16?step/s
          2022-05-27?15:58:31.039?|?INFO?????|?__main__:train:355?-?global?step:?180,?epoch:?1,?batch:?180,?loss:?0.01846,?speed:?2.20?step/s
          2022-05-27?15:58:40.542?|?INFO?????|?__main__:train:355?-?global?step:?200,?epoch:?1,?batch:?200,?loss:?0.06604,?speed:?2.10?step/s
          2022-05-27?15:58:56.831?|?INFO?????|?__main__:train:362?-?eval?loss:?0.02589,?F1:?0.89128
          2022-05-27?15:59:07.813?|?INFO?????|?__main__:train:355?-?global?step:?220,?epoch:?1,?batch:?220,?loss:?0.07066,?speed:?2.15?step/s
          2022-05-27?15:59:16.857?|?INFO?????|?__main__:train:355?-?global?step:?240,?epoch:?1,?batch:?240,?loss:?0.03061,?speed:?2.21?step/s
          2022-05-27?15:59:26.240?|?INFO?????|?__main__:train:355?-?global?step:?260,?epoch:?1,?batch:?260,?loss:?0.01680,?speed:?2.13?step/s
          2022-05-27?15:59:35.568?|?INFO?????|?__main__:train:355?-?global?step:?280,?epoch:?1,?batch:?280,?loss:?0.01245,?speed:?2.14?step/s
          2022-05-27?15:59:44.684?|?INFO?????|?__main__:train:355?-?global?step:?300,?epoch:?1,?batch:?300,?loss:?0.02699,?speed:?2.19?step/s
          2022-05-27?16:00:00.977?|?INFO?????|?__main__:train:362?-?eval?loss:?0.01928,?F1:?0.92157

          當(dāng)使用 BERT+CRF 結(jié)構(gòu)進(jìn)行訓(xùn)練時(shí),運(yùn)行下面的命令:

          CUDA_VISIBLE_DEVICES=0,1?python?-m?torch.distributed.launch?--nproc_per_node=2?run_ner.py?--train_data_file?./data/ner/msra/train.json?--dev_data_file?./data/ner/msra/dev.json?--label_file?./data/ner/msra/labels.txt?--tag?bios?--learning_rate?5e-5?--save_best_model?--batch_size?32?--use_crf?--crf_learning_rate?1e-4

          模型在不同驗(yàn)證集上的 F1 指標(biāo):

          模型MsraResumeOntonoteWeibo
          BERT+Linear0.941790.956430.802060.70588
          BERT+CRF0.942650.958180.802570.72215

          其中 Msra、Resume 和 Ontonote 訓(xùn)練了 3 個(gè) epochs,Weibo 訓(xùn)練了 5 個(gè) epochs,Resume、Ontonote 和 Weibo 的 logging_steps 和 save_steps 均設(shè)置為 10,所有數(shù)據(jù)集的 BERT 參數(shù)初始學(xué)習(xí)率設(shè)置為 5e-5,CRF 參數(shù)初始學(xué)習(xí)率設(shè)置為 1e-4,batch_size 設(shè)置為 32。


          往期精彩回顧




          瀏覽 49
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  精品视频日韩 | 黄片九九九 | 青青草性爱视频 | 国产九九传媒 | av在线资源站 |