日天天天天日,五月丁香婷婷乱伦,在线免费观看黄色视频,亚洲操操,操熟女视频,日欧免费视频,高清久久无码视频,成人在线偷拍视频

↑↑↑關(guān)注后"星標(biāo)"Datawhale

?Datawhale干貨?

作者：徐美蘭-lan，華南理工大學(xué)

摘要：這是從0實(shí)踐競賽第3篇，數(shù)據(jù)競賽對理論實(shí)踐和增加履歷有益，為了讓初學(xué)者也能入門，本文以全球人工智能技術(shù)大賽預(yù)熱賽-中文預(yù)訓(xùn)練模型賽事為背景，梳理了nlp賽事的完整實(shí)踐流程。

今天以天池中文預(yù)訓(xùn)練模型泛化能力賽事為背景，介紹一種適用入門nlp的Baseline：如何使用bert簡單、快速地完成多任務(wù)多目標(biāo)分類。具體目錄如下：

0. 賽事背景

大賽名稱：全球人工智能技術(shù)創(chuàng)新大賽【熱身賽二】-?中文預(yù)訓(xùn)練模型泛化能力賽事

大賽地址：
https://tianchi.aliyun.com/s/3bd272d942f97725286a8e44f40f3f74（或文末閱讀原文）

大賽類型：自然語言處理、預(yù)訓(xùn)練模型

1. 賽題分析

1.本次賽題為數(shù)據(jù)挖掘類型，通過預(yù)訓(xùn)練模型調(diào)優(yōu)進(jìn)行分類。

2.是一個典型的多任務(wù)多分類問題。

3.主要應(yīng)用keras_bert，以及pandas、numpy、matplotlib、seabon、sklearn、keras等數(shù)據(jù)挖掘常用庫或者框架來進(jìn)行數(shù)據(jù)挖掘任務(wù)。

4.賽題禁止人工標(biāo)注；微調(diào)階段不得使用外部數(shù)據(jù)；三個任務(wù)只能共用一個bert；只能單折訓(xùn)練。

2. 數(shù)據(jù)概況

賽題精選了3個具有代表性的任務(wù)，要求選手提交的模型能夠同時預(yù)測每個任務(wù)對應(yīng)的標(biāo)簽。數(shù)據(jù)下載地址：https://tianchi.aliyun.com/s/3bd272d942f97725286a8e44f40f3f74

數(shù)據(jù)格式：

任務(wù)1：OCNLI–中文原版自然語言推理，包含3個類別

ocnli_train.head(3)
'''
id	content1	content2	label
0	0	一月份跟二月份肯定有一個月份有.	肯定有一個月份有   0
1	1	一月份跟二月份肯定有一個月份有.	一月份有   1
2	2	一月份跟二月份肯定有一個月份有.	一月二月都沒有   2
'''
len(ocnli_train['label'].unique())
#3

任務(wù)2：OCEMOTION–中文情感分類，包含7個類別

ocemo_train.head(3)
'''
id	content	label
0	0	'你知道多倫多附近有什么嗎?哈哈有破布耶...真的書上寫的你聽哦...你家那塊破布是世界上最...	sadness
1	1	平安夜,圣誕節(jié),都過了,我很難過,和媽媽吵了兩天,以死相逼才終止戰(zhàn)爭,現(xiàn)在還處于冷戰(zhàn)中。	sadness
2	2	我只是自私了一點(diǎn),做自己想做的事情!	sadness
'''
len(ocemo_train['label'].unique())
#7

任務(wù)3：TNEWS–今日頭條新聞標(biāo)題分類，包含15個類別

times_train.head(3)
'''
id	content	label
0	0	上課時學(xué)生手機(jī)響個不停,老師一怒之下把手機(jī)摔了,家長拿發(fā)票讓老師賠,大家怎么看待這種事?	108
1	1	商贏環(huán)球股份有限公司關(guān)于延期回復(fù)上海證券交易所對公司2017年年度報告的事后審核問詢函的公告	104
2	2	通過中介公司買了二手房,首付都付了,現(xiàn)在賣家不想賣了。怎么處理?	106
'''
len(times_train['label'].unique())
#15

3. 代碼實(shí)踐

Step 1:環(huán)境準(zhǔn)備

導(dǎo)入相關(guān)包

import pandas as pd
import codecs, gc
import numpy as np
from sklearn.model_selection import KFold
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
from keras.metrics import top_k_categorical_accuracy
from keras.layers import *
from keras.callbacks import *
from keras.models import Model
import keras.backend as K
from keras.optimizers import Adam
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder

如果在google colab上運(yùn)行代碼，需要先將數(shù)據(jù)上傳至driver上。執(zhí)行以下代碼掛在driver并配置相關(guān)環(huán)境。

from google.colab import drive

drive.mount('/content/drive')

'''
路徑說明：
../code #保存代碼
../data #保存數(shù)據(jù)
../subs #保存數(shù)據(jù)
../chinese_roberta_wwm_large_ext_L-24_H-1024_A-16 #bert路徑
'''

pip install keras-bert

Step 2:數(shù)據(jù)讀取

path = "/content/drive/My Drive/天池nlp預(yù)訓(xùn)練/"

#將ocnli中content1[0:maxlentext1]+content2作為ocnli任務(wù)的content
times_train = pd.read_csv(path + '/data/TNEWS_train1128.csv',  sep='\t', header=None, names=('id', 'content', 'label')).astype(str)
ocemo_train = pd.read_csv(path + '/data/OCEMOTION_train1128.csv',sep='\t', header=None, names=('id', 'content', 'label')).astype(str)
ocnli_train = pd.read_csv(path + '/data/OCNLI_train1128.csv',  sep='\t', header=None, names=('id', 'content1', 'content2', 'label')).astype(str)
ocnli_train['content'] = ocnli_train['content1'] + ocnli_train['content2']#.apply( lambda x: x[:maxlentext1] )

times_testa = pd.read_csv(path + '/data/TNEWS_a.csv',  sep='\t', header=None, names=('id', 'content')).astype(str)
ocemo_testa = pd.read_csv(path + '/data/OCEMOTION_a.csv',sep='\t', header=None, names=('id', 'content')).astype(str)
ocnli_testa = pd.read_csv(path + '/data/OCNLI_a.csv',  sep='\t', header=None, names=('id', 'content1', 'content2')).astype(str)
ocnli_testa['content'] = ocnli_testa['content1']+ ocnli_testa['content2']#.apply( lambda x: x[:maxlentext1] )

1) 數(shù)據(jù)集合并

分別將三個任務(wù)的content、label列按行concat在一起作為訓(xùn)練集和標(biāo)簽、測試集，以此簡單地將三任務(wù)轉(zhuǎn)化為單任務(wù)。

#合并三個任務(wù)的訓(xùn)練、測試數(shù)據(jù)
train_df = pd.concat([times_train, ocemo_train, ocnli_train[['id','content', 'label']]], axis=0).copy()

testa_df = pd.concat([times_testa, ocemo_testa, ocnli_testa[['id', 'content']]], axis=0).copy()

2)標(biāo)簽編碼

#LabelEncoder處理標(biāo)簽，因?yàn)閎ert輸入的label需要從0開始
encode_label = LabelEncoder()
train_df['label'] = encode_label.fit_transform(train_df['label'].apply(str))

3) 數(shù)據(jù)信息查看

train_df.info()
'''

Int64Index: 147453 entries, 0 to 48777
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   id       147453 non-null  object
 1   content  147453 non-null  object
 2   label    147453 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 4.5+ MB
'''

數(shù)據(jù)為id、content、label三列，無子句為空的行。

Step 3: 數(shù)據(jù)分析（EDA）

1) 子句長度統(tǒng)計分析

統(tǒng)計子句長度主要用于設(shè)置輸入bert的序列長度。

times_train['content'].str.len().describe(percentiles=[.95, .98, .99])\
,ocemo_train['content'].str.len().describe(percentiles=[.95, .98, .99])\
,ocnli_train['content1'].str.len().describe(percentiles=[.95, .98, .99])\
,ocnli_train['content2'].str.len().describe(percentiles=[.95, .98, .99])
'''
(count    63360.000000
 mean        22.171086
 std          7.334206
 min          2.000000
 50%         22.000000
 95%         33.000000
 98%         37.000000
 99%         39.000000
 max        145.000000
 Name: content, dtype: float64, count    35315.000000
 mean        48.214328
 std         84.391942
 min          3.000000
 50%         34.000000
 95%        134.000000
 98%        138.000000
 99%        142.000000
 max      12326.000000
 Name: content, dtype: float64, count    48778.000000
 mean        24.174607
 std         11.515428
 min          8.000000
 50%         22.000000
 95%         46.000000
 98%         49.000000
 99%         50.000000
 max         50.000000
 Name: content1, dtype: float64, count     48778.000000
 mean         15.828529
 std         977.396848
 min           2.000000
 50%          10.000000
 95%          21.000000
 98%          24.000000
 99%          27.000000
 max      215874.000000
 Name: content2, dtype: float64)
'''

從上可以看出，當(dāng)設(shè)置bert序列長度為142時即可覆蓋約99%子句的全部內(nèi)容。

2)統(tǒng)計標(biāo)簽的基本分布信息

train_df['label'].value_counts() / train_df.shape[0]
'''
1     0.113467
0     0.109940
17    0.107397
23    0.084603
21    0.060318
10    0.047771
6     0.041749
4     0.039918
13    0.039036
8     0.033292
3     0.032668
5     0.032268
11    0.029487
19    0.029481
9     0.027690
18    0.027588
12    0.027541
16    0.027460
22    0.027412
15    0.022923
7     0.016853
2     0.008993
24    0.006097
20    0.004001
14    0.002048
Name: label, dtype: float64
'''

由上可以看出，標(biāo)簽占比差距非常大。在拆分訓(xùn)練集與驗(yàn)證集時如果簡單地采用隨機(jī)拆分，可能會導(dǎo)致驗(yàn)證集不存在部分標(biāo)簽的情況。

Step 4: 預(yù)訓(xùn)練模型選擇

1)模型選擇

在眾多nlp預(yù)訓(xùn)練模型中，本文baseline選擇了哈工大與訊飛聯(lián)合發(fā)布的基于全詞遮罩（Whole Word Masking）技術(shù)的中文預(yù)訓(xùn)練模型：RoBERTa-wwm-ext-large。點(diǎn)擊以下鏈接了解更多詳細(xì)信息：

論文地址：https://arxiv.org/abs/1906.08101
開源模型地址：https://github.com/ymcui/Chinese-BERT-wwm
哈工大訊飛聯(lián)合實(shí)驗(yàn)室的項(xiàng)目介紹：https://mp.weixin.qq.com/s/EE6dEhvpKxqnVW_bBAKrnA

2)調(diào)優(yōu)參數(shù)配置

為方便調(diào)優(yōu)，在同一代碼塊中配置調(diào)優(yōu)的參數(shù)。

#一些調(diào)優(yōu)參數(shù)
er_patience = 2 #early_stopping patience
lr_patience = 5 #ReduceLROnPlateau patience
max_epochs  = 2 #epochs
lr_rate   = 2e-6#learning rate
batch_sz  = 4   #batch_size
maxlen    = 256 #設(shè)置序列長度為，base模型要保證序列長度不超過512
lr_factor = 0.85  #ReduceLROnPlateau factor
maxlentext1 = 200 #選擇ocnli子句一的長度
n_folds   = 10    #設(shè)置驗(yàn)證集的占比:1/n_folds

Step 5: 模型構(gòu)建

1)切分?jǐn)?shù)據(jù)集（Train,Val）進(jìn)行模型訓(xùn)練、評價

采用StratifiedKFold分層抽樣抽取10%的訓(xùn)練數(shù)據(jù)作為驗(yàn)證集。

###采用分層抽樣的方式，從訓(xùn)練集中抽取10%作為驗(yàn)證機(jī)
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=222)

X_trn = pd.DataFrame()
X_val = pd.DataFrame()

for train_index, test_index in skf.split(train_df.copy(), train_df['label']):
  X_trn, X_val = train_df.iloc[train_index], train_df.iloc[test_index]
  break#不能多折訓(xùn)練

采用f1值做為評價指標(biāo)，當(dāng)評價指標(biāo)不在提升時，降低學(xué)習(xí)率。

from keras import backend as K

def f1(y_true, y_pred):
  def recall(y_true, y_pred):
    """Recall metric.

    Only computes a batch-wise average of recall.

    Computes the recall, a metric for multi-label classification of
    how many relevant items are selected.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

  def precision(y_true, y_pred):
    """Precision metric.

    Only computes a batch-wise average of precision.

    Computes the precision, a metric for multi-label classification of
    how many selected items are relevant.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision
  precision = precision(y_true, y_pred)
  recall = recall(y_true, y_pred)
  return 2*((precision*recall)/(precision+recall+K.epsilon()))

2)構(gòu)造輸入bert的數(shù)據(jù)格式

#標(biāo)簽類別個數(shù)
n_cls = len( train_df['label'].unique() )

#訓(xùn)練數(shù)據(jù)、測試數(shù)據(jù)和標(biāo)簽轉(zhuǎn)化為模型輸入格式
#訓(xùn)練集每行的content、label轉(zhuǎn)為tuple存入list，再轉(zhuǎn)為numpy array
TRN_LIST = []
for data_row in X_trn.iloc[:].itertuples():
  TRN_LIST.append((data_row.content, to_categorical(data_row.label, n_cls)))
TRN_LIST = np.array(TRN_LIST)

#驗(yàn)證集每行的content、label轉(zhuǎn)為tuple存入list，再轉(zhuǎn)為numpy array
VAL_LIST = []
for data_row in X_val.iloc[:].itertuples():
  VAL_LIST.append((data_row.content, to_categorical(data_row.label, n_cls)))
VAL_LIST = np.array(VAL_LIST)

#測試集每行的content、label轉(zhuǎn)為tuple存入list，再轉(zhuǎn)為numpy array，其中l(wèi)abel全為0
DATA_LIST_TEST = []
for data_row in testa_df.iloc[:].itertuples():
  DATA_LIST_TEST.append((data_row.content, to_categorical(0, n_cls)))
DATA_LIST_TEST = np.array(DATA_LIST_TEST)

3)模型搭建

在bert后接一層Lambda層取出[CLS]對應(yīng)的向量，再接一層Dense層用于分類輸出。

#bert模型設(shè)置
def build_bert(nclass):
  global lr_rate
  bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len=None)  #加載預(yù)訓(xùn)練模型

  for l in bert_model.layers:
    l.trainable = True

  x1_in = Input(shape=(None,))
  x2_in = Input(shape=(None,))

  x = bert_model([x1_in, x2_in])
  x = Lambda(lambda x: x[:, 0])(x) #取出[CLS]對應(yīng)的向量用來做分類
  p = Dense(nclass, activation='softmax')(x) #直接dense層softmax輸出

  model = Model([x1_in, x2_in], p)
  model.compile(loss='categorical_crossentropy',
                optimizer=Adam(lr_rate),    #選擇優(yōu)化器并設(shè)置學(xué)習(xí)率
                metrics=['accuracy', f1])

  print(model.summary())
  return model

4)模型訓(xùn)練

使用google colab 上的V100卡訓(xùn)練一個epoch需要約1.5小時，跑兩個epoch即可。

#模型訓(xùn)練函數(shù)
def run_nocv(nfold, trn_data, val_data, data_labels, data_test, n_cls):
  global er_patience
  global lr_patience
  global max_epochs
  global f1metrics
  global lr_factor
  test_model_pred = np.zeros((len(data_test), n_cls))

  model = build_bert(n_cls)
  #下行代碼用于加載保存的權(quán)重繼續(xù)訓(xùn)練
  #model.load_weights(path + '/subs/model.epoch01_val_loss0.9911_val_acc0.6445_val_f10.6276.hdf5')
  
  early_stopping = EarlyStopping(monitor= "val_f1", patience=er_patience) #早停法，防止過擬合 #'val_accuracy'
  plateau = ReduceLROnPlateau(monitor="val_f1", verbose=1, mode='max', factor=lr_factor, patience=lr_patience) #當(dāng)評價指標(biāo)不在提升時，降低學(xué)習(xí)率 
  checkpoint = ModelCheckpoint(path + "/subs/model.epoch{epoch:02d}_val_loss{val_loss:.4f}_val_acc{val_accuracy:.4f}_val_f1{val_f1:.4f}.hdf5", monitor="val_f1", verbose=2, save_best_only=True, mode='max', save_weights_only=True) #保存val_f1最好的模型權(quán)重

  #訓(xùn)練跟驗(yàn)證集可shuffle打亂，測試集不可打亂（否則在生成結(jié)果文件的時候沒法跟ID對應(yīng)上）
  train_D = data_generator(trn_data, shuffle=True)
  valid_D = data_generator(val_data, shuffle=True)
  test_D = data_generator(data_test, shuffle=False)
  
  #模型訓(xùn)練
  model.fit_generator(
      train_D.__iter__(),
      steps_per_epoch=len(train_D),
      epochs=max_epochs,
      validation_data=valid_D.__iter__(),
      validation_steps=len(valid_D),
      callbacks=[early_stopping, plateau, checkpoint],
  )
  #模型預(yù)測
  test_model_pred = model.predict_generator(test_D.__iter__(), steps=len(test_D), verbose=1)
  train_model_pred = test_model_pred#model.predict(train_D.__iter__(), steps=len(train_D), verbose=1)
 
  del model
  gc.collect()   #清理內(nèi)存
  K.clear_session() #clear_session就是清除一個session

  return test_model_pred, train_model_pred

調(diào)用上述函數(shù)進(jìn)行訓(xùn)練與預(yù)測。

cvs = 1
#輸出為numpy array格式的25列概率
test_model_pred, train_model_pred = run_nocv(cvs, TRN_LIST, VAL_LIST, None, DATA_LIST_TEST, n_cls)

5)輸出結(jié)果

#將結(jié)果轉(zhuǎn)為DataFrame格式
preds_tst_df = pd.DataFrame(test_model_pred)

#再將range(0,25)做encode_label逆變換作為該DataFrame的列名
preds_col_names = encode_label.inverse_transform( range(0,n_cls) )
preds_tst_df.columns = preds_col_names

#從每個任務(wù)對應(yīng)的概率標(biāo)簽列中找出最大的概率對應(yīng)的列名作為預(yù)測結(jié)果
'''
如ocnli任務(wù)的預(yù)測結(jié)果只能為0、1、2，那么從preds_tst_df中選擇0-1-2三列中每行概率最大的列名作為ocnli任務(wù)的測試集預(yù)測結(jié)果，其它兩個任務(wù)依此類推。
'''
times_preds = preds_tst_df.head(times_testa.shape[0])[times_train['label'].unique().tolist()]
times_preds = times_preds.eq(times_preds.max(1), axis=0).dot(times_preds.columns)

ocemo_preds = preds_tst_df.head(times_testa.shape[0] + ocemo_testa.shape[0]).tail(ocemo_testa.shape[0])[ocemo_train['label'].unique().tolist()]
ocemo_preds = ocemo_preds.eq(ocemo_preds.max(1), axis=0).dot(ocemo_preds.columns)

ocnli_preds = preds_tst_df.tail(ocnli_testa.shape[0])[ocnli_train['label'].unique().tolist()]
ocnli_preds = ocnli_preds.eq(ocnli_preds.max(1), axis=0).dot(ocnli_preds.columns)

#輸出任務(wù)tnews的預(yù)測結(jié)果
times_sub = times_testa[['id']].copy()
times_sub['label'] = times_preds.values
times_sub.to_json(path + "/subs/tnews_predict.json", orient='records', lines=True)
#輸出任務(wù)ocemo的預(yù)測結(jié)果
ocemo_sub = ocemo_testa[['id']].copy()
ocemo_sub['label'] = ocemo_preds.values
ocemo_sub.to_json(path + "/subs/ocemotion_predict.json", orient='records', lines=True)
#輸出任務(wù)ocnli的預(yù)測結(jié)果
ocnli_sub = ocnli_testa[['id']].copy()
ocnli_sub['label'] = ocnli_preds.values
ocnli_sub.to_json(path + "/subs/ocnli_predict.json", orient='records', lines=True)

6)線上評分

第一階段線上評分：0.6342。

希望能幫助你完整實(shí)踐一場NLP賽事。

“l(fā)an哥用心分享，點(diǎn)贊三連↓

從0梳理1場NLP賽事！