2021數(shù)據(jù)挖掘賽題方案來(lái)了!

賽題背景及任務(wù)
心電圖是臨床最基礎(chǔ)的一個(gè)檢查項(xiàng)目,因?yàn)榘踩?、便捷成為心臟病診斷的利器。由于心電圖數(shù)據(jù)與診斷的標(biāo)準(zhǔn)化程度較高,相對(duì)較易于運(yùn)用人工智能技術(shù)進(jìn)行智能診斷算法的開(kāi)發(fā)。本實(shí)踐針對(duì)心電圖數(shù)據(jù)輸出二元(正常 v.s 異常)分類標(biāo)簽。
比賽地址:http://ailab.aiwin.org.cn/competitions/64
賽題數(shù)據(jù)
數(shù)據(jù)將會(huì)分為可見(jiàn)標(biāo)簽的訓(xùn)練集,及不可見(jiàn)標(biāo)簽的測(cè)試集兩大部分。其中訓(xùn)練數(shù)據(jù)提供 1600 條 MAT 格式心電數(shù)據(jù)及其對(duì)應(yīng)診斷分類標(biāo)簽(“正?!被颉爱惓!?,csv 格式);測(cè)試數(shù)據(jù)提供 400 條 MAT格式心電數(shù)據(jù)。
數(shù)據(jù)目錄
???DATA?|-?trainreference.csv?TRAIN目錄下數(shù)據(jù)的LABEL
????????|-?TRAIN????????????訓(xùn)練用的數(shù)據(jù)
????????|-?VAL??????????????測(cè)試數(shù)據(jù)
數(shù)據(jù)格式 12導(dǎo)聯(lián)的數(shù)據(jù),保存matlab格式文件中。數(shù)據(jù)格式是(12, 5000)。 采樣500HZ,10S長(zhǎng)度有效數(shù)據(jù)。具體讀取方式參考下面代碼。 0..12是I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5和V6數(shù)據(jù)。單位是mV。
????import?scipy.io?as?sio
????ecgdata?=?sio.loadmat("TEST0001.MAT")['ecgdata']
trainreference.csv格式:每行一個(gè)文件。格式:文件名,LABEL (0正常心電圖,1異常心電圖)
實(shí)踐思路
TextCNN 模型是由 Harvard NLP 組的 Yoon Kim 在2014年發(fā)表的 《Convolutional Neural Networks for Sentence Classification 》一文中提出的模型,由于 CNN 在計(jì)算機(jī)視覺(jué)中,常被用于提取圖像的局部特征圖,且起到了很好的效果,所以該作者將其引入到 NLP 中,應(yīng)用于文本分類任務(wù),試圖使用 CNN 捕捉文本中單詞之間的關(guān)系。
本實(shí)踐使用TextCNN模型對(duì)心電數(shù)據(jù)進(jìn)行分類。

改進(jìn)思路
使用多折交叉驗(yàn)證,訓(xùn)練多個(gè)模型,對(duì)測(cè)試集預(yù)測(cè)多次。 在讀取數(shù)據(jù)時(shí),加入噪音,或者加入mixup數(shù)據(jù)擴(kuò)增。 使用更加強(qiáng)大的模型,textcnn這里還是過(guò)于簡(jiǎn)單。
實(shí)踐代碼
數(shù)據(jù)讀取
!\rm?-rf?val?train?trainreference.csv?數(shù)據(jù)說(shuō)明.txt?!
unzip?2021A_T2_Task1_數(shù)據(jù)集含訓(xùn)練集和測(cè)試集.zip?>?out.log
import?codecs,?glob,?os
import?numpy?as?np
import?pandas?as?pd
import?paddle
import?paddle.nn?as?nn
from?paddle.io?import?DataLoader,?Dataset
import?paddle.optimizer?as?optim
from?paddlenlp.data?import?Pad
import?scipy.io?as?sio
train_mat?=?glob.glob('./train/*.mat')
train_mat.sort()
train_mat?=?[sio.loadmat(x)['ecgdata'].reshape(1,?12,?5000)?for?x?in?train_mat]
test_mat?=?glob.glob('./val/*.mat')
test_mat.sort()
test_mat?=?[sio.loadmat(x)['ecgdata'].reshape(1,?12,?5000)?for?x?in?test_mat]
train_df?=?pd.read_csv('trainreference.csv')
train_df['tag']?=?train_df['tag'].astype(np.float32)
class?MyDataset(Dataset):
????def?__init__(self,?mat,?label,?mat_dim=3000):
????????super(MyDataset,?self).__init__()
????????self.mat?=?mat
????????self.label?=?label
????????self.mat_dim?=?mat_dim
????def?__len__(self):
????????return?len(self.mat)
????def?__getitem__(self,?index):
????????idx?=?np.random.randint(0,?5000-self.mat_dim)
????????return?paddle.to_tensor(self.mat[index][:,?:,?idx:idx+self.mat_dim]),?self.label[index]
模型構(gòu)建
class?TextCNN(paddle.nn.Layer):
????def?__init__(self,?kernel_num=30,?kernel_size=[3,?4,?5],?dropout=0.5):
????????super(TextCNN,?self).__init__()
????????self.kernel_num?=?kernel_num
????????self.kernel_size?=?kernel_size
????????self.dropout?=?dropout
????????self.convs?=?nn.LayerList([nn.Conv2D(1,?self.kernel_num,?(kernel_size_,?3000))?
????????????????for?kernel_size_?in?self.kernel_size])
????????self.dropout?=?nn.Dropout(self.dropout)
????????self.linear?=?nn.Linear(3?*?self.kernel_num,?1)
????def?forward(self,?x):
????????convs?=?[nn.ReLU()(conv(x)).squeeze(3)?for?conv?in?self.convs]
????????pool_out?=?[nn.MaxPool1D(block.shape[2])(block).squeeze(2)?for?block?in?convs]
????????pool_out?=?paddle.concat(pool_out,?1)
????????logits?=?self.linear(pool_out)
????????return?logits
model?=?TextCNN()
BATCH_SIZE?=?30
EPOCHS?=?200
LEARNING_RATE?=?0.0005
device?=?paddle.device.get_device()
print(device)
gpu:0模型訓(xùn)練
Train_Loader?=?DataLoader(MyDataset(train_mat[:-100],?paddle.to_tensor(train_df['tag'].values[:-100])),?batch_size=BATCH_SIZE,?shuffle=True)
Val_Loader?=?DataLoader(MyDataset(train_mat[-100:],?paddle.to_tensor(train_df['tag'].values[-100:])),?batch_size=BATCH_SIZE,?shuffle=True)
model?=?TextCNN()
optimizer?=?optim.Adam(parameters=model.parameters(),?learning_rate=LEARNING_RATE)
criterion?=?nn.BCEWithLogitsLoss()
Test_best_Acc?=?0
for?epoch?in?range(0,?EPOCHS):
????Train_Loss,?Test_Loss?=?[],?[]
????Train_Acc,?Test_Acc?=?[],?[]
????model.train()
????for?i,?(x,?y)?in?enumerate(Train_Loader):
????????if?device?==?'gpu':
????????????x?=?x.cuda()
????????????y?=?y.cuda()
????????pred?=?model(x)
????????loss?=?criterion(pred,?y)
????????Train_Loss.append(loss.item())
????????pred?=?(paddle.nn.functional.sigmoid(pred)>0.5).astype(int)
????????Train_Acc.append((pred.numpy()?==?y.numpy()).mean())
????????loss.backward()
????????optimizer.step()
????????optimizer.clear_grad()
????model.eval()
????for?i,?(x,?y)?in?enumerate(Val_Loader):
????????if?device?==?'gpu':
????????????x?=?x.cuda()
????????????y?=?y.cuda()
????????
????????pred?=?model(x)
????????Test_Loss.append(criterion(pred,?y).item())
????????pred?=?(paddle.nn.functional.sigmoid(pred)>0.5).astype(int)
????????Test_Acc.append((pred.numpy()?==?y.numpy()).mean())
????print(
????????"Epoch:?[{}/{}]?TrainLoss/TestLoss:?{:.4f}/{:.4f}?TrainAcc/TestAcc:?{:.4f}/{:.4f}".format(?\
????????epoch?+?1,?EPOCHS,?\
????????np.mean(Train_Loss),?np.mean(Test_Loss),?\
????????np.mean(Train_Acc),?np.mean(Test_Acc)?\
????????)?\
????)
????if?Test_best_Acc?????????print(f'Acc?imporve?from?{Test_best_Acc}?to?{np.mean(Test_Acc)}?Save?Model...')
????????paddle.save(model.state_dict(),?"model.pdparams")
????????Test_best_Acc?=?np.mean(Test_Acc)
結(jié)果預(yù)測(cè)
Test_Loader?=?DataLoader(MyDataset(test_mat,?paddle.to_tensor([0]*len(test_mat))),?
????????????????batch_size=BATCH_SIZE,?shuffle=False)
layer_state_dict?=?paddle.load("model.pdparams")
model.set_state_dict(layer_state_dict)
test_perd?=?np.zeros(len(test_mat))
for?tta?in?range(10):
????test_pred_list?=?[]
????for?i,?(x,?y)?in?enumerate(Test_Loader):
????????if?device?==?'gpu':
????????????x?=?x.cuda()
????????????y?=?y.cuda()
????????
????????pred?=?model(x)
????????test_pred_list.append(
????????????paddle.nn.functional.sigmoid(pred).numpy()
????????)
????test_perd?+=?np.vstack(test_pred_list)[:,?0]
????print(f'Test?TTA?{tta}')
????
test_perd?/=?10
test_path?=?glob.glob('./val/*.mat')
test_path?=?[os.path.basename(x)[:-4]?for?x?in?test_path]
test_path.sort()
test_answer?=?pd.DataFrame({
????'name':?test_path,
????'tag':?(test_perd?>?0.5).astype(int)
})