<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          【機(jī)器學(xué)習(xí)】三大樹模型實(shí)戰(zhàn)乳腺癌預(yù)測(cè)分類

          共 10575字,需瀏覽 22分鐘

           ·

          2022-06-09 08:49


          公眾號(hào):尤而小屋
          作者:Peter
          編輯:Peter

          今天給大家?guī)?lái)一篇新的UCI數(shù)據(jù)集建模的文章。

          本文從數(shù)據(jù)的探索分析出發(fā),經(jīng)過(guò)特征工程和樣本均衡性處理,使用決策樹、隨機(jī)森林、梯度提升樹對(duì)一份女性乳腺癌的數(shù)據(jù)集進(jìn)行分析和預(yù)測(cè)建模。

          關(guān)鍵詞:相關(guān)性、決策樹、隨機(jī)森林、降維、獨(dú)熱碼、乳腺癌

          數(shù)據(jù)集

          數(shù)據(jù)是來(lái)自UCI官網(wǎng),很老的一份數(shù)據(jù),主要是用于分類問(wèn)題,可以自行下載學(xué)習(xí)

          https://archive.ics.uci.edu/ml/datasets/breast+cancer

          導(dǎo)入庫(kù)

          import?pandas?as?pd
          import?numpy?as?np

          import?matplotlib.pyplot?as?plt
          import?seaborn?as?sns
          %matplotlib?inline

          import?plotly_express?as?px
          import?plotly.graph_objects?as?go

          from?sklearn.ensemble?import?RandomForestClassifier
          from?sklearn.tree?import?DecisionTreeClassifier

          from?sklearn.tree?import?export_graphviz
          from?sklearn.metrics?import?roc_curve,?auc
          from?sklearn.metrics?import?classification_report
          from?sklearn.metrics?import?confusion_matrix
          from?sklearn?import?metrics?

          from?sklearn.model_selection?import?train_test_split

          導(dǎo)入數(shù)據(jù)

          數(shù)據(jù)是來(lái)自UCI官網(wǎng),下載到本地可以直接讀取。只是這份數(shù)據(jù)是.data文件,沒(méi)有文件頭,我們需要自行指定對(duì)應(yīng)的文件頭(網(wǎng)上搜索的)

          #?來(lái)自u(píng)ci

          df?=?pd.read_table("breast-cancer.data",
          ???????????????????sep=",",
          ???????????????????names=["Class","age","menopause","tumor-size","inv-nodes",
          ??????????????????????????"node-caps","deg-malig","breast","breast-quad","irradiat"])

          df

          基本信息

          In [3]:

          df.dtypes???#?字段類型

          Out[3]:

          全部是object類型,只有一個(gè)int64類型

          Class??????????object
          age????????????object
          menopause??????object
          tumor-size?????object
          inv-nodes??????object
          node-caps??????object
          deg-malig???????int64
          breast?????????object
          breast-quad????object
          irradiat???????object
          dtype:?object

          In [4]:

          df.isnull().sum()??#?缺失值

          Out[4]:

          數(shù)據(jù)比較完整,沒(méi)有缺失值

          Class??????????0
          age????????????0
          menopause??????0
          tumor-size?????0
          inv-nodes??????0
          node-caps??????0
          deg-malig??????0
          breast?????????0
          breast-quad????0
          irradiat???????0
          dtype:?int64

          In [5]:

          ##?字段解釋

          columns?=?df.columns.tolist()
          columns

          Out[5]:

          ['Class',
          ?'age',
          ?'menopause',
          ?'tumor-size',
          ?'inv-nodes',
          ?'node-caps',
          ?'deg-malig',
          ?'breast',
          ?'breast-quad',
          ?'irradiat']

          下面是每個(gè)字段的含義和具體的取值范圍:

          屬性名含義取值范圍
          Class是否復(fù)發(fā)no-recurrence-events, recurrence-events
          age年齡10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99
          menopause絕經(jīng)情況lt40(40歲之前絕經(jīng)), ge40(40歲之后絕經(jīng)), premeno(還未絕經(jīng))
          tumor-size腫瘤大小0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59
          inv-nodes受侵淋巴結(jié)數(shù)0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39
          node-caps有無(wú)結(jié)節(jié)帽yes, no
          deg-malig惡性腫瘤程度1, 2, 3
          breast腫塊位置left, right
          breast-quad腫塊所在象限left-up, left-low, right-up, right-low, central
          irradiat是否放療yes,no

          去除缺失值

          In [6]:

          兩個(gè)字段中的?就是本數(shù)據(jù)中的缺失值,我們直接選擇非缺失值值的數(shù)據(jù)

          df?=?df[(df["node-caps"]?!=?"?")?&?(df["breast-quad"]?!=?"?")]

          len(df)

          Out[6]:

          277

          字段處理

          In [7]:

          from?sklearn.preprocessing?import?LabelEncoder

          年齡段-age

          In [8]:

          age?=?df["age"].value_counts().reset_index()
          age.columns?=?["年齡段",?"人數(shù)"]

          age

          可以看到數(shù)據(jù)中大部分的用戶集中在40-59歲。對(duì)年齡段執(zhí)行獨(dú)熱碼:

          df?=?df.join(pd.get_dummies(df["age"]))
          df.drop("age",?axis=1,?inplace=True)
          df.head()

          絕經(jīng)-menopause

          In [11]:

          menopause?=?df["menopause"].value_counts().reset_index()
          menopause

          Out[11]:


          indexmenopause
          0premeno149
          1ge40123
          2lt405

          In [12]:

          fig?=?px.pie(menopause,names="index",values="menopause")

          fig.update_traces(
          ????textposition='inside',???
          ????textinfo='percent+label')

          fig.show()
          df?=?df.join(pd.get_dummies(df["menopause"]))??#?獨(dú)熱碼

          df.drop("menopause",axis=1,?inplace=True)

          腫瘤大小-tumor-size

          In [14]:

          tumor_size?=?df["tumor-size"].value_counts().reset_index()

          tumor_size

          Out[14]:


          indextumor-size
          030-3457
          125-2951
          220-2448
          315-1929
          410-1428
          540-4422
          635-3919
          70-48
          850-548
          95-94
          1045-493

          In [15]:

          fig?=?px.bar(tumor_size,?
          ?????????????x="index",
          ?????????????y="tumor-size",
          ?????????????color="tumor-size",
          ?????????????text="tumor-size")

          fig.show()
          df?=?df.join(pd.get_dummies(df["tumor-size"]))

          df.drop("tumor-size",axis=1,?inplace=True)

          In [18]:

          df?=?df.join(pd.get_dummies(df["inv-nodes"]))

          df.drop("inv-nodes",axis=1,?inplace=True)

          有無(wú)結(jié)節(jié)帽-node-caps

          In [19]:

          df["node-caps"].value_counts()

          Out[19]:

          no?????221
          yes?????56
          Name:?node-caps,?dtype:?int64

          In [20]:

          df?=?df.join(pd.get_dummies(df["node-caps"]).rename(columns={"no":"node_capes_no",?"yes":"node_capes_yes"}))

          df.drop("node-caps",axis=1,?inplace=True)

          惡性腫瘤程度-deg-malig

          In [21]:

          df["deg-malig"].value_counts()

          Out[21]:

          2????129
          3?????82
          1?????66
          Name:?deg-malig,?dtype:?int64

          腫塊位置-breast

          In [22]:

          df["breast"].value_counts()

          Out[22]:

          left?????145
          right????132
          Name:?breast,?dtype:?int64

          In [23]:

          df?=?df.join(pd.get_dummies(df["breast"]))

          df.drop("breast",axis=1,?inplace=True)

          ...

          是否復(fù)發(fā)-Class

          這個(gè)是最終預(yù)測(cè)的因變量,我們需要將文本信息轉(zhuǎn)成0-1的數(shù)值信息

          In [29]:

          dic?=?{"no-recurrence-events":0,?"recurrence-events":1}
          df["Class"]?=?df["Class"].map(dic)??#?實(shí)施轉(zhuǎn)換

          df

          復(fù)發(fā)和非復(fù)發(fā)的統(tǒng)計(jì):

          sns.countplot(df['Class'],label="Count")

          plt.show()

          樣本不均衡處理

          In [31]:

          #?樣本量分布

          df["Class"].value_counts()

          Out[31]:

          0????196
          1?????81
          Name:?Class,?dtype:?int64

          In [32]:

          from?imblearn.over_sampling?import?SMOTE

          In [33]:

          X?=?df.iloc[:,1:]
          y?=?df.iloc[:,0]

          y.head()

          Out[33]:

          0????0
          1????0
          2????0
          3????0
          4????0
          Name:?Class,?dtype:?int64

          In [34]:

          groupby_df?=?df.groupby('Class').count()??????

          #?輸出原始數(shù)據(jù)集樣本分類分布
          groupby_df
          model_smote?=?SMOTE()

          x_smote_resampled,?y_smote_resampled?=?model_smote.fit_resample(X,?y)?????????
          x_smoted?=?pd.DataFrame(x_smote_resampled,?
          ????????????????????????columns=df.columns.tolist()[1:])???
          y_smoted?=?pd.DataFrame(y_smote_resampled,
          ????????????????????????columns=['Class'])???????
          df_smoted?=?pd.concat([x_smoted,?y_smoted],axis=1)

          建模

          相關(guān)性

          分析每個(gè)新字段和因變量之間的相關(guān)性

          In [36]:

          corr?=?df_smoted.corr()
          corr.head()

          繪制相關(guān)性熱力圖:

          fig?=?plt.figure(figsize=(12,8))
          sns.heatmap(corr)

          plt.show()

          數(shù)據(jù)集劃分

          In [38]:

          X?=?df_smoted.iloc[:,:-1]
          y?=?df_smoted.iloc[:,-1]

          from?sklearn.model_selection?import?train_test_split

          X_train,X_test,y_train,y_test?=?train_test_split(X,y,
          ?????????????????????????????????????????????????test_size=0.20,
          ?????????????????????????????????????????????????random_state=123)

          決策樹

          In [39]:

          dt?=?DecisionTreeClassifier(max_depth=5)
          dt.fit(X_train,?y_train)

          Out[39]:

          DecisionTreeClassifier(max_depth=5)

          In [40]:

          #?預(yù)測(cè)
          y_prob?=?dt.predict_proba(X_test)[:,1]

          #?預(yù)測(cè)的概率轉(zhuǎn)成0-1分類
          y_pred?=?np.where(y_prob?>?0.5,?1,?0)
          dt.score(X_test,?y_pred)

          Out[40]:

          1.0

          In [41]:

          #?混淆矩陣

          confusion_matrix(y_test,?y_pred)

          Out[41]:

          array([[29,??8],
          ???????[19,?23]])

          In [42]:

          #?分類得分報(bào)告
          print(classification_report(y_test,?y_pred))
          ??????????????precision????recall??f1-score???support

          ???????????0???????0.60??????0.78??????0.68????????37
          ???????????1???????0.74??????0.55??????0.63????????42

          ????accuracy???????????????????????????0.66????????79
          ???macro?avg???????0.67??????0.67??????0.66????????79
          weighted?avg???????0.68??????0.66??????0.65????????79

          In [43]:

          #?roc
          metrics.roc_auc_score(y_test,?y_pred)

          Out[43]:

          0.6657014157014157

          In [44]:

          #?roc曲線

          from?sklearn.metrics?import?roc_curve,?auc
          false_positive_rate,?true_positive_rate,?thresholds?=?roc_curve(y_test,?y_prob)
          roc_auc?=?auc(false_positive_rate,?true_positive_rate)

          import?matplotlib.pyplot?as?plt
          plt.figure(figsize=(10,10))??#?畫布
          plt.title('ROC')??#?標(biāo)題

          plt.plot(false_positive_rate,??#?繪圖
          ?????????true_positive_rate,
          ?????????color='red',
          ?????????label?=?'AUC?=?%0.2f'?%?roc_auc)

          plt.legend(loc?=?'lower?right')?#??圖例位置
          plt.plot([0,?1],?[0,?1],linestyle='--')??#?正比例直線

          plt.axis('tight')
          plt.xlabel('False?Positive?Rate')
          plt.ylabel('True?Positive?Rate')
          plt.show()

          隨機(jī)森林

          In [45]:

          rf?=?RandomForestClassifier(max_depth=5)
          rf.fit(X_train,?y_train)

          梯度提升樹

          In [50]:

          from?sklearn.ensemble?import?GradientBoostingClassifier

          In [51]:

          gbc?=?GradientBoostingClassifier(loss='deviance',?
          ?????????????????????????????????learning_rate=0.1,?
          ?????????????????????????????????n_estimators=5,?
          ?????????????????????????????????subsample=1,
          ?????????????????????????????????min_samples_split=2,?
          ?????????????????????????????????min_samples_leaf=1,?
          ?????????????????????????????????max_depth=3)

          gbc.fit(X_train,?y_train)

          Out[51]:

          GradientBoostingClassifier(n_estimators=5,?subsample=1)

          In [55]:

          #?roc曲線

          from?sklearn.metrics?import?roc_curve,?auc
          false_positive_rate,?true_positive_rate,?thresholds?=?roc_curve(y_test,?y_prob)
          roc_auc?=?auc(false_positive_rate,?true_positive_rate)

          import?matplotlib.pyplot?as?plt
          plt.figure(figsize=(10,10))??#?畫布
          plt.title('ROC')??#?標(biāo)題

          plt.plot(false_positive_rate,??#?繪圖
          ?????????true_positive_rate,
          ?????????color='red',
          ?????????label?=?'AUC?=?%0.2f'?%?roc_auc)

          plt.legend(loc?=?'lower?right')?#??圖例位置
          plt.plot([0,?1],?[0,?1],linestyle='--')??#?正比例直線

          plt.axis('tight')
          plt.xlabel('False?Positive?Rate')
          plt.ylabel('True?Positive?Rate')
          plt.show()

          PCA降維

          降維過(guò)程

          In [56]:

          from?sklearn.decomposition?import?PCA
          pca?=?PCA(n_components=17)
          pca.fit(X)

          #返回所保留的17個(gè)成分各自的方差百分比
          print(pca.explained_variance_ratio_)
          [0.17513053?0.12941834?0.11453698?0.07323991?0.05889187?0.05690304
          ?0.04869476?0.0393374??0.03703477?0.03240863?0.03062932?0.02574137
          ?0.01887462?0.0180381??0.01606983?0.01453912?0.01318003]

          In [57]:

          sum(pca.explained_variance_ratio_)

          Out[57]:

          0.9026686181152915

          降維后數(shù)據(jù)

          In [58]:

          X_NEW?=?pca.transform(X)
          X_NEW

          Out[58]:

          array([[?1.70510215e-01,??5.39929099e-01,?-1.04314303e+00,?...,
          ????????-2.26541223e-01,?-6.39332871e-02,?-8.97923150e-02],
          ???????[-9.01105403e-01,??8.01693088e-01,??5.92260258e-01,?...,
          ?????????9.66299251e-02,??1.40755806e-03,?-2.74626972e-01],
          ???????[-6.05200264e-01,??6.08455330e-01,?-1.00524376e+00,?...,
          ?????????4.11416630e-02,??4.15705282e-02,?-8.46941345e-02],
          ???????...,
          ???????[?1.40652211e-02,??5.35906106e-01,??5.64150123e-02,?...,
          ?????????1.70834934e-01,??7.11616391e-02,?-1.72250445e-01],
          ???????[-4.41363597e-01,??9.11950641e-01,?-4.22184256e-01,?...,
          ????????-4.13385344e-02,?-7.64405982e-02,??1.04686148e-01],
          ???????[?1.98533663e+00,?-4.74547396e-01,?-1.52557494e-01,?...,
          ?????????2.72194184e-02,??5.71553613e-02,??1.78074886e-01]])

          In [59]:

          X_NEW.shape

          Out[59]:

          (392,?17)

          重新劃分?jǐn)?shù)據(jù)

          In [60]:

          X_train,X_test,y_train,y_test?=?train_test_split(X_NEW,y,test_size=0.20,random_state=123)

          再用隨機(jī)森林

          In [61]:

          rf?=?RandomForestClassifier(max_depth=5)
          rf.fit(X_train,?y_train)

          Out[61]:

          RandomForestClassifier(max_depth=5)

          In [62]:

          #?預(yù)測(cè)
          y_prob?=?rf.predict_proba(X_test)[:,1]

          #?預(yù)測(cè)的概率轉(zhuǎn)成0-1分類
          y_pred?=?np.where(y_prob?>?0.5,?1,?0)
          rf.score(X_test,?y_pred)

          Out[62]:

          1.0

          In [63]:

          #?混淆矩陣

          confusion_matrix(y_test,?y_pred)

          Out[63]:

          array([[26,?11],
          ???????[13,?29]])

          In [64]:

          #?roc
          metrics.roc_auc_score(y_test,?y_pred)

          Out[64]:

          0.6965894465894465

          In [65]:

          #?roc曲線

          from?sklearn.metrics?import?roc_curve,?auc
          false_positive_rate,?true_positive_rate,?thresholds?=?roc_curve(y_test,?y_prob)
          roc_auc?=?auc(false_positive_rate,?true_positive_rate)

          import?matplotlib.pyplot?as?plt
          plt.figure(figsize=(10,10))??
          plt.title('ROC')??

          plt.plot(false_positive_rate,??
          ?????????true_positive_rate,
          ?????????color='red',
          ?????????label?=?'AUC?=?%0.2f'?%?roc_auc)

          plt.legend(loc?=?'lower?right')?
          plt.plot([0,?1],?[0,?1],linestyle='--')??

          plt.axis('tight')
          plt.xlabel('False?Positive?Rate')
          plt.ylabel('True?Positive?Rate')
          plt.show()

          總結(jié)

          從數(shù)據(jù)預(yù)處理和特征工程出發(fā),建立不同的樹模型表現(xiàn)來(lái)看,隨機(jī)森林表現(xiàn)的最好,AUC值高達(dá)0.81,在經(jīng)過(guò)對(duì)特征簡(jiǎn)單的降維之后,我們選擇前17個(gè)特征,它們的重要性超過(guò)90%,再次建模,此時(shí)AUC值達(dá)到0.83。

          往期精彩回顧




          瀏覽 121
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  亚洲黄色毛片电影院 | 乱伦电影毛片 | 美国A级毛片 | 欧美性爱内射 | 丁香五月国产 |