回本Aa一级黄色视屏,成人黄色AV网址,污污污免费视频网站,成人视频在线观看黄色18,欧美精品系列,啪啪啪网站免费看,自拍毛片在线观看,国产综合AV

公眾號(hào)：尤而小屋
作者：Peter
編輯：Peter

今天給大家?guī)?lái)一篇新的UCI數(shù)據(jù)集建模的文章。

本文從數(shù)據(jù)的探索分析出發(fā)，經(jīng)過(guò)特征工程和樣本均衡性處理，使用決策樹、隨機(jī)森林、梯度提升樹對(duì)一份女性乳腺癌的數(shù)據(jù)集進(jìn)行分析和預(yù)測(cè)建模。

關(guān)鍵詞：相關(guān)性、決策樹、隨機(jī)森林、降維、獨(dú)熱碼、乳腺癌

數(shù)據(jù)集

數(shù)據(jù)是來(lái)自UCI官網(wǎng)，很老的一份數(shù)據(jù)，主要是用于分類問(wèn)題，可以自行下載學(xué)習(xí)

https://archive.ics.uci.edu/ml/datasets/breast+cancer

導(dǎo)入庫(kù)

import?pandas?as?pd
import?numpy?as?np

import?matplotlib.pyplot?as?plt
import?seaborn?as?sns
%matplotlib?inline

import?plotly_express?as?px
import?plotly.graph_objects?as?go

from?sklearn.ensemble?import?RandomForestClassifier
from?sklearn.tree?import?DecisionTreeClassifier

from?sklearn.tree?import?export_graphviz
from?sklearn.metrics?import?roc_curve,?auc
from?sklearn.metrics?import?classification_report
from?sklearn.metrics?import?confusion_matrix
from?sklearn?import?metrics?

from?sklearn.model_selection?import?train_test_split

導(dǎo)入數(shù)據(jù)

數(shù)據(jù)是來(lái)自UCI官網(wǎng)，下載到本地可以直接讀取。只是這份數(shù)據(jù)是.data文件，沒(méi)有文件頭，我們需要自行指定對(duì)應(yīng)的文件頭（網(wǎng)上搜索的）

#?來(lái)自u(píng)ci

df?=?pd.read_table("breast-cancer.data",
???????????????????sep=",",
???????????????????names=["Class","age","menopause","tumor-size","inv-nodes",
??????????????????????????"node-caps","deg-malig","breast","breast-quad","irradiat"])

df

基本信息

In [3]:

df.dtypes???#?字段類型

Out[3]:

全部是object類型，只有一個(gè)int64類型

Class??????????object
age????????????object
menopause??????object
tumor-size?????object
inv-nodes??????object
node-caps??????object
deg-malig???????int64
breast?????????object
breast-quad????object
irradiat???????object
dtype:?object

In [4]:

df.isnull().sum()??#?缺失值

Out[4]:

數(shù)據(jù)比較完整，沒(méi)有缺失值

Class??????????0
age????????????0
menopause??????0
tumor-size?????0
inv-nodes??????0
node-caps??????0
deg-malig??????0
breast?????????0
breast-quad????0
irradiat???????0
dtype:?int64

In [5]:

##?字段解釋

columns?=?df.columns.tolist()
columns

Out[5]:

['Class',
?'age',
?'menopause',
?'tumor-size',
?'inv-nodes',
?'node-caps',
?'deg-malig',
?'breast',
?'breast-quad',
?'irradiat']

下面是每個(gè)字段的含義和具體的取值范圍：

屬性名	含義	取值范圍
Class	是否復(fù)發(fā)	no-recurrence-events, recurrence-events
age	年齡	10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99
menopause	絕經(jīng)情況	lt40（40歲之前絕經(jīng)）, ge40（40歲之后絕經(jīng)）, premeno（還未絕經(jīng)）
tumor-size	腫瘤大小	0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59
inv-nodes	受侵淋巴結(jié)數(shù)	0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39
node-caps	有無(wú)結(jié)節(jié)帽	yes, no
deg-malig	惡性腫瘤程度	1, 2, 3
breast	腫塊位置	left, right
breast-quad	腫塊所在象限	left-up, left-low, right-up, right-low, central
irradiat	是否放療	yes,no

去除缺失值

In [6]:

兩個(gè)字段中的？就是本數(shù)據(jù)中的缺失值，我們直接選擇非缺失值值的數(shù)據(jù)

df?=?df[(df["node-caps"]?!=?"?")?&?(df["breast-quad"]?!=?"?")]

len(df)

Out[6]:

字段處理

In [7]:

from?sklearn.preprocessing?import?LabelEncoder

年齡段-age

In [8]:

age?=?df["age"].value_counts().reset_index()
age.columns?=?["年齡段",?"人數(shù)"]

age

可以看到數(shù)據(jù)中大部分的用戶集中在40-59歲。對(duì)年齡段執(zhí)行獨(dú)熱碼：

df?=?df.join(pd.get_dummies(df["age"]))
df.drop("age",?axis=1,?inplace=True)
df.head()

絕經(jīng)-menopause

In [11]:

menopause?=?df["menopause"].value_counts().reset_index()
menopause

Out[11]:

	index	menopause
0	premeno	149
1	ge40	123
2	lt40	5

In [12]:

fig?=?px.pie(menopause,names="index",values="menopause")

fig.update_traces(
????textposition='inside',???
????textinfo='percent+label')

fig.show()

df?=?df.join(pd.get_dummies(df["menopause"]))??#?獨(dú)熱碼

df.drop("menopause",axis=1,?inplace=True)

腫瘤大小-tumor-size

In [14]:

tumor_size?=?df["tumor-size"].value_counts().reset_index()

tumor_size

Out[14]:

	index	tumor-size
0	30-34	57
1	25-29	51
2	20-24	48
3	15-19	29
4	10-14	28
5	40-44	22
6	35-39	19
7	0-4	8
8	50-54	8
9	5-9	4
10	45-49	3

In [15]:

fig?=?px.bar(tumor_size,?
?????????????x="index",
?????????????y="tumor-size",
?????????????color="tumor-size",
?????????????text="tumor-size")

fig.show()

df?=?df.join(pd.get_dummies(df["tumor-size"]))

df.drop("tumor-size",axis=1,?inplace=True)

In [18]:

df?=?df.join(pd.get_dummies(df["inv-nodes"]))

df.drop("inv-nodes",axis=1,?inplace=True)

有無(wú)結(jié)節(jié)帽-node-caps

In [19]:

df["node-caps"].value_counts()

Out[19]:

no?????221
yes?????56
Name:?node-caps,?dtype:?int64

In [20]:

df?=?df.join(pd.get_dummies(df["node-caps"]).rename(columns={"no":"node_capes_no",?"yes":"node_capes_yes"}))

df.drop("node-caps",axis=1,?inplace=True)

惡性腫瘤程度-deg-malig

In [21]:

df["deg-malig"].value_counts()

Out[21]:

2????129
3?????82
1?????66
Name:?deg-malig,?dtype:?int64

腫塊位置-breast

In [22]:

df["breast"].value_counts()

Out[22]:

left?????145
right????132
Name:?breast,?dtype:?int64

In [23]:

df?=?df.join(pd.get_dummies(df["breast"]))

df.drop("breast",axis=1,?inplace=True)

...

是否復(fù)發(fā)-Class

這個(gè)是最終預(yù)測(cè)的因變量，我們需要將文本信息轉(zhuǎn)成0-1的數(shù)值信息

In [29]:

dic?=?{"no-recurrence-events":0,?"recurrence-events":1}
df["Class"]?=?df["Class"].map(dic)??#?實(shí)施轉(zhuǎn)換

df

復(fù)發(fā)和非復(fù)發(fā)的統(tǒng)計(jì)：

sns.countplot(df['Class'],label="Count")

plt.show()

樣本不均衡處理

In [31]:

#?樣本量分布

df["Class"].value_counts()

Out[31]:

0????196
1?????81
Name:?Class,?dtype:?int64

In [32]:

from?imblearn.over_sampling?import?SMOTE

In [33]:

X?=?df.iloc[:,1:]
y?=?df.iloc[:,0]

y.head()

Out[33]:

0????0
1????0
2????0
3????0
4????0
Name:?Class,?dtype:?int64

In [34]:

groupby_df?=?df.groupby('Class').count()??????

#?輸出原始數(shù)據(jù)集樣本分類分布
groupby_df

model_smote?=?SMOTE()

x_smote_resampled,?y_smote_resampled?=?model_smote.fit_resample(X,?y)?????????
x_smoted?=?pd.DataFrame(x_smote_resampled,?
????????????????????????columns=df.columns.tolist()[1:])???
y_smoted?=?pd.DataFrame(y_smote_resampled,
????????????????????????columns=['Class'])???????
df_smoted?=?pd.concat([x_smoted,?y_smoted],axis=1)

建模

數(shù)據(jù)集劃分

In [38]:

X?=?df_smoted.iloc[:,:-1]
y?=?df_smoted.iloc[:,-1]

from?sklearn.model_selection?import?train_test_split

X_train,X_test,y_train,y_test?=?train_test_split(X,y,
?????????????????????????????????????????????????test_size=0.20,
?????????????????????????????????????????????????random_state=123)

決策樹

In [39]:

dt?=?DecisionTreeClassifier(max_depth=5)
dt.fit(X_train,?y_train)

Out[39]:

DecisionTreeClassifier(max_depth=5)

In [40]:

#?預(yù)測(cè)
y_prob?=?dt.predict_proba(X_test)[:,1]

#?預(yù)測(cè)的概率轉(zhuǎn)成0-1分類
y_pred?=?np.where(y_prob?>?0.5,?1,?0)
dt.score(X_test,?y_pred)

Out[40]:

1.0

In [41]:

#?混淆矩陣

confusion_matrix(y_test,?y_pred)

Out[41]:

array([[29,??8],
???????[19,?23]])

In [42]:

#?分類得分報(bào)告
print(classification_report(y_test,?y_pred))
??????????????precision????recall??f1-score???support

???????????0???????0.60??????0.78??????0.68????????37
???????????1???????0.74??????0.55??????0.63????????42

????accuracy???????????????????????????0.66????????79
???macro?avg???????0.67??????0.67??????0.66????????79
weighted?avg???????0.68??????0.66??????0.65????????79

In [43]:

#?roc
metrics.roc_auc_score(y_test,?y_pred)

Out[43]:

0.6657014157014157

In [44]:

#?roc曲線

from?sklearn.metrics?import?roc_curve,?auc
false_positive_rate,?true_positive_rate,?thresholds?=?roc_curve(y_test,?y_prob)
roc_auc?=?auc(false_positive_rate,?true_positive_rate)

import?matplotlib.pyplot?as?plt
plt.figure(figsize=(10,10))??#?畫布
plt.title('ROC')??#?標(biāo)題

plt.plot(false_positive_rate,??#?繪圖
?????????true_positive_rate,
?????????color='red',
?????????label?=?'AUC?=?%0.2f'?%?roc_auc)

plt.legend(loc?=?'lower?right')?#??圖例位置
plt.plot([0,?1],?[0,?1],linestyle='--')??#?正比例直線

plt.axis('tight')
plt.xlabel('False?Positive?Rate')
plt.ylabel('True?Positive?Rate')
plt.show()

隨機(jī)森林

In [45]:

rf?=?RandomForestClassifier(max_depth=5)
rf.fit(X_train,?y_train)

梯度提升樹

In [50]:

from?sklearn.ensemble?import?GradientBoostingClassifier

In [51]:

gbc?=?GradientBoostingClassifier(loss='deviance',?
?????????????????????????????????learning_rate=0.1,?
?????????????????????????????????n_estimators=5,?
?????????????????????????????????subsample=1,
?????????????????????????????????min_samples_split=2,?
?????????????????????????????????min_samples_leaf=1,?
?????????????????????????????????max_depth=3)

gbc.fit(X_train,?y_train)

Out[51]:

GradientBoostingClassifier(n_estimators=5,?subsample=1)

In [55]:

#?roc曲線

from?sklearn.metrics?import?roc_curve,?auc
false_positive_rate,?true_positive_rate,?thresholds?=?roc_curve(y_test,?y_prob)
roc_auc?=?auc(false_positive_rate,?true_positive_rate)

import?matplotlib.pyplot?as?plt
plt.figure(figsize=(10,10))??#?畫布
plt.title('ROC')??#?標(biāo)題

plt.plot(false_positive_rate,??#?繪圖
?????????true_positive_rate,
?????????color='red',
?????????label?=?'AUC?=?%0.2f'?%?roc_auc)

plt.legend(loc?=?'lower?right')?#??圖例位置
plt.plot([0,?1],?[0,?1],linestyle='--')??#?正比例直線

plt.axis('tight')
plt.xlabel('False?Positive?Rate')
plt.ylabel('True?Positive?Rate')
plt.show()

PCA降維

降維過(guò)程

In [56]:

from?sklearn.decomposition?import?PCA
pca?=?PCA(n_components=17)
pca.fit(X)

#返回所保留的17個(gè)成分各自的方差百分比
print(pca.explained_variance_ratio_)
[0.17513053?0.12941834?0.11453698?0.07323991?0.05889187?0.05690304
?0.04869476?0.0393374??0.03703477?0.03240863?0.03062932?0.02574137
?0.01887462?0.0180381??0.01606983?0.01453912?0.01318003]

In [57]:

sum(pca.explained_variance_ratio_)

Out[57]:

0.9026686181152915

降維后數(shù)據(jù)

In [58]:

X_NEW?=?pca.transform(X)
X_NEW

Out[58]:

array([[?1.70510215e-01,??5.39929099e-01,?-1.04314303e+00,?...,
????????-2.26541223e-01,?-6.39332871e-02,?-8.97923150e-02],
???????[-9.01105403e-01,??8.01693088e-01,??5.92260258e-01,?...,
?????????9.66299251e-02,??1.40755806e-03,?-2.74626972e-01],
???????[-6.05200264e-01,??6.08455330e-01,?-1.00524376e+00,?...,
?????????4.11416630e-02,??4.15705282e-02,?-8.46941345e-02],
???????...,
???????[?1.40652211e-02,??5.35906106e-01,??5.64150123e-02,?...,
?????????1.70834934e-01,??7.11616391e-02,?-1.72250445e-01],
???????[-4.41363597e-01,??9.11950641e-01,?-4.22184256e-01,?...,
????????-4.13385344e-02,?-7.64405982e-02,??1.04686148e-01],
???????[?1.98533663e+00,?-4.74547396e-01,?-1.52557494e-01,?...,
?????????2.72194184e-02,??5.71553613e-02,??1.78074886e-01]])

In [59]:

X_NEW.shape

Out[59]:

(392,?17)

重新劃分?jǐn)?shù)據(jù)

In [60]:

X_train,X_test,y_train,y_test?=?train_test_split(X_NEW,y,test_size=0.20,random_state=123)

再用隨機(jī)森林

In [61]:

rf?=?RandomForestClassifier(max_depth=5)
rf.fit(X_train,?y_train)

Out[61]:

RandomForestClassifier(max_depth=5)

In [62]:

#?預(yù)測(cè)
y_prob?=?rf.predict_proba(X_test)[:,1]

#?預(yù)測(cè)的概率轉(zhuǎn)成0-1分類
y_pred?=?np.where(y_prob?>?0.5,?1,?0)
rf.score(X_test,?y_pred)

Out[62]:

1.0

In [63]:

#?混淆矩陣

confusion_matrix(y_test,?y_pred)

Out[63]:

array([[26,?11],
???????[13,?29]])

In [64]:

#?roc
metrics.roc_auc_score(y_test,?y_pred)

Out[64]:

0.6965894465894465

In [65]:

#?roc曲線

from?sklearn.metrics?import?roc_curve,?auc
false_positive_rate,?true_positive_rate,?thresholds?=?roc_curve(y_test,?y_prob)
roc_auc?=?auc(false_positive_rate,?true_positive_rate)

import?matplotlib.pyplot?as?plt
plt.figure(figsize=(10,10))??
plt.title('ROC')??

plt.plot(false_positive_rate,??
?????????true_positive_rate,
?????????color='red',
?????????label?=?'AUC?=?%0.2f'?%?roc_auc)

plt.legend(loc?=?'lower?right')?
plt.plot([0,?1],?[0,?1],linestyle='--')??

plt.axis('tight')
plt.xlabel('False?Positive?Rate')
plt.ylabel('True?Positive?Rate')
plt.show()

總結(jié)

從數(shù)據(jù)預(yù)處理和特征工程出發(fā)，建立不同的樹模型表現(xiàn)來(lái)看，隨機(jī)森林表現(xiàn)的最好，AUC值高達(dá)0.81，在經(jīng)過(guò)對(duì)特征簡(jiǎn)單的降維之后，我們選擇前17個(gè)特征，它們的重要性超過(guò)90%，再次建模，此時(shí)AUC值達(dá)到0.83。


往期精彩回顧




適合初學(xué)者入門人工智能的路線及資料下載
(圖文+視頻)機(jī)器學(xué)習(xí)入門系列下載
中國(guó)大學(xué)慕課《機(jī)器學(xué)習(xí)》（黃海廣主講）
機(jī)器學(xué)習(xí)及深度學(xué)習(xí)筆記等資料打印
《統(tǒng)計(jì)學(xué)習(xí)方法》的代碼復(fù)現(xiàn)專輯
機(jī)器學(xué)習(xí)交流qq群955171419，加入微信群請(qǐng)掃碼

【機(jī)器學(xué)習(xí)】三大樹模型實(shí)戰(zhàn)乳腺癌預(yù)測(cè)分類