推薦 :21 句話入門機(jī)器學(xué)習(xí)!
本文約9700字,建議閱讀5分鐘 今天介紹一篇關(guān)于機(jī)器學(xué)習(xí)的入門級(jí)好文。
對(duì)于程序員來(lái)說(shuō),機(jī)器學(xué)習(xí)的重要性毋庸贅言。也許你還沒(méi)有開(kāi)始,也許曾經(jīng)失敗過(guò),都沒(méi)有關(guān)系,你將在這里找到或者重拾自信。只要粗通Python,略知NumPy,認(rèn)真讀完這21句話,逐行敲完示例代碼,就可以由此進(jìn)入自由的AI王國(guó)。
分類是對(duì)個(gè)體樣本做出定性判定,回歸是對(duì)個(gè)體樣本做出定量判定,二者同屬于有監(jiān)督的學(xué)習(xí),都是基于經(jīng)驗(yàn)的。舉個(gè)例子:有經(jīng)驗(yàn)的老師預(yù)測(cè)某學(xué)生考試及格或不及格,這是分類;預(yù)測(cè)某學(xué)生能考多少分,這是回歸;不管是預(yù)測(cè)是否及格還是預(yù)測(cè)考多少分,老師的經(jīng)驗(yàn)數(shù)據(jù)和思考方法是相同的,只是最后的表述不同而已。
4
numpy as npmembers = np.array([['男', '25', 185, 80, '程序員', 35, 200, 30],['女', '23', 170, 55, '公務(wù)員', 15, 0, 80],['男', '30', 180, 82, '律師', 60, 260, 300],['女', '27', 168, 52, '記者', 20, 180, 150]])
6
> security = np.float32((members[:,-1])) # 提取有價(jià)證券特征列數(shù)據(jù)> securityarray([ 30., 80., 300., 150.], dtype=float32)> (security - security.mean())/security.std() # 減去均值再除以標(biāo)準(zhǔn)差array([-1.081241, -0.5897678, 1.5727142, 0.09829464], dtype=float32)
9
歸一化是對(duì)樣本集的每個(gè)特征列減去該特征列的最小值進(jìn)行中心化,再除以極差(最大值最小值之差)進(jìn)行縮放。
> security = np.float32((members[:,-1])) # 提取有價(jià)證券特征列數(shù)據(jù)> securityarray([ 30., 80., 300., 150.], dtype=float32)> (security - security.min())/(security.max() - security.min()) # 減去最小值再除以極差array([0., 0.18518518, 1., 0.44444445], dtype=float32)
> from sklearn import preprocessing as pp> X = [['男', '程序員'],['女', '公務(wù)員'],['男', '律師', ],['女', '記者', ]]> ohe = pp.OneHotEncoder().fit(X)> ohe.transform(X).toarray()array([[0., 1., 0., 0., 1., 0.],[1., 0., 1., 0., 0., 0.],[0., 1., 0., 1., 0., 0.],[1., 0., 0., 0., 0., 1.]])
11
datasets.load_boston([return_X_y]) :加載波士頓房?jī)r(jià)數(shù)據(jù)集datasets.load_breast_cancer([return_X_y]) :加載威斯康星州乳腺癌數(shù)據(jù)集datasets.load_diabetes([return_X_y]) :加載糖尿病數(shù)據(jù)集datasets.load_digits([n_class, return_X_y]) :加載數(shù)字?jǐn)?shù)據(jù)集datasets.load_iris([return_X_y]) :加載鳶尾花數(shù)據(jù)集。datasets.load_linnerud([return_X_y]) :加載體能訓(xùn)練數(shù)據(jù)集datasets.load_wine([return_X_y]) :加載葡萄酒數(shù)據(jù)集datasets.fetch_20newsgroups([data_home, …]) :加載新聞文本分類數(shù)據(jù)集datasets.fetch_20newsgroups_vectorized([…]) :加載新聞文本向量化數(shù)據(jù)集datasets.fetch_california_housing([…]) :加載加利福尼亞住房數(shù)據(jù)集datasets.fetch_covtype([data_home, …]) :加載森林植被數(shù)據(jù)集datasets.fetch_kddcup99([subset, data_home, …]) :加載網(wǎng)絡(luò)入侵檢測(cè)數(shù)據(jù)集datasets.fetch_lfw_pairs([subset, …]) :加載人臉(成對(duì))數(shù)據(jù)集datasets.fetch_lfw_people([data_home, …]) :加載人臉(帶標(biāo)簽)數(shù)據(jù)集datasets.fetch_olivetti_faces([data_home, …]) :加載 Olivetti 人臉數(shù)據(jù)集datasets.fetch_rcv1([data_home, subset, …]):加載路透社英文新聞文本分類數(shù)據(jù)集datasets.fetch_species_distributions([…]) :加載物種分布數(shù)據(jù)集
>>> from sklearn.datasets import load_iris>>> X, y = load_iris(return_X_y=True)>>> X.shape # 數(shù)據(jù)集X有150個(gè)樣本,4個(gè)特征列(150, 4)>>> y.shape # 標(biāo)簽集y的每一個(gè)標(biāo)簽和數(shù)據(jù)集X的每一個(gè)樣本一一對(duì)應(yīng)(150,)>>> X[0], y[0](array([5.1, 3.5, 1.4, 0.2]), 0)
> iris = load_iris()> iris.target_names # 查看標(biāo)簽的名字array(['setosa', 'versicolor', 'virginica'], dtype='<U10')> X = iris.data> y = iris.target
>>> from sklearn.datasets import load_iris>>> from sklearn.model_selection import train_test_split as tsplit>>> X, y = load_iris(return_X_y=True)>>> X_train, X_test, y_train, y_test = tsplit(X, y, test_size=0.1)>>> X_train.shape, X_test.shape((135, 4), (15, 4))>>> y_train.shape, y_test.shape((135,), (15,))
> from sklearn.datasets import load_iris> from sklearn.model_selection import train_test_split as tsplit> from sklearn.neighbors import KNeighborsClassifier # 導(dǎo)入k-近鄰分類模型> X, y = load_iris(return_X_y=True) # 獲取鳶尾花數(shù)據(jù)集,返回樣本集和標(biāo)簽集> X_train, X_test, y_train, y_test = tsplit(X, y, test_size=0.1) # 拆分為訓(xùn)練集和測(cè)試集> m = KNeighborsClassifier(n_neighbors=10) # 模型實(shí)例化,n_neighbors參數(shù)指定k值,默認(rèn)k=5> m.fit(X_train, y_train) # 模型訓(xùn)練KNeighborsClassifier()> m.predict(X_test) # 對(duì)測(cè)試集分類array([2, 1, 2, 2, 1, 2, 1, 2, 2, 1, 0, 1, 0, 0, 2])> y_test # 這是實(shí)際的分類情況,上面的預(yù)測(cè)只錯(cuò)了一個(gè)array([2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 0, 1, 0, 0, 2])> m.score(X_test, y_test) # 模型測(cè)試精度(介于0~1)0.9333333333333333
> from sklearn.datasets import load_boston> from sklearn.model_selection import train_test_split as tsplit> from sklearn.neighbors import KNeighborsRegressor> X, y = load_boston(return_X_y=True) # 加載波士頓房?jī)r(jià)數(shù)據(jù)集> X.shape, y.shape, y.dtype # 該數(shù)據(jù)集共有506個(gè)樣本,13個(gè)特征列,標(biāo)簽集為浮點(diǎn)型,適用于回歸模型((506, 13), (506,), dtype('float64'))> X_train, X_test, y_train, y_test = tsplit(X, y, test_size=0.01) # 拆分為訓(xùn)練集和測(cè)試集> m = KNeighborsRegressor(n_neighbors=10) # 模型實(shí)例化,n_neighbors參數(shù)指定k值,默認(rèn)k=5> m.fit(X_train, y_train) # 模型訓(xùn)練KNeighborsRegressor(n_neighbors=10)> m.predict(X_test) # 預(yù)測(cè)6個(gè)測(cè)試樣本的房?jī)r(jià)array([27.15, 31.97, 12.68, 28.52, 20.59, 21.47])> y_test # 這是測(cè)試樣本的實(shí)際價(jià)格,除了第2個(gè)(索引為1)樣本偏差較大,其他樣本偏差還算差強(qiáng)人意array([29.1, 50. , 12.7, 22.8, 20.4, 21.5])
> from sklearn import metrics> y_pred = m.predict(X_test)> metrics.mean_squared_error(y_test, y_pred) # 均方誤差60.27319999999995> metrics.median_absolute_error(y_test, y_pred) # 中位數(shù)絕對(duì)誤差1.0700000000000003> metrics.r2_score(y_test, y_pred) # 復(fù)相關(guān)系數(shù)0.5612816401629652
> from sklearn.datasets import load_boston> from sklearn.model_selection import train_test_split as tsplit> from sklearn.tree import DecisionTreeRegressor> X, y = load_boston(return_X_y=True) # 加載波士頓房?jī)r(jià)數(shù)據(jù)集> X_train, X_test, y_train, y_test = tsplit(X, y, test_size=0.01) # 拆分為訓(xùn)練集和測(cè)試集> m = DecisionTreeRegressor(max_depth=10) # 實(shí)例化模型,決策樹(shù)深度為10> m.fit(X, y) # 訓(xùn)練DecisionTreeRegressor(max_depth=10)> y_pred = m.predict(X_test) # 預(yù)測(cè)> y_test # 這是測(cè)試樣本的實(shí)際價(jià)格,除了第2個(gè)(索引為1)樣本偏差略大,其他樣本偏差較小array([20.4, 21.9, 13.8, 22.4, 13.1, 7. ])> y_pred # 這是6個(gè)測(cè)試樣本的預(yù)測(cè)房?jī)r(jià),非常接近實(shí)際價(jià)格array([20.14, 22.33, 14.34, 22.4, 14.62, 7. ])> metrics.r2_score(y_test, y_pred) # 復(fù)相關(guān)系數(shù)0.9848774474870712> metrics.mean_squared_error(y_test, y_pred) # 均方誤差0.4744784865112032> metrics.median_absolute_error(y_test, y_pred) # 中位數(shù)絕對(duì)誤差0.3462962962962983
> from sklearn.datasets import load_diabetes> from sklearn.model_selection import train_test_split as tsplit> from sklearn.svm import SVR> from sklearn import metrics> X, y = load_diabetes(return_X_y=True)> X.shape, y.shape, y.dtype((442, 10), (442,), dtype('float64'))> X_train, X_test, y_train, y_test = tsplit(X, y, test_size=0.02)> svr_1 = SVR(kernel='rbf', C=0.1) # 實(shí)例化SVR模型,rbf核函數(shù),C=0.1> svr_2 = SVR(kernel='rbf', C=100) # 實(shí)例化SVR模型,rbf核函數(shù),C=100> svr_1.fit(X_train, y_train) # 模型訓(xùn)練SVR(C=0.1)> svr_2.fit(X_train, y_train) # 模型訓(xùn)練SVR(C=100)> z_1 = svr_1.predict(X_test) # 模型預(yù)測(cè)> z_2 = svr_2.predict(X_test) # 模型預(yù)測(cè)> y_test # 這是測(cè)試集的實(shí)際值array([ 49., 317., 84., 181., 281., 198., 84., 52., 129.])> z_1 # 這是C=0.1的預(yù)測(cè)值,偏差很大array([138.10720127, 142.1545034 , 141.25165838, 142.28652449,143.19648143, 143.24670732, 137.57932272, 140.51891989,143.24486911])> z_2 # 這是C=100的預(yù)測(cè)值,偏差明顯變小array([ 54.38891948, 264.1433666 , 169.71195204, 177.28782561,283.65199575, 196.53405477, 61.31486045, 199.30275061,184.94923477])> metrics.mean_squared_error(y_test, z_1) # C=0.01的均方誤差8464.946517460194> metrics.mean_squared_error(y_test, z_2) # C=100的均方誤差3948.37754995066> metrics.r2_score(y_test, z_1) # C=0.01的復(fù)相關(guān)系數(shù)0.013199351909129464> metrics.r2_score(y_test, z_2) # C=100的復(fù)相關(guān)系數(shù)0.5397181166871942> metrics.median_absolute_error(y_test, z_1) # C=0.01的中位數(shù)絕對(duì)誤差57.25165837797314> metrics.median_absolute_error(y_test, z_2) # C=100的中位數(shù)絕對(duì)誤差22.68513954888364
>>> from sklearn.datasets import load_breast_cancer # 導(dǎo)入數(shù)據(jù)加載函數(shù)>>> from sklearn.tree import DecisionTreeClassifier # 導(dǎo)入隨機(jī)樹(shù)>>> from sklearn.ensemble import RandomForestClassifier # 導(dǎo)入隨機(jī)森林>>> from sklearn.model_selection import cross_val_score # 導(dǎo)入交叉驗(yàn)證>>> ds = load_breast_cancer() # 加載威斯康星州乳腺癌數(shù)據(jù)集>>> ds.data.shape # 569個(gè)乳腺癌樣本,每個(gè)樣本包含30個(gè)特征(569, 30)>>> dtc = DecisionTreeClassifier() # 實(shí)例化決策樹(shù)分類模型>>> rfc = RandomForestClassifier() # 實(shí)例化隨機(jī)森林分類模型>>> dtc_scroe = cross_val_score(dtc, ds.data, ds.target, cv=10) # 交叉驗(yàn)證>>> dtc_scroe # 決策樹(shù)分類模型交叉驗(yàn)證10次的結(jié)果array([0.92982456, 0.85964912, 0.92982456, 0.89473684, 0.92982456,0.89473684, 0.87719298, 0.94736842, 0.92982456, 0.92857143])>>> dtc_scroe.mean() # 決策樹(shù)分類模型交叉驗(yàn)證10次的平均精度0.9121553884711779>>> rfc_scroe = cross_val_score(rfc, ds.data, ds.target, cv=10) # 交叉驗(yàn)證>>> rfc_scroe # 隨機(jī)森林分類模型交叉驗(yàn)證10次的結(jié)果array([0.98245614, 0.89473684, 0.94736842, 0.94736842, 0.98245614,0.98245614, 0.94736842, 0.98245614, 0.94736842, 1. ])>>> rfc_scroe.mean()# 隨機(jī)森林分類模型交叉驗(yàn)證10次的平均精度0.9614035087719298
> from sklearn import datasets as dss # 導(dǎo)入樣本生成器
> from sklearn.cluster import KMeans # 從聚類子模塊導(dǎo)入聚類模型> import matplotlib.pyplot as plt> plt.rcParams['font.sans-serif'] = ['FangSong']> plt.rcParams['axes.unicode_minus'] = False> X_blob, y_blob = dss.make_blobs(n_samples=[300,400,300], n_features=2)> X_circle, y_circle = dss.make_circles(n_samples=1000, noise=0.05, factor=0.5)> X_moon, y_moon = dss.make_moons(n_samples=1000, noise=0.05)> y_blob_pred = KMeans(init='k-means++', n_clusters=3).fit_predict(X_blob)> y_circle_pred = KMeans(init='k-means++', n_clusters=2).fit_predict(X_circle)> y_moon_pred = KMeans(init='k-means++', n_clusters=2).fit_predict(X_moon)> plt.subplot(131)<matplotlib.axes._subplots.AxesSubplot object at 0x00000180AFDECB88>> plt.title('團(tuán)狀簇')Text(0.5, 1.0, '團(tuán)狀簇')> plt.scatter(X_blob[:,0], X_blob[:,1], c=y_blob_pred)<matplotlib.collections.PathCollection object at 0x00000180C495DF08>> plt.subplot(132)<matplotlib.axes._subplots.AxesSubplot object at 0x00000180C493FA08>> plt.title('環(huán)狀簇')Text(0.5, 1.0, '環(huán)狀簇')> plt.scatter(X_circle[:,0], X_circle[:,1], c=y_circle_pred)<matplotlib.collections.PathCollection object at 0x00000180C499B888>> plt.subplot(133)<matplotlib.axes._subplots.AxesSubplot object at 0x00000180C4981188>> plt.title('新月簇')Text(0.5, 1.0, '新月簇')> plt.scatter(X_moon[:,0], X_moon[:,1], c=y_moon_pred)<matplotlib.collections.PathCollection object at 0x00000180C49DD1C8>> plt.show()
> from sklearn import datasets as dss> from sklearn.cluster import DBSCAN> import matplotlib.pyplot as plt> plt.rcParams['font.sans-serif'] = ['FangSong']> plt.rcParams['axes.unicode_minus'] = False> X, y = dss.make_moons(n_samples=1000, noise=0.05)> dbs_1 = DBSCAN() # 默認(rèn)核心樣本半徑0.5,核心樣本鄰居5個(gè)> dbs_2 = DBSCAN(eps=0.2) # 核心樣本半徑0.2,核心樣本鄰居5個(gè)> dbs_3 = DBSCAN(eps=0.1) # 核心樣本半徑0.1,核心樣本鄰居5個(gè)> dbs_1.fit(X)DBSCAN(algorithm='auto', eps=0.5, leaf_size=30, metric='euclidean',metric_params=None, min_samples=5, n_jobs=None, p=None)> dbs_2.fit(X)DBSCAN(algorithm='auto', eps=0.2, leaf_size=30, metric='euclidean',metric_params=None, min_samples=5, n_jobs=None, p=None)> dbs_3.fit(X)DBSCAN(algorithm='auto', eps=0.1, leaf_size=30, metric='euclidean',metric_params=None, min_samples=5, n_jobs=None, p=None)> plt.subplot(131)<matplotlib.axes._subplots.AxesSubplot object at 0x00000180C4C5D708>> plt.title('eps=0.5')Text(0.5, 1.0, 'eps=0.5')> plt.scatter(X[:,0], X[:,1], c=dbs_1.labels_)<matplotlib.collections.PathCollection object at 0x00000180C4C46348>> plt.subplot(132)<matplotlib.axes._subplots.AxesSubplot object at 0x00000180C4C462C8>> plt.title('eps=0.2')Text(0.5, 1.0, 'eps=0.2')> plt.scatter(X[:,0], X[:,1], c=dbs_2.labels_)<matplotlib.collections.PathCollection object at 0x00000180C49FC8C8>> plt.subplot(133)<matplotlib.axes._subplots.AxesSubplot object at 0x00000180C49FCC08>> plt.title('eps=0.1')Text(0.5, 1.0, 'eps=0.1')> plt.scatter(X[:,0], X[:,1], c=dbs_3.labels_)<matplotlib.collections.PathCollection object at 0x00000180C49FC4C8>> plt.show()
sklearn import datasets as dsssklearn.decomposition import PCAds = dss.load_iris()ds.data.shape # 150個(gè)樣本,4個(gè)特征維(150, 4)m = PCA() # 使用默認(rèn)參數(shù)實(shí)例化PCA類,n_components=Nonem.fit(ds.data)PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,svd_solver='auto', tol=0.0, whiten=False)m.explained_variance_ # 正交變換后各成分的方差值array([4.22824171, 0.24267075, 0.0782095 , 0.02383509])m.explained_variance_ratio_ # 正交變換后各成分的方差值占總方差值的比例array([0.92461872, 0.05306648, 0.01710261, 0.00521218])
對(duì)鳶尾花數(shù)據(jù)集的主成分分析結(jié)果顯示:存在一個(gè)明顯的成分,其方差值占總方差值的比例超過(guò)92% ;存在一個(gè)方差值很小的成分,其方差值占總方差值的比例只有0.52% ;前兩個(gè)成分貢獻(xiàn)的方差占比超過(guò)97.7%,數(shù)據(jù)集特征列可以從4個(gè)降至2個(gè)而不至于損失太多有效信息。
> m = PCA(n_components=0.97)> m.fit(ds.data)PCA(copy=True, iterated_power='auto', n_components=0.97, random_state=None,svd_solver='auto', tol=0.0, whiten=False)> m.explained_variance_array([4.22824171, 0.24267075])> m.explained_variance_ratio_array([0.92461872, 0.05306648])> d = m.transform(ds.data)> d.shape(150, 2)
> import matplotlib.pyplot as plt> plt.scatter(d[:,0], d[:,1], c=ds.target)<matplotlib.collections.PathCollection object at 0x0000016FBF243CC8>> plt.show()
評(píng)論
圖片
表情
