【機(jī)器學(xué)習(xí)】聚類代碼練習(xí)
本課程是中國大學(xué)慕課《機(jī)器學(xué)習(xí)》的“聚類”章節(jié)的課后代碼。
課程地址:
https://www.icourse163.org/course/WZU-1464096179
課程完整代碼:
https://github.com/fengdu78/WZU-machine-learning-course
代碼修改并注釋:黃海廣,[email protected]
在本練習(xí)中,我們將實(shí)現(xiàn)K-means聚類,并使用它來壓縮圖像。我們將從一個(gè)簡單的2D數(shù)據(jù)集開始,以了解K-means是如何工作的,然后我們將其應(yīng)用于圖像壓縮。我們還將對(duì)主成分分析進(jìn)行實(shí)驗(yàn),并了解如何使用它來找到面部圖像的低維表示。
K-means 聚類
我們將實(shí)施和應(yīng)用K-means到一個(gè)簡單的二維數(shù)據(jù)集,以獲得一些直觀的工作原理。K-means是一個(gè)迭代的,無監(jiān)督的聚類算法,將類似的實(shí)例組合成簇。該算法通過猜測每個(gè)簇的初始聚類中心開始,然后重復(fù)將實(shí)例分配給最近的簇,并重新計(jì)算該簇的聚類中心。我們要實(shí)現(xiàn)的第一部分是找到數(shù)據(jù)中每個(gè)實(shí)例最接近的聚類中心的函數(shù)。
import?numpy?as?np
import?pandas?as?pd
import?matplotlib.pyplot?as?plt
import?seaborn?as?sb
from?scipy.io?import?loadmat
def?find_closest_centroids(X,?centroids):
????m?=?X.shape[0]
????k?=?centroids.shape[0]
????idx?=?np.zeros(m)
????for?i?in?range(m):
????????min_dist?=?1000000
????????for?j?in?range(k):
????????????dist?=?np.sum((X[i,?:]?-?centroids[j,?:])**2)
????????????if?dist?????????????????min_dist?=?dist
????????????????idx[i]?=?j
????return?idx
讓我們來測試這個(gè)函數(shù),以確保它的工作正常。我們將使用練習(xí)中提供的測試用例。
data2?=?pd.read_csv('data/ex7data2.csv')
data2.head()
| X1 | X2 | |
|---|---|---|
| 0 | 1.842080 | 4.607572 |
| 1 | 5.658583 | 4.799964 |
| 2 | 6.352579 | 3.290854 |
| 3 | 2.904017 | 4.612204 |
| 4 | 3.231979 | 4.939894 |
X=data2.values
initial_centroids?=?initial_centroids?=?np.array([[3,?3],?[6,?2],?[8,?5]])
idx?=?find_closest_centroids(X,?initial_centroids)
idx[0:3]
array([0., 2., 1.])
輸出與文本中的預(yù)期值匹配(記住我們的數(shù)組是從零開始索引的,而不是從一開始索引的,所以值比練習(xí)中的值低一個(gè))。接下來,我們需要一個(gè)函數(shù)來計(jì)算簇的聚類中心。聚類中心只是當(dāng)前分配給簇的所有樣本的平均值。
sb.set(context="notebook",?style="white")
sb.lmplot(x='X1',?y='X2',?data=data2,?fit_reg=False)
plt.show()

def?compute_centroids(X,?idx,?k):
????m,?n?=?X.shape
????centroids?=?np.zeros((k,?n))
????for?i?in?range(k):
????????indices?=?np.where(idx?==?i)
????????centroids[i,?:]?=?(np.sum(X[indices,?:],?axis=1)?/
???????????????????????????len(indices[0])).ravel()
????return?centroids
compute_centroids(data2.values,?idx,?3)
array([[2.42830111, 3.15792418],
[5.81350331, 2.63365645],
[7.11938687, 3.6166844 ]])
此輸出也符合練習(xí)中的預(yù)期值。下一部分涉及實(shí)際運(yùn)行該算法的一些迭代次數(shù)和可視化結(jié)果。這個(gè)步驟是由于并不復(fù)雜,我將從頭開始構(gòu)建它。為了運(yùn)行算法,我們只需要在將樣本分配給最近的簇并重新計(jì)算簇的聚類中心。
def?run_k_means(X,?initial_centroids,?max_iters):
????m,?n?=?X.shape
????k?=?initial_centroids.shape[0]
????idx?=?np.zeros(m)
????centroids?=?initial_centroids
????
????for?i?in?range(max_iters):
????????idx?=?find_closest_centroids(X,?centroids)
????????centroids?=?compute_centroids(X,?idx,?k)
????
????return?idx,?centroids
idx,?centroids?=?run_k_means(X,?initial_centroids,?10)
cluster1?=?X[np.where(idx?==?0)[0],:]
cluster2?=?X[np.where(idx?==?1)[0],:]
cluster3?=?X[np.where(idx?==?2)[0],:]
fig,?ax?=?plt.subplots(figsize=(15,10))
ax.scatter(cluster1[:,0],?cluster1[:,1],?s=30,?color='r',?label='Cluster?1')
ax.scatter(cluster2[:,0],?cluster2[:,1],?s=30,?color='g',?label='Cluster?2')
ax.scatter(cluster3[:,0],?cluster3[:,1],?s=30,?color='b',?label='Cluster?3')
ax.legend()
plt.show()

我們跳過的一個(gè)步驟是初始化聚類中心的過程。這可以影響算法的收斂。我們的任務(wù)是創(chuàng)建一個(gè)選擇隨機(jī)樣本并將其用作初始聚類中心的函數(shù)。
def?init_centroids(X,?k):
????m,?n?=?X.shape
????centroids?=?np.zeros((k,?n))
????idx?=?np.random.randint(0,?m,?k)
????for?i?in?range(k):
????????centroids[i,?:]?=?X[idx[i],?:]
????return?centroids
init_centroids(X,?3)
array([[1.52334113, 4.87916159],
[3.06192918, 1.5719211 ],
[1.75164337, 0.68853536]])
k值的選擇
使用“肘部法則”選取k值
from?sklearn.cluster?import?KMeans
#?'利用SSE選擇k'
SSE?=?[]??#?存放每次結(jié)果的誤差平方和
for?k?in?range(1,?9):
????estimator?=?KMeans(n_clusters=k)??#?構(gòu)造聚類器
????estimator.fit(data2)
????SSE.append(estimator.inertia_)
X?=?range(1,?9)
plt.figure(figsize=(15,?10))
plt.xlabel('k')
plt.ylabel('SSE')
plt.plot(X,?SSE,?'o-')
plt.show()

圖中可以看出,k=3的時(shí)候是肘點(diǎn),所以,選擇k=3.
K-means圖像壓縮
我們的下一個(gè)任務(wù)是將K-means應(yīng)用于圖像壓縮。從下面的演示可以看到,我們可以使用聚類來找到最具代表性的少數(shù)顏色,并使用聚類分配將原始的24位顏色映射到較低維的顏色空間。
下面是我們要壓縮的圖像。
from?IPython.display?import?Image
Image(filename='data/bird_small.png')

The raw pixel data has been pre-loaded for us so let's pull it in.
image_data?=?loadmat('data/bird_small.mat')
#?image_data
A?=?image_data['A']
A.shape
(128, 128, 3)
現(xiàn)在我們需要對(duì)數(shù)據(jù)應(yīng)用一些預(yù)處理,并將其提供給K-means算法。
#?normalize?value?ranges
A?=?A?/?255.
#?reshape?the?array
X?=?np.reshape(A,?(A.shape[0]?*?A.shape[1],?A.shape[2]))
X.shape
(16384, 3)
#?randomly?initialize?the?centroids
initial_centroids?=?init_centroids(X,?16)
#?run?the?algorithm
idx,?centroids?=?run_k_means(X,?initial_centroids,?10)
#?get?the?closest?centroids?one?last?time
idx?=?find_closest_centroids(X,?centroids)
#?map?each?pixel?to?the?centroid?value
X_recovered?=?centroids[idx.astype(int),:]
X_recovered.shape
(16384, 3)
#?reshape?to?the?original?dimensions
X_recovered?=?np.reshape(X_recovered,?(A.shape[0],?A.shape[1],?A.shape[2]))
X_recovered.shape
(128, 128, 3)
plt.imshow(X_recovered)
plt.show()

您可以看到我們對(duì)圖像進(jìn)行了壓縮,但圖像的主要特征仍然存在。這就是K-means。下面我們來用scikit-learn來實(shí)現(xiàn)K-means。
from?skimage?import?io
#?cast?to?float,?you?need?to?do?this?otherwise?the?color?would?be?weird?after?clustring
pic?=?io.imread('data/bird_small.png')?/?255.
io.imshow(pic)
plt.show()

pic.shape
(128, 128, 3)
#?serialize?data
data?=?pic.reshape(128*128,?3)
data.shape
(16384, 3)
from?sklearn.cluster?import?KMeans#導(dǎo)入kmeans庫
model?=?KMeans(n_clusters=16,?n_init=100)
model.fit(data)
KMeans(n_clusters=16, n_init=100)
centroids?=?model.cluster_centers_
print(centroids.shape)
C?=?model.predict(data)
print(C.shape)
(16, 3)
(16384,)
centroids[C].shape
(16384, 3)
compressed_pic?=?centroids[C].reshape((128,128,3))
fig,?ax?=?plt.subplots(1,?2)
ax[0].imshow(pic)
ax[1].imshow(compressed_pic)
plt.show()

密度聚類
DBSCAN(Density-Based Spatial Clustering of Applications with Noise)是一個(gè)比較有代表性的基于密度的聚類算法。與劃分和層次聚類方法不同,它將簇定義為密度相連的點(diǎn)的最大集合,能夠把具有足夠高密度的區(qū)域劃分為簇,并可在噪聲的空間數(shù)據(jù)庫中發(fā)現(xiàn)任意形狀的聚類。
import?numpy?as?np
from?sklearn.cluster?import?DBSCAN
from?sklearn?import?metrics
from?sklearn.datasets?import?make_blobs
from?sklearn.preprocessing?import?StandardScaler
import?matplotlib.pyplot?as?plt
plt.rcParams['font.sans-serif']=['SimHei']?#用來正常顯示中文標(biāo)簽
plt.rcParams['axes.unicode_minus']=False?#用來正常顯示負(fù)號(hào)
創(chuàng)建樣本數(shù)據(jù)
centers?=?[[1,?1],?[-1,?-1],?[1,?-1]]
X,?labels_true?=?make_blobs(
????n_samples=750,?centers=centers,?cluster_std=0.4,?random_state=0
)
標(biāo)準(zhǔn)化數(shù)據(jù)
X?=?StandardScaler().fit_transform(X)
在DBSCAN使用兩個(gè)超參數(shù):
掃描半徑 (eps)和最小包含點(diǎn)數(shù)(minPts)來獲得簇的數(shù)量,而不是猜測簇的數(shù)目。
(1)掃描半徑 (eps) : 用于定位點(diǎn)/檢查任何點(diǎn)附近密度的距離度量,即掃描半徑。 (2)最小包含點(diǎn)數(shù)(minPts) :聚集在一起的最小點(diǎn)數(shù)(閾值),該區(qū)域被認(rèn)為是稠密的。
我們定義一個(gè)plot_dbscan(MyEps, MiniSample)函數(shù),MyEps代表eps,MiniSample代表minPts。
def?plot_dbscan(MyEps,?MiniSample):
????db?=?DBSCAN(eps=MyEps,?min_samples=MiniSample).fit(X)
????core_samples_mask?=?np.zeros_like(db.labels_,?dtype=bool)
????core_samples_mask[db.core_sample_indices_]?=?True
????labels?=?db.labels_
????#?標(biāo)簽中的簇?cái)?shù),忽略噪聲點(diǎn)(如果存在)。
????n_clusters_?=?len(set(labels))?-?(1?if?-1?in?labels?else?0)
????n_noise_?=?list(labels).count(-1)
????print("估計(jì)的簇的數(shù)量:?%d"?%?n_clusters_)
????print("估計(jì)的噪聲點(diǎn)數(shù)量:?%d"?%?n_noise_)
????print("同一性(Homogeneity):?%0.4f"?%
??????????metrics.homogeneity_score(labels_true,?labels))
????print("完整性(Completeness):?%0.4f"?%
??????????metrics.completeness_score(labels_true,?labels))
????print("V-measure:?%0.3f"?%?metrics.v_measure_score(labels_true,?labels))
????print("ARI(Adjusted?Rand?Index):?%0.4f"?%
??????????metrics.adjusted_rand_score(labels_true,?labels))
????print("AMI(Adjusted?Mutual?Information):?%0.4f"?%
??????????metrics.adjusted_mutual_info_score(labels_true,?labels))
????print("輪廓系數(shù)(Silhouette?Coefficient):?%0.4f"?%
??????????metrics.silhouette_score(X,?labels))
????#?#############################################################################
????#?畫出結(jié)果
????#?黑色點(diǎn)代表噪聲點(diǎn)
????unique_labels?=?set(labels)
????colors?=?[
????????plt.cm.Spectral(each)?for?each?in?np.linspace(0,?1,?len(unique_labels))
????]
????for?k,?col?in?zip(unique_labels,?colors):
????????if?k?==?-1:
????????????#?Black?used?for?noise.
????????????col?=?[0,?0,?1,?1]
????????class_member_mask?=?labels?==?k
????????xy?=?X[class_member_mask?&?core_samples_mask]
????????plt.plot(
????????????xy[:,?0],
????????????xy[:,?1],
????????????"o",
????????????markerfacecolor=tuple(col),
????????????markeredgecolor="k",
????????????markersize=14,
????????)
????????xy?=?X[class_member_mask?&?~core_samples_mask]
????????plt.plot(
????????????xy[:,?0],
????????????xy[:,?1],
????????????"o",
????????????markerfacecolor=tuple(col),
????????????markeredgecolor="k",
????????????markersize=6,
????????)
????plt.title("簇的數(shù)量為:?%d"?%?n_clusters_,?fontsize=18)
#?????plt.savefig(str(MyEps)?+?str(MiniSample)?+?'.png')#保存圖片
????plt.show()
plot_dbscan(0.3,?10)
估計(jì)的簇的數(shù)量: 3
估計(jì)的噪聲點(diǎn)數(shù)量: 18
同一性(Homogeneity): 0.9530
完整性(Completeness): 0.8832
V-measure: 0.917
ARI(Adjusted Rand Index): 0.9517
AMI(Adjusted Mutual Information): 0.9165
輪廓系數(shù)(Silhouette Coefficient): 0.6255

plot_dbscan(0.1,?10)
估計(jì)的簇的數(shù)量: 12
估計(jì)的噪聲點(diǎn)數(shù)量: 516
同一性(Homogeneity): 0.3128
完整性(Completeness): 0.2489
V-measure: 0.277
ARI(Adjusted Rand Index): 0.0237
AMI(Adjusted Mutual Information): 0.2673
輪廓系數(shù)(Silhouette Coefficient): -0.3659

plot_dbscan(0.4,?10)
估計(jì)的簇的數(shù)量: 1
估計(jì)的噪聲點(diǎn)數(shù)量: 2
同一性(Homogeneity): 0.0010
完整性(Completeness): 0.0586
V-measure: 0.002
ARI(Adjusted Rand Index): -0.0000
AMI(Adjusted Mutual Information): -0.0011
輪廓系數(shù)(Silhouette Coefficient): 0.0611

plot_dbscan(0.3,?6)
估計(jì)的簇的數(shù)量: 2
估計(jì)的噪聲點(diǎn)數(shù)量: 13
同一性(Homogeneity): 0.5365
完整性(Completeness): 0.8263
V-measure: 0.651
ARI(Adjusted Rand Index): 0.5414
AMI(Adjusted Mutual Information): 0.6495
輪廓系數(shù)(Silhouette Coefficient): 0.3845

可以看到,當(dāng)掃描半徑 (eps)為0.3,同時(shí)最小包含點(diǎn)數(shù)(minPts)為10的時(shí)候,評(píng)價(jià)指標(biāo)最高。
層次聚類
import?numpy?as?np
import?matplotlib.pyplot?as?plt
plt.rcParams['font.sans-serif']=['SimHei']?#用來正常顯示中文標(biāo)簽
plt.rcParams['axes.unicode_minus']=False?#用來正常顯示負(fù)號(hào)
from?scipy.cluster.hierarchy?import?dendrogram
from?sklearn.datasets?import?load_iris
from?sklearn.cluster?import?AgglomerativeClustering
def?plot_dendrogram(model,?**kwargs):
????#?創(chuàng)建鏈接矩陣,然后繪制樹狀圖
????#?創(chuàng)建每個(gè)節(jié)點(diǎn)下的樣本計(jì)數(shù)
????counts?=?np.zeros(model.children_.shape[0])
????n_samples?=?len(model.labels_)
????for?i,?merge?in?enumerate(model.children_):
????????current_count?=?0
????????for?child_idx?in?merge:
????????????if?child_idx?????????????????current_count?+=?1??#?leaf?node
????????????else:
????????????????current_count?+=?counts[child_idx?-?n_samples]
????????counts[i]?=?current_count
????linkage_matrix?=?np.column_stack(
????????[model.children_,?model.distances_,?counts]
????).astype(float)
????#?繪制相應(yīng)的樹狀圖
????dendrogram(linkage_matrix,?**kwargs)
iris?=?load_iris()
X?=?iris.data
#?設(shè)置距離閾值=0可確保計(jì)算完整的樹。
model?=?AgglomerativeClustering(distance_threshold=0,?n_clusters=None)
model?=?model.fit(X)
plt.title("層次聚類樹狀圖")
#?繪制樹狀圖的前三級(jí)
plot_dendrogram(model,?truncate_mode="level",?p=3)
plt.xlabel("節(jié)點(diǎn)中的點(diǎn)數(shù)(如果沒有括號(hào),則為點(diǎn)索引)")
plt.show()

參考
Prof. Andrew Ng. Machine Learning. Stanford University https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html https://scikit-learn.org/stable/auto_examples/cluster
往期精彩回顧 本站qq群955171419,加入微信群請(qǐng)掃碼:
