圖異常點檢測算法OddBall(附代碼)

大家好,我是愛生活,AI風控的小伍哥,上次給大家?guī)淼谝黄恼翺ddBall,當時并未附代碼,現(xiàn)在找到一個比較好的開源代碼,分享給大家。
項目地址:https://github.com/gloooryyt/oddball_py3

我們自己構建測試數(shù)據(jù)或者讀取測試數(shù)據(jù)
import networkx as nxG = nx.Graph(date="10.11",name="無向圖")#創(chuàng)建空圖,無向圖G.add_node(1)G.add_nodes_from([1,2,3,4,5,6,7,8,9])G.nodes()G.add_weighted_edges_from([(1,2,10),(1,3,1),(1,4,1),(1,5,1),(1,6,1),(1,7,1),(1,8,1),(1,9,1),(8,9,1)])print(G.adj){1: {2: {'weight': 10}, 3: {'weight': 1}, 4: {'weight': 1}, 5: {'weight': 1}, 6: {'weight': 1}, 7: {'weight': 1}, 8: {'weight': 1}, 9: {'weight': 1}}, 2: {1: {'weight': 10}}, 3: {1: {'weight': 1}}, 4: {1: {'weight': 1}}, 5: {1: {'weight': 1}}, 6: {1: {'weight': 1}}, 7: {1: {'weight': 1}}, 8: {1: {'weight': 1}, 9: {'weight': 1}}, 9: {1: {'weight': 1}, 8: {'weight': 1}}}import?matplotlib.pyplot?as?pltedgewidth?=?[10,1,1,1,1,1,1,1,12]nx.draw(G,with_labels=True,pos=nx.spring_layout(G),width=edgewidth,node_color="lightgreen",edge_color="DeepPink")plt.show()

也可以讀取自己的數(shù)據(jù)集
#數(shù)據(jù)讀取import ospath?=?'/Users/wuzhengxiang/Downloads/OddBall10_tbox?2/oddball-edges.txt'
import?numpy?as?npimport networkx as nx#load data, a weighted undirected graphdef load_data(path):data = np.loadtxt(path).astype('int32')G = nx.Graph()for ite in data:G.add_edge(ite[0], ite[1], weight=ite[2])return Gdef get_feature(G):#feature dictionary which format is {node i's id:Ni, Ei, Wi, λw,i}featureDict = {}nodelist = list(G.nodes)for ite in nodelist:featureDict[ite] = []#the number of node i's neighborNi = G.degree(ite)featureDict[ite].append(Ni)#the set of node i's neighboriNeighbor = list(G.neighbors(ite))#the number of edges in egonet iEi = 0#sum of weights in egonet iWi = 0#the principal eigenvalue(the maximum eigenvalue with abs) of egonet i's weighted adjacency matrixLambda_w_i = 0Ei += Niegonet = nx.Graph()for nei in iNeighbor:Wi += G[nei][ite]['weight']egonet.add_edge(ite, nei, weight=G[nei][ite]['weight'])iNeighborLen = len(iNeighbor)for it1 in range(iNeighborLen):for it2 in range(it1+1, iNeighborLen):#if it1 in it2's neighbor listif iNeighbor[it1] in list(G.neighbors(iNeighbor[it2])):Ei += 1Wi += G[iNeighbor[it1]][iNeighbor[it2]]['weight']egonet.add_edge(iNeighbor[it1], iNeighbor[it2], weight=G[iNeighbor[it1]][iNeighbor[it2]]['weight'])egonet_adjacency_matrix = nx.adjacency_matrix(egonet).todense()eigenvalue, eigenvector = np.linalg.eig(egonet_adjacency_matrix)eigenvalue.sort()Lambda_w_i = max(abs(eigenvalue[0]), abs(eigenvalue[-1]))featureDict[ite].append(Ei)featureDict[ite].append(Wi)featureDict[ite].append(Lambda_w_i)??????return?featureDict
import numpy as npfrom sklearn.linear_model import LinearRegressionfrom sklearn.neighbors import LocalOutlierFactor# feature dictionary which format is {node i's id:Ni, Ei, Wi, λw,i}def star_or_clique(featureDict):N = []E = []for key in featureDict.keys():N.append(featureDict[key][0])E.append(featureDict[key][1])#E=CN^α => log on both sides => logE=logC+αlogN#regard as y=b+wx to do linear regression#here the base of log is 2y_train = np.log2(E)y_train = np.array(y_train)y_train = y_train.reshape(len(E), 1)x_train = np.log2(N)x_train = np.array(x_train)x_train = x_train.reshape(len(N), 1)model = LinearRegression()y_train)w = model.coef_[0][0]b = model.intercept_[0]C = 2**balpha = woutlineScoreDict = {}for key in featureDict.keys():yi = featureDict[key][1]xi = featureDict[key][0]outlineScore = (max(yi, C*(xi**alpha))/min(yi, C*(xi**alpha)))*np.log(abs(yi-C*(xi**alpha))+1)= outlineScorereturn outlineScoreDictdef heavy_vicinity(featureDict):W = []E = []for key in featureDict.keys():W.append(featureDict[key][2])E.append(featureDict[key][1])#W=CE^β => log on both sides => logW=logC+βlogE#regard as y=b+wx to do linear regression#here the base of log is 2y_train = np.log2(W)y_train = np.array(y_train)y_train = y_train.reshape(len(W), 1)x_train = np.log2(E)x_train = np.array(x_train)x_train = x_train.reshape(len(E), 1)model = LinearRegression()y_train)w = model.coef_[0][0]b = model.intercept_[0]C = 2**bbeta = woutlineScoreDict = {}for key in featureDict.keys():yi = featureDict[key][2]xi = featureDict[key][1]outlineScore = (max(yi, C*(xi**beta))/min(yi, C*(xi**beta)))*np.log(abs(yi-C*(xi**beta))+1)= outlineScorereturn outlineScoreDictdef dominant_edge(featureDict):Lambda_w_i = []W = []for key in featureDict.keys():Lambda_w_i.append(featureDict[key][3])W.append(featureDict[key][2])#λ=CW^γ => log on both sides => logλ=logC+γlogW#regard as y=b+wx to do linear regression#here the base of log is 2y_train = np.log2(Lambda_w_i)y_train = np.array(y_train)y_train = y_train.reshape(len(Lambda_w_i), 1)x_train = np.log2(W)x_train = np.array(x_train)x_train = x_train.reshape(len(W), 1)model = LinearRegression()y_train)w = model.coef_[0][0]b = model.intercept_[0]C = 2 ** bbeta = woutlineScoreDict = {}for key in featureDict.keys():yi = featureDict[key][3]xi = featureDict[key][2]outlineScore = (max(yi, C * (xi ** beta)) / min(yi, C * (xi ** beta))) * np.log(abs(yi - C * (xi ** beta)) + 1)= outlineScorereturn outlineScoreDictdef star_or_clique_withLOF(featureDict):N = []E = []for key in featureDict.keys():N.append(featureDict[key][0])E.append(featureDict[key][1])#E=CN^α => log on both sides => logE=logC+αlogN#regard as y=b+wx to do linear regression#here the base of log is 2y_train = np.log2(E)y_train = np.array(y_train)y_train = y_train.reshape(len(E), 1)x_train = np.log2(N)x_train = np.array(x_train)x_train = x_train.reshape(len(N), 1) #the order in x_train and y_train is the same as which in featureDict.keys() now#prepare data for LOFxAndyForLOF = []for index in range(len(N)):tempArray = np.array([x_train[index][0], y_train[index][0]])xAndyForLOF.append(tempArray)xAndyForLOF = np.array(xAndyForLOF)model = LinearRegression()y_train)w = model.coef_[0][0]b = model.intercept_[0]C = 2**balpha = w={}'.format(alpha))#LOF algorithmclf = LocalOutlierFactor(n_neighbors=20)clf.fit(xAndyForLOF)LOFScoreArray = -clf.negative_outlier_factor_outScoreDict = {}count = 0 #Used to take LOFScore in sequence from LOFScoreArray#get the maximum outLinemaxOutLine = 0for key in featureDict.keys():yi = featureDict[key][1]xi = featureDict[key][0]outlineScore = (max(yi, C*(xi**alpha))/min(yi, C*(xi**alpha)))*np.log(abs(yi-C*(xi**alpha))+1)if outlineScore > maxOutLine:maxOutLine = outlineScore={}'.format(maxOutLine))#get the maximum LOFScoremaxLOFScore = 0for ite in range(len(N)):if LOFScoreArray[ite] > maxLOFScore:maxLOFScore = LOFScoreArray[ite]={}'.format(maxLOFScore))for key in featureDict.keys():yi = featureDict[key][1]xi = featureDict[key][0]outlineScore = (max(yi, C*(xi**alpha))/min(yi, C*(xi**alpha)))*np.log(abs(yi-C*(xi**alpha))+1)LOFScore = LOFScoreArray[count]count += 1outScore = outlineScore/maxOutLine + LOFScore/maxLOFScore= outScorereturn outScoreDictdef heavy_vicinity_withLOF(featureDict):W = []E = []for key in featureDict.keys():W.append(featureDict[key][2])E.append(featureDict[key][1])#W=CE^β => log on both sides => logW=logC+βlogE#regard as y=b+wx to do linear regression#here the base of log is 2y_train = np.log2(W)y_train = np.array(y_train)y_train = y_train.reshape(len(W), 1)x_train = np.log2(E)x_train = np.array(x_train)x_train = x_train.reshape(len(E), 1) #the order in x_train and y_train is the same as which in featureDict.keys() now#prepare data for LOFxAndyForLOF = []for index in range(len(W)):tempArray = np.array([x_train[index][0], y_train[index][0]])xAndyForLOF.append(tempArray)xAndyForLOF = np.array(xAndyForLOF)model = LinearRegression()y_train)w = model.coef_[0][0]b = model.intercept_[0]C = 2**bbeta = w={}'.format(beta))#LOF algorithmclf = LocalOutlierFactor(n_neighbors=20)clf.fit(xAndyForLOF)LOFScoreArray = -clf.negative_outlier_factor_outScoreDict = {}count = 0 #Used to take LOFScore in sequence from LOFScoreArray#get the maximum outLinemaxOutLine = 0for key in featureDict.keys():yi = featureDict[key][2]xi = featureDict[key][1]outlineScore = (max(yi, C*(xi**beta))/min(yi, C*(xi**beta)))*np.log(abs(yi-C*(xi**beta))+1)if outlineScore > maxOutLine:maxOutLine = outlineScore={}'.format(maxOutLine))#get the maximum LOFScoremaxLOFScore = 0for ite in range(len(W)):if LOFScoreArray[ite] > maxLOFScore:maxLOFScore = LOFScoreArray[ite]={}'.format(maxLOFScore))for key in featureDict.keys():yi = featureDict[key][2]xi = featureDict[key][1]outlineScore = (max(yi, C*(xi**beta))/min(yi, C*(xi**beta)))*np.log(abs(yi-C*(xi**beta))+1)LOFScore = LOFScoreArray[count]count += 1outScore = outlineScore/maxOutLine + LOFScore/maxLOFScore= outScorereturn outScoreDictdef dominant_edge_withLOF(featureDict):Lambda_w_i = []W = []for key in featureDict.keys():Lambda_w_i.append(featureDict[key][3])W.append(featureDict[key][2])#λ=CW^γ => log on both sides => logλ=logC+γlogW#regard as y=b+wx to do linear regression#here the base of log is 2y_train = np.log2(Lambda_w_i)y_train = np.array(y_train)y_train = y_train.reshape(len(Lambda_w_i), 1)x_train = np.log2(W)x_train = np.array(x_train)x_train = x_train.reshape(len(W), 1) #the order in x_train and y_train is the same as which in featureDict.keys() now#prepare data for LOFxAndyForLOF = []for index in range(len(W)):tempArray = np.array([x_train[index][0], y_train[index][0]])xAndyForLOF.append(tempArray)xAndyForLOF = np.array(xAndyForLOF)model = LinearRegression()y_train)w = model.coef_[0][0]b = model.intercept_[0]C = 2**bgamma = w={}'.format(gamma))#LOF algorithmclf = LocalOutlierFactor(n_neighbors=20)clf.fit(xAndyForLOF)LOFScoreArray = -clf.negative_outlier_factor_outScoreDict = {}count = 0 #Used to take LOFScore in sequence from LOFScoreArray#get the maximum outLinemaxOutLine = 0for key in featureDict.keys():yi = featureDict[key][3]xi = featureDict[key][2]outlineScore = (max(yi, C*(xi**gamma))/min(yi, C*(xi**gamma)))*np.log(abs(yi-C*(xi**gamma))+1)if outlineScore > maxOutLine:maxOutLine = outlineScore={}'.format(maxOutLine))#get the maximum LOFScoremaxLOFScore = 0for ite in range(len(W)):if LOFScoreArray[ite] > maxLOFScore:maxLOFScore = LOFScoreArray[ite]={}'.format(maxLOFScore))for key in featureDict.keys():yi = featureDict[key][3]xi = featureDict[key][2]outlineScore = (max(yi, C*(xi**gamma))/min(yi, C*(xi**gamma)))*np.log(abs(yi-C*(xi**gamma))+1)LOFScore = LOFScoreArray[count]count += 1outScore = outlineScore/maxOutLine + LOFScore/maxLOFScore= outScore????return?outScoreDict
一、概? ?述
基于圖的異常檢測分為 孤立點檢測 和 異常群簇檢測,本文是孤立點檢測中較經(jīng)典的論文,通過研究Ego-net總結幾種異常模型及提供度量方式:
異常結構 | 含義 | 度量方式? |
CliqueStar | 呈星狀或者團狀結構 | 邊數(shù)~節(jié)點鄰居數(shù) |
HeavyVicinity | 總邊權重異常大 | 總邊權重~邊數(shù) |
DominantPair | 存在某條權重異常大的邊 | 主特征值~總邊權重 |
文章調查了Ego-net中存在的異常模式,并給出了檢測異常模式的依據(jù)基于上述模式,提出了OddBall,一種用于異常點檢測的無監(jiān)督方法,將OddBall應用于真實數(shù)據(jù)集,并驗證了算法的有效性
論文名稱:OddBall: Spotting Anomalies in Weighted Graphs
論文地址:http://www.cs.cmu.edu/~mmcgloho/pubs/pakdd10.pdf
代碼地址:https://www.andrew.cmu.edu/user/lakoglu/pubs.html#code
?
二、Ego-net(中心節(jié)點)
以中心節(jié)點(ego)及其鄰居組成的子圖,一般用于研究個體性質以及局部社區(qū)發(fā)現(xiàn),本文僅考慮一階鄰居,這是為了減少計算量并提和高可解釋性。
三、Ego-net模式及度量方法
1 、CliqueStar(基于密度)
基于密度的方法可以識別出下面兩種Ego-net的異常結構:
Near-Star:在正常的社交網(wǎng)絡中,我們通常認為朋友之間可能會相互認識,因此一階Ego-net中的鄰居之間沒有任何關聯(lián)是非常可疑的,近似星型,鄰居之間很少聯(lián)系(如通話關系網(wǎng)絡中的中介、電催人員、營銷號碼,他們大量的聯(lián)系別人,然而聯(lián)系人中之間幾乎沒啥聯(lián)系),這種結構的Ego-net被稱為star,如下圖所示,中心節(jié)點與大量節(jié)點存在關聯(lián),但是鄰居之間無聯(lián)系或者聯(lián)系很少。

Near-Clique:與上述相反,鄰居之間存在大量關聯(lián)也是非??梢傻模@種結構的Ego-net被稱為cliques。正如下圖所示,中心節(jié)點與大量節(jié)點存在關聯(lián),鄰居之間的聯(lián)系非常密集,近似環(huán)狀,鄰居之間聯(lián)系緊密(如某個討論組、恐怖組織)。

度量方法:邊數(shù)~鄰居數(shù)

如下圖所示,可以看出大多數(shù)節(jié)點Ego-net中邊數(shù) E 與鄰居數(shù) N 服從冪律分布(對數(shù)坐標后呈線性)、給定某節(jié)點i對應的 Ei 、Ni ,求出冪律系數(shù)?α?,若:
α?接近1(黑色虛線),節(jié)點i的Ego-net呈現(xiàn)Near-Clique?
α 接近2(藍色虛線),節(jié)點i的Ego-net呈現(xiàn)Near-Star
紅線是擬合中位數(shù),藍色和黑色虛線是邊界線。
?
大多數(shù)Graph都遵循該模式:


?
?
2、HeavyVicinity(權重)
HeavyVicinity指“較重的鄰居“,Ego-net中邊數(shù)一定時,總邊權重異常大(如騙貸者通過頻繁撥打電話偽造通話記錄),中心節(jié)點與一小部分節(jié)點之間存在權重非常大的關聯(lián)也是可疑的,如騙貸者通過頻繁撥打電話偽造通話記錄。正如下圖所示,中心節(jié)點與少部分節(jié)點之間的連接權重非常大。

?
度量方法:總邊權重~邊數(shù)
大多數(shù)節(jié)點Ego-net中總邊權重~邊數(shù)也服從冪律分布(對數(shù)坐標),?β?越高表示越異常

圖(a)選舉中,民主黨(DNC)的大量的資金給為數(shù)不多的候選者
?

?

?

?
?
?
3 、DominantPair(主導邊)
Dominant heavy links指“主導的邊”,Ego-Net中存在某條邊權重異常大(如學者投稿會議網(wǎng)絡中,“Toshio Fukuda” 擁有115篇papers,投稿了17個會議,但其中87篇pager投稿了一個ICRA):

?
度量方法:主特征值~總權重
大多數(shù)節(jié)點Ego-net對應帶權鄰接矩陣中主特征值(principal eigenvalue,即最大特征值)~總邊權重也服從冪律分布,其中系數(shù) λ 表示Ego-net中邊權均勻分布,?λ 接近1表示存在DominantPair的情況。

?

四、OddBall異常檢測算法
OddBall由out-line(i)和out-lof(i)兩部分組成:
out-line:計算實際點與擬合直線(紅線)的偏離程度。
out-lof:但out-line但會存在“缺陷是無法識別離正常點很遠,但與擬合直線很近的異常點”的缺陷,故結合傳統(tǒng)基于密度的方法LOF(也可以選其他的)。
二者集成方式先求出兩個score,然后歸一化(除以最大值)后求和:
out-score(i)=out-line(i)+out-lof(i)
1、out-line
為實際值,
?為在擬合直線(正常點)上的預測值,二者相減為偏離程度/異常程度取log是為了平滑
?為懲罰系數(shù):實際值偏離正常的倍數(shù)
2、out-lof
outline的缺陷:無法識別紅框內的節(jié)點,故引入LOF,詳情可參考:https://zhuanlan.zhihu.com/p/28178476

五、相關思考
本文中僅考慮了節(jié)點的一階子圖,將子圖范圍擴展到二階或者是更大的局部子圖是否會效果更好?檢測模式依賴的特征是否具有魯棒性?
?
