大家好，我是愛生活，AI風控的小伍哥，上次給大家?guī)淼谝黄恼翺ddBall，當時并未附代碼，現(xiàn)在找到一個比較好的開源代碼，分享給大家。

項目地址：https://github.com/gloooryyt/oddball_py3

我們自己構建測試數(shù)據(jù)或者讀取測試數(shù)據(jù)

import networkx as nxG = nx.Graph(date="10.11",name="無向圖")#創(chuàng)建空圖，無向圖G.add_node(1)G.add_nodes_from([1,2,3,4,5,6,7,8,9])G.nodes()G.add_weighted_edges_from([(1,2,10),(1,3,1),(1,4,1),(1,5,1),(1,6,1),(1,7,1),(1,8,1),(1,9,1),(8,9,1)])print(G.adj){1: {2: {'weight': 10}, 3: {'weight': 1}, 4: {'weight': 1}, 5: {'weight': 1}, 6: {'weight': 1}, 7: {'weight': 1}, 8: {'weight': 1}, 9: {'weight': 1}}, 2: {1: {'weight': 10}}, 3: {1: {'weight': 1}}, 4: {1: {'weight': 1}}, 5: {1: {'weight': 1}}, 6: {1: {'weight': 1}}, 7: {1: {'weight': 1}}, 8: {1: {'weight': 1}, 9: {'weight': 1}}, 9: {1: {'weight': 1}, 8: {'weight': 1}}}
import?matplotlib.pyplot?as?pltedgewidth?=?[10,1,1,1,1,1,1,1,12]nx.draw(G,with_labels=True,        pos=nx.spring_layout(G),        width=edgewidth,        node_color="lightgreen",        edge_color="DeepPink")plt.show()

也可以讀取自己的數(shù)據(jù)集

#數(shù)據(jù)讀取import ospath?=?'/Users/wuzhengxiang/Downloads/OddBall10_tbox?2/oddball-edges.txt'

下面????是正式的代碼

import?numpy?as?npimport networkx as nx#load data, a weighted undirected graphdef load_data(path):      data = np.loadtxt(path).astype('int32')      G = nx.Graph()      for ite in data:          G.add_edge(ite[0], ite[1], weight=ite[2])      return Gdef get_feature(G):      #feature dictionary which format is {node i's id:Ni, Ei, Wi, λw,i}      featureDict = {}      nodelist = list(G.nodes)      for ite in nodelist:          featureDict[ite] = []          #the number of node i's neighbor          Ni = G.degree(ite)          featureDict[ite].append(Ni)          #the set of node i's neighbor          iNeighbor = list(G.neighbors(ite))          #the number of edges in egonet i          Ei = 0          #sum of weights in egonet i          Wi = 0          #the principal eigenvalue(the maximum eigenvalue with abs) of egonet i's weighted adjacency matrix          Lambda_w_i = 0          Ei += Ni          egonet = nx.Graph()          for nei in iNeighbor:              Wi += G[nei][ite]['weight']              egonet.add_edge(ite, nei, weight=G[nei][ite]['weight'])          iNeighborLen = len(iNeighbor)          for it1 in range(iNeighborLen):              for it2 in range(it1+1, iNeighborLen):                  #if it1 in it2's neighbor list                  if iNeighbor[it1] in list(G.neighbors(iNeighbor[it2])):                      Ei += 1                      Wi += G[iNeighbor[it1]][iNeighbor[it2]]['weight']                      egonet.add_edge(iNeighbor[it1], iNeighbor[it2], weight=G[iNeighbor[it1]][iNeighbor[it2]]['weight'])          egonet_adjacency_matrix = nx.adjacency_matrix(egonet).todense()          eigenvalue, eigenvector = np.linalg.eig(egonet_adjacency_matrix)          eigenvalue.sort()          Lambda_w_i = max(abs(eigenvalue[0]), abs(eigenvalue[-1]))          featureDict[ite].append(Ei)          featureDict[ite].append(Wi)          featureDict[ite].append(Lambda_w_i)??????return?featureDict

import numpy as npfrom sklearn.linear_model import LinearRegressionfrom sklearn.neighbors import LocalOutlierFactor  # feature dictionary which format is {node i's id:Ni, Ei, Wi, λw,i}def star_or_clique(featureDict):    N = []    E = []    for key in featureDict.keys():        N.append(featureDict[key][0])        E.append(featureDict[key][1])    #E=CN^α => log on both sides => logE=logC+αlogN    #regard as y=b+wx to do linear regression    #here the base of log is 2    y_train = np.log2(E)    y_train = np.array(y_train)    y_train = y_train.reshape(len(E), 1)    x_train = np.log2(N)    x_train = np.array(x_train)    x_train = x_train.reshape(len(N), 1)    model = LinearRegression()    model.fit(x_train, y_train)    w = model.coef_[0][0]    b = model.intercept_[0]    C = 2**b    alpha = w    outlineScoreDict = {}    for key in featureDict.keys():        yi = featureDict[key][1]        xi = featureDict[key][0]        outlineScore = (max(yi, C*(xi**alpha))/min(yi, C*(xi**alpha)))*np.log(abs(yi-C*(xi**alpha))+1)        outlineScoreDict[key] = outlineScore    return outlineScoreDict

def heavy_vicinity(featureDict):    W = []    E = []    for key in featureDict.keys():        W.append(featureDict[key][2])        E.append(featureDict[key][1])    #W=CE^β => log on both sides => logW=logC+βlogE    #regard as y=b+wx to do linear regression    #here the base of log is 2    y_train = np.log2(W)    y_train = np.array(y_train)    y_train = y_train.reshape(len(W), 1)    x_train = np.log2(E)    x_train = np.array(x_train)    x_train = x_train.reshape(len(E), 1)    model = LinearRegression()    model.fit(x_train, y_train)    w = model.coef_[0][0]    b = model.intercept_[0]    C = 2**b    beta = w    outlineScoreDict = {}    for key in featureDict.keys():        yi = featureDict[key][2]        xi = featureDict[key][1]        outlineScore = (max(yi, C*(xi**beta))/min(yi, C*(xi**beta)))*np.log(abs(yi-C*(xi**beta))+1)        outlineScoreDict[key] = outlineScore    return outlineScoreDict

def dominant_edge(featureDict):    Lambda_w_i = []    W = []    for key in featureDict.keys():        Lambda_w_i.append(featureDict[key][3])        W.append(featureDict[key][2])    #λ=CW^γ => log on both sides => logλ=logC+γlogW    #regard as y=b+wx to do linear regression    #here the base of log is 2    y_train = np.log2(Lambda_w_i)    y_train = np.array(y_train)    y_train = y_train.reshape(len(Lambda_w_i), 1)    x_train = np.log2(W)    x_train = np.array(x_train)    x_train = x_train.reshape(len(W), 1)    model = LinearRegression()    model.fit(x_train, y_train)    w = model.coef_[0][0]    b = model.intercept_[0]    C = 2 ** b    beta = w    outlineScoreDict = {}    for key in featureDict.keys():        yi = featureDict[key][3]        xi = featureDict[key][2]        outlineScore = (max(yi, C * (xi ** beta)) / min(yi, C * (xi ** beta))) * np.log(abs(yi - C * (xi ** beta)) + 1)        outlineScoreDict[key] = outlineScore    return outlineScoreDict

def star_or_clique_withLOF(featureDict):    N = []    E = []    for key in featureDict.keys():        N.append(featureDict[key][0])        E.append(featureDict[key][1])    #E=CN^α => log on both sides => logE=logC+αlogN    #regard as y=b+wx to do linear regression    #here the base of log is 2    y_train = np.log2(E)    y_train = np.array(y_train)    y_train = y_train.reshape(len(E), 1)    x_train = np.log2(N)    x_train = np.array(x_train)    x_train = x_train.reshape(len(N), 1)    #the order in x_train and y_train is the same as which in featureDict.keys() now
    #prepare data for LOF    xAndyForLOF = []    for index in range(len(N)):        tempArray = np.array([x_train[index][0], y_train[index][0]])        xAndyForLOF.append(tempArray)    xAndyForLOF = np.array(xAndyForLOF)
    model = LinearRegression()    model.fit(x_train, y_train)    w = model.coef_[0][0]    b = model.intercept_[0]    C = 2**b    alpha = w    print('alpha={}'.format(alpha))
    #LOF algorithm    clf = LocalOutlierFactor(n_neighbors=20)    clf.fit(xAndyForLOF)    LOFScoreArray = -clf.negative_outlier_factor_
    outScoreDict = {}    count = 0   #Used to take LOFScore in sequence from LOFScoreArray
    #get the maximum outLine    maxOutLine = 0    for key in featureDict.keys():        yi = featureDict[key][1]        xi = featureDict[key][0]        outlineScore = (max(yi, C*(xi**alpha))/min(yi, C*(xi**alpha)))*np.log(abs(yi-C*(xi**alpha))+1)        if outlineScore > maxOutLine:            maxOutLine = outlineScore
    print('maxOutLine={}'.format(maxOutLine))
    #get the maximum LOFScore    maxLOFScore = 0    for ite in range(len(N)):        if LOFScoreArray[ite] > maxLOFScore:            maxLOFScore = LOFScoreArray[ite]
    print('maxLOFScore={}'.format(maxLOFScore))
    for key in featureDict.keys():        yi = featureDict[key][1]        xi = featureDict[key][0]        outlineScore = (max(yi, C*(xi**alpha))/min(yi, C*(xi**alpha)))*np.log(abs(yi-C*(xi**alpha))+1)        LOFScore = LOFScoreArray[count]        count += 1        outScore = outlineScore/maxOutLine + LOFScore/maxLOFScore        outScoreDict[key] = outScore    return outScoreDict

def heavy_vicinity_withLOF(featureDict):    W = []    E = []    for key in featureDict.keys():        W.append(featureDict[key][2])        E.append(featureDict[key][1])    #W=CE^β => log on both sides => logW=logC+βlogE    #regard as y=b+wx to do linear regression    #here the base of log is 2    y_train = np.log2(W)    y_train = np.array(y_train)    y_train = y_train.reshape(len(W), 1)    x_train = np.log2(E)    x_train = np.array(x_train)    x_train = x_train.reshape(len(E), 1)    #the order in x_train and y_train is the same as which in featureDict.keys() now
    #prepare data for LOF    xAndyForLOF = []    for index in range(len(W)):        tempArray = np.array([x_train[index][0], y_train[index][0]])        xAndyForLOF.append(tempArray)    xAndyForLOF = np.array(xAndyForLOF)
    model = LinearRegression()    model.fit(x_train, y_train)    w = model.coef_[0][0]    b = model.intercept_[0]    C = 2**b    beta = w    print('beta={}'.format(beta))
    #LOF algorithm    clf = LocalOutlierFactor(n_neighbors=20)    clf.fit(xAndyForLOF)    LOFScoreArray = -clf.negative_outlier_factor_
    outScoreDict = {}    count = 0   #Used to take LOFScore in sequence from LOFScoreArray
    #get the maximum outLine    maxOutLine = 0    for key in featureDict.keys():        yi = featureDict[key][2]        xi = featureDict[key][1]        outlineScore = (max(yi, C*(xi**beta))/min(yi, C*(xi**beta)))*np.log(abs(yi-C*(xi**beta))+1)        if outlineScore > maxOutLine:            maxOutLine = outlineScore
    print('maxOutLine={}'.format(maxOutLine))
    #get the maximum LOFScore    maxLOFScore = 0    for ite in range(len(W)):        if LOFScoreArray[ite] > maxLOFScore:            maxLOFScore = LOFScoreArray[ite]
    print('maxLOFScore={}'.format(maxLOFScore))
    for key in featureDict.keys():        yi = featureDict[key][2]        xi = featureDict[key][1]        outlineScore = (max(yi, C*(xi**beta))/min(yi, C*(xi**beta)))*np.log(abs(yi-C*(xi**beta))+1)        LOFScore = LOFScoreArray[count]        count += 1        outScore = outlineScore/maxOutLine + LOFScore/maxLOFScore        outScoreDict[key] = outScore    return outScoreDict
def dominant_edge_withLOF(featureDict):    Lambda_w_i = []    W = []    for key in featureDict.keys():        Lambda_w_i.append(featureDict[key][3])        W.append(featureDict[key][2])    #λ=CW^γ => log on both sides => logλ=logC+γlogW    #regard as y=b+wx to do linear regression    #here the base of log is 2    y_train = np.log2(Lambda_w_i)    y_train = np.array(y_train)    y_train = y_train.reshape(len(Lambda_w_i), 1)    x_train = np.log2(W)    x_train = np.array(x_train)    x_train = x_train.reshape(len(W), 1)    #the order in x_train and y_train is the same as which in featureDict.keys() now
    #prepare data for LOF    xAndyForLOF = []    for index in range(len(W)):        tempArray = np.array([x_train[index][0], y_train[index][0]])        xAndyForLOF.append(tempArray)    xAndyForLOF = np.array(xAndyForLOF)
    model = LinearRegression()    model.fit(x_train, y_train)    w = model.coef_[0][0]    b = model.intercept_[0]    C = 2**b    gamma = w    print('gamma={}'.format(gamma))
    #LOF algorithm    clf = LocalOutlierFactor(n_neighbors=20)    clf.fit(xAndyForLOF)    LOFScoreArray = -clf.negative_outlier_factor_
    outScoreDict = {}    count = 0   #Used to take LOFScore in sequence from LOFScoreArray
    #get the maximum outLine    maxOutLine = 0    for key in featureDict.keys():        yi = featureDict[key][3]        xi = featureDict[key][2]        outlineScore = (max(yi, C*(xi**gamma))/min(yi, C*(xi**gamma)))*np.log(abs(yi-C*(xi**gamma))+1)        if outlineScore > maxOutLine:            maxOutLine = outlineScore
    print('maxOutLine={}'.format(maxOutLine))
    #get the maximum LOFScore    maxLOFScore = 0    for ite in range(len(W)):        if LOFScoreArray[ite] > maxLOFScore:            maxLOFScore = LOFScoreArray[ite]
    print('maxLOFScore={}'.format(maxLOFScore))
    for key in featureDict.keys():        yi = featureDict[key][3]        xi = featureDict[key][2]        outlineScore = (max(yi, C*(xi**gamma))/min(yi, C*(xi**gamma)))*np.log(abs(yi-C*(xi**gamma))+1)        LOFScore = LOFScoreArray[count]        count += 1        outScore = outlineScore/maxOutLine + LOFScore/maxLOFScore        outScoreDict[key] = outScore????return?outScoreDict

一、概? ?述

基于圖的異常檢測分為孤立點檢測和異常群簇檢測，本文是孤立點檢測中較經(jīng)典的論文，通過研究Ego-net總結幾種異常模型及提供度量方式：

異常結構	含義	度量方式?
CliqueStar	呈星狀或者團狀結構	邊數(shù)~節(jié)點鄰居數(shù)
HeavyVicinity	總邊權重異常大	總邊權重~邊數(shù)
DominantPair	存在某條權重異常大的邊	主特征值~總邊權重

文章調查了Ego-net中存在的異常模式，并給出了檢測異常模式的依據(jù)基于上述模式，提出了OddBall，一種用于異常點檢測的無監(jiān)督方法，將OddBall應用于真實數(shù)據(jù)集，并驗證了算法的有效性

論文名稱：OddBall: Spotting Anomalies in Weighted Graphs

論文地址：http://www.cs.cmu.edu/~mmcgloho/pubs/pakdd10.pdf

代碼地址：https://www.andrew.cmu.edu/user/lakoglu/pubs.html#code

二、Ego-net（中心節(jié)點）

以中心節(jié)點(ego)及其鄰居組成的子圖，一般用于研究個體性質以及局部社區(qū)發(fā)現(xiàn)，本文僅考慮一階鄰居，這是為了減少計算量并提和高可解釋性。

三、Ego-net模式及度量方法

1 、CliqueStar（基于密度）

基于密度的方法可以識別出下面兩種Ego-net的異常結構：

Near-Star：在正常的社交網(wǎng)絡中，我們通常認為朋友之間可能會相互認識，因此一階Ego-net中的鄰居之間沒有任何關聯(lián)是非常可疑的，近似星型，鄰居之間很少聯(lián)系（如通話關系網(wǎng)絡中的中介、電催人員、營銷號碼，他們大量的聯(lián)系別人，然而聯(lián)系人中之間幾乎沒啥聯(lián)系），這種結構的Ego-net被稱為star，如下圖所示，中心節(jié)點與大量節(jié)點存在關聯(lián)，但是鄰居之間無聯(lián)系或者聯(lián)系很少。

Near-Clique：與上述相反，鄰居之間存在大量關聯(lián)也是非?？梢傻模@種結構的Ego-net被稱為cliques。正如下圖所示，中心節(jié)點與大量節(jié)點存在關聯(lián)，鄰居之間的聯(lián)系非常密集，近似環(huán)狀，鄰居之間聯(lián)系緊密（如某個討論組、恐怖組織）。

度量方法：邊數(shù)～鄰居數(shù)

如下圖所示，可以看出大多數(shù)節(jié)點Ego-net中邊數(shù) E 與鄰居數(shù) N 服從冪律分布（對數(shù)坐標后呈線性）、給定某節(jié)點i對應的 Ei 、Ni ，求出冪律系數(shù)?α?，若：

α?接近1（黑色虛線），節(jié)點i的Ego-net呈現(xiàn)Near-Clique?

α 接近2（藍色虛線），節(jié)點i的Ego-net呈現(xiàn)Near-Star

紅線是擬合中位數(shù)，藍色和黑色虛線是邊界線。

大多數(shù)Graph都遵循該模式：

2、HeavyVicinity(權重)

HeavyVicinity指“較重的鄰居“，Ego-net中邊數(shù)一定時，總邊權重異常大（如騙貸者通過頻繁撥打電話偽造通話記錄），中心節(jié)點與一小部分節(jié)點之間存在權重非常大的關聯(lián)也是可疑的，如騙貸者通過頻繁撥打電話偽造通話記錄。正如下圖所示，中心節(jié)點與少部分節(jié)點之間的連接權重非常大。