91成人电影免费,外国女人操逼在线视频,69操逼视频,性无码一区,中文字幕第18页,欧美性爱中文字幕,性爱aaa,国产一级A片在线免费观看

1介紹

.什么是異常值檢測？

異常值檢測也稱為離群值檢測、噪聲檢測、偏差檢測或異常挖掘。一般來說并沒有普遍接受的定義。（Grubbs，1969）給出的一個早期定義是: 異常值或離群值是似乎與其所在的樣本內(nèi)其他成員明顯偏離的觀測值。（Barnett 和 Lewis，1994）的最新定義是: 與該組數(shù)據(jù)的其余部分不一致的觀測值。

.成因

導致異常值的最常見原因有，

數(shù)據(jù)輸入錯誤（人為錯誤）
測量誤差（儀器誤差）
實驗錯誤（數(shù)據(jù)提取或?qū)嶒炗媱?執(zhí)行錯誤）
故意的（虛假的異常值用于測試異常值檢測方法）
數(shù)據(jù)處理錯誤（數(shù)據(jù)處理或數(shù)據(jù)集意外突變）
抽樣錯誤（從錯誤或各種不同來源提取或混合數(shù)據(jù)）
自然引入（并不是錯誤，而是數(shù)據(jù)多樣性導致的數(shù)據(jù)新穎性）

.應用

異常值/離群值檢測的應用比較廣泛，例如

欺詐檢測，即檢測信用卡或電話卡的欺詐性事件。
貸款申請?zhí)幚?，檢測欺詐性申請或潛在問題客戶。
入侵檢測，檢測計算機網(wǎng)絡(luò)中未經(jīng)授權(quán)的訪問。
活動監(jiān)視，通過監(jiān)視電話活動或股票市場中的可疑交易來檢測手機欺詐。
網(wǎng)絡(luò)性能，監(jiān)視計算機網(wǎng)絡(luò)的性能，例如檢測網(wǎng)絡(luò)瓶頸。
故障診斷，檢測例如航天飛機上的電動機、發(fā)電機、管道或太空儀器中的故障。
結(jié)構(gòu)缺陷檢測，檢測生產(chǎn)線中的缺陷瑕疵。
衛(wèi)星圖像分析，識別新穎特征或分類錯誤的特征。
檢測圖像中的新穎性，用于機器人整形或監(jiān)視系統(tǒng)。
運動分割，檢測獨立于背景移動的圖像特征。
時間序列監(jiān)視，監(jiān)視安全關(guān)鍵應用，例如鉆孔或高速銑削。
醫(yī)療狀況監(jiān)控，例如心率監(jiān)控器。
藥物研究，確定新的分子結(jié)構(gòu)。
檢測文本中的新穎性，檢測新聞事件的出現(xiàn)，進行主題檢測和跟蹤，或讓交易者查明股票、商品、外匯交易事件，表現(xiàn)出色或表現(xiàn)不佳的商品。
檢測數(shù)據(jù)庫中的意外記錄，用于數(shù)據(jù)挖掘以檢測錯誤、欺詐或有效但異常的記錄。
在訓練數(shù)據(jù)集中檢測標簽錯誤的數(shù)據(jù)。

.方法

有三類離群值檢測方法:

在沒有數(shù)據(jù)先驗知識的情況下確定異常值。這類似于無監(jiān)督聚類。
對正常和異常進行建模。這類似于監(jiān)督分類，需要標記好數(shù)據(jù)。
僅建模正常數(shù)據(jù)。這稱為新穎性檢測，類似于半監(jiān)督識別。這種方法需要屬于正常類的標記數(shù)據(jù)。

我將處理第一種方法，這也是最常見的情況。大多數(shù)數(shù)據(jù)集并沒有關(guān)于異常值的標記數(shù)據(jù)。

.方法分類

離群值檢測方法可以分為: 單變量方法和多變量方法。
也可以分為: 參數(shù)（統(tǒng)計）方法，該類方法假定觀測值的潛在分布已知；以及非參數(shù)方法，如基于距離的方法和聚類方法。

.離群值檢測算法

本篇采用的算法有:

孤立森林
擴展孤立森林
局部離群因子
DBSCAN
單分類 SVM
以上方法的集成

2實踐

.數(shù)據(jù)集

這里將使用 Pokemon^[1] 數(shù)據(jù)集并在 ['HP', 'Speed'] 這兩列上執(zhí)行異常值檢測。這個數(shù)據(jù)集具有很少的觀測值，計算將很快。出于可視化目的而只選擇了其中兩列（二維），但該方法適用于多維度處理。

.上代碼

import?numpy?as?np
import?pandas?as?pd
from?scipy?import?stats
import?eif?as?iso
from?sklearn?import?svm
from?sklearn.cluster?import?DBSCAN
from?sklearn.ensemble?import?IsolationForest
from?sklearn.neighbors?import?LocalOutlierFactor

import?matplotlib.dates?as?md
from?scipy.stats?import?norm
%matplotlib?inline?
import?seaborn?as?sns?
sns.set_style("whitegrid")?#possible?choices:?white,?dark,?whitegrid,?darkgrid,?ticks


import?matplotlib.pyplot?as?plt
plt.style.use('ggplot')
import?plotly.express?as?px
import?plotly.graph_objs?as?go
import?plotly.figure_factory?as?ff
from?plotly?import?tools
from?plotly.offline?import?download_plotlyjs,?init_notebook_mode,?plot,?iplot
pd.set_option('float_format',?'{:f}'.format)
pd.set_option('max_columns',250)
pd.set_option('max_rows',150)

data?=?pd.read_csv('Pokemon.csv')

data.head().T

x1='HP';?x2='Speed'
X?=?data[[x1,x2]]
X.shape

(800, 2)

.孤立森林

孤立森林，就像任何集成樹方法一樣，都是基于決策樹構(gòu)建的。在這些樹中，首先通過隨機選擇一個特征，然后在所選特征的最小值和最大值之間選擇一個隨機分割值來創(chuàng)建分區(qū)。
為了在樹中創(chuàng)建分支，首先，選擇一個隨機特征。然后，為該特征選擇一個隨機的分割值（介于最小值和最大值之間）。如果給定的觀測值具有較低的此特征值，則選擇的觀測值將歸左分支，否則歸右分支。繼續(xù)此過程，直到分割單個點或達到指定的最大深度為止。
原則上，離群值不如正常觀察值那么普遍，并且在值方面與它們不同（它們離特征空間中的正常觀察值更遠）。使用這種隨機劃分，離群值往往出現(xiàn)在更接近樹的根的地方，只需相對更少的劃分（較短的平均路徑長度，即從樹的根到葉子節(jié)點的邊數(shù)）。

我將使用 sklearn 庫中的 IsolationForest。定義算法時，有一個重要的參數(shù)稱為污染。它是算法期望的離群值觀察值的百分比。我們將 X（具有 HP 和 Speed 2 個特征）擬合到算法中，并在 X 上使用 fit_predict 來對其進行處理。這將產(chǎn)生普通的異常值（-1 為異常值，1 為異常值）。我們還可以使用函數(shù) decision_function 來獲得 Isolation Forest 給每個樣本的分數(shù)。

clf?=?IsolationForest(max_samples='auto',?random_state?=?1,?contamination=?0.02)

preds?=?clf.fit_predict(X)

data['isoletionForest_outliers']?=?preds
data['isoletionForest_outliers']?=?data['isoletionForest_outliers'].astype(str)
data['isoletionForest_scores']?=?clf.decision_function(X)

print(data['isoletionForest_outliers'].value_counts())
data[152:156]

1     785
-1     15
Name: isoletionForest_outliers, dtype: int64

將結(jié)果繪制出來看看。

fig?=?px.scatter(data,?x=x1,?y=x2,?color='isoletionForest_outliers',?hover_name='Name')
fig.update_layout(title='Isolation?Forest?Outlier?Detection',?title_x=0.5,?yaxis=dict(gridcolor?=?'#DFEAF4'),?xaxis=dict(gridcolor?=?'#DFEAF4'),?plot_bgcolor='white')
#?fig.show()


fig?=?px.scatter(data,?x=x1,?y=x2,?color="isoletionForest_scores")
fig.update_layout(title='Isolation?Forest?Outlier?Detection?(scores)',?title_x=0.5,yaxis=dict(gridcolor?=?'#DFEAF4'),?xaxis=dict(gridcolor?=?'#DFEAF4'),?plot_bgcolor='white')
#?fig.show()

從視覺上看，這 15 個點不在主要數(shù)據(jù)點范圍內(nèi)，判為離群值似乎合乎常理。

除了異常值和異常值顯示孤立森林的決策邊界外，我們還可以進行更高級的可視化。

data['isoletionForest_outliers']=='1'

0      True
1      True
2      True
3      True
4      True
       ... 
795    True
796    True
797    True
798    True
799    True
Name: isoletionForest_outliers, Length: 800, dtype: bool

X_inliers?=?data.loc[data['isoletionForest_outliers']=='1'][[x1,x2]]
X_outliers?=?data.loc[data['isoletionForest_outliers']=='-1'][[x1,x2]]

xx,?yy?=?np.meshgrid(np.linspace(X.iloc[:,?0].min(),?X.iloc[:,?0].max(),?50),?np.linspace(X.iloc[:,?1].min(),?X.iloc[:,?1].max(),?50))
Z?=?clf.decision_function(np.c_[xx.ravel(),?yy.ravel()])
Z?=?Z.reshape(xx.shape)

fig,?ax?=?plt.subplots(figsize=(15,?7))
plt.title("Isolation?Forest?Outlier?Detection?with?Outlier?Areas",?fontsize?=?15,?loc='center')
plt.contourf(xx,?yy,?Z,?cmap=plt.cm.Blues_r)

inl?=?plt.scatter(X_inliers.iloc[:,?0],?X_inliers.iloc[:,?1],?c='white',?s=20,?edgecolor='k')
outl?=?plt.scatter(X_outliers.iloc[:,?0],?X_outliers.iloc[:,?1],?c='red',s=20,?edgecolor='k')

plt.axis('tight')
plt.xlim((X.iloc[:,?0].min(),?X.iloc[:,?0].max()))
plt.ylim((X.iloc[:,?1].min(),?X.iloc[:,?1].max()))
plt.legend([inl,?outl],["normal?observations",?"abnormal?observations"],loc="upper?left");
#?plt.show()

顏色越深，該區(qū)域就越離群。下面代碼可以查看分數(shù)分布。

fig,?ax?=?plt.subplots(figsize=(20,?7))
ax.set_title('Distribution?of?Isolation?Forest?Scores',?fontsize?=?15,?loc='center')
sns.distplot(data['isoletionForest_scores'],color='#3366ff',label='if',hist_kws?=?{"alpha":?0.35});

分布很重要，可以幫助我們更好地確定案例的正確污染值。如果我們更改污染值，isoletionForest_scores 將會更改，但是分布將保持不變。該算法將調(diào)整分布圖中離群值的截止值。

.擴展孤立森林

孤立森林有一個缺點: 它的決策邊界是垂直或水平的。由于線只能平行于軸，因此某些區(qū)域包含許多分支切口，并且只有少量或單個觀測值，這會導致某些觀測值的異常分不正確。

安裝 pip install git+https://github.com/sahandha/eif.git

擴展孤立森林選擇如下操作，

1）分支剪切的隨機斜率，以及

2）從訓練數(shù)據(jù)的可用值范圍中選擇的隨機截距。這些項實際上是線性回歸線。

X_data?=?X.values.astype('double')
F1??=?iso.iForest(X_data,?ntrees=100,?sample_size=256,?ExtensionLevel=X.shape[1]-1)?#?X?needs?to?by?numpy?array
#?calculate?anomaly?scores
anomaly_scores?=?F1.compute_paths(X_in?=?X_data)
data['extendedIsoletionForest_scores']?=?-anomaly_scores
#?determine?lowest?2%?as?outliers
data['extendedIsoletionForest_outliers']?=??data['extendedIsoletionForest_scores'].apply(lambda?x:?'-1'?if?x<=data['extendedIsoletionForest_scores'].quantile(0.02)?else?'1')
print(data['extendedIsoletionForest_outliers'].value_counts())

1     784
-1     16
Name: extendedIsoletionForest_outliers, dtype: int64

fig?=?px.scatter(data,?x=x1,?y=x2,?color='extendedIsoletionForest_outliers',?hover_name='Name')
fig.update_layout(title='Extended?Isolation?Forest?Outlier?Detection',?title_x=0.5,?yaxis=dict(gridcolor?=?'#DFEAF4'),?xaxis=dict(gridcolor?=?'#DFEAF4'),?plot_bgcolor='white')
#?fig.show()


fig?=?px.scatter(data,?x=x1,?y=x2,?color="extendedIsoletionForest_scores")
fig.update_layout(title='Extended?Isolation?Forest?Outlier?Detection?(scores)',?title_x=0.5,yaxis=dict(gridcolor?=?'#DFEAF4'),?xaxis=dict(gridcolor?=?'#DFEAF4'),?plot_bgcolor='white')
#?fig.show()

擴展孤立森林并不提供普通的異常值和正常值（如 -1 和 1）。我們只是通過將得分最低的 2％作為離群值來創(chuàng)建它們。該算法的分數(shù)與基本孤立森林不同，所有分數(shù)均為負。

X_inliers?=?data.loc[data['extendedIsoletionForest_outliers']=='1'][[x1,x2]]
X_outliers?=?data.loc[data['extendedIsoletionForest_outliers']=='-1'][[x1,x2]]

xx,?yy?=?np.meshgrid(np.linspace(X.iloc[:,?0].min(),?X.iloc[:,?0].max(),?50),?np.linspace(X.iloc[:,?1].min()-30,?X.iloc[:,?1].max()+30,?50))

S1?=?F1.compute_paths(X_in=np.c_[xx.ravel(),?yy.ravel()])
S1?=?S1.reshape(xx.shape)

fig,?ax?=?plt.subplots(figsize=(15,?7))
plt.title("Extended?Isolation?Forest?Outlier?Detection?with?Outlier?Areas",?fontsize?=?15,?loc='center')
levels?=?np.linspace(np.min(S1),np.max(S1),50)
CS?=?ax.contourf(xx,?yy,?S1,?levels,?cmap=plt.cm.Blues)

inl?=?plt.scatter(X_inliers.iloc[:,?0],?X_inliers.iloc[:,?1],?c='white',?s=20,?edgecolor='k')
outl?=?plt.scatter(X_outliers.iloc[:,?0],?X_outliers.iloc[:,?1],?c='red',s=20,?edgecolor='k')

plt.axis('tight')
plt.xlim((X.iloc[:,?0].min(),?X.iloc[:,?0].max()))
plt.ylim((X.iloc[:,?1].min()-30,?X.iloc[:,?1].max()+30))
plt.legend([inl,?outl],["normal?observations",?"abnormal?observations"],loc="upper?left")
#?plt.show()


fig,?ax?=?plt.subplots(figsize=(20,?7))
ax.set_title('Distribution?of?Extended?Isolation?Scores',?fontsize?=?15,?loc='center')
sns.distplot(data['extendedIsoletionForest_scores'],color='red',label='eif',hist_kws?=?{"alpha":?0.5});

3局部離群因子 LOF

該方法觀察某個點的鄰近點，找出它的密度，然后將其與其他點的密度進行比較。
點的 LOF 表示這個點的密度與其相鄰點的密度之比。如果一個點的密度遠小于其鄰近點的密度（LOF ? 1），則該點遠離密集區(qū)域，判為離群值。

clf?=?LocalOutlierFactor(n_neighbors=11)
y_pred?=?clf.fit_predict(X)

data['localOutlierFactor_outliers']?=?y_pred.astype(str)
print(data['localOutlierFactor_outliers'].value_counts())
data['localOutlierFactor_scores']?=?clf.negative_outlier_factor_

1     779
-1     21
Name: localOutlierFactor_outliers, dtype: int64

最重要的參數(shù)是 n_neighbors。默認值為 20，這給出了 45 個離群值。我將其更改為 11 以得到更少的離群值，接近 2％。

fig?=?px.scatter(data,?x=x1,?y=x2,?color='localOutlierFactor_outliers',?hover_name='Name')
fig.update_layout(title='Local?Outlier?Factor?Outlier?Detection',?title_x=0.5,?yaxis=dict(gridcolor?=?'#DFEAF4'),?xaxis=dict(gridcolor?=?'#DFEAF4'),?plot_bgcolor='white')
#?fig.show()


fig?=?px.scatter(data,?x=x1,?y=x2,?color="localOutlierFactor_scores",?hover_name='Name')
fig.update_layout(title='Local?Outlier?Factor?Outlier?Detection',?title_x=0.5,yaxis=dict(gridcolor?=?'#DFEAF4'),?xaxis=dict(gridcolor?=?'#DFEAF4'),?plot_bgcolor='white')
#?fig.show()

我們可以創(chuàng)建另一個有趣的圖，其中局部離群值越大，其周圍的圓圈越大。

fig,?ax?=?plt.subplots(figsize=(15,?7.5))
ax.set_title('Local?Outlier?Factor?Scores?Outlier?Detection',?fontsize?=?15,?loc='center')

plt.scatter(X.iloc[:,?0],?X.iloc[:,?1],?color='k',?s=3.,?label='Data?points')
radius?=?(data['localOutlierFactor_scores'].max()?-?data['localOutlierFactor_scores'])?/?(data['localOutlierFactor_scores'].max()?-?data['localOutlierFactor_scores'].min())
plt.scatter(X.iloc[:,?0],?X.iloc[:,?1],?s=2000?*?radius,?edgecolors='r',?facecolors='none',?label='Outlier?scores')
plt.axis('tight')
legend?=?plt.legend(loc='upper?left')
legend.legendHandles[0]._sizes?=?[10]
legend.legendHandles[1]._sizes?=?[20]
plt.show();

fig,?ax?=?plt.subplots(figsize=(20,?7))
ax.set_title('Distribution?of?Local?Outlier?Factor?Scores',?fontsize?=?15,?loc='center')
sns.distplot(data['localOutlierFactor_scores'],color='red',label='eif',hist_kws?=?{"alpha":?0.5});

該算法與以前的算法有很大不同，它以不同的方式找到離群值。

.DBSCAN

一種經(jīng)典的聚類算法，其工作方式如下：

隨機選擇一個尚未分配給簇或指定為離群值的點。通過查看 epsilon 距離內(nèi)是否至少有 min_samples 個點來確定其是否為核心點。
將核心點及其在 epsilon 距離內(nèi)的所有直接可達點構(gòu)成一簇。
查找簇中每個點的 epsilon 距離內(nèi)的所有點，并將它們添加到該簇中。查找所有新添加的點在 epsilon 距離內(nèi)的所有點，并將它們添加到簇中。重復上述步驟。

from?sklearn.cluster?import?DBSCAN
outlier_detection?=?DBSCAN(eps?=?20,?metric='euclidean',?min_samples?=?5,n_jobs?=?-1)
clusters?=?outlier_detection.fit_predict(X)

data['dbscan_outliers']?=?clusters
data['dbscan_outliers']?=?data['dbscan_outliers'].apply(lambda?x:?str(1)?if?x>-1?else?str(-1))
print(data['dbscan_outliers'].value_counts())

1     787
-1     13
Name: dbscan_outliers, dtype: int64

要調(diào)整的最重要參數(shù)是 eps。

fig?=?px.scatter(data,?x=x1,?y=x2,?color="dbscan_outliers",?hover_name='Name')
fig.update_layout(title='DBSCAN?Outlier?Detection',?title_x=0.5,yaxis=dict(gridcolor?=?'#DFEAF4'),?xaxis=dict(gridcolor?=?'#DFEAF4'),?plot_bgcolor='white')
#?fig.show()

.單分類 SVM

單類分類器在僅包含正常點的數(shù)據(jù)集上訓練，但可用于所有數(shù)據(jù)。一旦訓練好，該模型將用于將新示例分類為正常值或異常值。
與標準 SVM 的主要區(qū)別在于，它以無監(jiān)督的方式擬合，并不提供超參數(shù) C 來調(diào)節(jié)間隔。相反，它提供了控制支持向量靈敏度的超參數(shù) nu，并且應該調(diào)整為數(shù)據(jù)中離群值的近似比率。

有關(guān)單分類 SVM 的更多信息可參考，

Outlier Detection with One-Class SVMs^[2]
One-Class Classification Algorithms for Imbalanced Datasets^[3]

clf?=?svm.OneClassSVM(nu=0.08,?kernel='rbf',?gamma='auto')
outliers?=?clf.fit_predict(X)
data['ocsvm_outliers']?=?outliers
data['ocsvm_outliers']?=?data['ocsvm_outliers'].apply(lambda?x:?str(-1)?if?x==-1?else?str(1))
data['ocsvm_scores']?=?clf.score_samples(X)
print(data['ocsvm_outliers'].value_counts())

-1    481
1     319
Name: ocsvm_outliers, dtype: int64

fig?=?px.scatter(data,?x=x1,?y=x2,?color="ocsvm_outliers",?hover_name='Name')
fig.update_layout(title='One?Class?SVM?Outlier?Detection',?title_x=0.5,yaxis=dict(gridcolor?=?'#DFEAF4'),?xaxis=dict(gridcolor?=?'#DFEAF4'),?plot_bgcolor='white')
#?fig.show()

在此數(shù)據(jù)中找不到更好的 nu，參數(shù)在這個例子上似乎不起作用。對于其他 nu 值，離群值更是大于正常值。

.集成

最后，讓我們結(jié)合這 5 種算法來構(gòu)成一種健壯的算法。我將簡單添加離群值列，其中 -1 代表離群值，1 代表正常值。

由于此例中效果不好，因此不使用 One Class SVM。

data['outliers_sum']?=?data['isoletionForest_outliers'].astype(int)+data['extendedIsoletionForest_outliers'].astype(int)+data['localOutlierFactor_outliers'].astype(int)+data['dbscan_outliers'].astype(int)

data['outliers_sum'].value_counts()

 3    774
 1     11
-3      8
-1      7
Name: outliers_sum, dtype: int64

fig?=?px.scatter(data,?x=x1,?y=x2,?color="outliers_sum",?hover_name='Name')
fig.update_layout(title='Ensemble?Outlier?Detection',?title_x=0.5,yaxis=dict(gridcolor?=?'#DFEAF4'),?xaxis=dict(gridcolor?=?'#DFEAF4'),?plot_bgcolor='white')
#?fig.show()

觀察值 outliers_sum=4 的意思是，所有 4 種算法均同意這是一個正常值，而對于離群值的完全一致是其和為 -4。

首先，讓我們看看所有算法中哪些被認為是離群值，然后將 sum = 4 的觀察值設(shè)為正常值，其余則作為離群值。

data.loc[data['outliers_sum']==-4]['Name']

121              Chansey
155              Snorlax
217            Wobbuffet
261              Blissey
313              Slaking
316             Shedinja
431    DeoxysSpeed Forme
495             Munchlax
Name: Name, dtype: object

data['outliers_sum']?=?data['outliers_sum'].apply(lambda?x:?str(1)?if?x==4?else?str(-1))

fig?=?px.scatter(data,?x=x1,?y=x2,?color="outliers_sum",?hover_name='Name')
fig.update_layout(title='Ensemble?Outlier?Detection',?title_x=0.5,yaxis=dict(gridcolor?=?'#DFEAF4'),?xaxis=dict(gridcolor?=?'#DFEAF4'),?plot_bgcolor='white')
#?fig.show()

?參考資料?

[1]

Pokemon: https://www.kaggle.com/abcsds/pokemon

[2]

Outlier Detection with One-Class SVMs: https://towardsdatascience.com/outlier-detection-with-one-class-svms-5403a1a1878c

[3]

One-Class Classification Algorithms for Imbalanced Datasets: https://machinelearningmastery.com/one-class-classification-algorithms/

[4]

原文鏈接: https://towardsdatascience.com/outlier-detection-theory-visualizations-and-code-a4fd39de540c

.相關(guān)閱讀

用 SVD 修改馬氏距離及其在異常值檢測中的應用

異常值檢測實踐 - Python 代碼與可視化

1介紹

.什么是異常值檢測？

.成因

.應用

.方法