基于隨機(jī)優(yōu)化算法的特征選擇

make_classification()函數(shù)定義一個具有五個輸入變量的數(shù)據(jù)集,其中兩個是信息變量,并且有1,000行。下面的示例定義了數(shù)據(jù)集并總結(jié)了其形狀。#?define?a?small?classification?dataset
from?sklearn.datasets?import?make_classification
#?define?dataset
X,?y?=?make_classification(n_samples=1000,?n_features=5,?n_informative=2,?n_redundant=3,?random_state=1)
#?summarize?the?shape?of?the?dataset
print(X.shape,?y.shape)
(1000,?5)?(1000,)
DecisionTreeClassifier作為模型,因?yàn)樗男阅軐斎胱兞康倪x擇非常敏感。我們將使用良好的實(shí)踐來評估模型,例如具有三個重復(fù)和10折的重復(fù)分層k折交叉驗(yàn)證。下面列出了完整的示例。#?evaluate?a?decision?tree?on?the?entire?small?dataset
from?numpy?import?mean
from?numpy?import?std
from?sklearn.datasets?import?make_classification
from?sklearn.model_selection?import?cross_val_score
from?sklearn.model_selection?import?RepeatedStratifiedKFold
from?sklearn.tree?import?DecisionTreeClassifier
#?define?dataset
X,?y?=?make_classification(n_samples=1000,?n_features=3,?n_informative=2,?n_redundant=1,?random_state=1)
#?define?model
model?=?DecisionTreeClassifier()
#?define?evaluation?procedure
cv?=?RepeatedStratifiedKFold(n_splits=10,?n_repeats=3,?random_state=1)
#?evaluate?model
scores?=?cross_val_score(model,?X,?y,?scoring='accuracy',?cv=cv,?n_jobs=-1)
#?report?result
print('Mean?Accuracy:?%.3f?(%.3f)'?%?(mean(scores),?std(scores)))
Mean?Accuracy:?0.805?(0.030)
False。例如,對于五個輸入要素,序列[True,True,True,True,True]將使用所有輸入要素,而[True,F(xiàn)alse,F(xiàn)alse,F(xiàn)alse,F(xiàn)alse,F(xiàn)alse]僅將第一個輸入要素用作輸入。我們可以使用product()函數(shù)枚舉length = 5的所有布爾值序列。我們必須指定有效值[True,F(xiàn)alse]和序列中的步數(shù),該步數(shù)等于輸入變量的數(shù)量。該函數(shù)返回一個可迭代的函數(shù),我們可以直接為每個序列枚舉。#?determine?the?number?of?columns
n_cols?=?X.shape[1]
best_subset,?best_score?=?None,?0.0
#?enumerate?all?combinations?of?input?features
for?subset?in?product([True,?False],?repeat=n_cols):
#?convert?into?column?indexes
ix?=?[i?for?i,?x?in?enumerate(subset)?if?x]
#?check?for?now?column?(all?False)
if?len(ix)?==?0:
?continue
#?select?columns
X_new?=?X[:,?ix]
#?define?model
model?=?DecisionTreeClassifier()
#?define?evaluation?procedure
cv?=?RepeatedStratifiedKFold(n_splits=10,?n_repeats=3,?random_state=1)
#?evaluate?model
scores?=?cross_val_score(model,?X_new,?y,?scoring='accuracy',?cv=cv,?n_jobs=-1)
#?summarize?scores
result?=?mean(scores)
#?check?if?it?is?better?than?the?best?so?far
if?best_score?is?None?or?result?>=?best_score:
?#?better?result
?best_subset,?best_score?=?ix,?result
#?feature?selection?by?enumerating?all?possible?subsets?of?features
from?itertools?import?product
from?numpy?import?mean
from?sklearn.datasets?import?make_classification
from?sklearn.model_selection?import?cross_val_score
from?sklearn.model_selection?import?RepeatedStratifiedKFold
from?sklearn.tree?import?DecisionTreeClassifier
#?define?dataset
X,?y?=?make_classification(n_samples=1000,?n_features=5,?n_informative=2,?n_redundant=3,?random_state=1)
#?determine?the?number?of?columns
n_cols?=?X.shape[1]
best_subset,?best_score?=?None,?0.0
#?enumerate?all?combinations?of?input?features
for?subset?in?product([True,?False],?repeat=n_cols):
?#?convert?into?column?indexes
?ix?=?[i?for?i,?x?in?enumerate(subset)?if?x]
?#?check?for?now?column?(all?False)
?if?len(ix)?==?0:
??continue
?#?select?columns
?X_new?=?X[:,?ix]
?#?define?model
?model?=?DecisionTreeClassifier()
?#?define?evaluation?procedure
?cv?=?RepeatedStratifiedKFold(n_splits=10,?n_repeats=3,?random_state=1)
?#?evaluate?model
?scores?=?cross_val_score(model,?X_new,?y,?scoring='accuracy',?cv=cv,?n_jobs=-1)
?#?summarize?scores
?result?=?mean(scores)
?#?report?progress
?print('>f(%s)?=?%f?'?%?(ix,?result))
?#?check?if?it?is?better?than?the?best?so?far
?if?best_score?is?None?or?result?>=?best_score:
??#?better?result
??best_subset,?best_score?=?ix,?result
#?report?best
print('Done!')
print('f(%s)?=?%f'?%?(best_subset,?best_score))
[2、3、4]處的要素,這些要素的平均分類精度約為83.0%,這比以前使用所有輸入要素報(bào)告的結(jié)果要好。>f([0,?1,?2,?3,?4])?=?0.813667
>f([0,?1,?2,?3])?=?0.827667
>f([0,?1,?2,?4])?=?0.815333
>f([0,?1,?2])?=?0.824000
>f([0,?1,?3,?4])?=?0.821333
>f([0,?1,?3])?=?0.825667
>f([0,?1,?4])?=?0.807333
>f([0,?1])?=?0.817667
>f([0,?2,?3,?4])?=?0.830333
>f([0,?2,?3])?=?0.819000
>f([0,?2,?4])?=?0.828000
>f([0,?2])?=?0.818333
>f([0,?3,?4])?=?0.830333
>f([0,?3])?=?0.821333
>f([0,?4])?=?0.816000
>f([0])?=?0.639333
>f([1,?2,?3,?4])?=?0.823667
>f([1,?2,?3])?=?0.821667
>f([1,?2,?4])?=?0.823333
>f([1,?2])?=?0.818667
>f([1,?3,?4])?=?0.818000
>f([1,?3])?=?0.820667
>f([1,?4])?=?0.809000
>f([1])?=?0.797000
>f([2,?3,?4])?=?0.827667
>f([2,?3])?=?0.755000
>f([2,?4])?=?0.827000
>f([2])?=?0.516667
>f([3,?4])?=?0.824000
>f([3])?=?0.514333
>f([4])?=?0.777667
Done!
f([0,?3,?4])?=?0.830333
#?define?a?large?classification?dataset
from?sklearn.datasets?import?make_classification
#?define?dataset
X,?y?=?make_classification(n_samples=10000,?n_features=500,?n_informative=10,?n_redundant=490,?random_state=1)
#?summarize?the?shape?of?the?dataset
print(X.shape,?y.shape)
(10000,?500)?(10000,)
#?evaluate?a?decision?tree?on?the?entire?larger?dataset
from?numpy?import?mean
from?numpy?import?std
from?sklearn.datasets?import?make_classification
from?sklearn.model_selection?import?cross_val_score
from?sklearn.model_selection?import?StratifiedKFold
from?sklearn.tree?import?DecisionTreeClassifier
#?define?dataset
X,?y?=?make_classification(n_samples=10000,?n_features=500,?n_informative=10,?n_redundant=490,?random_state=1)
#?define?model
model?=?DecisionTreeClassifier()
#?define?evaluation?procedure
cv?=?StratifiedKFold(n_splits=3)
#?evaluate?model
scores?=?cross_val_score(model,?X,?y,?scoring='accuracy',?cv=cv,?n_jobs=-1)
#?report?result
print('Mean?Accuracy:?%.3f?(%.3f)'?%?(mean(scores),?std(scores)))
Mean?Accuracy:?0.913?(0.001)
Objective()函數(shù)實(shí)現(xiàn)了此目的,并返回了得分和用于幫助報(bào)告的列的已解碼子集。#?objective?function
def?objective(X,?y,?subset):
?#?convert?into?column?indexes
?ix?=?[i?for?i,?x?in?enumerate(subset)?if?x]
?#?check?for?now?column?(all?False)
?if?len(ix)?==?0:
??return?0.0
?#?select?columns
?X_new?=?X[:,?ix]
?#?define?model
?model?=?DecisionTreeClassifier()
?#?evaluate?model
?scores?=?cross_val_score(model,?X_new,?y,?scoring='accuracy',?cv=3,?n_jobs=-1)
?#?summarize?scores
?result?=?mean(scores)
?return?result,?ix
mutate()函數(shù)在給定的候選解決方案(布爾序列)和突變超參數(shù)的情況下實(shí)現(xiàn)了這一點(diǎn),創(chuàng)建并返回了修改后的解決方案(搜索空間中的步驟)。p_mutate值越大(在0到1的范圍內(nèi)),搜索空間中的步長越大。#?mutation?operator
def?mutate(solution,?p_mutate):
?#?make?a?copy
?child?=?solution.copy()
?for?i?in?range(len(child)):
??#?check?for?a?mutation
??if?rand()????#?flip?the?inclusion
???child[i]?=?not?child[i]
?return?child
#?generate?an?initial?point
solution?=?choice([True,?False],?size=X.shape[1])
#?evaluate?the?initial?point
solution_eval,?ix?=?objective(X,?y,?solution)
#?run?the?hill?climb
for?i?in?range(n_iter):
?#?take?a?step
?candidate?=?mutate(solution,?p_mutate)
?#?evaluate?candidate?point
?candidate_eval,?ix?=?objective(X,?y,?candidate)
?#?check?if?we?should?keep?the?new?point
?if?candidate_eval?>=?solution_eval:
??#?store?the?new?point
??solution,?solution_eval?=?candidate,?candidate_eval
?#?report?progress
?print('>%d?f(%s)?=?%f'?%?(i+1,?len(ix),?solution_eval))
hillclimbing()函數(shù)以數(shù)據(jù)集,目標(biāo)函數(shù)和超參數(shù)作為參數(shù)來實(shí)現(xiàn)此目的,并返回?cái)?shù)據(jù)集列的最佳子集和模型的估計(jì)性能。#?hill?climbing?local?search?algorithm
def?hillclimbing(X,?y,?objective,?n_iter,?p_mutate):
?#?generate?an?initial?point
?solution?=?choice([True,?False],?size=X.shape[1])
?#?evaluate?the?initial?point
?solution_eval,?ix?=?objective(X,?y,?solution)
?#?run?the?hill?climb
?for?i?in?range(n_iter):
??#?take?a?step
??candidate?=?mutate(solution,?p_mutate)
??#?evaluate?candidate?point
??candidate_eval,?ix?=?objective(X,?y,?candidate)
??#?check?if?we?should?keep?the?new?point
??if?candidate_eval?>=?solution_eval:
???#?store?the?new?point
???solution,?solution_eval?=?candidate,?candidate_eval
??#?report?progress
??print('>%d?f(%s)?=?%f'?%?(i+1,?len(ix),?solution_eval))
?return?solution,?solution_eval
#?define?dataset
X,?y?=?make_classification(n_samples=10000,?n_features=500,?n_informative=10,?n_redundant=490,?random_state=1)
#?define?the?total?iterations
n_iter?=?100
#?probability?of?including/excluding?a?column
p_mut?=?10.0?/?500.0
#?perform?the?hill?climbing?search
subset,?score?=?hillclimbing(X,?y,?objective,?n_iter,?p_mut)
#?convert?into?column?indexes
ix?=?[i?for?i,?x?in?enumerate(subset)?if?x]
print('Done!')
print('Best:?f(%d)?=?%f'?%?(len(ix),?score))
#?stochastic?optimization?for?feature?selection
from?numpy?import?mean
from?numpy.random?import?rand
from?numpy.random?import?choice
from?sklearn.datasets?import?make_classification
from?sklearn.model_selection?import?cross_val_score
from?sklearn.tree?import?DecisionTreeClassifier
?
#?objective?function
def?objective(X,?y,?subset):
?#?convert?into?column?indexes
?ix?=?[i?for?i,?x?in?enumerate(subset)?if?x]
?#?check?for?now?column?(all?False)
?if?len(ix)?==?0:
??return?0.0
?#?select?columns
?X_new?=?X[:,?ix]
?#?define?model
?model?=?DecisionTreeClassifier()
?#?evaluate?model
?scores?=?cross_val_score(model,?X_new,?y,?scoring='accuracy',?cv=3,?n_jobs=-1)
?#?summarize?scores
?result?=?mean(scores)
?return?result,?ix
?
#?mutation?operator
def?mutate(solution,?p_mutate):
?#?make?a?copy
?child?=?solution.copy()
?for?i?in?range(len(child)):
??#?check?for?a?mutation
??if?rand()????#?flip?the?inclusion
???child[i]?=?not?child[i]
?return?child
?
#?hill?climbing?local?search?algorithm
def?hillclimbing(X,?y,?objective,?n_iter,?p_mutate):
?#?generate?an?initial?point
?solution?=?choice([True,?False],?size=X.shape[1])
?#?evaluate?the?initial?point
?solution_eval,?ix?=?objective(X,?y,?solution)
?#?run?the?hill?climb
?for?i?in?range(n_iter):
??#?take?a?step
??candidate?=?mutate(solution,?p_mutate)
??#?evaluate?candidate?point
??candidate_eval,?ix?=?objective(X,?y,?candidate)
??#?check?if?we?should?keep?the?new?point
??if?candidate_eval?>=?solution_eval:
???#?store?the?new?point
???solution,?solution_eval?=?candidate,?candidate_eval
??#?report?progress
??print('>%d?f(%s)?=?%f'?%?(i+1,?len(ix),?solution_eval))
?return?solution,?solution_eval
?
#?define?dataset
X,?y?=?make_classification(n_samples=10000,?n_features=500,?n_informative=10,?n_redundant=490,?random_state=1)
#?define?the?total?iterations
n_iter?=?100
#?probability?of?including/excluding?a?column
p_mut?=?10.0?/?500.0
#?perform?the?hill?climbing?search
subset,?score?=?hillclimbing(X,?y,?objective,?n_iter,?p_mut)
#?convert?into?column?indexes
ix?=?[i?for?i,?x?in?enumerate(subset)?if?x]
print('Done!')
print('Best:?f(%d)?=?%f'?%?(len(ix),?score))
>80?f(240)?=?0.918099
>81?f(236)?=?0.918099
>82?f(238)?=?0.918099
>83?f(236)?=?0.918099
>84?f(239)?=?0.918099
>85?f(240)?=?0.918099
>86?f(239)?=?0.918099
>87?f(245)?=?0.918099
>88?f(241)?=?0.918099
>89?f(239)?=?0.918099
>90?f(239)?=?0.918099
>91?f(241)?=?0.918099
>92?f(243)?=?0.918099
>93?f(245)?=?0.918099
>94?f(239)?=?0.918099
>95?f(245)?=?0.918099
>96?f(244)?=?0.918099
>97?f(242)?=?0.918099
>98?f(238)?=?0.918099
>99?f(248)?=?0.918099
>100?f(238)?=?0.918099
Done!
Best:?f(239)?=?0.918099
遞歸特征消除(RFE)用于Python中的特征選擇
https://machinelearningmastery.com/rfe-feature-selection-in-python/如何為機(jī)器學(xué)習(xí)選擇特征選擇方法
https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/sklearn.datasets.make_classification API.
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.htmlitertools.product API.
https://docs.python.org/3/library/itertools.html#itertools.product作者:沂水寒城,CSDN博客專家,個人研究方向:機(jī)器學(xué)習(xí)、深度學(xué)習(xí)、NLP、CV
Blog:?http://yishuihancheng.blog.csdn.net
贊 賞 作 者

更多閱讀
特別推薦

點(diǎn)擊下方閱讀原文加入社區(qū)會員
評論
圖片
表情
