2024AV在线,亚洲无码视频看看,瑟瑟网站,免费性爱AV,美女靠比网站,亚洲毛多水多,婷婷五月五A片,色婷婷亚洲精品天天

特征選擇能剔除和目標變量不相關(irrelevant)或冗余(redundant )的特征，以此來減少特征個數(shù)，最終達到提高模型精確度，減少運行時間的目的。
另一方面，篩選出真正相關的特征之后也能夠簡化模型，經(jīng)常聽到的這段話足以說明特征工程以及特征選擇的重要性：
數(shù)據(jù)和特征決定了機器學習的上限，而模型和算法只是逼近這個上限而已

本文記錄的是使用4種不同的方法來進行機器學習中特征的重要性排序，從而比較不同特征對目標變量的影響。4種方法是：

遞歸特征消除
線性模型
隨機森林
相關系數(shù)

參考一篇博文：http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/

導入庫

In [1]:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.feature_selection import RFE, f_regression
from sklearn.linear_model import (LinearRegression, Ridge, Lasso)                                 
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor

導入數(shù)據(jù)

In [2]:

house = pd.read_csv("kc_house_data.csv")
house

Out[2]:

基本信息

In [3]:

# 數(shù)據(jù)shape
house.shape

Out[3]:

(21613, 21)

In [4]:

# 字段缺失值
house.isnull().sum()

Out[4]:

id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

In [5]:

house.isnull().any() # 每個字段都沒有缺失值

In [6]:

# 字段類型
house.dtypes

Out[6]:

id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

刪除無用字段

id和date兩個字段直接刪除掉：

In [7]:

house = house.drop(["id", "date"],axis=1)

Pairplot Visualisation

Pairplot中的plot就是成對、配對的意思，這種圖形主要是顯示變量兩兩之間的關系。

線性、非線性或者沒有明顯的相關性，都能觀察到。下面的例子教你如何查看不同特征之間的關系：

In [8]:

fig = sns.pairplot(house[['sqft_lot','sqft_above','price','sqft_living','bedrooms']],
             hue="bedrooms",
             palette="tab20",
             size=2
            )

fig.set(xticklabels=[])

plt.show()

屬性相關性熱力圖

屬性之間的相關性只是針對數(shù)值型的字段，在這里我們先排除字符串類型的屬性。

In [9]:

# # 方法1：尋找字符類型的屬性
# str_list = []

# for name, value in house.iteritems():
#     if type(value[1]) == str:
#         str_list.append(name)
# str_list

In [10]:

# 方法2
# house.select_dtypes(include="object")

在這里我們直接取出非字符類型的屬性數(shù)據(jù)：

In [11]:

house_num = house.select_dtypes(exclude="object")

計算相關性和熱力圖：

corr = house_num.astype(float).corr()

# 繪制熱力圖

f, ax = plt.subplots(figsize=(16,12))

plt.title("Person Correlation of 19 Features")

sns.heatmap(corr, # 數(shù)據(jù)
            linewidths=0.25,  # 線寬
            vmax=1.0,  # 最大值
            square=True,  # 顯示為方形
            linecolor="k",  # 線條顏色
            annot=True  # 注解；顯示數(shù)據(jù)
           )

plt.show()

下面是用對其他3種方式進行特征的重要性進行探索，先實施數(shù)據(jù)的分割

數(shù)據(jù)分離

In [14]:

# 1、先提取目標變量
y = house.price.values   # 目標變量
X = house.drop("price", axis=1)  # 特征

計算特征的重要性

# 2、定義

ranks = {}

def ranking(ranks, names, order=1):
    mm = MinMaxScaler()  # 歸一化實例
    ranks = mm.fit_transform(order * np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x,2), ranks)  
    
    return dict(zip(names, ranks))

基于RFE的特征排序

RFE：Recursive Feature Elimination，遞歸特征消除；

大致原理：通過反復地建立的線性回歸或者SVM模型獲得每個特征的coef_ 屬性或者 feature_importances_ 屬性，并且對特征屬性的重要性進行排序，從當前的特征集合中移除那些最不重要的特征，重復該過程。

Recursive Feature Elimination or RFE uses a model ( eg. linear Regression or SVM) to select either the best or worst-performing feature, and then excludes this feature

In [17]:

lr = LinearRegression(normalize=True)
lr.fit(X,y)

# 使用RFE的再次訓練
rfe = RFE(lr, n_features_to_select=1,verbose=3)
rfe.fit(X,y)

ranks["RFE"] = ranking(list(map(float, rfe.ranking_)),
                       col_names,
                       order=-1
                      )

ranks  # 特征和得分

上圖顯示的每個特征屬性的得分；可以通過ranking_屬性查看具體的排名：

基于線性模型的特征排序

下面嘗試使用3種線性模型來進行特征排序

In [20]:

# 1、線性回歸
lr = LinearRegression(normalize=True)
lr.fit(X,y)
ranks["LinReg"] = ranking(np.abs(lr.coef_), col_names)

# 2、嶺回歸
ridge = Ridge(alpha=7)
ridge.fit(X,y)
ranks["Ridge"] = ranking(np.abs(ridge.coef_), col_names)

# 3、Lasso回歸
lasso = Lasso(alpha=0.05)
lasso.fit(X,y)
ranks["Lasso"] = ranking(np.abs(lasso.coef_), col_names)

ranks中新增的部分數(shù)據(jù)：

基于隨機森林RandomForest的特征排序

隨機森林主要是通過返回模型中的feature_importances屬性來決定特征的重要性程度

In [22]:

rf = RandomForestRegressor(n_jobs=-1,
                           n_estimators=50,
                           verbose=3
                          )

rf.fit(X,y)

ranks["RF"] = ranking(rf.feature_importances_, col_names)

構(gòu)造特征排序矩陣

將上面我們獲取的每種方法的特征及其得分構(gòu)造一個特征排序矩陣

生成特征矩陣

最后把特征和目標變量的相關系數(shù)添加進來一起對比：

求出均值

求出所有方法下的均值：

In [27]:

ranks_df["Mean"] = ranks_df.mean(axis=1)
ranks_df

熱力圖顯示

In [28]:

import seaborn as sns

cm = sns.light_palette("red", as_cmap=True)
s = ranks_df.style.background_gradient(cmap=cm)
s

Out[28]:

對比結(jié)果

RFE的重要性分數(shù)取值整體是偏高的；前兩位是waterfront、lat
三種回歸模型的得分比較接近，而且前兩位和RFE是類型。可能原因是RFE選擇的基模型是線性回歸
隨機森林模型最終得到3個特征的分數(shù)是比較高的：grade、sqft_living、lat
基于相關系數(shù)：得分排序的結(jié)果和隨機森林接近

最后看看Mean的排名：

f = plt.figure(figsize=(12,8))

sns.barplot(y=df1.index.tolist(), 
            x=df1["Mean"].tolist()
           )

plt.show()

- END -
對比Excel系列圖書累積銷量達15w冊，讓你輕松掌握數(shù)據(jù)分析技能，可以點擊下方鏈接進行了解選購：

【特征工程】對比4大方法特征選擇

導入庫