gav在线,北条麻妃日B视频,无码乱论视频,激情五月91,国产精品久久久久久久激情视频 ,极品一区,男人天堂网在线,韩国精品在线播放

↑↑↑點(diǎn)擊上方藍(lán)字，回復(fù)資料，嘿嘿，10個G的驚喜

本文簡單介紹python中一些常見的數(shù)據(jù)預(yù)處理，包括數(shù)據(jù)加載、缺失值處理、異常值處理、描述性變量轉(zhuǎn)換為數(shù)值型、訓(xùn)練集測試集劃分、數(shù)據(jù)規(guī)范化。

1、加載數(shù)據(jù)

1.1 數(shù)據(jù)讀取

數(shù)據(jù)格式有很多，介紹常見的csv,txt,excel以及數(shù)據(jù)庫mysql中的文件讀取

import pandas as pddata = pd.read_csv(r'../filename.csv')	#讀取csv文件data = pd.read_table(r'../filename.txt')	#讀取txt文件data = pd.read_excel(r'../filename.xlsx')  #讀取excel文件
#  獲取數(shù)據(jù)庫中的數(shù)據(jù)import pymysqlconn = pymysql.connect(host='localhost',user='root',passwd='12345',db='mydb')	#連接數(shù)據(jù)庫,注意修改成要連的數(shù)據(jù)庫信息cur = conn.cursor()	#創(chuàng)建游標(biāo)cur.execute("select * from train_data limit 100")	#train_data是要讀取的數(shù)據(jù)名data = cur.fetchall()	#獲取數(shù)據(jù)cols = cur.description	#獲取列名conn.commit()	#執(zhí)行cur.close()	#關(guān)閉游標(biāo)conn.close()	#關(guān)閉數(shù)據(jù)庫連接col = []for i in cols:	col.append(i[0])data = list(map(list,data))data = pd.DataFrame(data,columns=col)

1.2 CSV文件合并

實(shí)際數(shù)據(jù)可能分布在一個個的小的csv或者txt文檔，而建模分析時可能需要讀取所有數(shù)據(jù)，這時呢，需要將一個個小的文檔合并到一個文件中

#合并多個csv文件成一個文件import glob
#合并def hebing():    csv_list = glob.glob('*.csv') #查看同文件夾下的csv文件數(shù)    print(u'共發(fā)現(xiàn)%s個CSV文件'% len(csv_list))    print(u'正在處理............')    for i in csv_list: #循環(huán)讀取同文件夾下的csv文件        fr = open(i,'rb').read()        with open('result.csv','ab') as f: #將結(jié)果保存為result.csv            f.write(fr)    print(u'合并完畢！')
#去重    def quchong(file):        df = pd.read_csv(file,header=0)            datalist = df.drop_duplicates()            datalist.to_csv(file)
if __name__ == '__main__':        hebing()        quchong("result.csv.csv")

1.3 CSV文件拆分

對于一些數(shù)據(jù)量比較大的文件，想直接讀取或者打開比較困難，介紹一個可以拆分?jǐn)?shù)據(jù)的方法吧，方便查看數(shù)據(jù)樣式以及讀取部分?jǐn)?shù)據(jù)

##csv比較大，打不開，將其切分成一個個小文件，看數(shù)據(jù)形式f = open('NEW_Data.csv','r') #打開大文件i = 0 #設(shè)置計(jì)數(shù)器
#這里1234567表示文件行數(shù)，如果不知道行數(shù)可用每行長度等其他條件來判斷while i<1234567 :     with open('newfile'+str(i),'w') as f1:        for j in range(0,10000) : #這里設(shè)置每個子文件的大小            if i < 1234567: #這里判斷是否已結(jié)束，否則最后可能報錯                f1.writelines(f.readline())                i = i+1            else:                break

1.4 數(shù)據(jù)查看

在進(jìn)行數(shù)據(jù)分析前呢，可以查看一下數(shù)據(jù)的總體情況，從宏觀上了解數(shù)據(jù)

data.head() #顯示前五行數(shù)據(jù)data.tail() #顯示末尾五行數(shù)據(jù)data.info() #查看各字段的信息data.shape #查看數(shù)據(jù)集有幾行幾列,data.shape[0]是行數(shù),data.shape[1]是列數(shù)data.describe() #查看數(shù)據(jù)的大體情況，均值，最值，分位數(shù)值...data.columns.tolist()   #得到列名的list

2、缺失值

現(xiàn)實(shí)獲取的數(shù)據(jù)經(jīng)常存在缺失，不完整的情況（能有數(shù)據(jù)就不錯了，還想完整?。。。瑸榱烁玫姆治?，一般會對這些缺失數(shù)據(jù)進(jìn)行識別和處理

2.1 缺失值查看

print(data.isnull().sum())  #統(tǒng)計(jì)每列有幾個缺失值missing_col = data.columns[data.isnull().any()].tolist() #找出存在缺失值的列
import numpy as np#統(tǒng)計(jì)每個變量的缺失值占比def CountNA(data):    cols = data.columns.tolist()    #cols為data的所有列名    n_df = data.shape[0]    #n_df為數(shù)據(jù)的行數(shù)    for col in cols:        missing = np.count_nonzero(data[col].isnull().values)  #col列中存在的缺失值個數(shù)        mis_perc = float(missing) / n_df * 100        print("{col}的缺失比例是{miss}%".format(col=col,miss=mis_perc))

2.2 缺失值處理

面對缺失值，一般有三種處理方法：不處理、刪除以及填充

2.2.1 不處理

有的算法（貝葉斯、xgboost、神經(jīng)網(wǎng)絡(luò)等）對缺失值不敏感，或者有些字段對結(jié)果分析作用不大，此時就沒必要費(fèi)時費(fèi)力去處理缺失值啦 =。=

2.2.2 刪除

在數(shù)據(jù)量比較大時候或者一條記錄中多個字段缺失，不方便填補(bǔ)的時候可以選擇刪除缺失值

data.dropna(axis=0,how="any",inplace=True)  #axis=0代表'行','any'代表任何空值行,若是'all'則代表所有值都為空時，才刪除該行data.dropna(axis=0,inplace=True)  #刪除帶有空值的行data.dropna(axis=1,inplace=True)  #刪除帶有空值的列

2.2.3 填充

數(shù)據(jù)量較少時候，以最可能的值來插補(bǔ)缺失值比刪除全部不完全樣本所產(chǎn)生的信息丟失要少

2.2.3.1 固定值填充

data = data.fillna(0)   #缺失值全部用0插補(bǔ)data['col_name'] = data['col_name'].fillna('UNKNOWN')  #某列缺失值用固定值插補(bǔ)

2.2.3.2 出現(xiàn)最頻繁值填充

即眾數(shù)插補(bǔ)，離散/連續(xù)數(shù)據(jù)都行，適用于名義變量，如性別

freq_port = data.col_name.dropna().mode()[0]  # mode返回出現(xiàn)最多的數(shù)據(jù),col_name為列名data['col_name'] = data['col_name'].fillna(freq_port)   #采用出現(xiàn)最頻繁的值插補(bǔ)

2.2.3.3 中位數(shù)/均值插補(bǔ)

data['col_name'].fillna(data['col_name'].dropna().median(),inplace=True)  #中位數(shù)插補(bǔ)，適用于偏態(tài)分布或者有離群點(diǎn)的分布data['col_name'].fillna(data['col_name'].dropna().mean(),inplace=True)    #均值插補(bǔ)，適用于正態(tài)分布

2.2.3.4 用前后數(shù)據(jù)填充

data['col_name'] = data['col_name'].fillna(method='pad')    #用前一個數(shù)據(jù)填充data['col_name'] = data['col_name'].fillna(method='bfill')  #用后一個數(shù)據(jù)填充

2.2.3.5 拉格朗日插值法

一般針對有序的數(shù)據(jù)，如帶有時間列的數(shù)據(jù)集,且缺失值為連續(xù)型數(shù)值小批量數(shù)據(jù)

from scipy.interpolate import lagrange#自定義列向量插值函數(shù),s為列向量,n為被插值的位置,k為取前后的數(shù)據(jù)個數(shù)，默認(rèn)5def ployinterp_columns(s, n, k=5):    y = s[list(range(n-k,n)) + list(range(n+1,n+1+k))]  #取數(shù)    y = y[y.notnull()]  #剔除空值    return lagrange(y.index, list(y))(n)    #插值并返回插值結(jié)果
#逐個元素判斷是否需要插值for i in data.columns:    for j in range(len(data)):        if (data[i].isnull())[j]:   #如果為空即插值            data[i][j] = ployinterp_columns(data[i],j)

2.2.3.6 其它插補(bǔ)方法最近鄰插補(bǔ)、回歸方法、牛頓插值法、隨機(jī)森林填充等。

3、異常值

異常值是指樣本中的個別值，其數(shù)值明顯偏離它所屬樣本的其余觀測值。異常值有時是記錄錯誤或者其它情況導(dǎo)致的錯誤數(shù)據(jù)，有時是代表少數(shù)情況的正常值

3.1 異常值識別

3.1.1 描述性統(tǒng)計(jì)法

#與業(yè)務(wù)或者基本認(rèn)知不符的數(shù)據(jù),如年齡為負(fù)

neg_list = ['col_name_1','col_name_2','col_name_3']for item in neg_list:    neg_item = data[item] < 0    print(item + '小于0的有' + str(neg_item.sum())+'個')
#刪除小于0的記錄for item in neg_list:    data = data[(data[item]>=0)]

3.1.2 三西格瑪法

當(dāng)數(shù)據(jù)服從正態(tài)分布時，99.7%的數(shù)值應(yīng)該位于距離均值3個標(biāo)準(zhǔn)差之內(nèi)的距離，P(|x?μ|>3σ)≤0.003

#當(dāng)數(shù)值超出這個距離，可以認(rèn)為它是異常值for item in neg_list:    data[item + '_zscore'] = (data[item] - data[item].mean()) / data[item].std()    z_abnormal = abs(data[item + '_zscore']) > 3    print(item + '中有' + str(z_abnormal.sum())+'個異常值')

3.1.3 箱型圖

#IQR(差值) = U(上四分位數(shù)) - L(下四分位數(shù))#上界 = U + 1.5IQR#下界 = L-1.5IQRfor item in neg_list:    IQR = data[item].quantile(0.75) - data[item].quantile(0.25)    q_abnormal_L = data[item] < data[item].quantile(0.25) - 1.5*IQR    q_abnormal_U = data[item] > data[item].quantile(0.75) + 1.5*IQR    print(item + '中有' + str(q_abnormal_L.sum() + q_abnormal_U.sum())+'個異常值')

3.1.4 其它

基于聚類方法檢測、基于密度的離群點(diǎn)檢測、基于近鄰度的離群點(diǎn)檢測等。

3.2 異常值處理

對于異常值，可以刪除，可以不處理，也可以視作缺失值進(jìn)行處理。

4、描述性變量轉(zhuǎn)換為數(shù)值型

大部分機(jī)器學(xué)習(xí)算法要求輸入的數(shù)據(jù)必須是數(shù)字，不能是字符串，這就要求將數(shù)據(jù)中的描述性變量（如性別）轉(zhuǎn)換為數(shù)值型數(shù)據(jù)

#尋找描述變量，并將其存儲到cat_vars這個list中去cat_vars = []print('\n描述變量有:')cols = data.columns.tolist()for col in cols:    if data[col].dtype == 'object':        print(col)        cat_vars.append(col)···
##若變量是有序的##     print('\n開始轉(zhuǎn)換描述變量...')from sklearn import preprocessingle = preprocessing.LabelEncoder()#將描述變量自動轉(zhuǎn)換為數(shù)值型變量，并將轉(zhuǎn)換后的數(shù)據(jù)附加到原始數(shù)據(jù)上for col in cat_vars:    tran = le.fit_transform(data[col].tolist())    tran_df = pd.DataFrame(tran,columns=['num_'+col])    print('{col}經(jīng)過轉(zhuǎn)化為{num_col}'.format(col=col,num_col='num_'+col))    data = pd.concat([data, tran_df], axis=1)    del data[col]	#刪除原來的列

##若變量是無序變量## #值得注意的是one-hot可能引發(fā)維度爆炸for col in cat_vars:    onehot_tran = pd.get_dummies(data.col)    data = data.join(onehot_tran)	#將one-hot后的數(shù)據(jù)添加到data中    del data[col]	#刪除原來的列

5、訓(xùn)練測試集劃分

實(shí)際在建模前大多需要對數(shù)據(jù)進(jìn)行訓(xùn)練集和測試集劃分，此處介紹兩種劃分方式

法一、直接調(diào)用train_test_split函數(shù)from sklearn.model_selection import train_test_splitX = data.drop('目標(biāo)列',1)	#X是特征列y = data['目標(biāo)列']	#y是目標(biāo)列X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)

法二：隨機(jī)抽樣

#隨機(jī)選數(shù)據(jù)作為測試集test_data = data.sample(frac=0.3,replace=False,random_state=123,axis=0)#frac是抽取30%的數(shù)據(jù)，replace是否為有放回抽樣，取replace=True時為有放回抽樣，axis=0是抽取行、為1時抽取列#在data中除去test_data，剩余數(shù)據(jù)為訓(xùn)練集train_data = (data.append(test_data)).drop_duplicates(keep=False)X_train = train_data.drop('目標(biāo)列',1)X_test = test_data.drop('目標(biāo)列',1)y_train = train_data['目標(biāo)列']y_test = test_data['目標(biāo)列']

6、數(shù)據(jù)規(guī)范化

數(shù)據(jù)的標(biāo)準(zhǔn)化（normalization）是將數(shù)據(jù)按比例縮放，使之落入一個小的特定區(qū)間。在某些比較和評價的指標(biāo)處理中經(jīng)常會用到，去除數(shù)據(jù)的單位限制，將其轉(zhuǎn)化為無量綱的純數(shù)值，便于不同單位或量級的指標(biāo)能夠進(jìn)行比較和加權(quán)。一些需要數(shù)據(jù)規(guī)范化的算法：LR、SVM、KNN、KMeans、GBDT、AdaBoost、神經(jīng)網(wǎng)絡(luò)等

6.1 最小最大規(guī)范化

對原始數(shù)據(jù)進(jìn)行線性變換，變換到[0,1]區(qū)間。計(jì)算公式為：x* = (x-x.min)/(x.max-x.min)

from sklearn.preprocessing import MinMaxScaler
x_scaler = MinMaxScaler()y_scaler = MinMaxScaler()
#特征歸一化x_train_sca = x_scaler.fit_transform(X_train)x_test_sca = x_scaler.transform(X_test)y_train_sca = y_scaler.fit_transform(pd.DataFrame(y_train))

6.2 零均值規(guī)范化

對原始數(shù)據(jù)進(jìn)行線性變換，經(jīng)過處理的數(shù)據(jù)的均值為0，標(biāo)準(zhǔn)差為1。計(jì)算方式是將特征值減去均值，除以標(biāo)準(zhǔn)差。計(jì)算公式為：x* = (x-x.mean)/σ

from sklearn.preprocessing import StandardScaler
#一般把train和test集放在一起做標(biāo)準(zhǔn)化，或者在train集上做標(biāo)準(zhǔn)化后，用同樣的標(biāo)準(zhǔn)化器去標(biāo)準(zhǔn)化test集scaler = StandardScaler()train = scaler.fit_transform(train)test = scaler.transform(test)

作者：GC_AIDM

https://www.cnblogs.com/shenggang/p/12133278.html


—END—
如果看到這里，說明你喜歡這篇文章，請轉(zhuǎn)發(fā)、點(diǎn)贊。微信搜索「hych666」，歡迎添加我的微信，更多精彩，盡在我的朋友圈。
↓掃描二維碼添加好友↓
推薦閱讀
（點(diǎn)擊標(biāo)題可跳轉(zhuǎn)閱讀）
玩機(jī)器學(xué)習(xí)，再也不缺數(shù)據(jù)集了
揭秘！阿里巴巴電商算法首次對外公開
在機(jī)器學(xué)習(xí)項(xiàng)目中該如何選擇優(yōu)化器
機(jī)器學(xué)習(xí)基礎(chǔ)：詳解 5 大常用特征選擇方法
NumPy庫入門教程：基礎(chǔ)知識總結(jié)
老鐵，三連支持一下，好嗎？↓↓↓

【Python基礎(chǔ)系列】常見的數(shù)據(jù)預(yù)處理方法（附代碼）

1、 加載數(shù)據(jù)