五月天色婷婷综合,黄色网页大全,无码一区二区三区四区五区六区七区,久久青青操,爱爱综合视频,B日烂了日B,日本黄免费看,狠狠穞A片一區二區三區

大家好，我是東哥。

本篇是pandas騷操作系列的第 24 篇：自動優(yōu)化數(shù)據(jù)類型，暴省內(nèi)存！

系列內(nèi)容，請看??「pandas騷操作」話題，訂閱后文章更新可第一時間推送至訂閱號。內(nèi)容也同步我的GitHub，歡迎star！

https://github.com/xiaoyusmd/PythonDataScience

平日工作里經(jīng)常會聽到周邊小伙伴說：我X，內(nèi)存又爆了！

對于這樣的話我聽了不下百遍。正因為如此，在資源有限的情況下，我們都是變著法的減少內(nèi)存占用，一些常用的方法如：

gc.collect和del回收
使用csv的替代品，如feather、Parquet
優(yōu)化代碼，盡量使用Numpy矩陣代替for循環(huán)和apply
...

本次再分享一個騷操作，就是通過改變數(shù)據(jù)類型來壓縮內(nèi)存空間。之前也和大家介紹過category類型，也可以減少一些內(nèi)存占用。和這個方法一樣，我們可以延伸到所有數(shù)據(jù)類型。

正常情況下，pandas 會給數(shù)據(jù)列自動設(shè)置默認的數(shù)據(jù)類型，其中最令人討厭并且最消耗內(nèi)存的數(shù)據(jù)類型就是object(O)，這也恰好限制了 pandas 的一些功能。下面是 pandas 、Python、Numpy的數(shù)據(jù)類型列表，對比你就發(fā)現(xiàn)pandas的數(shù)據(jù)類型是有很大優(yōu)化空間的。

Pandas dtype	Python type	NumPy type	Usage
object	str	string_,unicode	Text
int64	int	int,int8,intl6,int32,int64,uint8,uint16,uint32,uint64	Integer numbers
float64	float	float,float16,float32,float64	Floating point numbers
bool	bool	bool_	True/False values
datetime64	NA	datetime64[ns]	Date and time values
timedelta[ns]	NA	NA	Differences between two datetimes
category	NA	NA	Finite list of text values

來源：http : //pbpython.com/pandas_dtypes.html

很多默認的數(shù)據(jù)類型占用很多內(nèi)存空間，其實根據(jù)沒有必要，我們完全可以壓縮到可能小的子類型。

Data type	Description
bool_	Boolean(True or False) stored as a byte
int_	Default integer type(same as C 1ong ; normally either int64or int32)
intc	ldentical to C int(normally int32 or int64)
intp	Integer used for indexing(same as C ssize_t; normally either int32 or int64)
int	8Byte(-128 to 127)
int16	Integer(-32768 to 32767)
int32	Integer(-2147483648 to 2147483647)
int64	Integer(-9223372036854775808 to 9223372036854775807)
uint8	Unsigned integer(0 to 255)
uint16	Unsigned integer(0 to 65535)
uint32	Unsigned integer(0 to 4294967295)
uint64	Unsigned integer(0 to 18446744073709551615)
float_	Shorthand for float64.
float16	Half precision float: sign bit,5 bits exponent,10 bits mantissa
float32	Single precision float: sign bit,8 bits exponent,23 bits mantissa
float64	Double precision float: sign bit,11 bits exponent,52 bits mantissa
complex_	Shorthand for complex128.
complex64	Complex number, represented by two 32-bit floats(real and imaginary components)
complex128	Complex number, represented by two 64-bit floats(real and imaginary components)

來源：https : //docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html

上面是scipy文檔中列出的所有數(shù)據(jù)類型，從簡單到復(fù)雜。我們希望將類型簡單化，以此節(jié)省內(nèi)存，比如將浮點數(shù)轉(zhuǎn)換為float16/32，或者將具有正整數(shù)和負整數(shù)的列轉(zhuǎn)為int8/16/32，還可以將布爾值轉(zhuǎn)換為uint8，甚至僅使用正整數(shù)來進一步減少內(nèi)存消耗。

基于上面所說的變量類型簡化的思考，寫出一個自動轉(zhuǎn)化的函數(shù)，它可以根據(jù)上表將浮點數(shù)和整數(shù)轉(zhuǎn)換為它們的最小子類型：

def?reduce_memory_usage(df,?verbose=True):
????numerics?=?["int8",?"int16",?"int32",?"int64",?"float16",?"float32",?"float64"]
????start_mem?=?df.memory_usage().sum()?/?1024?**?2
????for?col?in?df.columns:
????????col_type?=?df[col].dtypes
????????if?col_type?in?numerics:
????????????c_min?=?df[col].min()
????????????c_max?=?df[col].max()
????????????if?str(col_type)[:3]?==?"int":
????????????????if?c_min?>?np.iinfo(np.int8).min?and?c_max?????????????????????df[col]?=?df[col].astype(np.int8)
????????????????elif?c_min?>?np.iinfo(np.int16).min?and?c_max?????????????????????df[col]?=?df[col].astype(np.int16)
????????????????elif?c_min?>?np.iinfo(np.int32).min?and?c_max?????????????????????df[col]?=?df[col].astype(np.int32)
????????????????elif?c_min?>?np.iinfo(np.int64).min?and?c_max?????????????????????df[col]?=?df[col].astype(np.int64)
????????????else:
????????????????if?(
????????????????????c_min?>?np.finfo(np.float16).min
????????????????????and?c_max?????????????????):
????????????????????df[col]?=?df[col].astype(np.float16)
????????????????elif?(
????????????????????c_min?>?np.finfo(np.float32).min
????????????????????and?c_max?????????????????):
????????????????????df[col]?=?df[col].astype(np.float32)
????????????????else:
????????????????????df[col]?=?df[col].astype(np.float64)
????end_mem?=?df.memory_usage().sum()?/?1024?**?2
????if?verbose:
????????print(
????????????"Mem.?usage?decreased?to?{:.2f}?Mb?({:.1f}%?reduction)".format(
????????????????end_mem,?100?*?(start_mem?-?end_mem)?/?start_mem
????????????)
????????)
????return?df

當然，這個函數(shù)不是固定，東哥只是提供個模板，大家可以直接復(fù)制拿過去改成自己習(xí)慣的方式。

下面來看一下這個轉(zhuǎn)化函數(shù)能給我們具體帶來多少內(nèi)存占用的減少。這里我用了一個加載進來會占用2.2GB內(nèi)存的數(shù)據(jù)集，使用reduce_memory_usage以后的情況是這樣的。

>>>?reduce_memory_usage(tps_october)
Mem.?usage?decreased?to?509.26?Mb?(76.9%?reduction)

數(shù)據(jù)集的內(nèi)存占用從原來的 2.2GB 壓縮到 510MB。不要小看這個壓縮量，因為數(shù)據(jù)分析或者建模的過程中，要做很多數(shù)據(jù)處理操作，就這導(dǎo)致數(shù)據(jù)集會被重復(fù)使用很多次。如果開始的數(shù)據(jù)集就很大，那么后面的內(nèi)存占用也會跟著大，這樣一算下來整個就放大了很多倍。

但有一點需要提示一下，盡管在我們運行時會減少內(nèi)存，但當我們保存數(shù)據(jù)時，內(nèi)存減少的效果會丟失掉，不過磁盤空間往往是夠用的，這個影響沒那么大。

相關(guān)閱讀：

寫在1024：一名數(shù)據(jù)分析師的修煉之路
數(shù)據(jù)科學(xué)系列：sklearn庫主要模塊功能簡介
數(shù)據(jù)科學(xué)系列：seaborn入門詳細教程
數(shù)據(jù)科學(xué)系列：pandas入門詳細教程
數(shù)據(jù)科學(xué)系列：matplotlib入門詳細教程
數(shù)據(jù)科學(xué)系列：numpy入門詳細教程

暴減內(nèi)存！pandas 自動優(yōu)化騷操作

暴減內(nèi)存！pandas 自動優(yōu)化騷操作