无码乱伦中文字幕,91精品国产综合久久久果冻传媒,青青操久久,日韩乱码人妻无码超清蜜桃丨,成人一区二区A片,www.色鬼,欧美色图俺去啦,亚洲五区

導(dǎo)讀：今天這篇文章我是學(xué)習(xí)pandas官方文檔，做的一個(gè)總結(jié)性學(xué)習(xí)筆記，大家可以作為科普性文章，了解下哪些常用文件可以使用pandas來(lái)讀寫(xiě)，以及一些常用的讀寫(xiě)方法。文章較長(zhǎng)，建議收藏。

稀疏矩陣coo_matrix

稀疏矩陣(sparse matrix)，在數(shù)值分析中，是其元素大部分為零的矩陣。反之，如果大部分元素都非零，則這個(gè)矩陣是稠密的。在科學(xué)與工程領(lǐng)域中求解線性模型時(shí)經(jīng)常出現(xiàn)大型的稀疏矩陣。

在使用計(jì)算機(jī)存儲(chǔ)和操作稀疏矩陣時(shí)，經(jīng)常需要修改標(biāo)準(zhǔn)算法以利用矩陣的稀疏結(jié)構(gòu)。由于其自身的稀疏特性，通過(guò)壓縮可以大大節(jié)省稀疏矩陣的內(nèi)存代價(jià)。更為重要的是，由于過(guò)大的尺寸，標(biāo)準(zhǔn)的算法經(jīng)常無(wú)法操作這些稀疏矩陣。

coo_matrix

A sparse matrix in COOrdinate format.是一種坐標(biāo)格式的稀疏矩陣。

關(guān)于稀疏矩陣的一些特點(diǎn)，其可用于算術(shù)運(yùn)算：它們支持加法、減法、乘法、除法和矩陣冪。

COO格式的優(yōu)點(diǎn)是促進(jìn)稀疏格式之間的快速轉(zhuǎn)換允許重復(fù)條目與 CSR/CSC 格式之間的快速轉(zhuǎn)換。但不直接支持算術(shù)運(yùn)算切片。
COO 的預(yù)期用途是一種構(gòu)建稀疏矩陣的快速格式構(gòu)建矩陣后，轉(zhuǎn)換為 CSR 或 CSC 格式以進(jìn)行快速算術(shù)和矩陣向量運(yùn)算。

coo_matrix的創(chuàng)建

row??=?np.array([0,?3,?1,?0])
col??=?np.array([0,?3,?1,?2])
data?=?np.array([4,?5,?7,?9])
coo_matrix((data,?(row,?col)),?shape=(4,?4)).toarray()

array([[4, 0, 9, 0],
       [0, 7, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 5]])

coo_matrix((data,?(i,?j)),?[shape=(M,?N)])

這里有三個(gè)參數(shù)：
data[:]?就是原始矩陣中的數(shù)據(jù)，例如上面的4,5,7,9。
i[:]?就是行的指示符號(hào)；例如上面row的第0個(gè)元素是0，就代表data中第一個(gè)數(shù)據(jù)在第0行。
j[:]?就是列的指示符號(hào)；例如上面col的第0個(gè)元素是0，就代表data中第一個(gè)數(shù)據(jù)在第0列。

讀

DataFrame.sparse.from_spmatrix(data,?
????????????index=None,?columns=None)

參數(shù)
data?：scipy.sparse.spmatrix?可轉(zhuǎn)換為csc格式。
index, columns：可選參數(shù)，用于結(jié)果DataFrame的行和列標(biāo)簽。默認(rèn)為RangeIndex。
返回的DataFrame的每一列都存儲(chǔ)為arrays.SparseArray。

import?scipy.sparse
mat?=?scipy.sparse.eye(3)
pd.DataFrame.sparse.from_spmatrix(mat)

     0    1    2
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0

寫(xiě)

DataFrame.sparse.to_coo()

將DataFrame的內(nèi)容作為以共存格式的稀疏矩陣COO 矩陣返回。

返回?coo_matrix?：scipy.sparse.spmatrix

字典Dict

在Python 的字典中，每一個(gè)元素都由鍵(key) 和值(value) 構(gòu)成，結(jié)構(gòu)為key: value?。不同的元素之間會(huì)以逗號(hào)分隔，并且以大括號(hào)?{}。

讀

從類(lèi)數(shù)組的字典或字典構(gòu)造DataFrame。通過(guò)按列或允許dtype規(guī)范的索引從字典創(chuàng)建DataFrame對(duì)象。

DataFrame.from_dict(data,?orient='columns',?
????????????????????dtype=None,?columns=None)

這里重要的參數(shù)是orient，用來(lái)指定字典的鍵來(lái)作為DataFrame的列活索引。下面分別用兩個(gè)例子來(lái)看看。

orient='columns'

默認(rèn)情況下，字典的鍵成為DataFrame列。

data?=?{'col_1':?[3,?2,?1,?0],?
???????'col_2':?['a',?'b',?'c',?'d']}
pd.DataFrame.from_dict(data,?
????????????orient='columns')

   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

orient='index'

使用字典鍵作為創(chuàng)建DataFrame行。

data?=?{'row_1':?[3,?2,?1,?0],?
????????'row_2':?['a',?'b',?'c',?'d']}
pd.DataFrame.from_dict(data,?
?????????????orient='index')

       0  1  2  3
row_1  3  2  1  0
row_2  a  b  c  d

寫(xiě)

DataFrame.to_dict(orient='dict',?into=<class?'dict'>)

寫(xiě)與讀同樣有重要的orient參數(shù)，讀有兩個(gè)參數(shù)，而讀有六個(gè)參數(shù)，更加靈活。

參數(shù)orient?是字符串{'dict', 'list', 'series', 'split', 'records', 'index'}
確定字典值的類(lèi)型。
'dict'(默認(rèn)) : 字典形狀如{column : {index : value}}
'list': 字典形狀如?{column : [values]}
'series': 字典形狀如?{column : Series(values)}
'split': 字典形狀如?{index : [index], columns : [columns], data : [values]}
'records'?: 列表內(nèi)容如[{column : value}, … , {column : value}]
'index'?: 字典形狀如?{index : {column : value}}
另外，還可以通過(guò)縮寫(xiě)來(lái)表示，如?'s'表示?'series',?'sp'?表示?'split'。

>>>?df?=?pd.DataFrame({'col1':?[1,?2],
...????????????????????'col2':?[0.5,?0.75]},
...????????????????????index=['row1',?'row2'])
>>>?df
??????col1??col2
row1?????1??0.50
row2?????2??0.75
>>>?df.to_dict()
{'col1':?{'row1':?1,?'row2':?2},?'col2':?{'row1':?0.5,?'row2':?0.75}}
>>>?df.to_dict('records')
[{'col1':?1,?'col2':?0.5},?{'col1':?2,?'col2':?0.75}]

記錄Records

讀

將結(jié)構(gòu)化數(shù)據(jù)的記錄的 ndarray 轉(zhuǎn)換為dataFrame。

DataFrame.from_records(data,?index=None,?
???????????????????????exclude=None,?
???????????????????????columns=None,?
???????????????????????coerce_float=False,?
???????????????????????nrows=None)

字典、元祖或 ndarray類(lèi)型的記錄

data?=?[{'col_1':?3,?'col_2':?'a'},
????????{'col_1':?2,?'col_2':?'b'}]
data?=?np.array([(3,?'a'),?(2,?'b')],
????????????????dtype=[('col_1',?'i4'),
???????????????????????????('col_2',?'U1')])
data?=?[(3,?'a'),?(2,?'b')]
#?上面三種形式的data，轉(zhuǎn)換成如下的dataFrame
pd.DataFrame.from_records(data)

   col_1 col_2
0      3     a
1      2     b

剪貼板Clipboard

剪貼板是一種軟件功能，通常由操作系統(tǒng)提供，作用是使用復(fù)制和粘貼操作短期存儲(chǔ)數(shù)據(jù)和在文檔或應(yīng)用程序間轉(zhuǎn)移數(shù)據(jù)。

有時(shí)候數(shù)據(jù)獲取不太方便，需要通過(guò)復(fù)制粘貼過(guò)來(lái)。通常的方法是先復(fù)制到特定的文件，如txt或Excel等，再通過(guò)一定的方法讀取文件。

但如果我們只需要臨時(shí)獲取某些信息，還是那樣做未免有點(diǎn)麻煩。下面來(lái)介紹方便快捷的從剪貼板直接讀取數(shù)據(jù)的方法。

讀

從剪貼板讀取文本并傳遞到read_csv。

pandas.read_clipboard(sep='\\s+',?**kwargs)

參數(shù)
sep?:字符串或正則表達(dá)式分隔符。's+'的默認(rèn)值表示一個(gè)或多個(gè)空格字符。
該方法是通過(guò)先從剪貼板讀取文本，再將其傳到read_csv，因此其他參數(shù)可以參見(jiàn)read_csv。

寫(xiě)

復(fù)制對(duì)象到系統(tǒng)剪貼板。

DataFrame.to_clipboard(excel=True,?
???????????????????????sep=None,?**kwargs)

參數(shù)
excel：bool, default True
生產(chǎn)輸出在csv格式，方便粘貼到excel。
True，使用提供的分隔符進(jìn)行csv粘貼。
False，將對(duì)象的字符串表示形式寫(xiě)入剪貼板。

將對(duì)象的文本表示形式寫(xiě)入系統(tǒng)剪貼板。例如將其粘貼到Excel中。

CSV

讀

最常用的數(shù)據(jù)讀取莫過(guò)于read_csv讀取了。之前文章有寫(xiě)過(guò)，這里直接拿過(guò)來(lái)引用下。

pandas.read_csv(filepath_or_buffer,?...)

讀取一個(gè)逗號(hào)分隔的值(csv)文件到DataFrame。

pd.read_csv('data.csv')??

寫(xiě)

DataFrame.to_csv()

將對(duì)象寫(xiě)入逗號(hào)分隔值(csv)文件。

保存為csv

df?=?pd.DataFrame({'name':?['Raphael',?'Donatello'],
??????????????????'mask':?['red',?'purple'],
??????????????????'weapon':?['sai',?'bo?staff']})
df.to_csv(index=False)

保存為zip

compression_opts?=?dict(method='zip',
????????????????????????archive_name='out.csv')??
df.to_csv('out.zip',?index=False,
??????????compression=compression_opts)??

Excel

讀

讀取Excel到pandas DataFrame。

pandas.read_excel(io,sheet_name=0,header=0,
??????????????????names=None,index_col=None,...)

支持從本地文件系統(tǒng)或URL讀取xls, xlsx, xlsm, xlsb, odf, ods和odt文件擴(kuò)展名。支持讀取單個(gè)工作表或工作表列表。

pd.read_excel(open('tmp.xlsx',?'rb'),
???????????????sheet_name='Sheet3')
pd.read_excel('tmp.xlsx',?index_col=None,?
??????????????header=None)??
pd.read_excel('tmp.xlsx',?index_col=0,
??????????????dtype={'Name':?str,?
??????????????'Value':?float})?

寫(xiě)

將對(duì)象寫(xiě)入Excel表格。

DataFrame.to_excel(excel_writer,?
???????????????????sheet_name='Sheet1')

常用參數(shù)
excel_writer?：ExcelWriter目標(biāo)路徑。
sheet_name?：excel表名命。
na_rep?：缺失值填充，可以設(shè)置為字符串。
columns?：選擇輸出的的列存入。
header?:指定作為列名的行，默認(rèn)0，即取第一行，數(shù)據(jù)為列名行以下的數(shù)據(jù)；若數(shù)據(jù)不含列名，則設(shè)定 header = None。
index：默認(rèn)為T(mén)rue，顯示index，當(dāng)index=False 則不顯示行索引（名字）。
index_label：設(shè)置索引列的列名。

要將單個(gè)對(duì)象寫(xiě)入Excel .xlsx文件，只需要指定目標(biāo)文件名。要寫(xiě)入多個(gè)工作表，需要?jiǎng)?chuàng)建一個(gè)帶有目標(biāo)文件名的ExcelWriter對(duì)象，并在文件中指定要寫(xiě)入的工作表。

可以通過(guò)指定惟一的sheet_name寫(xiě)入多個(gè)表。將所有數(shù)據(jù)寫(xiě)入文件后，需要保存更改。

創(chuàng)建、編寫(xiě)并保存工作簿

df1?=?pd.DataFrame([['a',?'b'],?['c',?'d']],
???????????????????index=['row?1',?'row?2'],
???????????????????columns=['col?1',?'col?2'])
df1.to_excel("output.xlsx")?

寫(xiě)到工作簿中的多個(gè)sheet，則需要指定 ExcelWriter 對(duì)象。

df2?=?df1.copy()
with?pd.ExcelWriter('output.xlsx')?as?writer:??
???df1.to_excel(writer,?
????????????????sheet_name='Sheet_name_1')
???df2.to_excel(writer,?
????????????????sheet_name='Sheet_name_2')

Excel 編寫(xiě)器還可用于在現(xiàn)有 Excel 文件中附加。

with?pd.ExcelWriter('output.xlsx',
????????????????????mode='a')?as?writer:??
????df.to_excel(writer,?
????????????????sheet_name='Sheet_name_3')

Feather

Feather是小眾但是很實(shí)用的文件格式，一句話(huà)定位它：高速讀寫(xiě)壓縮二進(jìn)制文件。

Feather其實(shí)是Apache Arrow 項(xiàng)目中包含的一種數(shù)據(jù)格式。Feather 為DataFrame提供二進(jìn)制列序列化。它旨在提高讀寫(xiě)數(shù)據(jù)幀的效率，并使跨數(shù)據(jù)分析語(yǔ)言的數(shù)據(jù)共享變得容易。

讀

從文件路徑加載Feather格式對(duì)象。

pandas.read_feather(path,?columns=None,?
????????????????????use_threads=True,?
????????????????????storage_options=None)

參數(shù)：
path：str，路徑對(duì)象或類(lèi)似文件的對(duì)象。
use_threads：布爾值，默認(rèn)為T(mén)rue，是否使用多線程并行閱讀。

train_data?=?pd.read_csv("train.csv")
train_data?=?pd.read_feather("train.feather")

寫(xiě)

調(diào)用to_feather函數(shù)將讀取的dataframe保存為feather文件。

在參加各種機(jī)器學(xué)習(xí)比賽的時(shí)候，有時(shí)候要讀取幾百M(fèi)甚至幾個(gè)G 的表格數(shù)據(jù)，為了使讀取速度加快，通常會(huì)選用這種方法，把.csv格式格式的文件轉(zhuǎn)存為.feather格式，再用read_feather讀取，速度可以大大提升。

DataFrame.to_feather(**kwargs)

path：需要保存的Feather文件路徑。
compression：是否壓縮，以及如何壓縮，支持{'zstd', 'uncompressed', 'lz4'}三個(gè)選項(xiàng)。
compression_level：壓縮水平，注意lz4 不支持該參數(shù)。

train_data.to_feather("train.feather")

Feather 相比于csv在性能上有明顯的優(yōu)勢(shì)：

它適合于中型數(shù)據(jù)（GB為單位的數(shù)據(jù)），比如4GB的csv文件，feather文件可能只占用700M。
讀寫(xiě)速度遠(yuǎn)勝于csv，而且相比于數(shù)據(jù)庫(kù)又具有便攜的優(yōu)勢(shì)，可以作為很好的中間媒介來(lái)傳輸數(shù)據(jù)，比如從某個(gè)大型數(shù)據(jù)庫(kù)中導(dǎo)出部分?jǐn)?shù)據(jù)。
Feather也支持從源文件中僅僅讀取所需的列，這樣可以減少內(nèi)存的使用。這對(duì)于中型數(shù)據(jù)（GB)分析是非常有用的。

Google BigQuery

DataFrame.to_gbq(destination_table,?...)

向谷歌BigQuery表寫(xiě)入一個(gè)DataFrame。

該功能需要pandas_gbq模塊。該模塊為 Google 的 BigQuery 分析 Web 服務(wù)提供了一個(gè)包裝，用于使用類(lèi)似 SQL 的查詢(xún)來(lái)簡(jiǎn)化從 BigQuery 表中檢索結(jié)果。

結(jié)果集被解析成pandas.DataFrame從源表中提取的形狀和數(shù)據(jù)類(lèi)型的數(shù)據(jù)框架。此外，數(shù)據(jù)幀可以插入到新的大查詢(xún)表或附加到現(xiàn)有表中。

具體使用可參見(jiàn)官網(wǎng)。

HDF 文件

HDF5（Hierarchical Data Formal）是用于存儲(chǔ)大規(guī)模數(shù)值數(shù)據(jù)的較為理想的存儲(chǔ)格式，文件后綴名為h5，存儲(chǔ)讀取速度非常快，且可在文件內(nèi)部按照明確的層次存儲(chǔ)數(shù)據(jù)，同一個(gè)HDF5可以看做一個(gè)高度整合的文件夾，其內(nèi)部可存放不同類(lèi)型的數(shù)據(jù)。

在Python中操縱HDF5文件的方式主要有兩種，

一是利用pandas中內(nèi)建的一系列HDF5文件操作相關(guān)的方法來(lái)將pandas中的數(shù)據(jù)結(jié)構(gòu)保存在HDF5文件中；
二是利用h5py模塊來(lái)完成從Python原生數(shù)據(jù)結(jié)構(gòu)向HDF5格式的保存。

Pandas的HDFStore類(lèi)可以將DataFrame存儲(chǔ)在HDF5文件中，以便可以有效地訪問(wèn)它，同時(shí)仍保留列類(lèi)型和其他元數(shù)據(jù)。它是一個(gè)類(lèi)似字典的類(lèi)，因此您可以像讀取Python dict對(duì)象一樣進(jìn)行讀寫(xiě)。

pandas.HDFStore(path,?mode='a',?
????????????????complevel=0,?complib=None,
????????????????fletcher32=False)

Pandas中的HDFStore()用于生成管理HDF5文件IO操作的對(duì)象。

其主要參數(shù)如下：
path：字符型輸入，用于指定h5文件的名稱(chēng)（不在當(dāng)前工作目錄時(shí)需要帶上完整路徑信息）。
mode：用于指定IO操作的模式，與Python內(nèi)建的open()中的參數(shù)一致。
默認(rèn)為'a'，即當(dāng)指定文件已存在時(shí)不影響原有數(shù)據(jù)寫(xiě)入，指定文件不存在時(shí)則新建文件；
'r'，只讀模式；
'w'，創(chuàng)建新文件（會(huì)覆蓋同名舊文件）；
'r+'，與'a'作用相似，但要求文件必須已經(jīng)存在。
complevel：int型，用于控制h5文件的壓縮水平，取值范圍在0-9之間，越大則文件的壓縮程度越大，占用的空間越小，但相對(duì)應(yīng)的在讀取文件時(shí)需要付出更多解壓縮的時(shí)間成本，默認(rèn)為0，代表不壓縮。

import?numpy?as?np
import?pandas?as?pd
#?打開(kāi)一個(gè)hdf文件
hdf?=?pd.HDFStore('test.hdf','w')
df1?=?pd.DataFrame(np.random.standard_normal((3,2)),?
???????????????????columns=['A','B'])
hdf.put(key='key1',?value=df1,?
????????format='table',?data_columns=True)
print(hdf.keys())

['/key1']

在pandas中讀入HDF5文件的方式主要有兩種，一是通過(guò)上術(shù)方式創(chuàng)建與本地h5文件連接的IO對(duì)象，接著使用鍵索引或者store對(duì)象的get()方法傳入要提取數(shù)據(jù)的key來(lái)讀入指定數(shù)據(jù)：

print(hdf['key1'])?#?方法一
print(hdf.get('key1'))?#?方法二

          A         B    
0  0.257239  1.684300 
1  0.076235 -0.071744 
2 -0.266105 -0.874081

使用壓縮格式存儲(chǔ)

large_data?=?pd.DataFrame(np.random.standard_normal((90000000,4)))
#?普通格式存儲(chǔ)：
hdf1?=?pd.HDFStore('test1.h5','w')
hdf1.put(key='data',?value=large_data)
hdf1.close()
#?壓縮格式存儲(chǔ)
hdf2?=?pd.HDFStore('test2.h5','w',?complevel=4,?complib='blosc')
hdf2.put(key='data',?value=large_data)
hdf2.close()

從結(jié)果上看，test2.h5比test1.h5小了700mb，節(jié)省了存儲(chǔ)空間。

讀

第二種讀入h5格式文件中數(shù)據(jù)的方法是pandas中的read_hdf()。

pandas.read_hdf(path_or_buf,?
????????????????key=None,?mode='r',?...)

參數(shù)：
path_or_buf?：str或pandas.HDFStore，文件路徑或HDFStore對(duì)象。
key?：str存儲(chǔ)中組的標(biāo)識(shí)符。
mode?：{'a', 'w', 'r+'}, 默認(rèn)為'a'打開(kāi)文件的方式。具體與HDFStore類(lèi)類(lèi)似。

df?=?pd.DataFrame([[1,?1.0,?'a']],?
??????????????????columns=['x',?'y',?'z'])
df.to_hdf('./store.h5',?'data')
reread?=?pd.read_hdf('./store.h5')

寫(xiě)

使用 HDF存儲(chǔ) 將包含的數(shù)據(jù)寫(xiě)入 HDF5 文件。

DataFrame.to_hdf(path_or_buf,?
?????????????????key,?mode='a',?...)

分層數(shù)據(jù)格式（HDF）是自我描述的，允許應(yīng)用程序在沒(méi)有外部信息的文件中解釋文件的結(jié)構(gòu)和內(nèi)容。一個(gè) HDF 文件可以包含可作為組或單獨(dú)對(duì)象訪問(wèn)的相關(guān)對(duì)象的組合。

為了在現(xiàn)有的 HDF 文件中添加其他數(shù)據(jù)幀或系列，請(qǐng)使用附加模式和不同的密鑰。

df_tl?=?pd.DataFrame({"A":?list(range(5)),?
??????????????????????"B":?list(range(5))})
df_tl.to_hdf("store_tl.h5",?"table",?
??????????????append=True)
pd.read_hdf("store_tl.h5",?
????????????"table",?where=["index>2"])

   A  B
3  3  3
4  4  4

Json

JSON（JavaScript Object Notation，JavaScript 對(duì)象表示法），是存儲(chǔ)和交換文本信息的語(yǔ)法，類(lèi)似 XML。JSON 比 XML 更小、更快，更易解析，Pandas 可以很方便的處理 JSON 數(shù)據(jù)。

讀

pandas.io.json.read_json(path_or_buf=None,
?????????????????????????orient=None,...)

JSON字符串轉(zhuǎn)為pandas 對(duì)象

參數(shù)
path_or_buf：有效的 JSON 字符串、路徑對(duì)象或類(lèi)文件對(duì)象
orient：指預(yù)期的 JSON 字符串格式?？梢?code style="padding: 2px 4px;border-radius: 4px;margin-right: 2px;margin-left: 2px;background-color: rgba(27, 31, 35, 0.05);font-family: "Operator Mono", Consolas, Monaco, Menlo, monospace;word-break: break-all;color: rgb(0, 0, 139);">to_json()使用相應(yīng)的 orient 值生成兼容的 JSON 字符串?？赡艿姆较蚣牵?/p>
'split'?: 字典類(lèi)型?{index: [index], columns: [columns], data: [values]}
'records'?: 列表類(lèi)型?[{column: value}, ... , {column: value}]
'index'?: 字典類(lèi)型?{index: {column: value}}
'columns'?: 字典類(lèi)型?{column: {index: value}}
'values'?: 值是數(shù)組。
允許值和默認(rèn)值取決于typ參數(shù)的值。
當(dāng)typ == 'series'
允許的方向是?{'split','records','index'}
默認(rèn)是?'index'
Series 索引對(duì)于 orient 必須是唯一的'index'。
當(dāng)typ == 'frame'
允許的方向是?{'split','records','index', 'columns','values', 'table'}
默認(rèn)是?'columns'
DataFrame 索引對(duì)于 orients'index'和?'columns'.
數(shù)據(jù)框列必須是唯一的取向'index'，?'columns'和'records'。
keep_default_dates?bool，默認(rèn)為 True
如果解析日期（convert_dates 不是 False），如下情況的列標(biāo)簽是類(lèi)似日期的，則嘗試解析默認(rèn)的類(lèi)似日期的列。
它以'_at'結(jié)尾，
它以'_time'結(jié)尾，
它以'timestamp'開(kāi)始，
它是'modified'，或'date'。

df?=?pd.DataFrame([['a',?'b'],?['c',?'d']],
??????????????????index=['row?1',?'row?2'],
??????????????????columns=['col?1',?'col?2'])
df.to_json(orient='split')

'{"columns":["col 1","col 2"], 
  "index":["row 1","row 2"], 
  "data":[["a","b"],["c","d"]]}'

?pd.read_json(_,?orient='split')???

      col 1 col 2
row 1     a     b
row 2     c     d

用'index'編解碼DataFrame

df.to_json(orient='index')

'{"row 1":{"col 1":"a","col 2":"b"},
  "row 2":{"col 1":"c","col 2":"d"}}'

pd.read_json(_,?orient='index')

      col 1 col 2
row 1     a     b
row 2     c     d

用'records'編解碼DataFrame

df.to_json(orient='records')

'[{"col 1":"a","col 2":"b"},
  {"col 1":"c","col 2":"d"}]'

pd.read_json(_,?orient='records')

  col 1 col 2
0     a     b
1     c     d

寫(xiě)

DataFrame.to_json(path_or_buf=None,?
??????????????????orient=None,?...)

指示預(yù)期的 JSON 字符串格式。
參數(shù)orient：參數(shù)為 JSON 字符串格式與讀一樣外，還有如下兩個(gè)
Series：
默認(rèn)為'index'
允許的值為：{'split'、'records'、'index'、'table'}。
DataFrame：
默認(rèn)為'columns'
允許的值為：{'split'、'records'、'index'、'columns'、'values'、'table'}。

import?json
df?=?pd.DataFrame(
?????[["a",?"b"],?["c",?"d"]],
?????index=["row?1",?"row?2"],
?????columns=["col?1",?"col?2"],)
result?=?df.to_json(orient="split")
parsed?=?json.loads(result)
json.dumps(parsed,?indent=4)?

{
????"columns":?[
????????"col?1",
????????"col?2"
????],
????"index":?[
????????"row?1",
????????"row?2"
????],
????"data":?[
????????[
????????????"a",
????????????"b"
????????],
????????[
????????????"c",
????????????"d"
????????]
????]
}

用'records'格式化的 JSON編解碼DataFrame

result?=?df.to_json(orient="records")
parsed?=?json.loads(result)
json.dumps(parsed,?indent=4)??

[
????{
????????"col?1":?"a",
????????"col?2":?"b"
????},
????{
????????"col?1":?"c",
????????"col?2":?"d"
????}
]

用'columns'格式化的 JSON編解碼DataFrame

result?=?df.to_json(orient="columns")
parsed?=?json.loads(result)
json.dumps(parsed,?indent=4)?

{
????"col?1":?{
????????"row?1":?"a",
????????"row?2":?"c"
????},
????"col?2":?{
????????"row?1":?"b",
????????"row?2":?"d"
????}
}

Html

讀

pandas.read_html(?io?,?match?=?'.+'?,...)?

參數(shù)
io?字符串、路徑對(duì)象或類(lèi)文件對(duì)象。
match?字符串或編譯的正則表達(dá)式。
將返回包含與此正則表達(dá)式或字符串匹配的文本的表集。默認(rèn)為“.+”（匹配任何非空字符串）。默認(rèn)值將返回頁(yè)面上包含的所有表。

Read_html() 方法的 io 參數(shù)默認(rèn)了多種形式，URL 便是其中一種。然后函數(shù)默認(rèn)調(diào)用 lxml 解析 table 標(biāo)簽里的每個(gè) td 的數(shù)據(jù)，最后生成一個(gè)包含 Dataframe 對(duì)象的列表。通過(guò)索引獲取到 DataFrame 對(duì)象即可。

read_html返回list的DataFrame對(duì)象，即使只有包含在HTML內(nèi)容的單個(gè)表。

>>>?url?=?("https://raw.githubusercontent.com/pandas-dev/pandas/master/"
...????????"pandas/tests/io/data/html/spam.html")
?pd.read_html(url)
[??????Nutrient????????Unit?Value?per?100.0g
?0???Proximates??Proximates???????Proximates
?1????????Water???????????g????????????51.70???
?2???????Energy????????kcal??????????????315???????????
?..?????????...?????????...??????????????...
[37?rows?x?6?columns]]

讀入"banklist.html"文件的內(nèi)容并將其read_html?作為字符串傳遞。

with?open(file_path,?"r")?as?f:
????dfs?=?pd.read_html(f.read())
#?可以傳入一個(gè)StringIO實(shí)例
with?open(file_path,?"r")?as?f:
????sio?=?StringIO(f.read())
dfs?=?pd.read_html(sio)

安裝了 bs4 和 html5lib ，并且傳入['lxml', 'bs4']，則很容易解析成功。

dfs?=?pd.read_html(url,?"Metcalf?Bank",?
??????????????????index_col=0,?
??????????????????flavor=["lxml",?"bs4"])

Read_html() 僅支持靜態(tài)網(wǎng)頁(yè)解析。你可以通過(guò)其他方法獲取動(dòng)態(tài)頁(yè)面加載后response.text 傳入 read_html() 再獲取表格數(shù)據(jù)。

寫(xiě)

DataFrame對(duì)象有一個(gè)實(shí)例方法to_html，它將的內(nèi)容呈現(xiàn)DataFrame為 HTML 表格。

DataFrame.to_html(buf=None,...)
df?=?pd.DataFrame(np.random.randn(2,?2))

          0         1
0 -0.184744  0.496971
1 -0.856240  1.857977

print(df.to_html())??#?raw?html

<table?border="1"?class="dataframe">
??<thead>
????<tr?style="text-align:?right;">
??????<th>th>
??????<th>0th>
??????<th>1th>
????tr>
??thead>
??<tbody>
????<tr>
??????<th>0th>
??????<td>-0.184744td>
??????<td>0.496971td>
????tr>
????<tr>
??????<th>1th>
??????<td>-0.856240td>
??????<td>1.857977td>
????tr>
??tbody>
table>

HTML：

	0	1
0	-0.184744	0.496971
1	-0.856240	1.857977

float_format?使用 Python 可調(diào)用來(lái)控制浮點(diǎn)值的精度：

df.to_html(float_format="{0:.10f}".format)

參數(shù)
classes參數(shù)提供了為結(jié)果 HTML 表提供 CSS 類(lèi)的能力。
render_links=True參數(shù)提供了向包含 URL 的單元格添加超鏈接的功能。

String

DataFrame.to_string()

將 DataFrame 轉(zhuǎn)換為字符串。

d?=?{'col1':?[1,?2,?3],?'col2':?[4,?5,?6]}
df?=?pd.DataFrame(d)
print(df.to_string())

   col1  col2
0     1     4
1     2     5
2     3     6

Pickle

Python的pickle模塊，用來(lái)對(duì)數(shù)據(jù)進(jìn)行序列化及反序列化。對(duì)數(shù)據(jù)進(jìn)行反序列化一個(gè)重要的作用就是便于存儲(chǔ)。

序列化過(guò)程將文本信息轉(zhuǎn)變?yōu)槎M(jìn)制數(shù)據(jù)流，同時(shí)保存數(shù)據(jù)類(lèi)型。比如，數(shù)據(jù)處理過(guò)程中，突然有事要走，你可以直接將數(shù)據(jù)序列化到本地，這時(shí)候你的數(shù)據(jù)是什么類(lèi)型，保存到本地也是同樣的數(shù)據(jù)類(lèi)型，再次打開(kāi)的時(shí)候同樣也是該數(shù)據(jù)類(lèi)型，而不是從頭開(kāi)始再處理。

存儲(chǔ)數(shù)據(jù)使用pickle.dump(obj, file, [,protocol])將對(duì)象obj保存到文件file中去。使用pickle.load(file)從file中讀取一個(gè)字符串，并將它重構(gòu)為原來(lái)的python對(duì)象，反序列化出對(duì)象過(guò)程。

使用pandas庫(kù)進(jìn)行pickle更加簡(jiǎn)單

使用pandas庫(kù)的pd.read_pickle讀取pickle數(shù)據(jù)。

read_pickle()，DataFrame.to_pickle()和Series.to_pickle()可以讀取和寫(xiě)入壓縮的腌制文件。支持讀寫(xiě)gzip，bz2，xz壓縮類(lèi)型。zip文件格式僅支持讀取，并且只能包含一個(gè)要讀取的數(shù)據(jù)文件。

壓縮類(lèi)型可以是顯式參數(shù)，也可以從文件擴(kuò)展名推斷出來(lái)。如果為"infer"，則文件名分別以?".gz"，".bz2"，".zip"或".xz"?結(jié)尾。

讀

pandas.read_pickle(filepath_or_buffer,?
???????????????????compression='infer',?
???????????????????storage_options=None?

參數(shù)?compression：{'infer', 'gzip', 'bz2', 'zip', 'xz', None}，默認(rèn)為 'infer'
可以是一個(gè)dict以便將選項(xiàng)傳遞給壓縮協(xié)議。它必須是?{ 'zip'、'gzip'、'bz2'}?之一。
如果解縮模式是'infer'并且 path_or_buf 是類(lèi)似路徑的，則從以下擴(kuò)展名中檢測(cè)壓縮模式：'.gz'、'.bz2'、'.zip' 或 '.xz'。（否則不解縮）。
如果給出的 dict 和模式為'zip'或推斷為'zip'，則其他條目作為附加壓縮選項(xiàng)傳遞。

read_pickle函數(shù)pandas可用于從文件加載任何 pickled 的pandas對(duì)象。

df?

c1         a   
c2         b  d
lvl1 lvl2      
a    c     1  5
     d     2  6
b    c     3  7
     d     4  8

df.to_pickle("foo.pkl")
pd.read_pickle("foo.pkl")

c1         a   
c2         b  d
lvl1 lvl2      
a    c     1  5
     d     2  6
b    c     3  7
     d     4  8

寫(xiě)

使用DataFrame的to_pickle屬性就可以生成pickle文件對(duì)數(shù)據(jù)進(jìn)行永久儲(chǔ)存。

所有 Pandas 對(duì)象都配備了to_pickle，即使用 Python?cPickle模塊以 pickle 格式將數(shù)據(jù)結(jié)構(gòu)保存到磁盤(pán)的方法。

DataFrame.to_pickle(path,?
????????????????????compression='infer',?
????????????????????protocol=5,?
????????????????????storage_options=None)

參數(shù)compression：{'infer', 'gzip', 'bz2', 'zip', 'xz', None}，默認(rèn)為 'infer'
表示在輸出文件中使用的壓縮的字符串。默認(rèn)情況下，從指定路徑中的文件擴(kuò)展名推斷。壓縮模式可以是以下任何可能的值：{'infer', 'gzip', 'bz2', 'zip', 'xz', None}。
如果壓縮模式是?'infer'?并且?path_or_buf?是類(lèi)似路徑的，則從以下擴(kuò)展名中檢測(cè)壓縮模式：'.gz'、'.bz2'、'.zip' 或 '.xz'。（否則不壓縮）。
如果給出的 dict 和模式為'zip'或推斷為'zip'，則其他條目作為附加壓縮選項(xiàng)傳遞。

SQL

讀

pandas.read_sql(?sql?,?con?,?index_col?=?None?,?
????????????????coerce_float?=?True?,?params?=?None?,?
????????????????parse_dates?=?None?,?columns?=?None?,?
????????????????chunksize?=?None?)

參數(shù)
sql?str 或 SQLAlchemy Selectable, 要執(zhí)行的 SQL 查詢(xún)或表名。
con?SQLAlchemy connectable、str 或 sqlite3 連接使用 SQLAlchemy 可以使用該庫(kù)支持的任何數(shù)據(jù)庫(kù)。
index_col: 選擇某一列作為index。
coerce_float:非常有用，將數(shù)字形式的字符串直接以float型讀入。
parse_dates:將某一列日期型字符串轉(zhuǎn)換為datetime型數(shù)據(jù)，與pd.to_datetime函數(shù)功能類(lèi)似。可以直接提供需要轉(zhuǎn)換的列名以默認(rèn)的日期形式轉(zhuǎn)換，也可以用字典的格式提供列名和轉(zhuǎn)換的日期格式，比如{column_name: format string}（format string："%Y:%m:%H:%M:%S"）。
columns:要選取的列。一般沒(méi)啥用，因?yàn)樵趕ql命令里面一般就指定要選擇的列了。
chunksize：如果提供了一個(gè)整數(shù)值，那么就會(huì)返回一個(gè)generator，每次輸出的行數(shù)就是提供的值的大小。

設(shè)置參數(shù)can創(chuàng)建數(shù)據(jù)庫(kù)鏈接的兩種方式

1. 用sqlalchemy構(gòu)建數(shù)據(jù)庫(kù)鏈接

import?pandas?as?pd
import?sqlalchemy
from?sqlalchemy?import?create_engine

#?用sqlalchemy構(gòu)建數(shù)據(jù)庫(kù)鏈接engine
connect_info?=?'mysql+pymysql://{}:{}@{}:{}/{}?charset=utf8'
engine?=?create_engine(connect_info)
#?sql?命令
sql_cmd?=?"SELECT?*?FROM?table"
df?=?pd.read_sql(sql=sql_cmd,?con=engine)

2. 用DBAPI構(gòu)建數(shù)據(jù)庫(kù)鏈接

import?pandas?as?pd
import?pymysql
#?sql?命令
sql_cmd?=?"SELECT?*?FROM?table"
#?用DBAPI構(gòu)建數(shù)據(jù)庫(kù)鏈接engine
con?=?pymysql.connect(host=localhost,?user=username,?
??????????????????????password=password,?database=dbname,?
??????????????????????charset='utf8',?use_unicode=True)
df?=?pd.read_sql(sql_cmd,?con)

將 SQL 查詢(xún)或數(shù)據(jù)庫(kù)表讀入 DataFrame

這個(gè)函數(shù)是將read_sql_table和?read_sql_query做了個(gè)包裝。該還輸根據(jù)輸入，指向特定的功能。如SQL 查詢(xún)將被路由到read_sql_query，而數(shù)據(jù)庫(kù)表名將被路由到read_sql_table。

from?sqlite3?import?connect
conn?=?connect(':memory:')
df?=?pd.DataFrame(data=[[0,?'10/11/12'],?
????????????????????????[1,?'12/11/10']],
??????????????????columns=['int_column',?
??????????????????'date_column'])
df.to_sql('test_data',?conn)
pd.read_sql('SELECT?int_column,?
?????????????date_column?FROM?test_data',?conn)

   int_column date_column
0           0    10/11/12
1           1    12/11/10

通過(guò)parse_dates參數(shù)將日期解析應(yīng)用于列。

pd.read_sql('SELECT?int_column,?
?????????????date_column?FROM?test_data',
????????????conn,
????????????parse_dates=["date_column"])

   int_column date_column
0           0  2012-11-10
1           1  2010-11-12

在日期解析“date_column”的值時(shí)應(yīng)用自定義格式。

parse_dates={"date_column":?{"format":?"%d/%m/%y"}})

對(duì)“date_column”的值應(yīng)用 dayfirst 日期解析順序。

parse_dates={"date_column":?{"dayfirst":?True}}

寫(xiě)

DataFrame.to_sql(name,?con,?schema=None,?if_exists='fail',?
?????????????????index=True,?index_label=None,?chunksize=None,
?????????????????dtype=None,?method=None)[source]

將存儲(chǔ)在 DataFrame 中的記錄寫(xiě)入 SQL 數(shù)據(jù)庫(kù)。

支持 SQLAlchemy 支持的數(shù)據(jù)庫(kù)?？梢孕陆?、追加或覆蓋表。

from?sqlalchemy?import?create_engine
engine?=?create_engine('sqlite://',?echo=False)
df?=?pd.DataFrame({'name'?:?['User?1',?'User?2',?'User?3']})
df.to_sql('users',?con=engine)
engine.execute("SELECT?*?FROM?users").fetchall()

[(0, 'User 1'), (1, 'User 2'), (2, 'User 3')]

Stata

pandas還可以由Stata程序所導(dǎo)出的stata統(tǒng)計(jì)文件文件。

讀

將Stata文件讀入DataFrame。

pandas.read_stata(?filepath_or_buffer?,...)

參數(shù)
filepath_or_bufferstr 或 Path文件路徑。
usecols?類(lèi)似列表，可選。返回列的子集。如果沒(méi)有，則返回所有列。
convert_categoricals?bool，默認(rèn)為 True。將分類(lèi)列轉(zhuǎn)換為 pd.Categorical。

df?=?pd.read_stata('filename.dta')

寫(xiě)

DataFrame.to_stata(?path?,?...)

將 DataFrame 寫(xiě)入 Stata 數(shù)據(jù)集文件

df?=?pd.DataFrame({'animal':?['falcon',?'parrot',
??????????????????????????????'falcon','parrot'],
???????????????????'speed':?[350,?18,?361,?15]})
df.to_stata('animals.dta')?

很多時(shí)候，用pandas.read_stata()讀取文件是，容易出現(xiàn)中文亂發(fā)，當(dāng)該情況發(fā)生時(shí)，可以使用如下方法。其中 load_large_dta用于讀取stata文件，decode_str用于編譯中文字符串。

def?load_large_dta(fname):
????import?sys
????reader?=?pd.read_stata(fname,iterator=True)
????df?=?pd.DataFrame()
????try:
????????chunk?=?reader.get_chunk(100*1000)
????????while?len(chunk)?>?0:
????????????df?=?df.append(chunk,?ignore_index=True)
????????????chunk?=?reader.get_chunk(100*1000)
????????????print?('.')
????????????sys.stdout.flush()
????except?(StopIteration,?KeyboardInterrupt):
????????pass
?
????print('\nloaded?{}?rows'.format(len(df)))
????return?df
?
def?deconde_str(string):
????"""
????解碼?dta文件防止亂碼
????"""
????re?=?string.encode('latin-1').decode('utf-8')
????return?re

Markdown

DataFrame.to_markdown(buf?=?None?,?mode?=?'wt'?,?index?=?True?,?
??????????????????????storage_options?=?None?,?**?kwargs?)

以 Markdown 格式打印 DataFrame。

>>>?s?=?pd.Series(["elk",?"pig",?"dog",?"quetzal"],?name="animal")
>>>?print(s.to_markdown())
|????|?animal???|
|---:|:---------|
|??0?|?elk??????|
|??1?|?pig??????|
|??2?|?dog??????|
|??3?|?quetzal??|

#?使用制表選項(xiàng)輸出
>>>?print(s.to_markdown(tablefmt="grid"))
+----+----------+
|????|?animal???|
+====+==========+
|??0?|?elk??????|
+----+----------+
|??1?|?pig??????|
+----+----------+
|??2?|?dog??????|
+----+----------+
|??3?|?quetzal??|
+----+----------+

參考資料：
http://104.130.226.93/docs/index.html

E?N?D

各位伙伴們好，詹帥本帥搭建了一個(gè)個(gè)人博客和小程序，匯集各種干貨和資源，也方便大家閱讀，感興趣的小伙伴請(qǐng)移步小程序體驗(yàn)一下哦！（歡迎提建議）

牛逼！Python的判斷、循環(huán)和各種表達(dá)式（長(zhǎng)文系列第②篇）

牛逼！Python函數(shù)和文件操作（長(zhǎng)文系列第③篇）

牛逼！Python錯(cuò)誤、異常和模塊（長(zhǎng)文系列第④篇）

詳解 16 個(gè) Pandas 讀與寫(xiě)函數(shù)

稀疏矩陣coo_matrix

coo_matrix

coo_matrix的創(chuàng)建

讀

寫(xiě)

字典Dict

讀

orient='columns'

orient='index'

寫(xiě)

記錄Records

讀

字典、元祖或 ndarray類(lèi)型的記錄

剪貼板Clipboard

讀

寫(xiě)

CSV

讀

寫(xiě)

保存為csv

保存為zip

Excel

讀

寫(xiě)

創(chuàng)建、編寫(xiě)并保存工作簿

Feather

讀

寫(xiě)

Google BigQuery

HDF 文件

使用壓縮格式存儲(chǔ)

讀

寫(xiě)

Json

讀

用'index'編解碼DataFrame

用'records'編解碼DataFrame

寫(xiě)

用'records'格式化的 JSON編解碼DataFrame

用'columns'格式化的 JSON編解碼DataFrame

Html

讀

寫(xiě)

String

Pickle

使用pandas庫(kù)進(jìn)行pickle更加簡(jiǎn)單

讀

寫(xiě)

SQL

讀

設(shè)置參數(shù)can創(chuàng)建數(shù)據(jù)庫(kù)鏈接的兩種方式

1. 用sqlalchemy構(gòu)建數(shù)據(jù)庫(kù)鏈接

2. 用DBAPI構(gòu)建數(shù)據(jù)庫(kù)鏈接

將 SQL 查詢(xún)或數(shù)據(jù)庫(kù)表讀入 DataFrame

寫(xiě)

Stata

讀

寫(xiě)

Markdown

字典、元祖或 ndarray類(lèi)型的記錄

創(chuàng)建、編寫(xiě)并保存工作簿