供接上一篇文章，肝了幾天，十分鐘入門pandas（上），本系列源碼+數(shù)據(jù)+PDF可以在文末找到獲取方法，干貨文章，求點(diǎn)贊求轉(zhuǎn)發(fā)。

合并

Concat 連接

pandas中提供了大量的方法能夠輕松對(duì)Series，DataFrame和Panel對(duì)象進(jìn)行不同滿足邏輯關(guān)系的合并操作

通過**concat()**來連接pandas對(duì)象

df?=?pd.DataFrame(np.random.randn(10,4))
df

#break?it?into?pieces
pieces?=?[df[:3],?df[3:7],?df[7:]]
pieces

pd.concat(pieces)

Join 合并

類似于SQL中的合并(merge)

left?=?pd.DataFrame({'key':['foo',?'foo'],?'lval':[1,2]})
left

	key	lval
0	foo	1
1	foo	2

right?=?pd.DataFrame({'key':['foo',?'foo'],?'lval':[4,5]})
right

	key	lval
0	foo	4
1	foo	5

pd.merge(left,?right,?on='key')

	key	lval_x	lval_y
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

Append 添加

將若干行添加到dataFrame后面

df?=?pd.DataFrame(np.random.randn(8,?4),?columns=['A',?'B',?'C',?'D'])
df

s?=?df.iloc[3]
s

A    0.163904
B    1.324567
C   -0.768324
D   -0.205520
Name: 3, dtype: float64

df.append(s,?ignore_index=True)

分組

對(duì)于“group by”操作，我們通常是指以下一個(gè)或幾個(gè)步驟：

劃分按照某些標(biāo)準(zhǔn)將數(shù)據(jù)分為不同的組
應(yīng)用 對(duì)每組數(shù)據(jù)分別執(zhí)行一個(gè)函數(shù)
組合將結(jié)果組合到一個(gè)數(shù)據(jù)結(jié)構(gòu)

df?=?pd.DataFrame({'A'?:?['foo',?'bar',?'foo',?'bar',?
??????????????????????????'foo',?'bar',?'foo',?'bar'],
???????????????????'B'?:?['one',?'one',?'two',?'three',?
??????????????????????????'two',?'two',?'one',?'three'],
???????????????????'C'?:?np.random.randn(8),
???????????????????'D'?:?np.random.randn(8)})
df

分組并對(duì)每個(gè)分組應(yīng)用sum函數(shù)

df.groupby('A').sum()

	C	D
A
bar	-0.565344	1.886637
foo	2.226542	2.122855

按多個(gè)列分組形成層級(jí)索引，然后應(yīng)用函數(shù)

df.groupby(['A','B']).sum()

變形

堆疊

tuples?=?list(zip(*[['bar',?'bar',?'baz',?'baz',
?????????????????????'foo',?'foo',?'qux',?'qux'],
????????????????????['one',?'two',?'one',?'two',
?????????????????????'one',?'two',?'one',?'two']]))

index?=?pd.MultiIndex.from_tuples(tuples,?names=['first',?'second'])

df?=?pd.DataFrame(np.random.randn(8,?2),?index=index,?columns=['A',?'B'])

df2?=?df[:4]
df2

**stack()**方法對(duì)DataFrame的列“壓縮”一個(gè)層級(jí)

stacked?=?df2.stack()
stacked

對(duì)于一個(gè)“堆疊過的”DataFrame或者Series（擁有MultiIndex作為索引），stack()的逆操作是unstack()，默認(rèn)反堆疊到上一個(gè)層級(jí)

stacked.unstack()

stacked.unstack(1)

stacked.unstack(0)

數(shù)據(jù)透視表

df?=?pd.DataFrame({'A'?:?['one',?'one',?'two',?'three']?*?3,
???????????????????'B'?:?['A',?'B',?'C']?*?4,
???????????????????'C'?:?['foo',?'foo',?'foo',?'bar',?'bar',?'bar']?*?2,
???????????????????'D'?:?np.random.randn(12),
???????????????????'E'?:?np.random.randn(12)})
df

我們可以輕松地從這個(gè)數(shù)據(jù)得到透視表

pd.pivot_table(df,?values='D',?index=['A',?'B'],?columns=['C'])

時(shí)間序列

pandas在對(duì)頻率轉(zhuǎn)換進(jìn)行重新采樣時(shí)擁有著簡(jiǎn)單，強(qiáng)大而且高效的功能（例如把按秒采樣的數(shù)據(jù)轉(zhuǎn)換為按5分鐘采樣的數(shù)據(jù)）。這在金融領(lǐng)域很常見，但又不限于此。

rng?=?pd.date_range('1/1/2012',?periods=100,?freq='S')
#?看下前三條DatetimeIndex
rng[0:3]

ts?=?pd.Series(np.random.randint(0,500,len(rng)),?index=rng)
#?看下前三條Series數(shù)據(jù)
ts[0:3]

ts.resample('5Min').sum()

2012-01-01    26203
Freq: 5T, dtype: int32

時(shí)區(qū)表示

rng?=?pd.date_range('3/6/2012',?periods=5,?freq='D')
rng

DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09',
               '2012-03-10'],
              dtype='datetime64[ns]', freq='D')

ts?=?pd.Series(np.random.randn(len(rng)),?index=rng)
ts

2012-03-06    0.523781
2012-03-07   -0.670822
2012-03-08    0.934826
2012-03-09    0.002239
2012-03-10   -0.091952
Freq: D, dtype: float64

ts_utc?=?ts.tz_localize('UTC')
ts_utc

2012-03-06 00:00:00+00:00    0.523781
2012-03-07 00:00:00+00:00   -0.670822
2012-03-08 00:00:00+00:00    0.934826
2012-03-09 00:00:00+00:00    0.002239
2012-03-10 00:00:00+00:00   -0.091952
Freq: D, dtype: float64

時(shí)區(qū)轉(zhuǎn)換

ts_utc.tz_convert('US/Eastern')

2012-03-05 19:00:00-05:00    0.523781
2012-03-06 19:00:00-05:00   -0.670822
2012-03-07 19:00:00-05:00    0.934826
2012-03-08 19:00:00-05:00    0.002239
2012-03-09 19:00:00-05:00   -0.091952
Freq: D, dtype: float64

時(shí)間跨度轉(zhuǎn)換

rng?=?pd.date_range('1/1/2012',?periods=5,?freq='M')
rng

DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',
               '2012-05-31'],
              dtype='datetime64[ns]', freq='M')

ts?=?pd.Series(np.random.randn(len(rng)),?index=rng)
ts

2012-01-31    1.296132
2012-02-29    1.023936
2012-03-31   -0.249774
2012-04-30    1.007810
2012-05-31   -0.051413
Freq: M, dtype: float64

ps?=?ts.to_period()
ps

2012-01    1.296132
2012-02    1.023936
2012-03   -0.249774
2012-04    1.007810
2012-05   -0.051413
Freq: M, dtype: float64

ps.to_timestamp()

2012-01-01    1.296132
2012-02-01    1.023936
2012-03-01   -0.249774
2012-04-01    1.007810
2012-05-01   -0.051413
Freq: MS, dtype: float64

日期與時(shí)間戳之間的轉(zhuǎn)換使得可以使用一些方便的算術(shù)函數(shù)。例如，我們把以11月為年底的季度數(shù)據(jù)轉(zhuǎn)換為當(dāng)前季度末月底為始的數(shù)據(jù)

prng?=?pd.period_range('1990Q1',?'2000Q4',?freq='Q-NOV')
prng

ts?=?pd.Series(np.random.randn(len(prng)),?index?=?prng)
#?看下數(shù)據(jù)前三條
ts[0:3]

ts.index?=?(prng.asfreq('M',?'end')?)?.asfreq('H',?'start')?+9
#?看下數(shù)據(jù)前三條
ts[0:3]

分類

從版本0.15開始，pandas在DataFrame中開始包括分類數(shù)據(jù)。

df?=?pd.DataFrame({"id":[1,2,3,4,5,6],?"raw_grade":['a',?'b',?'b',?'a',?'e',?'e']})
df

把raw_grade轉(zhuǎn)換為分類類型

df["grade"]?=?df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    e
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

重命名類別名為更有意義的名稱

df["grade"].cat.categories?=?["very?good",?"good",?"very?bad"]

對(duì)分類重新排序，并添加缺失的分類

df["grade"]?=?df["grade"].cat.set_categories(["very?bad",?"bad",?"medium",?"good",?"very?good"])
df["grade"]

0    very good
1         good
2         good
3    very good
4     very bad
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

排序是按照分類的順序進(jìn)行的，而不是字典序

df.sort_values(by="grade")

按分類分組時(shí)，也會(huì)顯示空的分類

df.groupby("grade").size()

grade
very bad     2
bad          0
medium       0
good         2
very good    2
dtype: int64

繪圖

ts?=?pd.Series(np.random.randn(1000),?index=pd.date_range('1/1/2000',?periods=1000))
ts?=?ts.cumsum()
ts.plot()

對(duì)于DataFrame類型，**plot()**能很方便地畫出所有列及其標(biāo)簽

df?=?pd.DataFrame(np.random.randn(1000,?4),?index=ts.index,?columns=['A',?'B',?'C',?'D'])
df?=?df.cumsum()
plt.figure();?df.plot();?plt.legend(loc='best')

獲取數(shù)據(jù)的I/O

CSV

寫入一個(gè)csv文件

df.to_csv('data/foo.csv')

從一個(gè)csv文件讀入

df1?=?pd.read_csv('data/foo.csv')
#?查看前三行數(shù)據(jù)
df1.head(3)

HDF5

HDFStores的讀寫

寫入一個(gè)HDF5 Store

df.to_hdf('data/foo.h5',?'df')

從一個(gè)HDF5 Store讀入

df1?=?pd.read_hdf('data/foo.h5',?'df')
#?查看前三行數(shù)據(jù)
df1.head(3)

Excel

MS Excel的讀寫

寫入一個(gè)Excel文件

df.to_excel('data/foo.xlsx',?sheet_name='Sheet1')

從一個(gè)excel文件讀入

df1?=?pd.read_excel('data/foo.xlsx',?'Sheet1',?index_col=None,?na_values=['NA'])
#?查看前三行數(shù)據(jù)
df1.head(3)

下期見！

需要本文所有代碼和數(shù)據(jù)的，可以掃下方二維碼加我微信后，回復(fù)：10pandas?獲取。

干貨文章，求點(diǎn)贊轉(zhuǎn)發(fā)支持。

--END--


掃碼即可加我微信
老表朋友圈經(jīng)常有贈(zèng)書/紅包福利活
如何找到我：
近期優(yōu)質(zhì)文章：
肝了幾天，十分鐘入門pandas（上）
手把手教你從零開始搭建個(gè)人博客，20分鐘上手
原創(chuàng)回答｜用 Python 進(jìn)行數(shù)據(jù)分析，學(xué)習(xí)書籍或資料推薦？
學(xué)習(xí)更多：
整理了我開始分享學(xué)習(xí)筆記到現(xiàn)在超過250篇優(yōu)質(zhì)文章，涵蓋數(shù)據(jù)分析、爬蟲、機(jī)器學(xué)習(xí)等方面，別再說不知道該從哪開始，實(shí)戰(zhàn)哪里找了
“點(diǎn)贊”就是對(duì)博主最大的支持?

肝了幾天，十分鐘入門pandas（上）

合并