二十分钟入门pandas，学不会私信教学！

有需要pyecharts资源的可以点击文章上面下载！！！

需要本项目运行源码可以点击资源进行下载

资源

#coding:utf8
%matplotlib inline

这个一篇针对pandas新手的简短入门，想要了解更多复杂的内容，参阅Cookbook

通常，我们首先要导入以下几个库：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

创建对象

通过传递一个list来创建Series，pandas会默认创建整型索引：

s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

通过传递一个numpy array，日期索引以及列标签来创建一个DataFrame：

dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

	A	B	C	D
2013-01-01	0.194873	0.298287	0.073043	-0.681957
2013-01-02	-0.679429	0.397972	0.887388	1.187169
2013-01-03	0.782244	-0.251828	-0.736243	1.752973
2013-01-04	-1.877066	-0.060967	-2.103769	-1.634272
2013-01-05	0.405459	-1.151989	-1.309286	1.023221
2013-01-06	-0.445221	-1.520998	0.717376	1.657476

通过传递一个能够被转换为类似series的dict对象来创建一个DataFrame:

df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3]*4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

可以看到各列的数据类型为：

df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据

查看frame中头部和尾部的几行：

df.head()

	A	B	C	D
2013-01-01	0.194873	0.298287	0.073043	-0.681957
2013-01-02	-0.679429	0.397972	0.887388	1.187169
2013-01-03	0.782244	-0.251828	-0.736243	1.752973
2013-01-04	-1.877066	-0.060967	-2.103769	-1.634272
2013-01-05	0.405459	-1.151989	-1.309286	1.023221

df.tail(3)

	A	B	C	D
2013-01-04	-1.877066	-0.060967	-2.103769	-1.634272
2013-01-05	0.405459	-1.151989	-1.309286	1.023221
2013-01-06	-0.445221	-1.520998	0.717376	1.657476

显示索引、列名以及底层的numpy数据

df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

df.values

array([[ 0.19487255,  0.29828663,  0.07304296, -0.68195723],
       [-0.67942918,  0.39797196,  0.88738797,  1.18716883],
       [ 0.78224424, -0.25182784, -0.73624252,  1.75297344],
       [-1.8770662 , -0.06096652, -2.10376905, -1.63427152],
       [ 0.40545892, -1.15198867, -1.30928606,  1.02322057],
       [-0.44522115, -1.52099782,  0.71737636,  1.65747636]])

describe()能对数据做一个快速统计汇总

df.describe()

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	-0.269857	-0.381587	-0.411915	0.550768
std	0.955046	0.785029	1.180805	1.385087
min	-1.877066	-1.520998	-2.103769	-1.634272
25%	-0.620877	-0.926948	-1.166025	-0.255663
50%	-0.125174	-0.156397	-0.331600	1.105195
75%	0.352812	0.208473	0.556293	1.539899
max	0.782244	0.397972	0.887388	1.752973

对数据做转置：

df.T

	2013-01-01	2013-01-02	2013-01-03	2013-01-04	2013-01-05	2013-01-06
A	0.194873	-0.679429	0.782244	-1.877066	0.405459	-0.445221
B	0.298287	0.397972	-0.251828	-0.060967	-1.151989	-1.520998
C	0.073043	0.887388	-0.736243	-2.103769	-1.309286	0.717376
D	-0.681957	1.187169	1.752973	-1.634272	1.023221	1.657476

按轴进行排序（1代表横轴；0代表纵轴）：

df.sort_index(axis=1, ascending=False)

	D	C	B	A
2013-01-01	-0.681957	0.073043	0.298287	0.194873
2013-01-02	1.187169	0.887388	0.397972	-0.679429
2013-01-03	1.752973	-0.736243	-0.251828	0.782244
2013-01-04	-1.634272	-2.103769	-0.060967	-1.877066
2013-01-05	1.023221	-1.309286	-1.151989	0.405459
2013-01-06	1.657476	0.717376	-1.520998	-0.445221

按值进行排序 :

df.sort_values(by='D')

	A	B	C	D
2013-01-04	-1.877066	-0.060967	-2.103769	-1.634272
2013-01-01	0.194873	0.298287	0.073043	-0.681957
2013-01-05	0.405459	-1.151989	-1.309286	1.023221
2013-01-02	-0.679429	0.397972	0.887388	1.187169
2013-01-06	-0.445221	-1.520998	0.717376	1.657476
2013-01-03	0.782244	-0.251828	-0.736243	1.752973

数据选择

注意：虽然标准的Python/Numpy的表达式能完成选择与赋值等功能，但我们仍推荐使用优化过的pandas数据访问方法：.at，.iat，.loc，.iloc和.ix

选取

选择某一列数据，它会返回一个Series，等同于df.A：

df['A']

2013-01-01    0.194873
2013-01-02   -0.679429
2013-01-03    0.782244
2013-01-04   -1.877066
2013-01-05    0.405459
2013-01-06   -0.445221
Freq: D, Name: A, dtype: float64

通过使用 [ ]进行切片选取：

df[0:4]

	A	B	C	D
2013-01-01	0.194873	0.298287	0.073043	-0.681957
2013-01-02	-0.679429	0.397972	0.887388	1.187169
2013-01-03	0.782244	-0.251828	-0.736243	1.752973
2013-01-04	-1.877066	-0.060967	-2.103769	-1.634272

df['20130102':'20130104']

	A	B	C	D
2013-01-02	-0.679429	0.397972	0.887388	1.187169
2013-01-03	0.782244	-0.251828	-0.736243	1.752973
2013-01-04	-1.877066	-0.060967	-2.103769	-1.634272

通过标签选取

通过标签进行交叉选取：

dates[0]

Timestamp('2013-01-01 00:00:00', freq='D')

df.loc[dates[0]]

A    0.194873
B    0.298287
C    0.073043
D   -0.681957
Name: 2013-01-01 00:00:00, dtype: float64

使用标签对多个轴进行选取

df.loc[:,['A','B']]

	A	B
2013-01-01	0.194873	0.298287
2013-01-02	-0.679429	0.397972
2013-01-03	0.782244	-0.251828
2013-01-04	-1.877066	-0.060967
2013-01-05	0.405459	-1.151989
2013-01-06	-0.445221	-1.520998

df.loc[:,['A','B']][:3]

	A	B
2013-01-01	0.194873	0.298287
2013-01-02	-0.679429	0.397972
2013-01-03	0.782244	-0.251828

进行标签切片，包含两个端点

df.loc['20130102':'20130104',['A','B']]

	A	B
2013-01-02	-0.679429	0.397972
2013-01-03	0.782244	-0.251828
2013-01-04	-1.877066	-0.060967

对于返回的对象进行降维处理

df.loc['20130102',['A','B']]

A   -0.679429
B    0.397972
Name: 2013-01-02 00:00:00, dtype: float64

获取一个标量

df.loc[dates[0],'A']

0.19487255317338711

快速获取标量（与上面的方法等价）

df.at[dates[0],'A']

0.19487255317338711

通过位置选取

通过传递整型的位置进行选取

df.iloc[3]

A   -1.877066
B   -0.060967
C   -2.103769
D   -1.634272
Name: 2013-01-04 00:00:00, dtype: float64

通过整型的位置切片进行选取，与python/numpy形式相同

df.iloc[3:5,0:2]

	A	B
2013-01-04	-1.877066	-0.060967
2013-01-05	0.405459	-1.151989

只对行进行切片

df.iloc[1:3,:]

	A	B	C	D
2013-01-02	-0.679429	0.397972	0.887388	1.187169
2013-01-03	0.782244	-0.251828	-0.736243	1.752973

只对列进行切片

df.iloc[:,1:3]

	B	C
2013-01-01	0.298287	0.073043
2013-01-02	0.397972	0.887388
2013-01-03	-0.251828	-0.736243
2013-01-04	-0.060967	-2.103769
2013-01-05	-1.151989	-1.309286
2013-01-06	-1.520998	0.717376

只获取某个值

df.iloc[1,1]

0.39797195640479976

快速获取某个值（与上面的方法等价）

df.iat[1,1]

0.39797195640479976

布尔索引

用某列的值来选取数据

df[df.A > 0]

	A	B	C	D
2013-01-01	0.194873	0.298287	0.073043	-0.681957
2013-01-03	0.782244	-0.251828	-0.736243	1.752973
2013-01-05	0.405459	-1.151989	-1.309286	1.023221

用where操作来选取数据

df[df > 0]

	A	B	C	D
2013-01-01	0.194873	0.298287	0.073043	NaN
2013-01-02	NaN	0.397972	0.887388	1.187169
2013-01-03	0.782244	NaN	NaN	1.752973
2013-01-04	NaN	NaN	NaN	NaN
2013-01-05	0.405459	NaN	NaN	1.023221
2013-01-06	NaN	NaN	0.717376	1.657476

用**isin()**方法来过滤数据

df2 = df.copy()

df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2

	A	B	C	D	E
2013-01-01	0.194873	0.298287	0.073043	-0.681957	one
2013-01-02	-0.679429	0.397972	0.887388	1.187169	one
2013-01-03	0.782244	-0.251828	-0.736243	1.752973	two
2013-01-04	-1.877066	-0.060967	-2.103769	-1.634272	three
2013-01-05	0.405459	-1.151989	-1.309286	1.023221	four
2013-01-06	-0.445221	-1.520998	0.717376	1.657476	three

df2[df2['E'].isin(['two', 'four'])]

	A	B	C	D	E
2013-01-03	0.782244	-0.251828	-0.736243	1.752973	two
2013-01-05	0.405459	-1.151989	-1.309286	1.023221	four

赋值

赋值一个新的列，通过索引来自动对齐数据

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102',periods=6))
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

df['F'] = s1
df

	A	B	C	D	F
2013-01-01	0.194873	0.298287	0.073043	-0.681957	NaN
2013-01-02	-0.679429	0.397972	0.887388	1.187169	1.0
2013-01-03	0.782244	-0.251828	-0.736243	1.752973	2.0
2013-01-04	-1.877066	-0.060967	-2.103769	-1.634272	3.0
2013-01-05	0.405459	-1.151989	-1.309286	1.023221	4.0
2013-01-06	-0.445221	-1.520998	0.717376	1.657476	5.0

通过标签赋值

df.at[dates[0], 'A'] = 0
df

	A	B	C	D	F
2013-01-01	0.000000	0.298287	0.073043	-0.681957	NaN
2013-01-02	-0.679429	0.397972	0.887388	1.187169	1.0
2013-01-03	0.782244	-0.251828	-0.736243	1.752973	2.0
2013-01-04	-1.877066	-0.060967	-2.103769	-1.634272	3.0
2013-01-05	0.405459	-1.151989	-1.309286	1.023221	4.0
2013-01-06	-0.445221	-1.520998	0.717376	1.657476	5.0

通过位置赋值

df.iat[0,1] = 8888
df

	A	B	C	D	F
2013-01-01	0.000000	8888.000000	0.073043	-0.681957	NaN
2013-01-02	-0.679429	0.397972	0.887388	1.187169	1.0
2013-01-03	0.782244	-0.251828	-0.736243	1.752973	2.0
2013-01-04	-1.877066	-0.060967	-2.103769	-1.634272	3.0
2013-01-05	0.405459	-1.151989	-1.309286	1.023221	4.0
2013-01-06	-0.445221	-1.520998	0.717376	1.657476	5.0

通过传递numpy array赋值

df.loc[:,'D'] = np.array([5] * len(df))
df

	A	B	C	D	F
2013-01-01	0.000000	8888.000000	0.073043	5	NaN
2013-01-02	-0.679429	0.397972	0.887388	5	1.0
2013-01-03	0.782244	-0.251828	-0.736243	5	2.0
2013-01-04	-1.877066	-0.060967	-2.103769	5	3.0
2013-01-05	0.405459	-1.151989	-1.309286	5	4.0
2013-01-06	-0.445221	-1.520998	0.717376	5	5.0

通过where操作来赋值

df2 = df.copy()
df2[df2 > 0] = -df2
df2

	A	B	C	D	F
2013-01-01	0.000000	-8888.000000	-0.073043	-5	NaN
2013-01-02	-0.679429	-0.397972	-0.887388	-5	-1.0
2013-01-03	-0.782244	-0.251828	-0.736243	-5	-2.0
2013-01-04	-1.877066	-0.060967	-2.103769	-5	-3.0
2013-01-05	-0.405459	-1.151989	-1.309286	-5	-4.0
2013-01-06	-0.445221	-1.520998	-0.717376	-5	-5.0

缺失值处理

在pandas中，用np.nan来代表缺失值，这些值默认不会参与运算。

reindex()允许你修改、增加、删除指定轴上的索引，并返回一个数据副本。

dates[0]

Timestamp('2013-01-01 00:00:00', freq='D')

df1 = df.reindex(index=dates[0:4], columns=list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

	A	B	C	D	F	E
2013-01-01	0.000000	8888.000000	0.073043	5	NaN	1.0
2013-01-02	-0.679429	0.397972	0.887388	5	1.0	1.0
2013-01-03	0.782244	-0.251828	-0.736243	5	2.0	NaN
2013-01-04	-1.877066	-0.060967	-2.103769	5	3.0	NaN

剔除所有包含缺失值的行数据

df1.dropna(how='any')#any是任意，所以保留下来了

	A	B	C	D	F	E
2013-01-02	-0.679429	0.397972	0.887388	5	1.0	1.0

填充缺失值

df1.fillna(value=5)

	A	B	C	D	F	E
2013-01-01	0.000000	8888.000000	0.073043	5	5.0	1.0
2013-01-02	-0.679429	0.397972	0.887388	5	1.0	1.0
2013-01-03	0.782244	-0.251828	-0.736243	5	2.0	5.0
2013-01-04	-1.877066	-0.060967	-2.103769	5	3.0	5.0

获取值是否为nan的布尔标记

pd.isnull(df1)

	A	B	C	D	F	E
2013-01-01	False	False	False	False	True	False
2013-01-02	False	False	False	False	False	False
2013-01-03	False	False	False	False	False	True
2013-01-04	False	False	False	False	False	True

运算

统计

运算过程中，通常不包含缺失值。

进行描述性统计

df.mean()#默认为0

A      -0.302336
B    1480.902032
C      -0.411915
D       5.000000
F       3.000000
dtype: float64

对其他轴进行同样的运算

df.mean(1)#对所有行

2013-01-01    2223.268261
2013-01-02       1.321186
2013-01-03       1.358835
2013-01-04       0.791640
2013-01-05       1.388837
2013-01-06       1.750231
Freq: D, dtype: float64

对于拥有不同维度的对象进行运算时需要对齐。除此之外，pandas会自动沿着指定维度计算。

s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
#意思就是前面两个填充为空，后面的自动往下移动
s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

df

	A	B	C	D	F
2013-01-01	0.000000	8888.000000	0.073043	5	NaN
2013-01-02	-0.679429	0.397972	0.887388	5	1.0
2013-01-03	0.782244	-0.251828	-0.736243	5	2.0
2013-01-04	-1.877066	-0.060967	-2.103769	5	3.0
2013-01-05	0.405459	-1.151989	-1.309286	5	4.0
2013-01-06	-0.445221	-1.520998	0.717376	5	5.0

df.sub(s, axis='index')
# 然后对df和s中的对应位置的元素进行减法运算,即df中的元素减去s中对应索引元素的值。

	A	B	C	D	F
2013-01-01	NaN	NaN	NaN	NaN	NaN
2013-01-02	NaN	NaN	NaN	NaN	NaN
2013-01-03	-0.217756	-1.251828	-1.736243	4.0	1.0
2013-01-04	-4.877066	-3.060967	-5.103769	2.0	0.0
2013-01-05	-4.594541	-6.151989	-6.309286	0.0	-1.0
2013-01-06	NaN	NaN	NaN	NaN	NaN

Apply 函数作用

通过apply()对函数作用

df.apply(np.cumsum)
# df.apply(np.cumsum)对DataFrame中的每一列应用np.cumsum函数进行累积求和。

# np.cumsum计算累积和,它将每一列中的元素按顺序进行累加。

	A	B	C	D	F
2013-01-01	0.000000	8888.000000	0.073043	5	NaN
2013-01-02	-0.679429	8888.397972	0.960431	10	1.0
2013-01-03	0.102815	8888.146144	0.224188	15	3.0
2013-01-04	-1.774251	8888.085178	-1.879581	20	6.0
2013-01-05	-1.368792	8886.933189	-3.188867	25	10.0
2013-01-06	-1.814013	8885.412191	-2.471490	30	15.0

df.apply(lambda x:x.max()-x.min())
# 这个lambda函数会对df的每一列调用x.max()求最大值和x.min()求最小值,最后计算最大值减最小值。

A       2.659310
B    8889.520998
C       2.991157
D       0.000000
F       4.000000
dtype: float64

频数统计

s = pd.Series(np.random.randint(0, 7, size=10))
s

0    2
1    5
2    1
3    3
4    2
5    0
6    0
7    1
8    4
9    1
dtype: int32

s.value_counts()

1    3
2    2
0    2
5    1
3    1
4    1
dtype: int64

字符串方法

对于Series对象，在其str属性中有着一系列的字符串处理方法。就如同下段代码一样，能很方便的对array中各个元素进行运算。值得注意的是，在str属性中的模式匹配默认使用正则表达式。

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

合并

Concat 连接

pandas中提供了大量的方法能够轻松对Series，DataFrame和Panel对象进行不同满足逻辑关系的合并操作

通过**concat()**来连接pandas对象

df = pd.DataFrame(np.random.randn(10,4))
df

	0	1	2	3
0	0.639168	0.881631	0.198015	0.393510
1	0.503544	0.155373	1.218277	0.128893
2	-0.661486	1.365067	-0.010755	0.058110
3	0.185698	-0.750695	-0.637134	-1.811947
4	-0.493348	-0.246197	-0.700524	0.692042
5	-2.280015	0.986806	1.297614	-0.749969
6	0.688663	0.088751	-0.164766	-0.165378
7	-0.382894	-0.157371	0.000836	-1.947379
8	-1.618486	0.804667	1.919125	-0.290719
9	0.392898	-0.264556	0.817233	0.680797

#break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces

[          0         1         2         3
 0  0.639168  0.881631  0.198015  0.393510
 1  0.503544  0.155373  1.218277  0.128893
 2 -0.661486  1.365067 -0.010755  0.058110,
           0         1         2         3
 3  0.185698 -0.750695 -0.637134 -1.811947
 4 -0.493348 -0.246197 -0.700524  0.692042
 5 -2.280015  0.986806  1.297614 -0.749969
 6  0.688663  0.088751 -0.164766 -0.165378,
           0         1         2         3
 7 -0.382894 -0.157371  0.000836 -1.947379
 8 -1.618486  0.804667  1.919125 -0.290719
 9  0.392898 -0.264556  0.817233  0.680797]

pd.concat(pieces)

	0	1	2	3
0	0.639168	0.881631	0.198015	0.393510
1	0.503544	0.155373	1.218277	0.128893
2	-0.661486	1.365067	-0.010755	0.058110
3	0.185698	-0.750695	-0.637134	-1.811947
4	-0.493348	-0.246197	-0.700524	0.692042
5	-2.280015	0.986806	1.297614	-0.749969
6	0.688663	0.088751	-0.164766	-0.165378
7	-0.382894	-0.157371	0.000836	-1.947379
8	-1.618486	0.804667	1.919125	-0.290719
9	0.392898	-0.264556	0.817233	0.680797

Join 合并

类似于SQL中的合并(merge)

left = pd.DataFrame({'key':['foo', 'foo'], 'lval':[1,2]})
left

	key	lval
0	foo	1
1	foo	2

right = pd.DataFrame({'key':['foo', 'foo'], 'lval':[4,5]})
right

	key	lval
0	foo	4
1	foo	5

pd.merge(left, right, on='key')

	key	lval_x	lval_y
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

Append 添加

将若干行添加到dataFrame后面

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df

	A	B	C	D
0	-1.526419	0.868844	-1.379758	0.498004
1	-0.917867	-0.137874	-0.909232	-0.523873
2	1.370409	-0.948766	1.728098	0.361813
3	-1.274621	-1.224051	-0.749470	-2.712027
4	-0.303875	-0.177942	0.496359	0.048004
5	-0.941436	0.044570	-0.229654	0.092941
6	0.465798	-0.835244	0.131745	2.219413
7	0.875844	0.243440	-1.050471	1.761330

s = df.iloc[3]
s

A   -1.274621
B   -1.224051
C   -0.749470
D   -2.712027
Name: 3, dtype: float64

import warnings

df.append(s, ignore_index=True)

C:\Users\48125\AppData\Local\Temp\ipykernel_17504\1384017944.py:3: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df.append(s, ignore_index=True)

	A	B	C	D
0	-1.526419	0.868844	-1.379758	0.498004
1	-0.917867	-0.137874	-0.909232	-0.523873
2	1.370409	-0.948766	1.728098	0.361813
3	-1.274621	-1.224051	-0.749470	-2.712027
4	-0.303875	-0.177942	0.496359	0.048004
5	-0.941436	0.044570	-0.229654	0.092941
6	0.465798	-0.835244	0.131745	2.219413
7	0.875844	0.243440	-1.050471	1.761330
8	-1.274621	-1.224051	-0.749470	-2.712027

分组

对于“group by”操作，我们通常是指以下一个或几个步骤：

划分按照某些标准将数据分为不同的组
应用对每组数据分别执行一个函数
组合将结果组合到一个数据结构

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 
                          'foo', 'bar', 'foo', 'bar'],
                   'B' : ['one', 'one', 'two', 'three', 
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df

	A	B	C	D
0	foo	one	-0.491461	-0.550970
1	bar	one	-0.468956	0.584847
2	foo	two	0.461989	0.372785
3	bar	three	0.290600	-2.142788
4	foo	two	0.448364	-2.036729
5	bar	two	0.639793	0.577440
6	foo	one	1.335309	-0.582853
7	bar	three	0.629100	0.223494

分组并对每个分组应用sum函数

df.groupby('A').sum()

C:\Users\48125\AppData\Local\Temp\ipykernel_17504\1885751491.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  df.groupby('A').sum()

	C	D
A
bar	1.090537	-0.757006
foo	1.754201	-2.797767

按多个列分组形成层级索引，然后应用函数

df.groupby(['A','B']).sum()

		C	D
A	B
bar	one	-0.468956	0.584847
	three	0.919700	-1.919294
	two	0.639793	0.577440
foo	one	0.843848	-1.133823
foo	two	0.910352	-1.663943

变形

堆叠

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

df2 = df[:4]
df2

		A	B
first	second
bar	one	0.138584	0.661800
bar	two	0.206128	0.668363
baz	one	-0.531729	-0.695009
baz	two	0.746672	-0.735620

**stack()**方法对DataFrame的列“压缩”一个层级

stacked = df2.stack()
stacked

first  second   
bar    one     A    0.138584
               B    0.661800
       two     A    0.206128
               B    0.668363
baz    one     A   -0.531729
               B   -0.695009
       two     A    0.746672
               B   -0.735620
dtype: float64

对于一个“堆叠过的”DataFrame或者Series（拥有MultiIndex作为索引），stack()的逆操作是unstack()，默认反堆叠到上一个层级

stacked.unstack()

		A	B
first	second
bar	one	0.138584	0.661800
bar	two	0.206128	0.668363
baz	one	-0.531729	-0.695009
baz	two	0.746672	-0.735620

stacked.unstack(1)

	second	one	two
first
bar	A	0.138584	0.206128
bar	B	0.661800	0.668363
baz	A	-0.531729	0.746672
baz	B	-0.695009	-0.735620

stacked.unstack(0)

	first	bar	baz
second
one	A	0.138584	-0.531729
one	B	0.661800	-0.695009
two	A	0.206128	0.746672
two	B	0.668363	-0.735620

数据透视表

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df

	A	B	C	D	E
0	one	A	foo	0.014262	-0.732936
1	one	B	foo	0.853949	1.719331
2	two	C	foo	-0.344640	0.765832
3	three	A	bar	1.908338	-1.447838
4	one	B	bar	-1.469287	0.954153
5	one	C	bar	-1.341587	-0.606839
6	two	A	foo	0.961927	0.054527
7	three	B	foo	0.829778	-0.581417
8	one	C	foo	0.623418	-0.456098
9	one	A	bar	1.745817	-0.684403
10	two	B	bar	-1.457706	-0.272665
11	three	C	bar	1.336044	2.080870

我们可以轻松地从这个数据得到透视表

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

	C	bar	foo
A	B
one	A	1.745817	0.014262
	B	-1.469287	0.853949
	C	-1.341587	0.623418
three	A	1.908338	NaN
	B	NaN	0.829778
	C	1.336044	NaN
two	A	NaN	0.961927
	B	-1.457706	NaN
	C	NaN	-0.344640

时间序列

pandas在对频率转换进行重新采样时拥有着简单，强大而且高效的功能（例如把按秒采样的数据转换为按5分钟采样的数据）。这在金融领域很常见，但又不限于此。

rng = pd.date_range('1/1/2012', periods=100, freq='S')
rng[:5]

DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01',
               '2012-01-01 00:00:02', '2012-01-01 00:00:03',
               '2012-01-01 00:00:04'],
              dtype='datetime64[ns]', freq='S')

ts = pd.Series(np.random.randint(0,500,len(rng)), index=rng)
ts

2012-01-01 00:00:00    193
2012-01-01 00:00:01    447
2012-01-01 00:00:02    407
2012-01-01 00:00:03    450
2012-01-01 00:00:04    368
                      ... 
2012-01-01 00:01:35    102
2012-01-01 00:01:36    446
2012-01-01 00:01:37    290
2012-01-01 00:01:38    256
2012-01-01 00:01:39    154
Freq: S, Length: 100, dtype: int32

ts.resample(‘5Min’) 表示对时间序列数据ts进行重采样,改变其索引的时间频率。

具体来说:

ts: 一个时间序列数据,索引为datetime类型。
'5Min': 表示将索引频率改为每5分钟一个。

resample函数将按照新频率重新采样,原索引不能被5分钟整除的时间点会被删除,新的索引会在5分钟的时间点上。

ts.resample('5Min')
# Datafream才这样写
# ts.resample('5Min', how='sum')

<pandas.core.resample.DatetimeIndexResampler object at 0x0000017E3D5A86D0>

时区表示

rng = pd.date_range('3/6/2012', periods=5, freq='D')
rng

DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09',
               '2012-03-10'],
              dtype='datetime64[ns]', freq='D')

ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2012-03-06   -0.967180
2012-03-07   -0.917108
2012-03-08    0.252346
2012-03-09    0.461718
2012-03-10   -0.931543
Freq: D, dtype: float64

ts_utc = ts.tz_localize('UTC')
ts_utc

2012-03-06 00:00:00+00:00   -0.967180
2012-03-07 00:00:00+00:00   -0.917108
2012-03-08 00:00:00+00:00    0.252346
2012-03-09 00:00:00+00:00    0.461718
2012-03-10 00:00:00+00:00   -0.931543
Freq: D, dtype: float64

时区转换

ts_utc.tz_convert('US/Eastern')

2012-03-05 19:00:00-05:00   -0.967180
2012-03-06 19:00:00-05:00   -0.917108
2012-03-07 19:00:00-05:00    0.252346
2012-03-08 19:00:00-05:00    0.461718
2012-03-09 19:00:00-05:00   -0.931543
Freq: D, dtype: float64

时间跨度转换

rng = pd.date_range('1/1/2012', periods=5, freq='M')
rng

DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',
               '2012-05-31'],
              dtype='datetime64[ns]', freq='M')

ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2012-01-31   -0.841095
2012-02-29   -1.548459
2012-03-31   -0.473997
2012-04-30   -0.602313
2012-05-31   -0.519119
Freq: M, dtype: float64

ps = ts.to_period()
ps

2012-01   -0.841095
2012-02   -1.548459
2012-03   -0.473997
2012-04   -0.602313
2012-05   -0.519119
Freq: M, dtype: float64

ps.to_timestamp()

2012-01-01   -0.841095
2012-02-01   -1.548459
2012-03-01   -0.473997
2012-04-01   -0.602313
2012-05-01   -0.519119
Freq: MS, dtype: float64

日期与时间戳之间的转换使得可以使用一些方便的算术函数。例如，我们把以11月为年底的季度数据转换为当前季度末月底为始的数据

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
prng

PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
             '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
             '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
             '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
             '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
             '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
             '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
             '2000Q3', '2000Q4'],
            dtype='period[Q-NOV]')

ts = pd.Series(np.random.randn(len(prng)), index = prng)
ts[:5]

1990Q1   -0.316562
1990Q2   -0.055698
1990Q3   -0.267225
1990Q4    0.514381
1991Q1   -0.716024
Freq: Q-NOV, dtype: float64

ts.index = (prng.asfreq('M', 'end') ) .asfreq('H', 'start') +9
ts[:6]

1990-02-01 09:00   -0.316562
1990-05-01 09:00   -0.055698
1990-08-01 09:00   -0.267225
1990-11-01 09:00    0.514381
1991-02-01 09:00   -0.716024
1991-05-01 09:00    1.530323
Freq: H, dtype: float64

分类

从版本0.15开始，pandas在DataFrame中开始包括分类数据。

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'e', 'e']})
df

	id	raw_grade
0	1	a
1	2	b
2	3	b
3	4	a
4	5	e
5	6	e

把raw_grade转换为分类类型

df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    e
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

重命名类别名为更有意义的名称

df["grade"].cat.categories = ["very good", "good", "very bad"]

C:\Users\48125\AppData\Local\Temp\ipykernel_17504\2985790766.py:1: FutureWarning: Setting categories in-place is deprecated and will raise in a future version. Use rename_categories instead.
  df["grade"].cat.categories = ["very good", "good", "very bad"]

对分类重新排序，并添加缺失的分类

df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

0    very good
1         good
2         good
3    very good
4     very bad
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

排序是按照分类的顺序进行的，而不是字典序

df.sort_values(by="grade")

	id	raw_grade	grade
4	5	e	very bad
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good

按分类分组时，也会显示空的分类

df.groupby("grade").size()

grade
very bad     2
bad          0
medium       0
good         2
very good    2
dtype: int64

绘图

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

在这里插入图片描述

对于DataFrame类型，**plot()**能很方便地画出所有列及其标签

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')

在这里插入图片描述

获取数据的I/O

CSV

写入一个csv文件

df.to_csv('data/foo.csv')

从一个csv文件读入

pd.read_csv('data/foo.csv')

	Unnamed: 0	A	B	C	D
0	2000-01-01	-0.055622	1.889998	-1.094622	0.591568
1	2000-01-02	-0.408948	3.388888	-0.550835	1.362462
2	2000-01-03	-0.041353	1.880637	-2.417740	0.383862
3	2000-01-04	-0.442517	0.807854	-2.409478	1.418281
4	2000-01-05	0.442292	-0.735053	-2.970804	1.435721
...	...	...	...	...	...
995	2002-09-22	-28.470361	14.628723	-68.388462	-37.630887
996	2002-09-23	-28.995055	14.517615	-68.247657	-36.363608
997	2002-09-24	-28.756586	15.852227	-68.328996	-36.032141
998	2002-09-25	-29.331636	15.557680	-68.450159	-35.825427
999	2002-09-26	-31.065863	16.141154	-68.852564	-36.462825

1000 rows × 5 columns

HDF5

HDFStores的读写

写入一个HDF5 Store

df.to_hdf('data/foo.h5', 'df')

从一个HDF5 Store读入

pd.read_hdf('data/foo.h5', 'df')

	A	B	C	D
2000-01-01	-0.055622	1.889998	-1.094622	0.591568
2000-01-02	-0.408948	3.388888	-0.550835	1.362462
2000-01-03	-0.041353	1.880637	-2.417740	0.383862
2000-01-04	-0.442517	0.807854	-2.409478	1.418281
2000-01-05	0.442292	-0.735053	-2.970804	1.435721
...	...	...	...	...
2002-09-22	-28.470361	14.628723	-68.388462	-37.630887
2002-09-23	-28.995055	14.517615	-68.247657	-36.363608
2002-09-24	-28.756586	15.852227	-68.328996	-36.032141
2002-09-25	-29.331636	15.557680	-68.450159	-35.825427
2002-09-26	-31.065863	16.141154	-68.852564	-36.462825

1000 rows × 4 columns

Excel

MS Excel的读写

写入一个Excel文件

df.to_excel('data/foo.xlsx', sheet_name='Sheet1')

从一个excel文件读入

pd.read_excel('data/foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

	Unnamed: 0	A	B	C	D
0	2000-01-01	-0.055622	1.889998	-1.094622	0.591568
1	2000-01-02	-0.408948	3.388888	-0.550835	1.362462
2	2000-01-03	-0.041353	1.880637	-2.417740	0.383862
3	2000-01-04	-0.442517	0.807854	-2.409478	1.418281
4	2000-01-05	0.442292	-0.735053	-2.970804	1.435721
...	...	...	...	...	...
995	2002-09-22	-28.470361	14.628723	-68.388462	-37.630887
996	2002-09-23	-28.995055	14.517615	-68.247657	-36.363608
997	2002-09-24	-28.756586	15.852227	-68.328996	-36.032141
998	2002-09-25	-29.331636	15.557680	-68.450159	-35.825427
999	2002-09-26	-31.065863	16.141154	-68.852564	-36.462825