二十分钟入门pandas,学不会私信教学!
有需要pyecharts资源的可以点击文章上面下载!!!
需要本项目运行源码可以点击资源进行下载
资源
#coding:utf8
%matplotlib inline
 
这个一篇针对pandas新手的简短入门,想要了解更多复杂的内容,参阅Cookbook
通常,我们首先要导入以下几个库:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
创建对象
通过传递一个list来创建Series,pandas会默认创建整型索引:
s = pd.Series([1,3,5,np.nan,6,8])
s
 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
 
通过传递一个numpy array,日期索引以及列标签来创建一个DataFrame:
dates = pd.date_range('20130101', periods=6)
dates
 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
 
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
 
| A | B | C | D | |
|---|---|---|---|---|
| 2013-01-01 | 0.194873 | 0.298287 | 0.073043 | -0.681957 | 
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 1.187169 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 1.752973 | 
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | -1.634272 | 
| 2013-01-05 | 0.405459 | -1.151989 | -1.309286 | 1.023221 | 
| 2013-01-06 | -0.445221 | -1.520998 | 0.717376 | 1.657476 | 
通过传递一个能够被转换为类似series的dict对象来创建一个DataFrame:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3]*4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2
 
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| 0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo | 
| 1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo | 
| 2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo | 
| 3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo | 
可以看到各列的数据类型为:
df2.dtypes
 
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
 
查看数据
查看frame中头部和尾部的几行:
df.head()
 
| A | B | C | D | |
|---|---|---|---|---|
| 2013-01-01 | 0.194873 | 0.298287 | 0.073043 | -0.681957 | 
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 1.187169 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 1.752973 | 
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | -1.634272 | 
| 2013-01-05 | 0.405459 | -1.151989 | -1.309286 | 1.023221 | 
df.tail(3)
 
| A | B | C | D | |
|---|---|---|---|---|
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | -1.634272 | 
| 2013-01-05 | 0.405459 | -1.151989 | -1.309286 | 1.023221 | 
| 2013-01-06 | -0.445221 | -1.520998 | 0.717376 | 1.657476 | 
显示索引、列名以及底层的numpy数据
df.index
 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
 
df.columns
 
Index(['A', 'B', 'C', 'D'], dtype='object')
 
df.values
 
array([[ 0.19487255,  0.29828663,  0.07304296, -0.68195723],
       [-0.67942918,  0.39797196,  0.88738797,  1.18716883],
       [ 0.78224424, -0.25182784, -0.73624252,  1.75297344],
       [-1.8770662 , -0.06096652, -2.10376905, -1.63427152],
       [ 0.40545892, -1.15198867, -1.30928606,  1.02322057],
       [-0.44522115, -1.52099782,  0.71737636,  1.65747636]])
 
describe()能对数据做一个快速统计汇总
df.describe()
 
| A | B | C | D | |
|---|---|---|---|---|
| count | 6.000000 | 6.000000 | 6.000000 | 6.000000 | 
| mean | -0.269857 | -0.381587 | -0.411915 | 0.550768 | 
| std | 0.955046 | 0.785029 | 1.180805 | 1.385087 | 
| min | -1.877066 | -1.520998 | -2.103769 | -1.634272 | 
| 25% | -0.620877 | -0.926948 | -1.166025 | -0.255663 | 
| 50% | -0.125174 | -0.156397 | -0.331600 | 1.105195 | 
| 75% | 0.352812 | 0.208473 | 0.556293 | 1.539899 | 
| max | 0.782244 | 0.397972 | 0.887388 | 1.752973 | 
对数据做转置:
df.T
 
| 2013-01-01 | 2013-01-02 | 2013-01-03 | 2013-01-04 | 2013-01-05 | 2013-01-06 | |
|---|---|---|---|---|---|---|
| A | 0.194873 | -0.679429 | 0.782244 | -1.877066 | 0.405459 | -0.445221 | 
| B | 0.298287 | 0.397972 | -0.251828 | -0.060967 | -1.151989 | -1.520998 | 
| C | 0.073043 | 0.887388 | -0.736243 | -2.103769 | -1.309286 | 0.717376 | 
| D | -0.681957 | 1.187169 | 1.752973 | -1.634272 | 1.023221 | 1.657476 | 
按轴进行排序(1代表横轴;0代表纵轴):
df.sort_index(axis=1, ascending=False)
 
| D | C | B | A | |
|---|---|---|---|---|
| 2013-01-01 | -0.681957 | 0.073043 | 0.298287 | 0.194873 | 
| 2013-01-02 | 1.187169 | 0.887388 | 0.397972 | -0.679429 | 
| 2013-01-03 | 1.752973 | -0.736243 | -0.251828 | 0.782244 | 
| 2013-01-04 | -1.634272 | -2.103769 | -0.060967 | -1.877066 | 
| 2013-01-05 | 1.023221 | -1.309286 | -1.151989 | 0.405459 | 
| 2013-01-06 | 1.657476 | 0.717376 | -1.520998 | -0.445221 | 
按值进行排序 :
df.sort_values(by='D')
 
| A | B | C | D | |
|---|---|---|---|---|
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | -1.634272 | 
| 2013-01-01 | 0.194873 | 0.298287 | 0.073043 | -0.681957 | 
| 2013-01-05 | 0.405459 | -1.151989 | -1.309286 | 1.023221 | 
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 1.187169 | 
| 2013-01-06 | -0.445221 | -1.520998 | 0.717376 | 1.657476 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 1.752973 | 
数据选择
注意:虽然标准的Python/Numpy的表达式能完成选择与赋值等功能,但我们仍推荐使用优化过的pandas数据访问方法:.at,.iat,.loc,.iloc和.ix
选取
选择某一列数据,它会返回一个Series,等同于df.A:
df['A']
 
2013-01-01    0.194873
2013-01-02   -0.679429
2013-01-03    0.782244
2013-01-04   -1.877066
2013-01-05    0.405459
2013-01-06   -0.445221
Freq: D, Name: A, dtype: float64
 
通过使用 [ ]进行切片选取:
df[0:4]
 
| A | B | C | D | |
|---|---|---|---|---|
| 2013-01-01 | 0.194873 | 0.298287 | 0.073043 | -0.681957 | 
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 1.187169 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 1.752973 | 
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | -1.634272 | 
df['20130102':'20130104']
 
| A | B | C | D | |
|---|---|---|---|---|
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 1.187169 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 1.752973 | 
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | -1.634272 | 
通过标签选取
通过标签进行交叉选取:
dates[0]
 
Timestamp('2013-01-01 00:00:00', freq='D')
 
df.loc[dates[0]]
 
A    0.194873
B    0.298287
C    0.073043
D   -0.681957
Name: 2013-01-01 00:00:00, dtype: float64
 
使用标签对多个轴进行选取
df.loc[:,['A','B']]
 
| A | B | |
|---|---|---|
| 2013-01-01 | 0.194873 | 0.298287 | 
| 2013-01-02 | -0.679429 | 0.397972 | 
| 2013-01-03 | 0.782244 | -0.251828 | 
| 2013-01-04 | -1.877066 | -0.060967 | 
| 2013-01-05 | 0.405459 | -1.151989 | 
| 2013-01-06 | -0.445221 | -1.520998 | 
df.loc[:,['A','B']][:3]
 
| A | B | |
|---|---|---|
| 2013-01-01 | 0.194873 | 0.298287 | 
| 2013-01-02 | -0.679429 | 0.397972 | 
| 2013-01-03 | 0.782244 | -0.251828 | 
进行标签切片,包含两个端点
df.loc['20130102':'20130104',['A','B']]
 
| A | B | |
|---|---|---|
| 2013-01-02 | -0.679429 | 0.397972 | 
| 2013-01-03 | 0.782244 | -0.251828 | 
| 2013-01-04 | -1.877066 | -0.060967 | 
对于返回的对象进行降维处理
df.loc['20130102',['A','B']]
 
A   -0.679429
B    0.397972
Name: 2013-01-02 00:00:00, dtype: float64
 
获取一个标量
df.loc[dates[0],'A']
 
0.19487255317338711
 
快速获取标量(与上面的方法等价)
df.at[dates[0],'A']
 
0.19487255317338711
 
通过位置选取
通过传递整型的位置进行选取
df.iloc[3]
 
A   -1.877066
B   -0.060967
C   -2.103769
D   -1.634272
Name: 2013-01-04 00:00:00, dtype: float64
 
通过整型的位置切片进行选取,与python/numpy形式相同
df.iloc[3:5,0:2]
 
| A | B | |
|---|---|---|
| 2013-01-04 | -1.877066 | -0.060967 | 
| 2013-01-05 | 0.405459 | -1.151989 | 
只对行进行切片
df.iloc[1:3,:]
 
| A | B | C | D | |
|---|---|---|---|---|
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 1.187169 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 1.752973 | 
只对列进行切片
df.iloc[:,1:3]
 
| B | C | |
|---|---|---|
| 2013-01-01 | 0.298287 | 0.073043 | 
| 2013-01-02 | 0.397972 | 0.887388 | 
| 2013-01-03 | -0.251828 | -0.736243 | 
| 2013-01-04 | -0.060967 | -2.103769 | 
| 2013-01-05 | -1.151989 | -1.309286 | 
| 2013-01-06 | -1.520998 | 0.717376 | 
只获取某个值
df.iloc[1,1]
 
0.39797195640479976
 
快速获取某个值(与上面的方法等价)
df.iat[1,1]
 
0.39797195640479976
 
布尔索引
用某列的值来选取数据
df[df.A > 0]
 
| A | B | C | D | |
|---|---|---|---|---|
| 2013-01-01 | 0.194873 | 0.298287 | 0.073043 | -0.681957 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 1.752973 | 
| 2013-01-05 | 0.405459 | -1.151989 | -1.309286 | 1.023221 | 
用where操作来选取数据
df[df > 0]
 
| A | B | C | D | |
|---|---|---|---|---|
| 2013-01-01 | 0.194873 | 0.298287 | 0.073043 | NaN | 
| 2013-01-02 | NaN | 0.397972 | 0.887388 | 1.187169 | 
| 2013-01-03 | 0.782244 | NaN | NaN | 1.752973 | 
| 2013-01-04 | NaN | NaN | NaN | NaN | 
| 2013-01-05 | 0.405459 | NaN | NaN | 1.023221 | 
| 2013-01-06 | NaN | NaN | 0.717376 | 1.657476 | 
用**isin()**方法来过滤数据
df2 = df.copy()
 
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2
 
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 2013-01-01 | 0.194873 | 0.298287 | 0.073043 | -0.681957 | one | 
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 1.187169 | one | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 1.752973 | two | 
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | -1.634272 | three | 
| 2013-01-05 | 0.405459 | -1.151989 | -1.309286 | 1.023221 | four | 
| 2013-01-06 | -0.445221 | -1.520998 | 0.717376 | 1.657476 | three | 
df2[df2['E'].isin(['two', 'four'])]
 
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 1.752973 | two | 
| 2013-01-05 | 0.405459 | -1.151989 | -1.309286 | 1.023221 | four | 
赋值
赋值一个新的列,通过索引来自动对齐数据
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102',periods=6))
s1
 
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64
 
df['F'] = s1
df
 
| A | B | C | D | F | |
|---|---|---|---|---|---|
| 2013-01-01 | 0.194873 | 0.298287 | 0.073043 | -0.681957 | NaN | 
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 1.187169 | 1.0 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 1.752973 | 2.0 | 
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | -1.634272 | 3.0 | 
| 2013-01-05 | 0.405459 | -1.151989 | -1.309286 | 1.023221 | 4.0 | 
| 2013-01-06 | -0.445221 | -1.520998 | 0.717376 | 1.657476 | 5.0 | 
通过标签赋值
df.at[dates[0], 'A'] = 0
df
 
| A | B | C | D | F | |
|---|---|---|---|---|---|
| 2013-01-01 | 0.000000 | 0.298287 | 0.073043 | -0.681957 | NaN | 
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 1.187169 | 1.0 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 1.752973 | 2.0 | 
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | -1.634272 | 3.0 | 
| 2013-01-05 | 0.405459 | -1.151989 | -1.309286 | 1.023221 | 4.0 | 
| 2013-01-06 | -0.445221 | -1.520998 | 0.717376 | 1.657476 | 5.0 | 
通过位置赋值
df.iat[0,1] = 8888
df
 
| A | B | C | D | F | |
|---|---|---|---|---|---|
| 2013-01-01 | 0.000000 | 8888.000000 | 0.073043 | -0.681957 | NaN | 
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 1.187169 | 1.0 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 1.752973 | 2.0 | 
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | -1.634272 | 3.0 | 
| 2013-01-05 | 0.405459 | -1.151989 | -1.309286 | 1.023221 | 4.0 | 
| 2013-01-06 | -0.445221 | -1.520998 | 0.717376 | 1.657476 | 5.0 | 
通过传递numpy array赋值
df.loc[:,'D'] = np.array([5] * len(df))
df
 
| A | B | C | D | F | |
|---|---|---|---|---|---|
| 2013-01-01 | 0.000000 | 8888.000000 | 0.073043 | 5 | NaN | 
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 5 | 1.0 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 5 | 2.0 | 
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | 5 | 3.0 | 
| 2013-01-05 | 0.405459 | -1.151989 | -1.309286 | 5 | 4.0 | 
| 2013-01-06 | -0.445221 | -1.520998 | 0.717376 | 5 | 5.0 | 
通过where操作来赋值
df2 = df.copy()
df2[df2 > 0] = -df2
df2
 
| A | B | C | D | F | |
|---|---|---|---|---|---|
| 2013-01-01 | 0.000000 | -8888.000000 | -0.073043 | -5 | NaN | 
| 2013-01-02 | -0.679429 | -0.397972 | -0.887388 | -5 | -1.0 | 
| 2013-01-03 | -0.782244 | -0.251828 | -0.736243 | -5 | -2.0 | 
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | -5 | -3.0 | 
| 2013-01-05 | -0.405459 | -1.151989 | -1.309286 | -5 | -4.0 | 
| 2013-01-06 | -0.445221 | -1.520998 | -0.717376 | -5 | -5.0 | 
缺失值处理
在pandas中,用np.nan来代表缺失值,这些值默认不会参与运算。
reindex()允许你修改、增加、删除指定轴上的索引,并返回一个数据副本。
dates[0]
 
Timestamp('2013-01-01 00:00:00', freq='D')
 
df1 = df.reindex(index=dates[0:4], columns=list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
 
| A | B | C | D | F | E | |
|---|---|---|---|---|---|---|
| 2013-01-01 | 0.000000 | 8888.000000 | 0.073043 | 5 | NaN | 1.0 | 
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 5 | 1.0 | 1.0 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 5 | 2.0 | NaN | 
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | 5 | 3.0 | NaN | 
剔除所有包含缺失值的行数据
df1.dropna(how='any')#any是任意,所以保留下来了
 
| A | B | C | D | F | E | |
|---|---|---|---|---|---|---|
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 5 | 1.0 | 1.0 | 
填充缺失值
df1.fillna(value=5)
 
| A | B | C | D | F | E | |
|---|---|---|---|---|---|---|
| 2013-01-01 | 0.000000 | 8888.000000 | 0.073043 | 5 | 5.0 | 1.0 | 
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 5 | 1.0 | 1.0 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 5 | 2.0 | 5.0 | 
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | 5 | 3.0 | 5.0 | 
获取值是否为nan的布尔标记
pd.isnull(df1)
 
| A | B | C | D | F | E | |
|---|---|---|---|---|---|---|
| 2013-01-01 | False | False | False | False | True | False | 
| 2013-01-02 | False | False | False | False | False | False | 
| 2013-01-03 | False | False | False | False | False | True | 
| 2013-01-04 | False | False | False | False | False | True | 
运算
统计
运算过程中,通常不包含缺失值。
进行描述性统计
df.mean()#默认为0
 
A      -0.302336
B    1480.902032
C      -0.411915
D       5.000000
F       3.000000
dtype: float64
 
对其他轴进行同样的运算
df.mean(1)#对所有行
 
2013-01-01    2223.268261
2013-01-02       1.321186
2013-01-03       1.358835
2013-01-04       0.791640
2013-01-05       1.388837
2013-01-06       1.750231
Freq: D, dtype: float64
 
对于拥有不同维度的对象进行运算时需要对齐。除此之外,pandas会自动沿着指定维度计算。
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
#意思就是前面两个填充为空,后面的自动往下移动
s
 
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64
 
df
 
| A | B | C | D | F | |
|---|---|---|---|---|---|
| 2013-01-01 | 0.000000 | 8888.000000 | 0.073043 | 5 | NaN | 
| 2013-01-02 | -0.679429 | 0.397972 | 0.887388 | 5 | 1.0 | 
| 2013-01-03 | 0.782244 | -0.251828 | -0.736243 | 5 | 2.0 | 
| 2013-01-04 | -1.877066 | -0.060967 | -2.103769 | 5 | 3.0 | 
| 2013-01-05 | 0.405459 | -1.151989 | -1.309286 | 5 | 4.0 | 
| 2013-01-06 | -0.445221 | -1.520998 | 0.717376 | 5 | 5.0 | 
df.sub(s, axis='index')
# 然后对df和s中的对应位置的元素进行减法运算,即df中的元素减去s中对应索引元素的值。
 
| A | B | C | D | F | |
|---|---|---|---|---|---|
| 2013-01-01 | NaN | NaN | NaN | NaN | NaN | 
| 2013-01-02 | NaN | NaN | NaN | NaN | NaN | 
| 2013-01-03 | -0.217756 | -1.251828 | -1.736243 | 4.0 | 1.0 | 
| 2013-01-04 | -4.877066 | -3.060967 | -5.103769 | 2.0 | 0.0 | 
| 2013-01-05 | -4.594541 | -6.151989 | -6.309286 | 0.0 | -1.0 | 
| 2013-01-06 | NaN | NaN | NaN | NaN | NaN | 
Apply 函数作用
通过apply()对函数作用
df.apply(np.cumsum)
# df.apply(np.cumsum)对DataFrame中的每一列应用np.cumsum函数进行累积求和。
# np.cumsum计算累积和,它将每一列中的元素按顺序进行累加。
 
| A | B | C | D | F | |
|---|---|---|---|---|---|
| 2013-01-01 | 0.000000 | 8888.000000 | 0.073043 | 5 | NaN | 
| 2013-01-02 | -0.679429 | 8888.397972 | 0.960431 | 10 | 1.0 | 
| 2013-01-03 | 0.102815 | 8888.146144 | 0.224188 | 15 | 3.0 | 
| 2013-01-04 | -1.774251 | 8888.085178 | -1.879581 | 20 | 6.0 | 
| 2013-01-05 | -1.368792 | 8886.933189 | -3.188867 | 25 | 10.0 | 
| 2013-01-06 | -1.814013 | 8885.412191 | -2.471490 | 30 | 15.0 | 
df.apply(lambda x:x.max()-x.min())
# 这个lambda函数会对df的每一列调用x.max()求最大值和x.min()求最小值,最后计算最大值减最小值。
 
A       2.659310
B    8889.520998
C       2.991157
D       0.000000
F       4.000000
dtype: float64
 
频数统计
s = pd.Series(np.random.randint(0, 7, size=10))
s
 
0    2
1    5
2    1
3    3
4    2
5    0
6    0
7    1
8    4
9    1
dtype: int32
 
s.value_counts()
 
1    3
2    2
0    2
5    1
3    1
4    1
dtype: int64
 
字符串方法
对于Series对象,在其str属性中有着一系列的字符串处理方法。就如同下段代码一样,能很方便的对array中各个元素进行运算。值得注意的是,在str属性中的模式匹配默认使用正则表达式。
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
 
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object
 
合并
Concat 连接
pandas中提供了大量的方法能够轻松对Series,DataFrame和Panel对象进行不同满足逻辑关系的合并操作
通过**concat()**来连接pandas对象
df = pd.DataFrame(np.random.randn(10,4))
df
 
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 0 | 0.639168 | 0.881631 | 0.198015 | 0.393510 | 
| 1 | 0.503544 | 0.155373 | 1.218277 | 0.128893 | 
| 2 | -0.661486 | 1.365067 | -0.010755 | 0.058110 | 
| 3 | 0.185698 | -0.750695 | -0.637134 | -1.811947 | 
| 4 | -0.493348 | -0.246197 | -0.700524 | 0.692042 | 
| 5 | -2.280015 | 0.986806 | 1.297614 | -0.749969 | 
| 6 | 0.688663 | 0.088751 | -0.164766 | -0.165378 | 
| 7 | -0.382894 | -0.157371 | 0.000836 | -1.947379 | 
| 8 | -1.618486 | 0.804667 | 1.919125 | -0.290719 | 
| 9 | 0.392898 | -0.264556 | 0.817233 | 0.680797 | 
#break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces
 
[          0         1         2         3
 0  0.639168  0.881631  0.198015  0.393510
 1  0.503544  0.155373  1.218277  0.128893
 2 -0.661486  1.365067 -0.010755  0.058110,
           0         1         2         3
 3  0.185698 -0.750695 -0.637134 -1.811947
 4 -0.493348 -0.246197 -0.700524  0.692042
 5 -2.280015  0.986806  1.297614 -0.749969
 6  0.688663  0.088751 -0.164766 -0.165378,
           0         1         2         3
 7 -0.382894 -0.157371  0.000836 -1.947379
 8 -1.618486  0.804667  1.919125 -0.290719
 9  0.392898 -0.264556  0.817233  0.680797]
 
pd.concat(pieces)
 
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 0 | 0.639168 | 0.881631 | 0.198015 | 0.393510 | 
| 1 | 0.503544 | 0.155373 | 1.218277 | 0.128893 | 
| 2 | -0.661486 | 1.365067 | -0.010755 | 0.058110 | 
| 3 | 0.185698 | -0.750695 | -0.637134 | -1.811947 | 
| 4 | -0.493348 | -0.246197 | -0.700524 | 0.692042 | 
| 5 | -2.280015 | 0.986806 | 1.297614 | -0.749969 | 
| 6 | 0.688663 | 0.088751 | -0.164766 | -0.165378 | 
| 7 | -0.382894 | -0.157371 | 0.000836 | -1.947379 | 
| 8 | -1.618486 | 0.804667 | 1.919125 | -0.290719 | 
| 9 | 0.392898 | -0.264556 | 0.817233 | 0.680797 | 
Join 合并
类似于SQL中的合并(merge)
left = pd.DataFrame({'key':['foo', 'foo'], 'lval':[1,2]})
left
 
| key | lval | |
|---|---|---|
| 0 | foo | 1 | 
| 1 | foo | 2 | 
right = pd.DataFrame({'key':['foo', 'foo'], 'lval':[4,5]})
right
 
| key | lval | |
|---|---|---|
| 0 | foo | 4 | 
| 1 | foo | 5 | 
pd.merge(left, right, on='key')
 
| key | lval_x | lval_y | |
|---|---|---|---|
| 0 | foo | 1 | 4 | 
| 1 | foo | 1 | 5 | 
| 2 | foo | 2 | 4 | 
| 3 | foo | 2 | 5 | 
Append 添加
将若干行添加到dataFrame后面
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df
 
| A | B | C | D | |
|---|---|---|---|---|
| 0 | -1.526419 | 0.868844 | -1.379758 | 0.498004 | 
| 1 | -0.917867 | -0.137874 | -0.909232 | -0.523873 | 
| 2 | 1.370409 | -0.948766 | 1.728098 | 0.361813 | 
| 3 | -1.274621 | -1.224051 | -0.749470 | -2.712027 | 
| 4 | -0.303875 | -0.177942 | 0.496359 | 0.048004 | 
| 5 | -0.941436 | 0.044570 | -0.229654 | 0.092941 | 
| 6 | 0.465798 | -0.835244 | 0.131745 | 2.219413 | 
| 7 | 0.875844 | 0.243440 | -1.050471 | 1.761330 | 
s = df.iloc[3]
s
 
A   -1.274621
B   -1.224051
C   -0.749470
D   -2.712027
Name: 3, dtype: float64
 
import warnings
df.append(s, ignore_index=True)
 
C:\Users\48125\AppData\Local\Temp\ipykernel_17504\1384017944.py:3: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df.append(s, ignore_index=True)
 
| A | B | C | D | |
|---|---|---|---|---|
| 0 | -1.526419 | 0.868844 | -1.379758 | 0.498004 | 
| 1 | -0.917867 | -0.137874 | -0.909232 | -0.523873 | 
| 2 | 1.370409 | -0.948766 | 1.728098 | 0.361813 | 
| 3 | -1.274621 | -1.224051 | -0.749470 | -2.712027 | 
| 4 | -0.303875 | -0.177942 | 0.496359 | 0.048004 | 
| 5 | -0.941436 | 0.044570 | -0.229654 | 0.092941 | 
| 6 | 0.465798 | -0.835244 | 0.131745 | 2.219413 | 
| 7 | 0.875844 | 0.243440 | -1.050471 | 1.761330 | 
| 8 | -1.274621 | -1.224051 | -0.749470 | -2.712027 | 
分组
对于“group by”操作,我们通常是指以下一个或几个步骤:
- 划分 按照某些标准将数据分为不同的组
 - 应用 对每组数据分别执行一个函数
 - 组合 将结果组合到一个数据结构
 
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 
                          'foo', 'bar', 'foo', 'bar'],
                   'B' : ['one', 'one', 'two', 'three', 
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df
 
| A | B | C | D | |
|---|---|---|---|---|
| 0 | foo | one | -0.491461 | -0.550970 | 
| 1 | bar | one | -0.468956 | 0.584847 | 
| 2 | foo | two | 0.461989 | 0.372785 | 
| 3 | bar | three | 0.290600 | -2.142788 | 
| 4 | foo | two | 0.448364 | -2.036729 | 
| 5 | bar | two | 0.639793 | 0.577440 | 
| 6 | foo | one | 1.335309 | -0.582853 | 
| 7 | bar | three | 0.629100 | 0.223494 | 
分组并对每个分组应用sum函数
df.groupby('A').sum()
 
C:\Users\48125\AppData\Local\Temp\ipykernel_17504\1885751491.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  df.groupby('A').sum()
 
| C | D | |
|---|---|---|
| A | ||
| bar | 1.090537 | -0.757006 | 
| foo | 1.754201 | -2.797767 | 
按多个列分组形成层级索引,然后应用函数
df.groupby(['A','B']).sum()
 
| C | D | ||
|---|---|---|---|
| A | B | ||
| bar | one | -0.468956 | 0.584847 | 
| three | 0.919700 | -1.919294 | |
| two | 0.639793 | 0.577440 | |
| foo | one | 0.843848 | -1.133823 | 
| two | 0.910352 | -1.663943 | 
变形
堆叠
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
 
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
 
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
 
df2 = df[:4]
df2
 
| A | B | ||
|---|---|---|---|
| first | second | ||
| bar | one | 0.138584 | 0.661800 | 
| two | 0.206128 | 0.668363 | |
| baz | one | -0.531729 | -0.695009 | 
| two | 0.746672 | -0.735620 | 
**stack()**方法对DataFrame的列“压缩”一个层级
stacked = df2.stack()
stacked
 
first  second   
bar    one     A    0.138584
               B    0.661800
       two     A    0.206128
               B    0.668363
baz    one     A   -0.531729
               B   -0.695009
       two     A    0.746672
               B   -0.735620
dtype: float64
 
对于一个“堆叠过的”DataFrame或者Series(拥有MultiIndex作为索引),stack()的逆操作是unstack(),默认反堆叠到上一个层级
stacked.unstack()
 
| A | B | ||
|---|---|---|---|
| first | second | ||
| bar | one | 0.138584 | 0.661800 | 
| two | 0.206128 | 0.668363 | |
| baz | one | -0.531729 | -0.695009 | 
| two | 0.746672 | -0.735620 | 
stacked.unstack(1)
 
| second | one | two | |
|---|---|---|---|
| first | |||
| bar | A | 0.138584 | 0.206128 | 
| B | 0.661800 | 0.668363 | |
| baz | A | -0.531729 | 0.746672 | 
| B | -0.695009 | -0.735620 | 
stacked.unstack(0)
 
| first | bar | baz | |
|---|---|---|---|
| second | |||
| one | A | 0.138584 | -0.531729 | 
| B | 0.661800 | -0.695009 | |
| two | A | 0.206128 | 0.746672 | 
| B | 0.668363 | -0.735620 | 
数据透视表
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df
 
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 0 | one | A | foo | 0.014262 | -0.732936 | 
| 1 | one | B | foo | 0.853949 | 1.719331 | 
| 2 | two | C | foo | -0.344640 | 0.765832 | 
| 3 | three | A | bar | 1.908338 | -1.447838 | 
| 4 | one | B | bar | -1.469287 | 0.954153 | 
| 5 | one | C | bar | -1.341587 | -0.606839 | 
| 6 | two | A | foo | 0.961927 | 0.054527 | 
| 7 | three | B | foo | 0.829778 | -0.581417 | 
| 8 | one | C | foo | 0.623418 | -0.456098 | 
| 9 | one | A | bar | 1.745817 | -0.684403 | 
| 10 | two | B | bar | -1.457706 | -0.272665 | 
| 11 | three | C | bar | 1.336044 | 2.080870 | 
我们可以轻松地从这个数据得到透视表
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
 
| C | bar | foo | |
|---|---|---|---|
| A | B | ||
| one | A | 1.745817 | 0.014262 | 
| B | -1.469287 | 0.853949 | |
| C | -1.341587 | 0.623418 | |
| three | A | 1.908338 | NaN | 
| B | NaN | 0.829778 | |
| C | 1.336044 | NaN | |
| two | A | NaN | 0.961927 | 
| B | -1.457706 | NaN | |
| C | NaN | -0.344640 | 
时间序列
pandas在对频率转换进行重新采样时拥有着简单,强大而且高效的功能(例如把按秒采样的数据转换为按5分钟采样的数据)。这在金融领域很常见,但又不限于此。
rng = pd.date_range('1/1/2012', periods=100, freq='S')
rng[:5]
 
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01',
               '2012-01-01 00:00:02', '2012-01-01 00:00:03',
               '2012-01-01 00:00:04'],
              dtype='datetime64[ns]', freq='S')
 
ts = pd.Series(np.random.randint(0,500,len(rng)), index=rng)
ts
 
2012-01-01 00:00:00    193
2012-01-01 00:00:01    447
2012-01-01 00:00:02    407
2012-01-01 00:00:03    450
2012-01-01 00:00:04    368
                      ... 
2012-01-01 00:01:35    102
2012-01-01 00:01:36    446
2012-01-01 00:01:37    290
2012-01-01 00:01:38    256
2012-01-01 00:01:39    154
Freq: S, Length: 100, dtype: int32
 
ts.resample(‘5Min’) 表示对时间序列数据ts进行重采样,改变其索引的时间频率。
具体来说:
ts: 一个时间序列数据,索引为datetime类型。
'5Min': 表示将索引频率改为每5分钟一个。
 
resample函数将按照新频率重新采样,原索引不能被5分钟整除的时间点会被删除,新的索引会在5分钟的时间点上。
ts.resample('5Min')
# Datafream才这样写
# ts.resample('5Min', how='sum')
 
<pandas.core.resample.DatetimeIndexResampler object at 0x0000017E3D5A86D0>
 
时区表示
rng = pd.date_range('3/6/2012', periods=5, freq='D')
rng
 
DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09',
               '2012-03-10'],
              dtype='datetime64[ns]', freq='D')
 
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
 
2012-03-06   -0.967180
2012-03-07   -0.917108
2012-03-08    0.252346
2012-03-09    0.461718
2012-03-10   -0.931543
Freq: D, dtype: float64
 
ts_utc = ts.tz_localize('UTC')
ts_utc
 
2012-03-06 00:00:00+00:00   -0.967180
2012-03-07 00:00:00+00:00   -0.917108
2012-03-08 00:00:00+00:00    0.252346
2012-03-09 00:00:00+00:00    0.461718
2012-03-10 00:00:00+00:00   -0.931543
Freq: D, dtype: float64
 
时区转换
ts_utc.tz_convert('US/Eastern')
 
2012-03-05 19:00:00-05:00   -0.967180
2012-03-06 19:00:00-05:00   -0.917108
2012-03-07 19:00:00-05:00    0.252346
2012-03-08 19:00:00-05:00    0.461718
2012-03-09 19:00:00-05:00   -0.931543
Freq: D, dtype: float64
 
时间跨度转换
rng = pd.date_range('1/1/2012', periods=5, freq='M')
rng
 
DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',
               '2012-05-31'],
              dtype='datetime64[ns]', freq='M')
 
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
 
2012-01-31   -0.841095
2012-02-29   -1.548459
2012-03-31   -0.473997
2012-04-30   -0.602313
2012-05-31   -0.519119
Freq: M, dtype: float64
 
ps = ts.to_period()
ps
 
2012-01   -0.841095
2012-02   -1.548459
2012-03   -0.473997
2012-04   -0.602313
2012-05   -0.519119
Freq: M, dtype: float64
 
ps.to_timestamp()
 
2012-01-01   -0.841095
2012-02-01   -1.548459
2012-03-01   -0.473997
2012-04-01   -0.602313
2012-05-01   -0.519119
Freq: MS, dtype: float64
 
日期与时间戳之间的转换使得可以使用一些方便的算术函数。例如,我们把以11月为年底的季度数据转换为当前季度末月底为始的数据
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
prng
 
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
             '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
             '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
             '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
             '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
             '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
             '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
             '2000Q3', '2000Q4'],
            dtype='period[Q-NOV]')
 
ts = pd.Series(np.random.randn(len(prng)), index = prng)
ts[:5]
 
1990Q1   -0.316562
1990Q2   -0.055698
1990Q3   -0.267225
1990Q4    0.514381
1991Q1   -0.716024
Freq: Q-NOV, dtype: float64
 
ts.index = (prng.asfreq('M', 'end') ) .asfreq('H', 'start') +9
ts[:6]
 
1990-02-01 09:00   -0.316562
1990-05-01 09:00   -0.055698
1990-08-01 09:00   -0.267225
1990-11-01 09:00    0.514381
1991-02-01 09:00   -0.716024
1991-05-01 09:00    1.530323
Freq: H, dtype: float64
 
分类
从版本0.15开始,pandas在DataFrame中开始包括分类数据。
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'e', 'e']})
df
 
| id | raw_grade | |
|---|---|---|
| 0 | 1 | a | 
| 1 | 2 | b | 
| 2 | 3 | b | 
| 3 | 4 | a | 
| 4 | 5 | e | 
| 5 | 6 | e | 
把raw_grade转换为分类类型
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
 
0    a
1    b
2    b
3    a
4    e
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']
 
重命名类别名为更有意义的名称
df["grade"].cat.categories = ["very good", "good", "very bad"]
 
C:\Users\48125\AppData\Local\Temp\ipykernel_17504\2985790766.py:1: FutureWarning: Setting categories in-place is deprecated and will raise in a future version. Use rename_categories instead.
  df["grade"].cat.categories = ["very good", "good", "very bad"]
 
对分类重新排序,并添加缺失的分类
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
 
0    very good
1         good
2         good
3    very good
4     very bad
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']
 
排序是按照分类的顺序进行的,而不是字典序
df.sort_values(by="grade")
 
| id | raw_grade | grade | |
|---|---|---|---|
| 4 | 5 | e | very bad | 
| 5 | 6 | e | very bad | 
| 1 | 2 | b | good | 
| 2 | 3 | b | good | 
| 0 | 1 | a | very good | 
| 3 | 4 | a | very good | 
按分类分组时,也会显示空的分类
df.groupby("grade").size()
 
grade
very bad     2
bad          0
medium       0
good         2
very good    2
dtype: int64
 
绘图
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
 

对于DataFrame类型,**plot()**能很方便地画出所有列及其标签
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')
 

获取数据的I/O
CSV
写入一个csv文件
df.to_csv('data/foo.csv')
 
从一个csv文件读入
pd.read_csv('data/foo.csv')
 
| Unnamed: 0 | A | B | C | D | |
|---|---|---|---|---|---|
| 0 | 2000-01-01 | -0.055622 | 1.889998 | -1.094622 | 0.591568 | 
| 1 | 2000-01-02 | -0.408948 | 3.388888 | -0.550835 | 1.362462 | 
| 2 | 2000-01-03 | -0.041353 | 1.880637 | -2.417740 | 0.383862 | 
| 3 | 2000-01-04 | -0.442517 | 0.807854 | -2.409478 | 1.418281 | 
| 4 | 2000-01-05 | 0.442292 | -0.735053 | -2.970804 | 1.435721 | 
| ... | ... | ... | ... | ... | ... | 
| 995 | 2002-09-22 | -28.470361 | 14.628723 | -68.388462 | -37.630887 | 
| 996 | 2002-09-23 | -28.995055 | 14.517615 | -68.247657 | -36.363608 | 
| 997 | 2002-09-24 | -28.756586 | 15.852227 | -68.328996 | -36.032141 | 
| 998 | 2002-09-25 | -29.331636 | 15.557680 | -68.450159 | -35.825427 | 
| 999 | 2002-09-26 | -31.065863 | 16.141154 | -68.852564 | -36.462825 | 
1000 rows × 5 columns
HDF5
HDFStores的读写
写入一个HDF5 Store
df.to_hdf('data/foo.h5', 'df')
 
从一个HDF5 Store读入
pd.read_hdf('data/foo.h5', 'df')
 
| A | B | C | D | |
|---|---|---|---|---|
| 2000-01-01 | -0.055622 | 1.889998 | -1.094622 | 0.591568 | 
| 2000-01-02 | -0.408948 | 3.388888 | -0.550835 | 1.362462 | 
| 2000-01-03 | -0.041353 | 1.880637 | -2.417740 | 0.383862 | 
| 2000-01-04 | -0.442517 | 0.807854 | -2.409478 | 1.418281 | 
| 2000-01-05 | 0.442292 | -0.735053 | -2.970804 | 1.435721 | 
| ... | ... | ... | ... | ... | 
| 2002-09-22 | -28.470361 | 14.628723 | -68.388462 | -37.630887 | 
| 2002-09-23 | -28.995055 | 14.517615 | -68.247657 | -36.363608 | 
| 2002-09-24 | -28.756586 | 15.852227 | -68.328996 | -36.032141 | 
| 2002-09-25 | -29.331636 | 15.557680 | -68.450159 | -35.825427 | 
| 2002-09-26 | -31.065863 | 16.141154 | -68.852564 | -36.462825 | 
1000 rows × 4 columns
Excel
MS Excel的读写
写入一个Excel文件
df.to_excel('data/foo.xlsx', sheet_name='Sheet1')
 
从一个excel文件读入
pd.read_excel('data/foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
 
| Unnamed: 0 | A | B | C | D | |
|---|---|---|---|---|---|
| 0 | 2000-01-01 | -0.055622 | 1.889998 | -1.094622 | 0.591568 | 
| 1 | 2000-01-02 | -0.408948 | 3.388888 | -0.550835 | 1.362462 | 
| 2 | 2000-01-03 | -0.041353 | 1.880637 | -2.417740 | 0.383862 | 
| 3 | 2000-01-04 | -0.442517 | 0.807854 | -2.409478 | 1.418281 | 
| 4 | 2000-01-05 | 0.442292 | -0.735053 | -2.970804 | 1.435721 | 
| ... | ... | ... | ... | ... | ... | 
| 995 | 2002-09-22 | -28.470361 | 14.628723 | -68.388462 | -37.630887 | 
| 996 | 2002-09-23 | -28.995055 | 14.517615 | -68.247657 | -36.363608 | 
| 997 | 2002-09-24 | -28.756586 | 15.852227 | -68.328996 | -36.032141 | 
| 998 | 2002-09-25 | -29.331636 | 15.557680 | -68.450159 | -35.825427 | 
| 999 | 2002-09-26 | -31.065863 | 16.141154 | -68.852564 | -36.462825 | 
1000 rows × 5 columns
Pandas练习题目录
1.Getting and knowing
- Chipotle
 - Occupation
 - World Food Facts
 
2.Filtering and Sorting
- Chipotle
 - Euro12
 - Fictional Army
 
3.Grouping
- Alcohol Consumption
 - Occupation
 - Regiment
 
4.Apply
- Students
 - Alcohol Consumption
 - US_Crime_Rates
 
5.Merge
- Auto_MPG
 - Fictitious Names
 - House Market
 
6.Stats
- US_Baby_Names
 - Wind_Stats
 
7.Visualization
- Chipotle
 - Titanic Disaster
 - Scores
 - Online Retail
 - Tips
 
8.Creating Series and DataFrames
- Pokemon
 
9.Time Series
- Apple_Stock
 - Getting_Financial_Data
 - Investor_Flow_of_Funds_US
 
10.Deleting
- Iris
 - Wine
 
使用方法
每个练习文件夹有三个不同类型的文件:
1.Exercises.ipynb
没有答案代码的文件,这个是你做的练习
2.Solutions.ipynb
运行代码后的结果(不要改动)
3.Exercise_with_Solutions.ipynb
有答案代码和注释的文件
你可以在Exercises.ipynb里输入代码,看看运行结果是否和Solutions.ipynb里面的内容一致,如果真的完成不了再看下Exercise_with_Solutions.ipynb的答案。
每文一语
开教传授,私信教学!!!



















