【案例+操作+演示】20分钟带你入门Pandas,掌握数据分析科学模块,附带上百个案例练习题【含答案】

news2025/5/24 3:35:22

二十分钟入门pandas,学不会私信教学!

有需要pyecharts资源的可以点击文章上面下载!!!

需要本项目运行源码可以点击资源进行下载

资源

#coding:utf8
%matplotlib inline

这个一篇针对pandas新手的简短入门,想要了解更多复杂的内容,参阅Cookbook

通常,我们首先要导入以下几个库:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

创建对象

通过传递一个list来创建Series,pandas会默认创建整型索引:

s = pd.Series([1,3,5,np.nan,6,8])
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

通过传递一个numpy array,日期索引以及列标签来创建一个DataFrame

dates = pd.date_range('20130101', periods=6)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
ABCD
2013-01-010.1948730.2982870.073043-0.681957
2013-01-02-0.6794290.3979720.8873881.187169
2013-01-030.782244-0.251828-0.7362431.752973
2013-01-04-1.877066-0.060967-2.103769-1.634272
2013-01-050.405459-1.151989-1.3092861.023221
2013-01-06-0.445221-1.5209980.7173761.657476

通过传递一个能够被转换为类似series的dict对象来创建一个DataFrame:

df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3]*4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2
ABCDEF
01.02013-01-021.03testfoo
11.02013-01-021.03trainfoo
21.02013-01-021.03testfoo
31.02013-01-021.03trainfoo

可以看到各列的数据类型为:

df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据

查看frame中头部和尾部的几行:

df.head()
ABCD
2013-01-010.1948730.2982870.073043-0.681957
2013-01-02-0.6794290.3979720.8873881.187169
2013-01-030.782244-0.251828-0.7362431.752973
2013-01-04-1.877066-0.060967-2.103769-1.634272
2013-01-050.405459-1.151989-1.3092861.023221
df.tail(3)
ABCD
2013-01-04-1.877066-0.060967-2.103769-1.634272
2013-01-050.405459-1.151989-1.3092861.023221
2013-01-06-0.445221-1.5209980.7173761.657476

显示索引、列名以及底层的numpy数据

df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.values
array([[ 0.19487255,  0.29828663,  0.07304296, -0.68195723],
       [-0.67942918,  0.39797196,  0.88738797,  1.18716883],
       [ 0.78224424, -0.25182784, -0.73624252,  1.75297344],
       [-1.8770662 , -0.06096652, -2.10376905, -1.63427152],
       [ 0.40545892, -1.15198867, -1.30928606,  1.02322057],
       [-0.44522115, -1.52099782,  0.71737636,  1.65747636]])

describe()能对数据做一个快速统计汇总

df.describe()
ABCD
count6.0000006.0000006.0000006.000000
mean-0.269857-0.381587-0.4119150.550768
std0.9550460.7850291.1808051.385087
min-1.877066-1.520998-2.103769-1.634272
25%-0.620877-0.926948-1.166025-0.255663
50%-0.125174-0.156397-0.3316001.105195
75%0.3528120.2084730.5562931.539899
max0.7822440.3979720.8873881.752973

对数据做转置:

df.T
2013-01-012013-01-022013-01-032013-01-042013-01-052013-01-06
A0.194873-0.6794290.782244-1.8770660.405459-0.445221
B0.2982870.397972-0.251828-0.060967-1.151989-1.520998
C0.0730430.887388-0.736243-2.103769-1.3092860.717376
D-0.6819571.1871691.752973-1.6342721.0232211.657476

按轴进行排序(1代表横轴;0代表纵轴):

df.sort_index(axis=1, ascending=False)
DCBA
2013-01-01-0.6819570.0730430.2982870.194873
2013-01-021.1871690.8873880.397972-0.679429
2013-01-031.752973-0.736243-0.2518280.782244
2013-01-04-1.634272-2.103769-0.060967-1.877066
2013-01-051.023221-1.309286-1.1519890.405459
2013-01-061.6574760.717376-1.520998-0.445221

按值进行排序 :

df.sort_values(by='D')
ABCD
2013-01-04-1.877066-0.060967-2.103769-1.634272
2013-01-010.1948730.2982870.073043-0.681957
2013-01-050.405459-1.151989-1.3092861.023221
2013-01-02-0.6794290.3979720.8873881.187169
2013-01-06-0.445221-1.5209980.7173761.657476
2013-01-030.782244-0.251828-0.7362431.752973

数据选择

注意:虽然标准的Python/Numpy的表达式能完成选择与赋值等功能,但我们仍推荐使用优化过的pandas数据访问方法:.at,.iat,.loc,.iloc和.ix

选取

选择某一列数据,它会返回一个Series,等同于df.A

df['A']
2013-01-01    0.194873
2013-01-02   -0.679429
2013-01-03    0.782244
2013-01-04   -1.877066
2013-01-05    0.405459
2013-01-06   -0.445221
Freq: D, Name: A, dtype: float64

通过使用 [ ]进行切片选取:

df[0:4]
ABCD
2013-01-010.1948730.2982870.073043-0.681957
2013-01-02-0.6794290.3979720.8873881.187169
2013-01-030.782244-0.251828-0.7362431.752973
2013-01-04-1.877066-0.060967-2.103769-1.634272
df['20130102':'20130104']
ABCD
2013-01-02-0.6794290.3979720.8873881.187169
2013-01-030.782244-0.251828-0.7362431.752973
2013-01-04-1.877066-0.060967-2.103769-1.634272

通过标签选取

通过标签进行交叉选取:

dates[0]
Timestamp('2013-01-01 00:00:00', freq='D')
df.loc[dates[0]]
A    0.194873
B    0.298287
C    0.073043
D   -0.681957
Name: 2013-01-01 00:00:00, dtype: float64

使用标签对多个轴进行选取

df.loc[:,['A','B']]
AB
2013-01-010.1948730.298287
2013-01-02-0.6794290.397972
2013-01-030.782244-0.251828
2013-01-04-1.877066-0.060967
2013-01-050.405459-1.151989
2013-01-06-0.445221-1.520998
df.loc[:,['A','B']][:3]
AB
2013-01-010.1948730.298287
2013-01-02-0.6794290.397972
2013-01-030.782244-0.251828

进行标签切片,包含两个端点

df.loc['20130102':'20130104',['A','B']]
AB
2013-01-02-0.6794290.397972
2013-01-030.782244-0.251828
2013-01-04-1.877066-0.060967

对于返回的对象进行降维处理

df.loc['20130102',['A','B']]
A   -0.679429
B    0.397972
Name: 2013-01-02 00:00:00, dtype: float64

获取一个标量

df.loc[dates[0],'A']
0.19487255317338711

快速获取标量(与上面的方法等价)

df.at[dates[0],'A']
0.19487255317338711

通过位置选取

通过传递整型的位置进行选取

df.iloc[3]
A   -1.877066
B   -0.060967
C   -2.103769
D   -1.634272
Name: 2013-01-04 00:00:00, dtype: float64

通过整型的位置切片进行选取,与python/numpy形式相同

df.iloc[3:5,0:2]
AB
2013-01-04-1.877066-0.060967
2013-01-050.405459-1.151989

只对行进行切片

df.iloc[1:3,:]
ABCD
2013-01-02-0.6794290.3979720.8873881.187169
2013-01-030.782244-0.251828-0.7362431.752973

只对列进行切片

df.iloc[:,1:3]
BC
2013-01-010.2982870.073043
2013-01-020.3979720.887388
2013-01-03-0.251828-0.736243
2013-01-04-0.060967-2.103769
2013-01-05-1.151989-1.309286
2013-01-06-1.5209980.717376

只获取某个值

df.iloc[1,1]
0.39797195640479976

快速获取某个值(与上面的方法等价)

df.iat[1,1]
0.39797195640479976

布尔索引

用某列的值来选取数据

df[df.A > 0]
ABCD
2013-01-010.1948730.2982870.073043-0.681957
2013-01-030.782244-0.251828-0.7362431.752973
2013-01-050.405459-1.151989-1.3092861.023221

where操作来选取数据

df[df > 0]
ABCD
2013-01-010.1948730.2982870.073043NaN
2013-01-02NaN0.3979720.8873881.187169
2013-01-030.782244NaNNaN1.752973
2013-01-04NaNNaNNaNNaN
2013-01-050.405459NaNNaN1.023221
2013-01-06NaNNaN0.7173761.657476

用**isin()**方法来过滤数据

df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2
ABCDE
2013-01-010.1948730.2982870.073043-0.681957one
2013-01-02-0.6794290.3979720.8873881.187169one
2013-01-030.782244-0.251828-0.7362431.752973two
2013-01-04-1.877066-0.060967-2.103769-1.634272three
2013-01-050.405459-1.151989-1.3092861.023221four
2013-01-06-0.445221-1.5209980.7173761.657476three
df2[df2['E'].isin(['two', 'four'])]
ABCDE
2013-01-030.782244-0.251828-0.7362431.752973two
2013-01-050.405459-1.151989-1.3092861.023221four

赋值

赋值一个新的列,通过索引来自动对齐数据

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102',periods=6))
s1
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64
df['F'] = s1
df
ABCDF
2013-01-010.1948730.2982870.073043-0.681957NaN
2013-01-02-0.6794290.3979720.8873881.1871691.0
2013-01-030.782244-0.251828-0.7362431.7529732.0
2013-01-04-1.877066-0.060967-2.103769-1.6342723.0
2013-01-050.405459-1.151989-1.3092861.0232214.0
2013-01-06-0.445221-1.5209980.7173761.6574765.0

通过标签赋值

df.at[dates[0], 'A'] = 0
df
ABCDF
2013-01-010.0000000.2982870.073043-0.681957NaN
2013-01-02-0.6794290.3979720.8873881.1871691.0
2013-01-030.782244-0.251828-0.7362431.7529732.0
2013-01-04-1.877066-0.060967-2.103769-1.6342723.0
2013-01-050.405459-1.151989-1.3092861.0232214.0
2013-01-06-0.445221-1.5209980.7173761.6574765.0

通过位置赋值

df.iat[0,1] = 8888
df
ABCDF
2013-01-010.0000008888.0000000.073043-0.681957NaN
2013-01-02-0.6794290.3979720.8873881.1871691.0
2013-01-030.782244-0.251828-0.7362431.7529732.0
2013-01-04-1.877066-0.060967-2.103769-1.6342723.0
2013-01-050.405459-1.151989-1.3092861.0232214.0
2013-01-06-0.445221-1.5209980.7173761.6574765.0

通过传递numpy array赋值

df.loc[:,'D'] = np.array([5] * len(df))
df
ABCDF
2013-01-010.0000008888.0000000.0730435NaN
2013-01-02-0.6794290.3979720.88738851.0
2013-01-030.782244-0.251828-0.73624352.0
2013-01-04-1.877066-0.060967-2.10376953.0
2013-01-050.405459-1.151989-1.30928654.0
2013-01-06-0.445221-1.5209980.71737655.0

通过where操作来赋值

df2 = df.copy()
df2[df2 > 0] = -df2
df2
ABCDF
2013-01-010.000000-8888.000000-0.073043-5NaN
2013-01-02-0.679429-0.397972-0.887388-5-1.0
2013-01-03-0.782244-0.251828-0.736243-5-2.0
2013-01-04-1.877066-0.060967-2.103769-5-3.0
2013-01-05-0.405459-1.151989-1.309286-5-4.0
2013-01-06-0.445221-1.520998-0.717376-5-5.0

缺失值处理

在pandas中,用np.nan来代表缺失值,这些值默认不会参与运算。

reindex()允许你修改、增加、删除指定轴上的索引,并返回一个数据副本。

dates[0]
Timestamp('2013-01-01 00:00:00', freq='D')
df1 = df.reindex(index=dates[0:4], columns=list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
ABCDFE
2013-01-010.0000008888.0000000.0730435NaN1.0
2013-01-02-0.6794290.3979720.88738851.01.0
2013-01-030.782244-0.251828-0.73624352.0NaN
2013-01-04-1.877066-0.060967-2.10376953.0NaN

剔除所有包含缺失值的行数据

df1.dropna(how='any')#any是任意,所以保留下来了
ABCDFE
2013-01-02-0.6794290.3979720.88738851.01.0

填充缺失值

df1.fillna(value=5)
ABCDFE
2013-01-010.0000008888.0000000.07304355.01.0
2013-01-02-0.6794290.3979720.88738851.01.0
2013-01-030.782244-0.251828-0.73624352.05.0
2013-01-04-1.877066-0.060967-2.10376953.05.0

获取值是否为nan的布尔标记

pd.isnull(df1)
ABCDFE
2013-01-01FalseFalseFalseFalseTrueFalse
2013-01-02FalseFalseFalseFalseFalseFalse
2013-01-03FalseFalseFalseFalseFalseTrue
2013-01-04FalseFalseFalseFalseFalseTrue

运算

统计

运算过程中,通常不包含缺失值。

进行描述性统计

df.mean()#默认为0
A      -0.302336
B    1480.902032
C      -0.411915
D       5.000000
F       3.000000
dtype: float64

对其他轴进行同样的运算

df.mean(1)#对所有行
2013-01-01    2223.268261
2013-01-02       1.321186
2013-01-03       1.358835
2013-01-04       0.791640
2013-01-05       1.388837
2013-01-06       1.750231
Freq: D, dtype: float64

对于拥有不同维度的对象进行运算时需要对齐。除此之外,pandas会自动沿着指定维度计算。

s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
#意思就是前面两个填充为空,后面的自动往下移动
s
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64
df
ABCDF
2013-01-010.0000008888.0000000.0730435NaN
2013-01-02-0.6794290.3979720.88738851.0
2013-01-030.782244-0.251828-0.73624352.0
2013-01-04-1.877066-0.060967-2.10376953.0
2013-01-050.405459-1.151989-1.30928654.0
2013-01-06-0.445221-1.5209980.71737655.0
df.sub(s, axis='index')
# 然后对df和s中的对应位置的元素进行减法运算,即df中的元素减去s中对应索引元素的值。
ABCDF
2013-01-01NaNNaNNaNNaNNaN
2013-01-02NaNNaNNaNNaNNaN
2013-01-03-0.217756-1.251828-1.7362434.01.0
2013-01-04-4.877066-3.060967-5.1037692.00.0
2013-01-05-4.594541-6.151989-6.3092860.0-1.0
2013-01-06NaNNaNNaNNaNNaN

Apply 函数作用

通过apply()对函数作用

df.apply(np.cumsum)
# df.apply(np.cumsum)对DataFrame中的每一列应用np.cumsum函数进行累积求和。

# np.cumsum计算累积和,它将每一列中的元素按顺序进行累加。
ABCDF
2013-01-010.0000008888.0000000.0730435NaN
2013-01-02-0.6794298888.3979720.960431101.0
2013-01-030.1028158888.1461440.224188153.0
2013-01-04-1.7742518888.085178-1.879581206.0
2013-01-05-1.3687928886.933189-3.1888672510.0
2013-01-06-1.8140138885.412191-2.4714903015.0
df.apply(lambda x:x.max()-x.min())
# 这个lambda函数会对df的每一列调用x.max()求最大值和x.min()求最小值,最后计算最大值减最小值。
A       2.659310
B    8889.520998
C       2.991157
D       0.000000
F       4.000000
dtype: float64

频数统计

s = pd.Series(np.random.randint(0, 7, size=10))
s
0    2
1    5
2    1
3    3
4    2
5    0
6    0
7    1
8    4
9    1
dtype: int32
s.value_counts()
1    3
2    2
0    2
5    1
3    1
4    1
dtype: int64

字符串方法

对于Series对象,在其str属性中有着一系列的字符串处理方法。就如同下段代码一样,能很方便的对array中各个元素进行运算。值得注意的是,在str属性中的模式匹配默认使用正则表达式。

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

合并

Concat 连接

pandas中提供了大量的方法能够轻松对Series,DataFrame和Panel对象进行不同满足逻辑关系的合并操作

通过**concat()**来连接pandas对象

df = pd.DataFrame(np.random.randn(10,4))
df
0123
00.6391680.8816310.1980150.393510
10.5035440.1553731.2182770.128893
2-0.6614861.365067-0.0107550.058110
30.185698-0.750695-0.637134-1.811947
4-0.493348-0.246197-0.7005240.692042
5-2.2800150.9868061.297614-0.749969
60.6886630.088751-0.164766-0.165378
7-0.382894-0.1573710.000836-1.947379
8-1.6184860.8046671.919125-0.290719
90.392898-0.2645560.8172330.680797
#break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces
[          0         1         2         3
 0  0.639168  0.881631  0.198015  0.393510
 1  0.503544  0.155373  1.218277  0.128893
 2 -0.661486  1.365067 -0.010755  0.058110,
           0         1         2         3
 3  0.185698 -0.750695 -0.637134 -1.811947
 4 -0.493348 -0.246197 -0.700524  0.692042
 5 -2.280015  0.986806  1.297614 -0.749969
 6  0.688663  0.088751 -0.164766 -0.165378,
           0         1         2         3
 7 -0.382894 -0.157371  0.000836 -1.947379
 8 -1.618486  0.804667  1.919125 -0.290719
 9  0.392898 -0.264556  0.817233  0.680797]
pd.concat(pieces)
0123
00.6391680.8816310.1980150.393510
10.5035440.1553731.2182770.128893
2-0.6614861.365067-0.0107550.058110
30.185698-0.750695-0.637134-1.811947
4-0.493348-0.246197-0.7005240.692042
5-2.2800150.9868061.297614-0.749969
60.6886630.088751-0.164766-0.165378
7-0.382894-0.1573710.000836-1.947379
8-1.6184860.8046671.919125-0.290719
90.392898-0.2645560.8172330.680797

Join 合并

类似于SQL中的合并(merge)

left = pd.DataFrame({'key':['foo', 'foo'], 'lval':[1,2]})
left
keylval
0foo1
1foo2
right = pd.DataFrame({'key':['foo', 'foo'], 'lval':[4,5]})
right
keylval
0foo4
1foo5
pd.merge(left, right, on='key')
keylval_xlval_y
0foo14
1foo15
2foo24
3foo25

Append 添加

将若干行添加到dataFrame后面

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df
ABCD
0-1.5264190.868844-1.3797580.498004
1-0.917867-0.137874-0.909232-0.523873
21.370409-0.9487661.7280980.361813
3-1.274621-1.224051-0.749470-2.712027
4-0.303875-0.1779420.4963590.048004
5-0.9414360.044570-0.2296540.092941
60.465798-0.8352440.1317452.219413
70.8758440.243440-1.0504711.761330
s = df.iloc[3]
s
A   -1.274621
B   -1.224051
C   -0.749470
D   -2.712027
Name: 3, dtype: float64
import warnings

df.append(s, ignore_index=True)
C:\Users\48125\AppData\Local\Temp\ipykernel_17504\1384017944.py:3: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df.append(s, ignore_index=True)
ABCD
0-1.5264190.868844-1.3797580.498004
1-0.917867-0.137874-0.909232-0.523873
21.370409-0.9487661.7280980.361813
3-1.274621-1.224051-0.749470-2.712027
4-0.303875-0.1779420.4963590.048004
5-0.9414360.044570-0.2296540.092941
60.465798-0.8352440.1317452.219413
70.8758440.243440-1.0504711.761330
8-1.274621-1.224051-0.749470-2.712027

分组

对于“group by”操作,我们通常是指以下一个或几个步骤:

  • 划分 按照某些标准将数据分为不同的组
  • 应用 对每组数据分别执行一个函数
  • 组合 将结果组合到一个数据结构
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 
                          'foo', 'bar', 'foo', 'bar'],
                   'B' : ['one', 'one', 'two', 'three', 
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df
ABCD
0fooone-0.491461-0.550970
1barone-0.4689560.584847
2footwo0.4619890.372785
3barthree0.290600-2.142788
4footwo0.448364-2.036729
5bartwo0.6397930.577440
6fooone1.335309-0.582853
7barthree0.6291000.223494

分组并对每个分组应用sum函数

df.groupby('A').sum()
C:\Users\48125\AppData\Local\Temp\ipykernel_17504\1885751491.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  df.groupby('A').sum()
CD
A
bar1.090537-0.757006
foo1.754201-2.797767

按多个列分组形成层级索引,然后应用函数

df.groupby(['A','B']).sum()
CD
AB
barone-0.4689560.584847
three0.919700-1.919294
two0.6397930.577440
fooone0.843848-1.133823
two0.910352-1.663943

变形

堆叠

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2
AB
firstsecond
barone0.1385840.661800
two0.2061280.668363
bazone-0.531729-0.695009
two0.746672-0.735620

**stack()**方法对DataFrame的列“压缩”一个层级

stacked = df2.stack()
stacked
first  second   
bar    one     A    0.138584
               B    0.661800
       two     A    0.206128
               B    0.668363
baz    one     A   -0.531729
               B   -0.695009
       two     A    0.746672
               B   -0.735620
dtype: float64

对于一个“堆叠过的”DataFrame或者Series(拥有MultiIndex作为索引),stack()的逆操作是unstack(),默认反堆叠到上一个层级

stacked.unstack()
AB
firstsecond
barone0.1385840.661800
two0.2061280.668363
bazone-0.531729-0.695009
two0.746672-0.735620
stacked.unstack(1)
secondonetwo
first
barA0.1385840.206128
B0.6618000.668363
bazA-0.5317290.746672
B-0.695009-0.735620
stacked.unstack(0)
firstbarbaz
second
oneA0.138584-0.531729
B0.661800-0.695009
twoA0.2061280.746672
B0.668363-0.735620

数据透视表

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df
ABCDE
0oneAfoo0.014262-0.732936
1oneBfoo0.8539491.719331
2twoCfoo-0.3446400.765832
3threeAbar1.908338-1.447838
4oneBbar-1.4692870.954153
5oneCbar-1.341587-0.606839
6twoAfoo0.9619270.054527
7threeBfoo0.829778-0.581417
8oneCfoo0.623418-0.456098
9oneAbar1.745817-0.684403
10twoBbar-1.457706-0.272665
11threeCbar1.3360442.080870

我们可以轻松地从这个数据得到透视表

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Cbarfoo
AB
oneA1.7458170.014262
B-1.4692870.853949
C-1.3415870.623418
threeA1.908338NaN
BNaN0.829778
C1.336044NaN
twoANaN0.961927
B-1.457706NaN
CNaN-0.344640

时间序列

pandas在对频率转换进行重新采样时拥有着简单,强大而且高效的功能(例如把按秒采样的数据转换为按5分钟采样的数据)。这在金融领域很常见,但又不限于此。

rng = pd.date_range('1/1/2012', periods=100, freq='S')
rng[:5]
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01',
               '2012-01-01 00:00:02', '2012-01-01 00:00:03',
               '2012-01-01 00:00:04'],
              dtype='datetime64[ns]', freq='S')
ts = pd.Series(np.random.randint(0,500,len(rng)), index=rng)
ts
2012-01-01 00:00:00    193
2012-01-01 00:00:01    447
2012-01-01 00:00:02    407
2012-01-01 00:00:03    450
2012-01-01 00:00:04    368
                      ... 
2012-01-01 00:01:35    102
2012-01-01 00:01:36    446
2012-01-01 00:01:37    290
2012-01-01 00:01:38    256
2012-01-01 00:01:39    154
Freq: S, Length: 100, dtype: int32

ts.resample(‘5Min’) 表示对时间序列数据ts进行重采样,改变其索引的时间频率。

具体来说:

ts: 一个时间序列数据,索引为datetime类型。
'5Min': 表示将索引频率改为每5分钟一个。

resample函数将按照新频率重新采样,原索引不能被5分钟整除的时间点会被删除,新的索引会在5分钟的时间点上。

ts.resample('5Min')
# Datafream才这样写
# ts.resample('5Min', how='sum')
<pandas.core.resample.DatetimeIndexResampler object at 0x0000017E3D5A86D0>

时区表示

rng = pd.date_range('3/6/2012', periods=5, freq='D')
rng
DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09',
               '2012-03-10'],
              dtype='datetime64[ns]', freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2012-03-06   -0.967180
2012-03-07   -0.917108
2012-03-08    0.252346
2012-03-09    0.461718
2012-03-10   -0.931543
Freq: D, dtype: float64
ts_utc = ts.tz_localize('UTC')
ts_utc
2012-03-06 00:00:00+00:00   -0.967180
2012-03-07 00:00:00+00:00   -0.917108
2012-03-08 00:00:00+00:00    0.252346
2012-03-09 00:00:00+00:00    0.461718
2012-03-10 00:00:00+00:00   -0.931543
Freq: D, dtype: float64

时区转换

ts_utc.tz_convert('US/Eastern')
2012-03-05 19:00:00-05:00   -0.967180
2012-03-06 19:00:00-05:00   -0.917108
2012-03-07 19:00:00-05:00    0.252346
2012-03-08 19:00:00-05:00    0.461718
2012-03-09 19:00:00-05:00   -0.931543
Freq: D, dtype: float64

时间跨度转换

rng = pd.date_range('1/1/2012', periods=5, freq='M')
rng
DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',
               '2012-05-31'],
              dtype='datetime64[ns]', freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2012-01-31   -0.841095
2012-02-29   -1.548459
2012-03-31   -0.473997
2012-04-30   -0.602313
2012-05-31   -0.519119
Freq: M, dtype: float64
ps = ts.to_period()
ps
2012-01   -0.841095
2012-02   -1.548459
2012-03   -0.473997
2012-04   -0.602313
2012-05   -0.519119
Freq: M, dtype: float64
ps.to_timestamp()
2012-01-01   -0.841095
2012-02-01   -1.548459
2012-03-01   -0.473997
2012-04-01   -0.602313
2012-05-01   -0.519119
Freq: MS, dtype: float64

日期与时间戳之间的转换使得可以使用一些方便的算术函数。例如,我们把以11月为年底的季度数据转换为当前季度末月底为始的数据

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
prng
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
             '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
             '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
             '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
             '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
             '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
             '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
             '2000Q3', '2000Q4'],
            dtype='period[Q-NOV]')
ts = pd.Series(np.random.randn(len(prng)), index = prng)
ts[:5]
1990Q1   -0.316562
1990Q2   -0.055698
1990Q3   -0.267225
1990Q4    0.514381
1991Q1   -0.716024
Freq: Q-NOV, dtype: float64
ts.index = (prng.asfreq('M', 'end') ) .asfreq('H', 'start') +9
ts[:6]
1990-02-01 09:00   -0.316562
1990-05-01 09:00   -0.055698
1990-08-01 09:00   -0.267225
1990-11-01 09:00    0.514381
1991-02-01 09:00   -0.716024
1991-05-01 09:00    1.530323
Freq: H, dtype: float64

分类

从版本0.15开始,pandas在DataFrame中开始包括分类数据。

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'e', 'e']})
df
idraw_grade
01a
12b
23b
34a
45e
56e

把raw_grade转换为分类类型

df["grade"] = df["raw_grade"].astype("category")
df["grade"]
0    a
1    b
2    b
3    a
4    e
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

重命名类别名为更有意义的名称

df["grade"].cat.categories = ["very good", "good", "very bad"]
C:\Users\48125\AppData\Local\Temp\ipykernel_17504\2985790766.py:1: FutureWarning: Setting categories in-place is deprecated and will raise in a future version. Use rename_categories instead.
  df["grade"].cat.categories = ["very good", "good", "very bad"]

对分类重新排序,并添加缺失的分类

df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
0    very good
1         good
2         good
3    very good
4     very bad
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

排序是按照分类的顺序进行的,而不是字典序

df.sort_values(by="grade")
idraw_gradegrade
45every bad
56every bad
12bgood
23bgood
01avery good
34avery good

按分类分组时,也会显示空的分类

df.groupby("grade").size()
grade
very bad     2
bad          0
medium       0
good         2
very good    2
dtype: int64

绘图

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

在这里插入图片描述

对于DataFrame类型,**plot()**能很方便地画出所有列及其标签

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')

在这里插入图片描述

获取数据的I/O

CSV

写入一个csv文件

df.to_csv('data/foo.csv')

从一个csv文件读入

pd.read_csv('data/foo.csv')
Unnamed: 0ABCD
02000-01-01-0.0556221.889998-1.0946220.591568
12000-01-02-0.4089483.388888-0.5508351.362462
22000-01-03-0.0413531.880637-2.4177400.383862
32000-01-04-0.4425170.807854-2.4094781.418281
42000-01-050.442292-0.735053-2.9708041.435721
..................
9952002-09-22-28.47036114.628723-68.388462-37.630887
9962002-09-23-28.99505514.517615-68.247657-36.363608
9972002-09-24-28.75658615.852227-68.328996-36.032141
9982002-09-25-29.33163615.557680-68.450159-35.825427
9992002-09-26-31.06586316.141154-68.852564-36.462825

1000 rows × 5 columns

HDF5

HDFStores的读写

写入一个HDF5 Store

df.to_hdf('data/foo.h5', 'df')

从一个HDF5 Store读入

pd.read_hdf('data/foo.h5', 'df')
ABCD
2000-01-01-0.0556221.889998-1.0946220.591568
2000-01-02-0.4089483.388888-0.5508351.362462
2000-01-03-0.0413531.880637-2.4177400.383862
2000-01-04-0.4425170.807854-2.4094781.418281
2000-01-050.442292-0.735053-2.9708041.435721
...............
2002-09-22-28.47036114.628723-68.388462-37.630887
2002-09-23-28.99505514.517615-68.247657-36.363608
2002-09-24-28.75658615.852227-68.328996-36.032141
2002-09-25-29.33163615.557680-68.450159-35.825427
2002-09-26-31.06586316.141154-68.852564-36.462825

1000 rows × 4 columns

Excel

MS Excel的读写

写入一个Excel文件

df.to_excel('data/foo.xlsx', sheet_name='Sheet1')

从一个excel文件读入

pd.read_excel('data/foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Unnamed: 0ABCD
02000-01-01-0.0556221.889998-1.0946220.591568
12000-01-02-0.4089483.388888-0.5508351.362462
22000-01-03-0.0413531.880637-2.4177400.383862
32000-01-04-0.4425170.807854-2.4094781.418281
42000-01-050.442292-0.735053-2.9708041.435721
..................
9952002-09-22-28.47036114.628723-68.388462-37.630887
9962002-09-23-28.99505514.517615-68.247657-36.363608
9972002-09-24-28.75658615.852227-68.328996-36.032141
9982002-09-25-29.33163615.557680-68.450159-35.825427
9992002-09-26-31.06586316.141154-68.852564-36.462825

1000 rows × 5 columns

在这里插入图片描述Pandas练习题目录

1.Getting and knowing

  • Chipotle
  • Occupation
  • World Food Facts

2.Filtering and Sorting

  • Chipotle
  • Euro12
  • Fictional Army

3.Grouping

  • Alcohol Consumption
  • Occupation
  • Regiment

4.Apply

  • Students
  • Alcohol Consumption
  • US_Crime_Rates

5.Merge

  • Auto_MPG
  • Fictitious Names
  • House Market

6.Stats

  • US_Baby_Names
  • Wind_Stats

7.Visualization

  • Chipotle
  • Titanic Disaster
  • Scores
  • Online Retail
  • Tips

8.Creating Series and DataFrames

  • Pokemon

9.Time Series

  • Apple_Stock
  • Getting_Financial_Data
  • Investor_Flow_of_Funds_US

10.Deleting

  • Iris
  • Wine

使用方法

每个练习文件夹有三个不同类型的文件:

1.Exercises.ipynb

没有答案代码的文件,这个是你做的练习

2.Solutions.ipynb

运行代码后的结果(不要改动)

3.Exercise_with_Solutions.ipynb

有答案代码和注释的文件

你可以在Exercises.ipynb里输入代码,看看运行结果是否和Solutions.ipynb里面的内容一致,如果真的完成不了再看下Exercise_with_Solutions.ipynb的答案。

每文一语

开教传授,私信教学!!!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1010724.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Anaconda安装和配置 ---- 详细到家

安装 1.打开Anaconda官网&#xff0c;选择对应版本,下载到对应目录即可 或者进入: Index of /anaconda/archive/ | 清华大学开源软件镜像站 | Tsinghua Open Source Mirror 2.双击打开.exe文件&#xff0c;然后点击next ; 3.点击agree 4.点击just me,然后next; 5.在Choose Ins…

C++QT day7

仿照vector手动实现自己的myVector&#xff0c;最主要实现二倍扩容功能 #include <iostream>using namespace std;template<typename T> class my_vector {int size;//可存储的容量大小int num;//当前存储的元素个数T* data;//存储数据的空间地址public://无参构造…

C++下基于粒子群算法解决TSP问题

粒子群优化算法求解TSP旅行商问题C&#xff08;2020.11.12&#xff09;_jing_zhong的博客-CSDN博客 混合粒子群算法&#xff08;PSO&#xff09;&#xff1a;C实现TSP问题 - 知乎 (zhihu.com) 一、原理 又是一个猜答案的算法&#xff0c;和遗传算法比较像&#xff0c;也是设…

小米手机安装面具教程(Xiaomi手机获取root权限)

文章目录 1.Magisk中文网&#xff1a;2.某呼&#xff1a;3.最后一步打开cmd命令行输入的时候:4.Flash Boot 通包-Magisk&#xff08;Flash Boot通刷包&#xff09;5.小米Rom下载&#xff08;官方刷机包&#xff09;6.Magisk最新版本国内源下载 1.Magisk中文网&#xff1a; htt…

深入解析 Nginx 代理配置:从 server 块到上游服务器的全面指南

&#x1f337;&#x1f341; 博主猫头虎&#xff08;&#x1f405;&#x1f43e;&#xff09;带您 Go to New World✨&#x1f341; &#x1f405;&#x1f43e;猫头虎建议程序员必备技术栈一览表&#x1f4d6;&#xff1a; &#x1f6e0;️ 全栈技术 Full Stack: &#x1f4da…

XSS跨站脚本攻击

XSS全称&#xff08;Cross Site Scripting&#xff09;跨站脚本攻击,XSS属于客户端攻击&#xff0c;受害者最终是用户&#xff0c;在网页中嵌入客户端恶意脚本代码&#xff0c;最常用javascript语言。&#xff08;注意&#xff1a;叠成样式表CSS已经被占用所以叫XSS&#xff09…

(vue2).sync修饰符、ref和$refs、$nextTick、自定义指令、插槽

.sync修饰符 实现子组件和父组件数据的双向绑定 &#xff0c;简化代码 prop属性名&#xff0c;可以自定义&#xff0c;非固定value 本质&#xff1a;属性名和update&#xff1a;属性名的合写 <BaseDialog :value"isShow" update"isShow$event"> /…

朋友圈大佬都去读研了,这份备考书单我码住了(文末赠书)

朋友圈大佬都去读研了&#xff0c;这份备考书单我码住了 1、《数据结构与算法分析》2、《计算机网络&#xff1a;自顶向下方法》3、《现代操作系统》4、《深入理解计算机系统》5、《概率论基础教程&#xff08;原书第10版》6、《线性代数&#xff08;原书第10版&#xff09;》7…

猫爪插件-官网下载方法

一、猫抓 github 二、下载 三、安装 谷歌浏览器&#xff0c;打开插件页面&#xff0c;打开开发者模式&#xff0c;将插件拖入浏览器安装。

数学实验-素数(Mathematica实现)

一、实验名称&#xff1a;素数 二、实验环境&#xff1a;Mathematica 10.3软件 三、实验目的&#xff1a;本实验将探讨素数的规律&#xff0c;研究素数的判别、最大的素数、构成生成素数的公式和素数的分布&#xff0c;并学会求解某些范围内的素数。 四、实验内容、步骤以及…

Maven仓库Nexus安装部署

Nexus是半开源软件&#xff0c;用于支持Apache Maven的在线仓库&#xff0c;Maven在CI/CD领域中&#xff0c;支持Java工程编译、打包、发布&#xff0c;而Maven发布的目的地是Nexus中央仓库。Nexus也提供商业版本&#xff0c;商业版本支持高可用、企业级统一认证与登录以及其他…

MobaXterm工具软件使用介绍

大家好&#xff0c;我是虎哥&#xff0c;最近由于大部分嵌入式的系统都切换到了ubuntu20.04及更高版本的系统&#xff0c;导致我自己使用的Xshell也需要从5升级到7&#xff0c;但是Xshell7尽然开始收费了&#xff0c;网上也没有什么好用的破解版本&#xff0c;索性我就准备找个…

2023年浦东新区数字化安全风险智慧管控技能比武初赛-技能题一

目录 二、技能题 2.1 MD5===MD5 三、业*&&&务**&&联&&&*&&系 二、技能题 2.1 MD5===MD5

react | react-router-dom v6 结合 antd 面包屑 |嵌套路由

大致需求图示如上&#xff1a; 需求&#xff1a; 1. 点击page2默认进入/page2/中国 2. 在中国界面选择省份&#xff0c;进入浙江省 3. 在浙江省中选择市&#xff0c;进入杭州市 4. 选择大学&#xff0c;进入浙江大学 5. 点击面包屑中某个tab&#xff0c;进入对应tab界面&…

【Redis 多机服务的简单认识】

目录 主从同步 哨兵模式 集群服务 随着业务的不断发展&#xff0c;单机 Redis 的性能已经不能满⾜我们的需求了&#xff0c;此时我们需要将单机 Redis 扩展为多机服务&#xff0c;Redis 多机服务主要包含以下 3 个内容&#xff1a; Redis 主从同步Redis 哨兵模式Redis 集群…

【记录】Truenas scale|Truenas 的 SSH 服务连不上 VScode,终端能连上

一般 Truenas连不上 就只有两种情况&#xff1a; 第一种&#xff1a;用户没对应用户目录。需要去用户管理里面对每个用户设置目录。 第二种情况&#xff0c;服务有个选项没勾选。这时会发现能输入密码但是一点反应都没有&#xff0c;打开details会看到报错channel 3: open fai…

A股风格因子看板 (2023.09 第03期)

该因子看板跟踪A股风格因子&#xff0c;该因子主要解释沪深两市的市场收益、刻画市场风格趋势的系列风格因子&#xff0c;用以分析市场风格切换、组合风格暴露等。 今日为该因子跟踪第03期&#xff0c;指数组合数据截止日2023-08-31&#xff0c;要点如下 近1年A股风格因子检验统…

qml怎么显示网页

QML显示网页需要使用Qt WebEngine模块,它提供了一个WebEngineView组件,可以用来在QML中显示和交互网页。 首先,确保你已经安装了Qt WebEngine模块。如果你使用的是Qt的在线安装程序,你可以通过Qt Maintenance Tool来添加这个模块。 以下是如何在QML中使用WebEngineView来…

三维重建_表面重建_基于符号距离场的表面重建

目录 1. 三维物体的表面表达方式 1.1 边界表示法 (Boundary Representation) 1.2 空间划分法 (Spatial-Partitioning Representations) 1.3 构造体素法 (Boundary Constructive Solid Geometry) 2. 三维模型的表述方式 3. 基于符号距离场的表面重建方法 3.1 符号距离…

再不跳槽,就真晚了......

从时间节点上来看&#xff0c;9月、10月是每年跳槽的黄金季&#xff01; 以 BAT 为代表的互联网大厂&#xff0c;无论是薪资待遇、还是平台和福利&#xff0c;都一直是求职者眼中的香饽饽&#xff0c;“大厂经历” 在国内就业环境中无异于一块金子招牌。在这金三银四的时间里&…