Pandas入门

导入库

1
import pandas as pd

创建数据

在Pandas中有两种核心数据:DataFrameSeries

创建DataFrame

DataFrame是一个表格,包含多个独立条目的序列,每个条目下面是一列数据。每一行叫record

1
2
pd.DataFrame({'Yes':[50,21], 'No':[131,2]})
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

行标被称为Index

1
2
3
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])

创建Series

1
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

读取数据

1
2
wine_reviews = pd.read_csv("filename.csv")
wine_reviews = pd.read_csv("filename.csv",index_col=0) # 使用Excel文档中已有的index

观察数据

看形状、看头看尾

1
2
3
wine_reviews.shape
wine_reviews.head()
wine_reviews.tail()

保存数据

1
wine_reviews.to_csv('savefilename.csv')

索引

基本索引方式

1
2
reviews.country
reviews['country']

pandas中有lociloc方法,都是先行row后列column,与基本方式相反

基于index的索引

1
2
reviews.iloc[:, 0] # 获取第一列
reviews.iloc[-5:] # 最后5行

基于标签的索引

1
2
reviews.loc[0, 'country'] # 获取第一行,country列的元素

使用loc索引0:10就是0,…,10
使用iloc索引0:10就是0,…,9

基于标签判断的缩影

1
2
italian_wines = reviews[reviews.country == 'Italy']
top_oceania_wines = reviews.loc[(reviews.country.isin(['Australia','New Zealand'])) & (reviews.points>=95)]

不重复元素

1
reviews.country.unique()

元素计数

1
reviews.country.value_count()

使用map()

1
reviews.points.map(lambda p: p - review_points_mean)

传给map()的function每次接收Series中的一个值,返回一个新的Series包含所有被function转换后的值

同样的操作可以用apply()实现

1
2
3
4
5
def remean_points(row):
row.points = row.points - review_points_mean
return row

reviews.apply(remean_points, axis='columns') # axis=index 时则应用函数到每列

idxmax()

1
2
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx,'title']

grouping 和 sorting

1
2
reviews.groupby('points').points.count()
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

agg()函数可以实现同时运行多种函数

1
reviews.groupby(['country']).price.agg([len, min, max])

多索引

1
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])

复原

1
countries_reviewed.reset_index()

Sorting

1
2
3
4
5
6
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len') # 默认ascending = True
# 使用index排序
countries_reviewed.sort_index()
# 使用两个同时排序
countries_reviewed.sort_values(by=['country', 'len'])

计数

1
2
3
reviews_written = reviews.groupby('taster_twitter_handle').size()

reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()

数据类型和缺失值

数据类型

1
2
3
reviews.price.dtype
# 更改类型
reviews.points.astype('float64')

丢失值NaN,类型是float64

1
reviews[pd.isnull(reviews.country)]

替换丢失值是常见操作,例如替换为Unknown

1
reviews.region_2.fillna('Unknown')

替换正常值

1
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

重命名

1
2
3
4
5
# 给column重命名
reviews.rename(columns={'points': 'score'})

# 给index重命名
reviews.rename(index={0:'FirstEntry',1:'SecondEntry'})

行和列都可以有自己的名字属性

1
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')

组合数据的三种方式

concat(),join(),merge(),后者与join类似

1
2
3
4
5
6
7
8
9
10
11
12
# concat() 组合具有相同column的数据
canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])

# join() 组合具有相同index的数据
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK') # 后两个参数是后缀必须