导入库
创建数据
在Pandas中有两种核心数据:DataFrame
和Series
创建DataFrame
DataFrame
是一个表格,包含多个独立条目的序列,每个条目下面是一列数据。每一行叫record
1 2
| pd.DataFrame({'Yes':[50,21], 'No':[131,2]}) pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
|
行标被称为Index
1 2 3
| pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']}, index=['Product A', 'Product B'])
|
创建Series
1
| pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
|
读取数据
1 2
| wine_reviews = pd.read_csv("filename.csv") wine_reviews = pd.read_csv("filename.csv",index_col=0)
|
观察数据
看形状、看头看尾
1 2 3
| wine_reviews.shape wine_reviews.head() wine_reviews.tail()
|
保存数据
1
| wine_reviews.to_csv('savefilename.csv')
|
索引
基本索引方式
1 2
| reviews.country reviews['country']
|
pandas中有loc
和iloc
方法,都是先行row后列column,与基本方式相反
基于index的索引
1 2
| reviews.iloc[:, 0] reviews.iloc[-5:]
|
基于标签的索引
1 2
| reviews.loc[0, 'country']
|
使用loc
索引0:10就是0,…,10
使用iloc
索引0:10就是0,…,9
基于标签判断的缩影
1 2
| italian_wines = reviews[reviews.country == 'Italy'] top_oceania_wines = reviews.loc[(reviews.country.isin(['Australia','New Zealand'])) & (reviews.points>=95)]
|
不重复元素
1
| reviews.country.unique()
|
元素计数
1
| reviews.country.value_count()
|
使用map()
1
| reviews.points.map(lambda p: p - review_points_mean)
|
传给map()
的function每次接收Series中的一个值,返回一个新的Series包含所有被function转换后的值
同样的操作可以用apply()
实现
1 2 3 4 5
| def remean_points(row): row.points = row.points - review_points_mean return row
reviews.apply(remean_points, axis='columns')
|
idxmax()
1 2
| bargain_idx = (reviews.points / reviews.price).idxmax() bargain_wine = reviews.loc[bargain_idx,'title']
|
grouping 和 sorting
1 2
| reviews.groupby('points').points.count() reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
|
agg()
函数可以实现同时运行多种函数
1
| reviews.groupby(['country']).price.agg([len, min, max])
|
多索引
1
| countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
|
复原
1
| countries_reviewed.reset_index()
|
Sorting
1 2 3 4 5 6
| countries_reviewed = countries_reviewed.reset_index() countries_reviewed.sort_values(by='len')
countries_reviewed.sort_index()
countries_reviewed.sort_values(by=['country', 'len'])
|
计数
1 2 3
| reviews_written = reviews.groupby('taster_twitter_handle').size()
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
|
数据类型和缺失值
数据类型
1 2 3
| reviews.price.dtype
reviews.points.astype('float64')
|
丢失值NaN
,类型是float64
1
| reviews[pd.isnull(reviews.country)]
|
替换丢失值是常见操作,例如替换为Unknown
1
| reviews.region_2.fillna('Unknown')
|
替换正常值
1
| reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")
|
重命名
1 2 3 4 5
| reviews.rename(columns={'points': 'score'})
reviews.rename(index={0:'FirstEntry',1:'SecondEntry'})
|
行和列都可以有自己的名字属性
1
| reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')
|
组合数据的三种方式
concat()
,join()
,merge()
,后者与join类似
1 2 3 4 5 6 7 8 9 10 11 12
| canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv") british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")
pd.concat([canadian_youtube, british_youtube])
left = canadian_youtube.set_index(['title', 'trending_date']) right = british_youtube.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK')
|