Introduction to Pandas library with Python
This notebook has been taken from the following link. Please follow that link to find more such notebooks on different topics.
Ref : https://sebastianraschka.com/notebooks/python-notebooks.html
Pandas is a popular library compatible with python, for Machine Learning and Data Analysis. This notebook aims to
introduce you to a couple of basic functionalities of Pandas.
At the end of this notebook, you should have a fair understanding of the functions available as part of pandas, and
what you can do with them.
Sections
Loading Some Example Data
In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/rasbt/python_reference/master/Data/some_soccer_data.csv')
df
Out[1]:
PLAYER | SALARY | GP | G | A | SOT | PPG | P | |
---|---|---|---|---|---|---|---|---|
0 | Sergio Agüero\n Forward — Manchester City | $19.2m | 16.0 | 14 | 3.0 | 34 | 13.12 | 209.98 |
1 | Eden Hazard\n Midfield — Chelsea | $18.9m | 21.0 | 8 | 4.0 | 17 | 13.05 | 274.04 |
2 | Alexis Sánchez\n Forward — Arsenal | $17.6m | NaN | 12 | 7.0 | 29 | 11.19 | 223.86 |
3 | Yaya Touré\n Midfield — Manchester City | $16.6m | 18.0 | 7 | 1.0 | 19 | 10.99 | 197.91 |
4 | Ángel Di María\n Midfield — Manchester United | $15.0m | 13.0 | 3 | NaN | 13 | 10.17 | 132.23 |
5 | Santiago Cazorla\n Midfield — Arsenal | $14.8m | 20.0 | 4 | NaN | 20 | 9.97 | NaN |
6 | David Silva\n Midfield — Manchester City | $14.3m | 15.0 | 6 | 2.0 | 11 | 10.35 | 155.26 |
7 | Cesc Fàbregas\n Midfield — Chelsea | $14.0m | 20.0 | 2 | 14.0 | 10 | 10.47 | 209.49 |
8 | Saido Berahino\n Forward — West Brom | $13.8m | 21.0 | 9 | 0.0 | 20 | 7.02 | 147.43 |
9 | Steven Gerrard\n Midfield — Liverpool | $13.8m | 20.0 | 5 | 1.0 | 11 | 7.50 | 150.01 |
Renaming Columns
Converting Column Names to Lowercase
In [2]:
# Converting column names to lowercase
df.columns = [c.lower() for c in df.columns]
# or
# df.rename(columns=lambda x : x.lower())
df.tail(3)
Out[2]:
player | salary | gp | g | a | sot | ppg | p | |
---|---|---|---|---|---|---|---|---|
7 | Cesc Fàbregas\n Midfield — Chelsea | $14.0m | 20.0 | 2 | 14.0 | 10 | 10.47 | 209.49 |
8 | Saido Berahino\n Forward — West Brom | $13.8m | 21.0 | 9 | 0.0 | 20 | 7.02 | 147.43 |
9 | Steven Gerrard\n Midfield — Liverpool | $13.8m | 20.0 | 5 | 1.0 | 11 | 7.50 | 150.01 |
Renaming Particular Columns
In [3]:
df = df.rename(columns={'p': 'points',
'gp': 'games',
'sot': 'shots_on_target',
'g': 'goals',
'ppg': 'points_per_game',
'a': 'assists',})
df.tail(3)
Out[3]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | |
---|---|---|---|---|---|---|---|---|
7 | Cesc Fàbregas\n Midfield — Chelsea | $14.0m | 20.0 | 2 | 14.0 | 10 | 10.47 | 209.49 |
8 | Saido Berahino\n Forward — West Brom | $13.8m | 21.0 | 9 | 0.0 | 20 | 7.02 | 147.43 |
9 | Steven Gerrard\n Midfield — Liverpool | $13.8m | 20.0 | 5 | 1.0 | 11 | 7.50 | 150.01 |
Applying Computations Rows-wise
Changing Values in a Column
In [4]:
# Processing `salary` column
df['salary'] = df['salary'].apply(lambda x: x.strip('$m'))
df.tail()
Out[4]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | |
---|---|---|---|---|---|---|---|---|
5 | Santiago Cazorla\n Midfield — Arsenal | 14.8 | 20.0 | 4 | NaN | 20 | 9.97 | NaN |
6 | David Silva\n Midfield — Manchester City | 14.3 | 15.0 | 6 | 2.0 | 11 | 10.35 | 155.26 |
7 | Cesc Fàbregas\n Midfield — Chelsea | 14.0 | 20.0 | 2 | 14.0 | 10 | 10.47 | 209.49 |
8 | Saido Berahino\n Forward — West Brom | 13.8 | 21.0 | 9 | 0.0 | 20 | 7.02 | 147.43 |
9 | Steven Gerrard\n Midfield — Liverpool | 13.8 | 20.0 | 5 | 1.0 | 11 | 7.50 | 150.01 |
Adding a New Column
In [5]:
df['team'] = pd.Series('', index=df.index)
# or
df.insert(loc=8, column='position', value='')
df.tail(3)
Out[5]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
7 | Cesc Fàbregas\n Midfield — Chelsea | 14.0 | 20.0 | 2 | 14.0 | 10 | 10.47 | 209.49 | ||
8 | Saido Berahino\n Forward — West Brom | 13.8 | 21.0 | 9 | 0.0 | 20 | 7.02 | 147.43 | ||
9 | Steven Gerrard\n Midfield — Liverpool | 13.8 | 20.0 | 5 | 1.0 | 11 | 7.50 | 150.01 |
In [6]:
# Processing `player` column
def process_player_col(text):
name, rest = text.split('\n')
position, team = [x.strip() for x in rest.split(' — ')]
return pd.Series([name, team, position])
df[['player', 'team', 'position']] = df.player.apply(process_player_col)
# modified after tip from reddit.com/user/hharison
#
# Alternative (inferior) approach:
#
#for idx,row in df.iterrows():
# name, position, team = process_player_col(row['player'])
# df.ix[idx, 'player'], df.ix[idx, 'position'], df.ix[idx, 'team'] = name, position, team
df.tail(3)
Out[6]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
7 | Cesc Fàbregas | 14.0 | 20.0 | 2 | 14.0 | 10 | 10.47 | 209.49 | Midfield | Chelsea |
8 | Saido Berahino | 13.8 | 21.0 | 9 | 0.0 | 20 | 7.02 | 147.43 | Forward | West Brom |
9 | Steven Gerrard | 13.8 | 20.0 | 5 | 1.0 | 11 | 7.50 | 150.01 | Midfield | Liverpool |
Applying Functions to Multiple Columns
In [7]:
cols = ['player', 'position', 'team']
df[cols] = df[cols].applymap(lambda x: x.lower())
df.head()
Out[7]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
0 | sergio agüero | 19.2 | 16.0 | 14 | 3.0 | 34 | 13.12 | 209.98 | forward | manchester city |
1 | eden hazard | 18.9 | 21.0 | 8 | 4.0 | 17 | 13.05 | 274.04 | midfield | chelsea |
2 | alexis sánchez | 17.6 | NaN | 12 | 7.0 | 29 | 11.19 | 223.86 | forward | arsenal |
3 | yaya touré | 16.6 | 18.0 | 7 | 1.0 | 19 | 10.99 | 197.91 | midfield | manchester city |
4 | ángel di maría | 15.0 | 13.0 | 3 | NaN | 13 | 10.17 | 132.23 | midfield | manchester united |
Missing Values aka NaNs
Counting Rows with NaNs
In [8]:
nans = df.shape[0] - df.dropna().shape[0]
print('%d rows have missing values' % nans)
3 rows have missing values
Selecting NaN Rows
In [9]:
# Selecting all rows that have NaNs in the `assists` column
df[df['assists'].isnull()]
Out[9]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
4 | ángel di maría | 15.0 | 13.0 | 3 | NaN | 13 | 10.17 | 132.23 | midfield | manchester united |
5 | santiago cazorla | 14.8 | 20.0 | 4 | NaN | 20 | 9.97 | NaN | midfield | arsenal |
Selecting non-NaN Rows
In [10]:
df[df['assists'].notnull()]
Out[10]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
0 | sergio agüero | 19.2 | 16.0 | 14 | 3.0 | 34 | 13.12 | 209.98 | forward | manchester city |
1 | eden hazard | 18.9 | 21.0 | 8 | 4.0 | 17 | 13.05 | 274.04 | midfield | chelsea |
2 | alexis sánchez | 17.6 | NaN | 12 | 7.0 | 29 | 11.19 | 223.86 | forward | arsenal |
3 | yaya touré | 16.6 | 18.0 | 7 | 1.0 | 19 | 10.99 | 197.91 | midfield | manchester city |
6 | david silva | 14.3 | 15.0 | 6 | 2.0 | 11 | 10.35 | 155.26 | midfield | manchester city |
7 | cesc fàbregas | 14.0 | 20.0 | 2 | 14.0 | 10 | 10.47 | 209.49 | midfield | chelsea |
8 | saido berahino | 13.8 | 21.0 | 9 | 0.0 | 20 | 7.02 | 147.43 | forward | west brom |
9 | steven gerrard | 13.8 | 20.0 | 5 | 1.0 | 11 | 7.50 | 150.01 | midfield | liverpool |
Filling NaN Rows
In [11]:
# Filling NaN cells with default value 0
df.fillna(value=0, inplace=True)
df
Out[11]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
0 | sergio agüero | 19.2 | 16.0 | 14 | 3.0 | 34 | 13.12 | 209.98 | forward | manchester city |
1 | eden hazard | 18.9 | 21.0 | 8 | 4.0 | 17 | 13.05 | 274.04 | midfield | chelsea |
2 | alexis sánchez | 17.6 | 0.0 | 12 | 7.0 | 29 | 11.19 | 223.86 | forward | arsenal |
3 | yaya touré | 16.6 | 18.0 | 7 | 1.0 | 19 | 10.99 | 197.91 | midfield | manchester city |
4 | ángel di maría | 15.0 | 13.0 | 3 | 0.0 | 13 | 10.17 | 132.23 | midfield | manchester united |
5 | santiago cazorla | 14.8 | 20.0 | 4 | 0.0 | 20 | 9.97 | 0.00 | midfield | arsenal |
6 | david silva | 14.3 | 15.0 | 6 | 2.0 | 11 | 10.35 | 155.26 | midfield | manchester city |
7 | cesc fàbregas | 14.0 | 20.0 | 2 | 14.0 | 10 | 10.47 | 209.49 | midfield | chelsea |
8 | saido berahino | 13.8 | 21.0 | 9 | 0.0 | 20 | 7.02 | 147.43 | forward | west brom |
9 | steven gerrard | 13.8 | 20.0 | 5 | 1.0 | 11 | 7.50 | 150.01 | midfield | liverpool |
Appending Rows to a DataFrame
In [12]:
# Adding an "empty" row to the DataFrame
import numpy as np
df = df.append(pd.Series(
[np.nan]*len(df.columns), # Fill cells with NaNs
index=df.columns),
ignore_index=True)
df.tail(3)
Out[12]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
8 | saido berahino | 13.8 | 21.0 | 9.0 | 0.0 | 20.0 | 7.02 | 147.43 | forward | west brom |
9 | steven gerrard | 13.8 | 20.0 | 5.0 | 1.0 | 11.0 | 7.50 | 150.01 | midfield | liverpool |
10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
In [13]:
# Filling cells with data
df.loc[df.index[-1], 'player'] = 'new player'
df.loc[df.index[-1], 'salary'] = 12.3
df.tail(3)
Out[13]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
8 | saido berahino | 13.8 | 21.0 | 9.0 | 0.0 | 20.0 | 7.02 | 147.43 | forward | west brom |
9 | steven gerrard | 13.8 | 20.0 | 5.0 | 1.0 | 11.0 | 7.50 | 150.01 | midfield | liverpool |
10 | new player | 12.3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Sorting and Reindexing DataFrames
In [14]:
# Sorting the DataFrame by a certain column (from highest to lowest)
df.sort_values('goals', ascending=False, inplace=True)
df.head()
Out[14]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
0 | sergio agüero | 19.2 | 16.0 | 14.0 | 3.0 | 34.0 | 13.12 | 209.98 | forward | manchester city |
2 | alexis sánchez | 17.6 | 0.0 | 12.0 | 7.0 | 29.0 | 11.19 | 223.86 | forward | arsenal |
8 | saido berahino | 13.8 | 21.0 | 9.0 | 0.0 | 20.0 | 7.02 | 147.43 | forward | west brom |
1 | eden hazard | 18.9 | 21.0 | 8.0 | 4.0 | 17.0 | 13.05 | 274.04 | midfield | chelsea |
3 | yaya touré | 16.6 | 18.0 | 7.0 | 1.0 | 19.0 | 10.99 | 197.91 | midfield | manchester city |
In [15]:
# Optional reindexing of the DataFrame after sorting
df.index = range(1,len(df.index)+1)
df.head()
Out[15]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
1 | sergio agüero | 19.2 | 16.0 | 14.0 | 3.0 | 34.0 | 13.12 | 209.98 | forward | manchester city |
2 | alexis sánchez | 17.6 | 0.0 | 12.0 | 7.0 | 29.0 | 11.19 | 223.86 | forward | arsenal |
3 | saido berahino | 13.8 | 21.0 | 9.0 | 0.0 | 20.0 | 7.02 | 147.43 | forward | west brom |
4 | eden hazard | 18.9 | 21.0 | 8.0 | 4.0 | 17.0 | 13.05 | 274.04 | midfield | chelsea |
5 | yaya touré | 16.6 | 18.0 | 7.0 | 1.0 | 19.0 | 10.99 | 197.91 | midfield | manchester city |
Updating Columns
In [16]:
# Creating a dummy DataFrame with changes in the `salary` column
df_2 = df.copy()
df_2.loc[0:2, 'salary'] = [20.0, 15.0]
df_2.head(3)
Out[16]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
1 | sergio agüero | 20 | 16.0 | 14.0 | 3.0 | 34.0 | 13.12 | 209.98 | forward | manchester city |
2 | alexis sánchez | 15 | 0.0 | 12.0 | 7.0 | 29.0 | 11.19 | 223.86 | forward | arsenal |
3 | saido berahino | 13.8 | 21.0 | 9.0 | 0.0 | 20.0 | 7.02 | 147.43 | forward | west brom |
In [17]:
# Temporarily use the `player` columns as indices to
# apply the update functions
df.set_index('player', inplace=True)
df_2.set_index('player', inplace=True)
df.head(3)
Out[17]:
salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|
player | |||||||||
sergio agüero | 19.2 | 16.0 | 14.0 | 3.0 | 34.0 | 13.12 | 209.98 | forward | manchester city |
alexis sánchez | 17.6 | 0.0 | 12.0 | 7.0 | 29.0 | 11.19 | 223.86 | forward | arsenal |
saido berahino | 13.8 | 21.0 | 9.0 | 0.0 | 20.0 | 7.02 | 147.43 | forward | west brom |
In [18]:
# Update the `salary` column
df.update(other=df_2['salary'], overwrite=True)
df.head(3)
Out[18]:
salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|
player | |||||||||
sergio agüero | 20 | 16.0 | 14.0 | 3.0 | 34.0 | 13.12 | 209.98 | forward | manchester city |
alexis sánchez | 15 | 0.0 | 12.0 | 7.0 | 29.0 | 11.19 | 223.86 | forward | arsenal |
saido berahino | 13.8 | 21.0 | 9.0 | 0.0 | 20.0 | 7.02 | 147.43 | forward | west brom |
In [19]:
# Reset the indices
df.reset_index(inplace=True)
df.head(3)
Out[19]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
0 | sergio agüero | 20 | 16.0 | 14.0 | 3.0 | 34.0 | 13.12 | 209.98 | forward | manchester city |
1 | alexis sánchez | 15 | 0.0 | 12.0 | 7.0 | 29.0 | 11.19 | 223.86 | forward | arsenal |
2 | saido berahino | 13.8 | 21.0 | 9.0 | 0.0 | 20.0 | 7.02 | 147.43 | forward | west brom |
Chaining Conditions – Using Bitwise Operators
In [20]:
# Selecting only those players that either playing for Arsenal or Chelsea
df[ (df['team'] == 'arsenal') | (df['team'] == 'chelsea') ]
Out[20]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
1 | alexis sánchez | 15 | 0.0 | 12.0 | 7.0 | 29.0 | 11.19 | 223.86 | forward | arsenal |
3 | eden hazard | 18.9 | 21.0 | 8.0 | 4.0 | 17.0 | 13.05 | 274.04 | midfield | chelsea |
7 | santiago cazorla | 14.8 | 20.0 | 4.0 | 0.0 | 20.0 | 9.97 | 0.00 | midfield | arsenal |
9 | cesc fàbregas | 14.0 | 20.0 | 2.0 | 14.0 | 10.0 | 10.47 | 209.49 | midfield | chelsea |
In [21]:
# Selecting forwards from Arsenal only
df[ (df['team'] == 'arsenal') & (df['position'] == 'forward') ]
Out[21]:
player | salary | games | goals | assists | shots_on_target | points_per_game | points | position | team | |
---|---|---|---|---|---|---|---|---|---|---|
1 | alexis sánchez | 15 | 0.0 | 12.0 | 7.0 | 29.0 | 11.19 | 223.86 | forward | arsenal |
Column Types
Printing Column Types
In [22]:
types = df.columns.to_series().groupby(df.dtypes).groups
types
Out[22]:
{dtype('float64'): Index(['games', 'goals', 'assists', 'shots_on_target', 'points_per_game', 'points'], dtype='object'), dtype('O'): Index(['player', 'salary', 'position', 'team'], dtype='object')}
Selecting by Column Type
In [23]:
# select string columns
df.loc[:, (df.dtypes == np.dtype('O')).values].head()
Out[23]:
player | salary | position | team | |
---|---|---|---|---|
0 | sergio agüero | 20 | forward | manchester city |
1 | alexis sánchez | 15 | forward | arsenal |
2 | saido berahino | 13.8 | forward | west brom |
3 | eden hazard | 18.9 | midfield | chelsea |
4 | yaya touré | 16.6 | midfield | manchester city |
Converting Column Types
In [24]:
df['salary'] = df['salary'].astype(float)
In [25]:
types = df.columns.to_series().groupby(df.dtypes).groups
types
Out[25]:
{dtype('float64'): Index(['salary', 'games', 'goals', 'assists', 'shots_on_target', 'points_per_game', 'points'], dtype='object'), dtype('O'): Index(['player', 'position', 'team'], dtype='object')}
If-tests
In [26]:
int(True)
Out[26]:
1
In [27]:
import pandas as pd
a = [[2., .3, 4., 5.], [.8, .03, 0.02, 5.]]
df = pd.DataFrame(a)
df
Out[27]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 2.0 | 0.30 | 4.00 | 5.0 |
1 | 0.8 | 0.03 | 0.02 | 5.0 |
In [28]:
df = df <= 0.05
df
Out[28]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | False | False | False | False |
1 | False | True | True | False |
In [29]:
df.astype(int)
Out[29]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 0 | 0 | 0 |
1 | 0 | 1 | 1 | 0 |