Introduction to Pandas library with Python

This notebook has been taken from the following link. Please follow that link to find more such notebooks on different topics.
Ref : https://sebastianraschka.com/notebooks/python-notebooks.html

Pandas is a popular library compatible with python, for Machine Learning and Data Analysis. This notebook aims to
introduce you to a couple of basic functionalities of Pandas.

At the end of this notebook, you should have a fair understanding of the functions available as part of pandas, and
what you can do with them.

Sections

Loading Some Example Data

In [1]:

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/rasbt/python_reference/master/Data/some_soccer_data.csv')
df

Out[1]:

	PLAYER	SALARY	GP	G	A	SOT	PPG	P
0	Sergio Agüero\n Forward — Manchester City	$19.2m	16.0	14	3.0	34	13.12	209.98
1	Eden Hazard\n Midfield — Chelsea	$18.9m	21.0	8	4.0	17	13.05	274.04
2	Alexis Sánchez\n Forward — Arsenal	$17.6m	NaN	12	7.0	29	11.19	223.86
3	Yaya Touré\n Midfield — Manchester City	$16.6m	18.0	7	1.0	19	10.99	197.91
4	Ángel Di María\n Midfield — Manchester United	$15.0m	13.0	3	NaN	13	10.17	132.23
5	Santiago Cazorla\n Midfield — Arsenal	$14.8m	20.0	4	NaN	20	9.97	NaN
6	David Silva\n Midfield — Manchester City	$14.3m	15.0	6	2.0	11	10.35	155.26
7	Cesc Fàbregas\n Midfield — Chelsea	$14.0m	20.0	2	14.0	10	10.47	209.49
8	Saido Berahino\n Forward — West Brom	$13.8m	21.0	9	0.0	20	7.02	147.43
9	Steven Gerrard\n Midfield — Liverpool	$13.8m	20.0	5	1.0	11	7.50	150.01

Renaming Columns

Converting Column Names to Lowercase

In [2]:

# Converting column names to lowercase

df.columns = [c.lower() for c in df.columns]

# or
# df.rename(columns=lambda x : x.lower())

df.tail(3)

Out[2]:

	player	salary	gp	g	a	sot	ppg	p
7	Cesc Fàbregas\n Midfield — Chelsea	$14.0m	20.0	2	14.0	10	10.47	209.49
8	Saido Berahino\n Forward — West Brom	$13.8m	21.0	9	0.0	20	7.02	147.43
9	Steven Gerrard\n Midfield — Liverpool	$13.8m	20.0	5	1.0	11	7.50	150.01

Renaming Particular Columns

In [3]:

df = df.rename(columns={'p': 'points', 
                        'gp': 'games',
                        'sot': 'shots_on_target',
                        'g': 'goals',
                        'ppg': 'points_per_game',
                        'a': 'assists',})

df.tail(3)

Out[3]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points
7	Cesc Fàbregas\n Midfield — Chelsea	$14.0m	20.0	2	14.0	10	10.47	209.49
8	Saido Berahino\n Forward — West Brom	$13.8m	21.0	9	0.0	20	7.02	147.43
9	Steven Gerrard\n Midfield — Liverpool	$13.8m	20.0	5	1.0	11	7.50	150.01

Applying Computations Rows-wise

Changing Values in a Column

In [4]:

# Processing `salary` column

df['salary'] = df['salary'].apply(lambda x: x.strip('$m'))
df.tail()

Out[4]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points
5	Santiago Cazorla\n Midfield — Arsenal	14.8	20.0	4	NaN	20	9.97	NaN
6	David Silva\n Midfield — Manchester City	14.3	15.0	6	2.0	11	10.35	155.26
7	Cesc Fàbregas\n Midfield — Chelsea	14.0	20.0	2	14.0	10	10.47	209.49
8	Saido Berahino\n Forward — West Brom	13.8	21.0	9	0.0	20	7.02	147.43
9	Steven Gerrard\n Midfield — Liverpool	13.8	20.0	5	1.0	11	7.50	150.01

Adding a New Column

In [5]:

df['team'] = pd.Series('', index=df.index)

# or
df.insert(loc=8, column='position', value='') 

df.tail(3)

Out[5]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points
7	Cesc Fàbregas\n Midfield — Chelsea	14.0	20.0	2	14.0	10	10.47	209.49
8	Saido Berahino\n Forward — West Brom	13.8	21.0	9	0.0	20	7.02	147.43
9	Steven Gerrard\n Midfield — Liverpool	13.8	20.0	5	1.0	11	7.50	150.01

In [6]:

# Processing `player` column

def process_player_col(text):
    name, rest = text.split('\n')
    position, team = [x.strip() for x in rest.split(' — ')]
    return pd.Series([name, team, position])

df[['player', 'team', 'position']] = df.player.apply(process_player_col)

# modified after tip from reddit.com/user/hharison
#
# Alternative (inferior) approach:
#
#for idx,row in df.iterrows():
#    name, position, team = process_player_col(row['player'])
#    df.ix[idx, 'player'], df.ix[idx, 'position'], df.ix[idx, 'team'] = name, position, team
    
df.tail(3)

Out[6]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
7	Cesc Fàbregas	14.0	20.0	2	14.0	10	10.47	209.49	Midfield	Chelsea
8	Saido Berahino	13.8	21.0	9	0.0	20	7.02	147.43	Forward	West Brom
9	Steven Gerrard	13.8	20.0	5	1.0	11	7.50	150.01	Midfield	Liverpool

Applying Functions to Multiple Columns

In [7]:

cols = ['player', 'position', 'team']
df[cols] = df[cols].applymap(lambda x: x.lower())
df.head()

Out[7]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
0	sergio agüero	19.2	16.0	14	3.0	34	13.12	209.98	forward	manchester city
1	eden hazard	18.9	21.0	8	4.0	17	13.05	274.04	midfield	chelsea
2	alexis sánchez	17.6	NaN	12	7.0	29	11.19	223.86	forward	arsenal
3	yaya touré	16.6	18.0	7	1.0	19	10.99	197.91	midfield	manchester city
4	ángel di maría	15.0	13.0	3	NaN	13	10.17	132.23	midfield	manchester united

Missing Values aka NaNs

Counting Rows with NaNs

In [8]:

nans = df.shape[0] - df.dropna().shape[0]

print('%d rows have missing values' % nans)

3 rows have missing values

Selecting NaN Rows

In [9]:

# Selecting all rows that have NaNs in the `assists` column

df[df['assists'].isnull()]

Out[9]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
4	ángel di maría	15.0	13.0	3	NaN	13	10.17	132.23	midfield	manchester united
5	santiago cazorla	14.8	20.0	4	NaN	20	9.97	NaN	midfield	arsenal

Selecting non-NaN Rows

In [10]:

df[df['assists'].notnull()]

Out[10]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
0	sergio agüero	19.2	16.0	14	3.0	34	13.12	209.98	forward	manchester city
1	eden hazard	18.9	21.0	8	4.0	17	13.05	274.04	midfield	chelsea
2	alexis sánchez	17.6	NaN	12	7.0	29	11.19	223.86	forward	arsenal
3	yaya touré	16.6	18.0	7	1.0	19	10.99	197.91	midfield	manchester city
6	david silva	14.3	15.0	6	2.0	11	10.35	155.26	midfield	manchester city
7	cesc fàbregas	14.0	20.0	2	14.0	10	10.47	209.49	midfield	chelsea
8	saido berahino	13.8	21.0	9	0.0	20	7.02	147.43	forward	west brom
9	steven gerrard	13.8	20.0	5	1.0	11	7.50	150.01	midfield	liverpool

Filling NaN Rows

In [11]:

# Filling NaN cells with default value 0

df.fillna(value=0, inplace=True)
df

Out[11]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
0	sergio agüero	19.2	16.0	14	3.0	34	13.12	209.98	forward	manchester city
1	eden hazard	18.9	21.0	8	4.0	17	13.05	274.04	midfield	chelsea
2	alexis sánchez	17.6	0.0	12	7.0	29	11.19	223.86	forward	arsenal
3	yaya touré	16.6	18.0	7	1.0	19	10.99	197.91	midfield	manchester city
4	ángel di maría	15.0	13.0	3	0.0	13	10.17	132.23	midfield	manchester united
5	santiago cazorla	14.8	20.0	4	0.0	20	9.97	0.00	midfield	arsenal
6	david silva	14.3	15.0	6	2.0	11	10.35	155.26	midfield	manchester city
7	cesc fàbregas	14.0	20.0	2	14.0	10	10.47	209.49	midfield	chelsea
8	saido berahino	13.8	21.0	9	0.0	20	7.02	147.43	forward	west brom
9	steven gerrard	13.8	20.0	5	1.0	11	7.50	150.01	midfield	liverpool

Appending Rows to a DataFrame

In [12]:

# Adding an "empty" row to the DataFrame

import numpy as np

df = df.append(pd.Series(
                [np.nan]*len(df.columns), # Fill cells with NaNs
                index=df.columns),    
                ignore_index=True)

df.tail(3)

Out[12]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
8	saido berahino	13.8	21.0	9.0	0.0	20.0	7.02	147.43	forward	west brom
9	steven gerrard	13.8	20.0	5.0	1.0	11.0	7.50	150.01	midfield	liverpool
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

In [13]:

# Filling cells with data

df.loc[df.index[-1], 'player'] = 'new player'
df.loc[df.index[-1], 'salary'] = 12.3
df.tail(3)

Out[13]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
8	saido berahino	13.8	21.0	9.0	0.0	20.0	7.02	147.43	forward	west brom
9	steven gerrard	13.8	20.0	5.0	1.0	11.0	7.50	150.01	midfield	liverpool
10	new player	12.3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Sorting and Reindexing DataFrames

In [14]:

# Sorting the DataFrame by a certain column (from highest to lowest)

df.sort_values('goals', ascending=False, inplace=True)
df.head()

Out[14]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
0	sergio agüero	19.2	16.0	14.0	3.0	34.0	13.12	209.98	forward	manchester city
2	alexis sánchez	17.6	0.0	12.0	7.0	29.0	11.19	223.86	forward	arsenal
8	saido berahino	13.8	21.0	9.0	0.0	20.0	7.02	147.43	forward	west brom
1	eden hazard	18.9	21.0	8.0	4.0	17.0	13.05	274.04	midfield	chelsea
3	yaya touré	16.6	18.0	7.0	1.0	19.0	10.99	197.91	midfield	manchester city

In [15]:

# Optional reindexing of the DataFrame after sorting

df.index = range(1,len(df.index)+1)
df.head()

Out[15]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
1	sergio agüero	19.2	16.0	14.0	3.0	34.0	13.12	209.98	forward	manchester city
2	alexis sánchez	17.6	0.0	12.0	7.0	29.0	11.19	223.86	forward	arsenal
3	saido berahino	13.8	21.0	9.0	0.0	20.0	7.02	147.43	forward	west brom
4	eden hazard	18.9	21.0	8.0	4.0	17.0	13.05	274.04	midfield	chelsea
5	yaya touré	16.6	18.0	7.0	1.0	19.0	10.99	197.91	midfield	manchester city

Updating Columns

In [16]:

# Creating a dummy DataFrame with changes in the `salary` column

df_2 = df.copy()
df_2.loc[0:2, 'salary'] = [20.0, 15.0]
df_2.head(3)

Out[16]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
1	sergio agüero	20	16.0	14.0	3.0	34.0	13.12	209.98	forward	manchester city
2	alexis sánchez	15	0.0	12.0	7.0	29.0	11.19	223.86	forward	arsenal
3	saido berahino	13.8	21.0	9.0	0.0	20.0	7.02	147.43	forward	west brom

In [17]:

# Temporarily use the `player` columns as indices to 
# apply the update functions

df.set_index('player', inplace=True)
df_2.set_index('player', inplace=True)
df.head(3)

Out[17]:

	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
player
sergio agüero	19.2	16.0	14.0	3.0	34.0	13.12	209.98	forward	manchester city
alexis sánchez	17.6	0.0	12.0	7.0	29.0	11.19	223.86	forward	arsenal
saido berahino	13.8	21.0	9.0	0.0	20.0	7.02	147.43	forward	west brom

In [18]:

# Update the `salary` column
df.update(other=df_2['salary'], overwrite=True)
df.head(3)

Out[18]:

	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
player
sergio agüero	20	16.0	14.0	3.0	34.0	13.12	209.98	forward	manchester city
alexis sánchez	15	0.0	12.0	7.0	29.0	11.19	223.86	forward	arsenal
saido berahino	13.8	21.0	9.0	0.0	20.0	7.02	147.43	forward	west brom

In [19]:

# Reset the indices
df.reset_index(inplace=True)
df.head(3)

Out[19]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
0	sergio agüero	20	16.0	14.0	3.0	34.0	13.12	209.98	forward	manchester city
1	alexis sánchez	15	0.0	12.0	7.0	29.0	11.19	223.86	forward	arsenal
2	saido berahino	13.8	21.0	9.0	0.0	20.0	7.02	147.43	forward	west brom

Chaining Conditions – Using Bitwise Operators

In [20]:

# Selecting only those players that either playing for Arsenal or Chelsea

df[ (df['team'] == 'arsenal') | (df['team'] == 'chelsea') ]

Out[20]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
1	alexis sánchez	15	0.0	12.0	7.0	29.0	11.19	223.86	forward	arsenal
3	eden hazard	18.9	21.0	8.0	4.0	17.0	13.05	274.04	midfield	chelsea
7	santiago cazorla	14.8	20.0	4.0	0.0	20.0	9.97	0.00	midfield	arsenal
9	cesc fàbregas	14.0	20.0	2.0	14.0	10.0	10.47	209.49	midfield	chelsea

In [21]:

# Selecting forwards from Arsenal only

df[ (df['team'] == 'arsenal') & (df['position'] == 'forward') ]

Out[21]:

	player	salary	games	goals	assists	shots_on_target	points_per_game	points	position	team
1	alexis sánchez	15	0.0	12.0	7.0	29.0	11.19	223.86	forward	arsenal

Column Types

Printing Column Types

In [22]:

types = df.columns.to_series().groupby(df.dtypes).groups
types

Out[22]:

{dtype('float64'): Index(['games', 'goals', 'assists', 'shots_on_target', 'points_per_game',
        'points'],
       dtype='object'),
 dtype('O'): Index(['player', 'salary', 'position', 'team'], dtype='object')}

Selecting by Column Type

In [23]:

# select string columns
df.loc[:, (df.dtypes == np.dtype('O')).values].head()

Out[23]:

	player	salary	position	team
0	sergio agüero	20	forward	manchester city
1	alexis sánchez	15	forward	arsenal
2	saido berahino	13.8	forward	west brom
3	eden hazard	18.9	midfield	chelsea
4	yaya touré	16.6	midfield	manchester city

Converting Column Types

In [24]:

df['salary'] = df['salary'].astype(float)

In [25]:

types = df.columns.to_series().groupby(df.dtypes).groups
types

Out[25]:

{dtype('float64'): Index(['salary', 'games', 'goals', 'assists', 'shots_on_target',
        'points_per_game', 'points'],
       dtype='object'),
 dtype('O'): Index(['player', 'position', 'team'], dtype='object')}

If-tests

In [26]:

int(True)

Out[26]:

In [27]:

import pandas as pd

a = [[2., .3, 4., 5.], [.8, .03, 0.02, 5.]]
df = pd.DataFrame(a)
df

Out[27]:

	0	1	2	3
0	2.0	0.30	4.00	5.0
1	0.8	0.03	0.02	5.0

In [28]:

df = df <= 0.05
df

Out[28]:

	0	1	2	3
0	False	False	False	False
1	False	True	True	False

In [29]:

df.astype(int)

Out[29]:

	0	1	2	3
0	0	0	0	0
1	0	1	1	0

solidfish

Python Pandas