Pandas

Pandas is one of the most popular software (Python) library that can be used by the Data Analyst and Data Scientist for data manipulation and data analysis. Pandas can be used for the management of two-dimensional data tables. Pandas can be said to be the SQL of python though pandas have additional features as well. In this blog, I will show the most commonly used features of Pandas by a data analyst and data scientist.

The blog is done in Jupyter notebook. Installing python 3 and Jupyter notebook is not part of this blog. So, we can assume that these are already installed and we are ready to go.

Import Pandas

Importing Pandas is one of the first step whenever we work in pandas. When we write ‘import pandas as pd’, it means that Jupyter notebook understands that when you write pd, it is referring to Pandas library.

import pandas as pd

Read from the file

Any type of file compatible to work in Pandas like csv, excel etc can be read in Pandas.

df = pd.read_csv("Customers.csv")

Get top 3 rows

head(n) gives the first n number of rows where n = 3 above

df.head(3)
Customer ID Name Surname Gender Age Region Job Classification Date Joined Balance
0 100000001 Simon Walsh Male 21 England White Collar 05.Jan.15 113810.15
1 400000002 Jasmine Miller Female 34 Northern Ireland Blue Collar 06.Jan.15 36919.73
2 100000003 Liam Brown Male 46 England White Collar 07.Jan.15 101536.83

Get bottom 2 rows

tail(n) gives the last n number of rows where n = 3 above

df.tail(3)
Customer ID Name Surname Gender Age Region Job Classification Date Joined Balance
4011 200004012 Hannah Springer Female 50 Scotland Other 30.Dec.15 59477.82
4012 200004013 Christian Reid Male 51 Scotland Blue Collar 30.Dec.15 239.45
4013 300004014 Stephen May Male 33 Wales Blue Collar 30.Dec.15 30293.19

Summary of the data

describe() gives total number of count, mean, standard deviation, minimum, maximum, 25%, 50% and 75% of value

df.describe()
Customer ID Age Balance
count 4.014000e+03 4014.000000 4014.000000
mean 1.696831e+08 38.611111 39766.448274
std 8.865374e+07 9.819121 29859.489192
min 1.000000e+08 15.000000 11.520000
25% 1.000020e+08 31.000000 16115.367500
50% 1.000038e+08 37.000000 33567.330000
75% 2.000031e+08 45.000000 57533.930000
max 4.000038e+08 64.000000 183467.700000

Name of the Columns

df.columns
Index(['Customer ID', 'Name', 'Surname', 'Gender', 'Age', 'Region',
       'Job Classification', 'Date Joined', 'Balance'],
      dtype='object')

Creating new dataframe which has the subset of the columns

New data frame includes only three columns ‘Customer ID’, ‘Name’, ‘Gender’ from the old dataframe.

new_df = df[['Customer ID','Name', 'Gender']]
print(type(df)) # original dataframe
print(type(new_df)) # newly created dataframe
new_df.head()
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
Customer ID Name Gender
0 100000001 Simon Male
1 400000002 Jasmine Female
2 100000003 Liam Male
3 300000004 Trevor Male
4 100000005 Deirdre Female

Filter the rows based on the condition

It checks the condition ands gives values if true.

df1 = df[df['Balance'] >150000]
df1
Customer ID Name Surname Gender Age Region Job Classification Date Joined Balance
1306 100001307 Dorothy Jackson Female 31 England White Collar 22.Jul.15 183467.70
2009 200002010 Connor North Male 50 Scotland Other 11.Sep.15 172085.48
3449 100003450 Carl Fraser Male 35 England White Collar 30.Nov.15 181680.99
3831 100003832 Sebastian Arnold Male 44 England Blue Collar 20.Dec.15 161517.82

Merge in Pandas

Merge two dataframes

df1= pd.DataFrame({ "Employee_id":[100,120,130],"Salary":[2000,5000,5000],"Year_entered":[2001, 2002,2003]})
df2=pd.DataFrame({ "Employee_id":[130,140,150],"Salary":[5000,8000,6000],"Year_entered":[2003, 2005,2007]})
merged= pd.merge(df1,df2)

display(df1)
display(df2)
display(merged)
Employee_id Salary Year_entered
0 100 2000 2001
1 120 5000 2002
2 130 5000 2003
Employee_id Salary Year_entered
0 130 5000 2003
1 140 8000 2005
2 150 6000 2007
Employee_id Salary Year_entered
0 130 5000 2003

Merge two dataframes along the ‘Employee_id’ value

merged= pd.merge(df1,df2,left_on = 'Employee_id', right_on = 'Employee_id')
display(merged)

Employee_id Salary_x Year_entered_x Salary_y Year_entered_y
0 130 5000 2003 5000 2003

Merge two dataframes along the ‘Employee_id’ value

merged= pd.merge(df1,df2,on = 'Employee_id')
display(merged)
Employee_id Salary_x Year_entered_x Salary_y Year_entered_y
0 130 5000 2003 5000 2003

Merge two dataframes along the ‘Employee_id’ value using left join

merged= pd.merge(df1,df2, on = 'Employee_id', how = 'left')
display(merged)
Employee_id Salary_x Year_entered_x Salary_y Year_entered_y
0 100 2000 2001 NaN NaN
1 120 5000 2002 NaN NaN
2 130 5000 2003 5000.0 2003.0

Merge two dataframes along the ‘Employee_id’ value using right join

merged= pd.merge(df1,df2,on = 'Employee_id', how = 'right')
display(merged)
Employee_id Salary_x Year_entered_x Salary_y Year_entered_y
0 130 5000.0 2003.0 5000 2003
1 140 NaN NaN 8000 2005
2 150 NaN NaN 6000 2007

Merge two dataframes along the ‘Employee_id’ value using outer join

merged= pd.merge(df1,df2,on = 'Employee_id', how = 'outer')
display(merged)
Employee_id Salary_x Year_entered_x Salary_y Year_entered_y
0 100 2000.0 2001.0 NaN NaN
1 120 5000.0 2002.0 NaN NaN
2 130 5000.0 2003.0 5000.0 2003.0
3 140 NaN NaN 8000.0 2005.0
4 150 NaN NaN 6000.0 2007.0

Concatination

Join the two dataframes along rows

concat= pd.concat([df1,df2])
display(concat)
Employee_id Salary Year_entered
0 100 2000 2001
1 120 5000 2002
2 130 5000 2003
0 130 5000 2003
1 140 8000 2005
2 150 6000 2007

Sort by the column

Sort the df by ‘Year_entered’ (ascending order).

df1.sort_values(by='Year_entered')
Employee_id Salary Year_entered
0 100 2000 2001
1 120 5000 2002
2 130 5000 2003

Aggregation in Pandas

Some of the aggregation function includes number of rows, minimum, maximum value, mean, median, mode of the particular columns

Count()

df1.count()
Employee_id     3
Salary          3
Year_entered    3
dtype: int64
df1[['Salary']].count()
Salary    3
dtype: int64

min() and max()

df1[['Salary']].min()
Salary    2000
dtype: int64
df1[['Salary']].max()
Salary    5000
dtype: int64

mean(), median() and mode()

df1[['Salary']].mean()
Salary    4000.0
dtype: float64
df1[['Salary']].median()
Salary    5000.0
dtype: float64
df1[['Salary']].mode()
Salary
0 5000

Grouping in Pandas

Group the dataframe by ‘Year_entered’

df1.groupby('Year_entered')
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001F0C729A2E8>

Counts the ‘Employee_id’ and ‘Salary’ by grouping the dataframe based on ‘Year_entered’

df1.groupby('Year_entered').count()
Employee_id Salary
Year_entered
2001 1 1
2002 1 1
2003 1 1

Provides mean of the ‘Salary’ of each group by grouping the dataframe based on ‘Year_entered’

df1.groupby('Year_entered').mean()[['Salary']]
Salary
Year_entered
2001 2000
2002 5000
2003 5000