Pandas

Pandas is one of the most popular software (Python) library that can be used by the Data Analyst and Data Scientist for data manipulation and data analysis. Pandas can be used for the management of two-dimensional data tables. Pandas can be said to be the SQL of python though pandas have additional features as well. In this blog, I will show the most commonly used features of Pandas by a data analyst and data scientist.

The blog is done in Jupyter notebook. Installing python 3 and Jupyter notebook is not part of this blog. So, we can assume that these are already installed and we are ready to go.

Import Pandas

Importing Pandas is one of the first step whenever we work in pandas. When we write ‘import pandas as pd’, it means that Jupyter notebook understands that when you write pd, it is referring to Pandas library.

import pandas as pd

Read from the file

Any type of file compatible to work in Pandas like csv, excel etc can be read in Pandas.

df = pd.read_csv("Customers.csv")

Get top 3 rows

head(n) gives the first n number of rows where n = 3 above

df.head(3)

	Customer ID	Name	Surname	Gender	Age	Region	Job Classification	Date Joined	Balance
0	100000001	Simon	Walsh	Male	21	England	White Collar	05.Jan.15	113810.15
1	400000002	Jasmine	Miller	Female	34	Northern Ireland	Blue Collar	06.Jan.15	36919.73
2	100000003	Liam	Brown	Male	46	England	White Collar	07.Jan.15	101536.83

Get bottom 2 rows

tail(n) gives the last n number of rows where n = 3 above

df.tail(3)

	Customer ID	Name	Surname	Gender	Age	Region	Job Classification	Date Joined	Balance
4011	200004012	Hannah	Springer	Female	50	Scotland	Other	30.Dec.15	59477.82
4012	200004013	Christian	Reid	Male	51	Scotland	Blue Collar	30.Dec.15	239.45
4013	300004014	Stephen	May	Male	33	Wales	Blue Collar	30.Dec.15	30293.19

Summary of the data

describe() gives total number of count, mean, standard deviation, minimum, maximum, 25%, 50% and 75% of value

df.describe()

	Customer ID	Age	Balance
count	4.014000e+03	4014.000000	4014.000000
mean	1.696831e+08	38.611111	39766.448274
std	8.865374e+07	9.819121	29859.489192
min	1.000000e+08	15.000000	11.520000
25%	1.000020e+08	31.000000	16115.367500
50%	1.000038e+08	37.000000	33567.330000
75%	2.000031e+08	45.000000	57533.930000
max	4.000038e+08	64.000000	183467.700000

Name of the Columns

df.columns

Index(['Customer ID', 'Name', 'Surname', 'Gender', 'Age', 'Region',
       'Job Classification', 'Date Joined', 'Balance'],
      dtype='object')

Creating new dataframe which has the subset of the columns

New data frame includes only three columns ‘Customer ID’, ‘Name’, ‘Gender’ from the old dataframe.

new_df = df[['Customer ID','Name', 'Gender']]
print(type(df)) # original dataframe
print(type(new_df)) # newly created dataframe
new_df.head()

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>

	Customer ID	Name	Gender
0	100000001	Simon	Male
1	400000002	Jasmine	Female
2	100000003	Liam	Male
3	300000004	Trevor	Male
4	100000005	Deirdre	Female

Filter the rows based on the condition

It checks the condition ands gives values if true.

df1 = df[df['Balance'] >150000]
df1

	Customer ID	Name	Surname	Gender	Age	Region	Job Classification	Date Joined	Balance
1306	100001307	Dorothy	Jackson	Female	31	England	White Collar	22.Jul.15	183467.70
2009	200002010	Connor	North	Male	50	Scotland	Other	11.Sep.15	172085.48
3449	100003450	Carl	Fraser	Male	35	England	White Collar	30.Nov.15	181680.99
3831	100003832	Sebastian	Arnold	Male	44	England	Blue Collar	20.Dec.15	161517.82

Merge in Pandas

Merge two dataframes

df1= pd.DataFrame({ "Employee_id":[100,120,130],"Salary":[2000,5000,5000],"Year_entered":[2001, 2002,2003]})
df2=pd.DataFrame({ "Employee_id":[130,140,150],"Salary":[5000,8000,6000],"Year_entered":[2003, 2005,2007]})
merged= pd.merge(df1,df2)

display(df1)
display(df2)
display(merged)

	Employee_id	Salary	Year_entered
0	100	2000	2001
1	120	5000	2002
2	130	5000	2003

	Employee_id	Salary	Year_entered
0	130	5000	2003
1	140	8000	2005
2	150	6000	2007

	Employee_id	Salary	Year_entered
0	130	5000	2003

Merge two dataframes along the ‘Employee_id’ value

merged= pd.merge(df1,df2,left_on = 'Employee_id', right_on = 'Employee_id')
display(merged)

	Employee_id	Salary_x	Year_entered_x	Salary_y	Year_entered_y
0	130	5000	2003	5000	2003

Merge two dataframes along the ‘Employee_id’ value

merged= pd.merge(df1,df2,on = 'Employee_id')
display(merged)

	Employee_id	Salary_x	Year_entered_x	Salary_y	Year_entered_y
0	130	5000	2003	5000	2003

Merge two dataframes along the ‘Employee_id’ value using left join

merged= pd.merge(df1,df2, on = 'Employee_id', how = 'left')
display(merged)

	Employee_id	Salary_x	Year_entered_x	Salary_y	Year_entered_y
0	100	2000	2001	NaN	NaN
1	120	5000	2002	NaN	NaN
2	130	5000	2003	5000.0	2003.0

Merge two dataframes along the ‘Employee_id’ value using right join

merged= pd.merge(df1,df2,on = 'Employee_id', how = 'right')
display(merged)

	Employee_id	Salary_x	Year_entered_x	Salary_y	Year_entered_y
0	130	5000.0	2003.0	5000	2003
1	140	NaN	NaN	8000	2005
2	150	NaN	NaN	6000	2007

Merge two dataframes along the ‘Employee_id’ value using outer join

merged= pd.merge(df1,df2,on = 'Employee_id', how = 'outer')
display(merged)

	Employee_id	Salary_x	Year_entered_x	Salary_y	Year_entered_y
0	100	2000.0	2001.0	NaN	NaN
1	120	5000.0	2002.0	NaN	NaN
2	130	5000.0	2003.0	5000.0	2003.0
3	140	NaN	NaN	8000.0	2005.0
4	150	NaN	NaN	6000.0	2007.0

Concatination

Join the two dataframes along rows

concat= pd.concat([df1,df2])
display(concat)

	Employee_id	Salary	Year_entered
0	100	2000	2001
1	120	5000	2002
2	130	5000	2003
0	130	5000	2003
1	140	8000	2005
2	150	6000	2007

Sort by the column

Sort the df by ‘Year_entered’ (ascending order).

df1.sort_values(by='Year_entered')

	Employee_id	Salary	Year_entered
0	100	2000	2001
1	120	5000	2002
2	130	5000	2003

Aggregation in Pandas

Some of the aggregation function includes number of rows, minimum, maximum value, mean, median, mode of the particular columns

Count()

df1.count()

Employee_id     3
Salary          3
Year_entered    3
dtype: int64

df1[['Salary']].count()

Salary    3
dtype: int64

min() and max()

df1[['Salary']].min()

Salary    2000
dtype: int64

df1[['Salary']].max()

Salary    5000
dtype: int64

mean(), median() and mode()

df1[['Salary']].mean()

Salary    4000.0
dtype: float64

df1[['Salary']].median()

Salary    5000.0
dtype: float64

df1[['Salary']].mode()

	Salary
0	5000

Grouping in Pandas

Group the dataframe by ‘Year_entered’

df1.groupby('Year_entered')

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001F0C729A2E8>

Counts the ‘Employee_id’ and ‘Salary’ by grouping the dataframe based on ‘Year_entered’

df1.groupby('Year_entered').count()

	Employee_id	Salary
Year_entered
2001	1	1
2002	1	1
2003	1	1

Provides mean of the ‘Salary’ of each group by grouping the dataframe based on ‘Year_entered’

df1.groupby('Year_entered').mean()[['Salary']]

	Salary
Year_entered
2001	2000
2002	5000
2003	5000