Pandas
Pandas
Pandas is one of the most popular software (Python) library that can be used by the Data Analyst and Data Scientist for data manipulation and data analysis. Pandas can be used for the management of two-dimensional data tables. Pandas can be said to be the SQL of python though pandas have additional features as well. In this blog, I will show the most commonly used features of Pandas by a data analyst and data scientist.
The blog is done in Jupyter notebook. Installing python 3 and Jupyter notebook is not part of this blog. So, we can assume that these are already installed and we are ready to go.
Import Pandas
Importing Pandas is one of the first step whenever we work in pandas. When we write ‘import pandas as pd’, it means that Jupyter notebook understands that when you write pd, it is referring to Pandas library.
import pandas as pd
Read from the file
Any type of file compatible to work in Pandas like csv, excel etc can be read in Pandas.
df = pd.read_csv("Customers.csv")
Get top 3 rows
head(n) gives the first n number of rows where n = 3 above
df.head(3)
Customer ID | Name | Surname | Gender | Age | Region | Job Classification | Date Joined | Balance | |
---|---|---|---|---|---|---|---|---|---|
0 | 100000001 | Simon | Walsh | Male | 21 | England | White Collar | 05.Jan.15 | 113810.15 |
1 | 400000002 | Jasmine | Miller | Female | 34 | Northern Ireland | Blue Collar | 06.Jan.15 | 36919.73 |
2 | 100000003 | Liam | Brown | Male | 46 | England | White Collar | 07.Jan.15 | 101536.83 |
Get bottom 2 rows
tail(n) gives the last n number of rows where n = 3 above
df.tail(3)
Customer ID | Name | Surname | Gender | Age | Region | Job Classification | Date Joined | Balance | |
---|---|---|---|---|---|---|---|---|---|
4011 | 200004012 | Hannah | Springer | Female | 50 | Scotland | Other | 30.Dec.15 | 59477.82 |
4012 | 200004013 | Christian | Reid | Male | 51 | Scotland | Blue Collar | 30.Dec.15 | 239.45 |
4013 | 300004014 | Stephen | May | Male | 33 | Wales | Blue Collar | 30.Dec.15 | 30293.19 |
Summary of the data
describe() gives total number of count, mean, standard deviation, minimum, maximum, 25%, 50% and 75% of value
df.describe()
Customer ID | Age | Balance | |
---|---|---|---|
count | 4.014000e+03 | 4014.000000 | 4014.000000 |
mean | 1.696831e+08 | 38.611111 | 39766.448274 |
std | 8.865374e+07 | 9.819121 | 29859.489192 |
min | 1.000000e+08 | 15.000000 | 11.520000 |
25% | 1.000020e+08 | 31.000000 | 16115.367500 |
50% | 1.000038e+08 | 37.000000 | 33567.330000 |
75% | 2.000031e+08 | 45.000000 | 57533.930000 |
max | 4.000038e+08 | 64.000000 | 183467.700000 |
Name of the Columns
df.columns
Index(['Customer ID', 'Name', 'Surname', 'Gender', 'Age', 'Region',
'Job Classification', 'Date Joined', 'Balance'],
dtype='object')
Creating new dataframe which has the subset of the columns
New data frame includes only three columns ‘Customer ID’, ‘Name’, ‘Gender’ from the old dataframe.
new_df = df[['Customer ID','Name', 'Gender']]
print(type(df)) # original dataframe
print(type(new_df)) # newly created dataframe
new_df.head()
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
Customer ID | Name | Gender | |
---|---|---|---|
0 | 100000001 | Simon | Male |
1 | 400000002 | Jasmine | Female |
2 | 100000003 | Liam | Male |
3 | 300000004 | Trevor | Male |
4 | 100000005 | Deirdre | Female |
Filter the rows based on the condition
It checks the condition ands gives values if true.
df1 = df[df['Balance'] >150000]
df1
Customer ID | Name | Surname | Gender | Age | Region | Job Classification | Date Joined | Balance | |
---|---|---|---|---|---|---|---|---|---|
1306 | 100001307 | Dorothy | Jackson | Female | 31 | England | White Collar | 22.Jul.15 | 183467.70 |
2009 | 200002010 | Connor | North | Male | 50 | Scotland | Other | 11.Sep.15 | 172085.48 |
3449 | 100003450 | Carl | Fraser | Male | 35 | England | White Collar | 30.Nov.15 | 181680.99 |
3831 | 100003832 | Sebastian | Arnold | Male | 44 | England | Blue Collar | 20.Dec.15 | 161517.82 |
Merge in Pandas
Merge two dataframes
df1= pd.DataFrame({ "Employee_id":[100,120,130],"Salary":[2000,5000,5000],"Year_entered":[2001, 2002,2003]})
df2=pd.DataFrame({ "Employee_id":[130,140,150],"Salary":[5000,8000,6000],"Year_entered":[2003, 2005,2007]})
merged= pd.merge(df1,df2)
display(df1)
display(df2)
display(merged)
Employee_id | Salary | Year_entered | |
---|---|---|---|
0 | 100 | 2000 | 2001 |
1 | 120 | 5000 | 2002 |
2 | 130 | 5000 | 2003 |
Employee_id | Salary | Year_entered | |
---|---|---|---|
0 | 130 | 5000 | 2003 |
1 | 140 | 8000 | 2005 |
2 | 150 | 6000 | 2007 |
Employee_id | Salary | Year_entered | |
---|---|---|---|
0 | 130 | 5000 | 2003 |
Merge two dataframes along the ‘Employee_id’ value
merged= pd.merge(df1,df2,left_on = 'Employee_id', right_on = 'Employee_id')
display(merged)
Employee_id | Salary_x | Year_entered_x | Salary_y | Year_entered_y | |
---|---|---|---|---|---|
0 | 130 | 5000 | 2003 | 5000 | 2003 |
Merge two dataframes along the ‘Employee_id’ value
merged= pd.merge(df1,df2,on = 'Employee_id')
display(merged)
Employee_id | Salary_x | Year_entered_x | Salary_y | Year_entered_y | |
---|---|---|---|---|---|
0 | 130 | 5000 | 2003 | 5000 | 2003 |
Merge two dataframes along the ‘Employee_id’ value using left join
merged= pd.merge(df1,df2, on = 'Employee_id', how = 'left')
display(merged)
Employee_id | Salary_x | Year_entered_x | Salary_y | Year_entered_y | |
---|---|---|---|---|---|
0 | 100 | 2000 | 2001 | NaN | NaN |
1 | 120 | 5000 | 2002 | NaN | NaN |
2 | 130 | 5000 | 2003 | 5000.0 | 2003.0 |
Merge two dataframes along the ‘Employee_id’ value using right join
merged= pd.merge(df1,df2,on = 'Employee_id', how = 'right')
display(merged)
Employee_id | Salary_x | Year_entered_x | Salary_y | Year_entered_y | |
---|---|---|---|---|---|
0 | 130 | 5000.0 | 2003.0 | 5000 | 2003 |
1 | 140 | NaN | NaN | 8000 | 2005 |
2 | 150 | NaN | NaN | 6000 | 2007 |
Merge two dataframes along the ‘Employee_id’ value using outer join
merged= pd.merge(df1,df2,on = 'Employee_id', how = 'outer')
display(merged)
Employee_id | Salary_x | Year_entered_x | Salary_y | Year_entered_y | |
---|---|---|---|---|---|
0 | 100 | 2000.0 | 2001.0 | NaN | NaN |
1 | 120 | 5000.0 | 2002.0 | NaN | NaN |
2 | 130 | 5000.0 | 2003.0 | 5000.0 | 2003.0 |
3 | 140 | NaN | NaN | 8000.0 | 2005.0 |
4 | 150 | NaN | NaN | 6000.0 | 2007.0 |
Concatination
Join the two dataframes along rows
concat= pd.concat([df1,df2])
display(concat)
Employee_id | Salary | Year_entered | |
---|---|---|---|
0 | 100 | 2000 | 2001 |
1 | 120 | 5000 | 2002 |
2 | 130 | 5000 | 2003 |
0 | 130 | 5000 | 2003 |
1 | 140 | 8000 | 2005 |
2 | 150 | 6000 | 2007 |
Sort by the column
Sort the df by ‘Year_entered’ (ascending order).
df1.sort_values(by='Year_entered')
Employee_id | Salary | Year_entered | |
---|---|---|---|
0 | 100 | 2000 | 2001 |
1 | 120 | 5000 | 2002 |
2 | 130 | 5000 | 2003 |
Aggregation in Pandas
Some of the aggregation function includes number of rows, minimum, maximum value, mean, median, mode of the particular columns
Count()
df1.count()
Employee_id 3
Salary 3
Year_entered 3
dtype: int64
df1[['Salary']].count()
Salary 3
dtype: int64
min() and max()
df1[['Salary']].min()
Salary 2000
dtype: int64
df1[['Salary']].max()
Salary 5000
dtype: int64
mean(), median() and mode()
df1[['Salary']].mean()
Salary 4000.0
dtype: float64
df1[['Salary']].median()
Salary 5000.0
dtype: float64
df1[['Salary']].mode()
Salary | |
---|---|
0 | 5000 |
Grouping in Pandas
Group the dataframe by ‘Year_entered’
df1.groupby('Year_entered')
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001F0C729A2E8>
Counts the ‘Employee_id’ and ‘Salary’ by grouping the dataframe based on ‘Year_entered’
df1.groupby('Year_entered').count()
Employee_id | Salary | |
---|---|---|
Year_entered | ||
2001 | 1 | 1 |
2002 | 1 | 1 |
2003 | 1 | 1 |
Provides mean of the ‘Salary’ of each group by grouping the dataframe based on ‘Year_entered’
df1.groupby('Year_entered').mean()[['Salary']]
Salary | |
---|---|
Year_entered | |
2001 | 2000 |
2002 | 5000 |
2003 | 5000 |