EDA and V-EDA Tutorial 2: Exploratory Data Analysis using Pandas

I know this pandemic is a difficult time for everybody but I hope you are safe and well? Kindly adhere to the directives given by the experts and we will come out of this unscathed. Please, don’t forget to wash your hands regularly with soap under running water, sanitize when there is no water, practice social distancing, and always wear a mask.

Introduction

As earlier stated in my previous post, this is going to be a series of tutorials to analyze the COVID-19 data set.

In this second tutorial of the series, we will work on answering the questions from the Jupyter Notebook of the research paper using Exploratory Data Analysis (EDA).

Installations

As usual the prerequisite you need to get started is Ananconda, which should come with almost all the libraries you will need for the analysis.

Import libraries
import pandas as pd

Exploratory Data Analysis

Prior to answering these questions, there are two things I would like you to note;

  • What we are doing today is a build-up from my previous post on Data Collection and Preprocessing.
  • The technique used in most of the EDA was pivot_table() and you can learn more about it here

Now let’s get right into the Notebook and answer each of the questions;

For the purposes of this tutorial I will be using two dataframes.

1. df – This first dataframe is from merging the three dataframes (confirmed, recovered and death) hence what we have below;

df = df_confirmed.merge(df_deaths, on=['Province/State', 'Country/Region', 'Lat', 'Long', 'Date'], how='left')
df = df.merge(df_recovered, on=['Province/State', 'Country/Region', 'Lat', 'Long', 'Date'], how='left')
df

2. df_pivot – The data of the three columns (confirmed, recovered and death) is cumulative, hence all we had to do was pick the maximum which is always going to be the very recent number or figure.

df_pivot = pd.pivot_table(df, values=['Confirmed', 'Death', 'Recovered'], index=['Country_Region'], 
               aggfunc={'Confirmed': 'max','Death': 'max', 'Recovered':'max'}).reset_index()

df_pivot
1. Number Of Countries/Regions To Which COVID-19 Has Spread

By getting the count of unique countries, we get the number of countries affected by the pandemic.

df['Country_Region'].nunique() 
2. Provinces In China With Most Reported Cases

This is basically saying give me all the provinces in China, where the value is number of confirmed cases in that province, and the index, you guessed right, is the Province.

reported_province_cases = pd.pivot_table(df[(df['Country_Region'] == 'China')], values=['Confirmed'], index=['Province_State'], 
               aggfunc={'Confirmed': 'max'}).reset_index()

(reported_province_cases.sort_values('Confirmed', ascending=False)
.style
.highlight_min(color='lightgreen')
.highlight_max(color='#cd4f39'))
3. Number Of Provinces/States In China To Which COVID-19 Has Spread

The beautiful thing about pivot tables is that, whatever your output is becomes a dataframe on its own. Hence when I am asked about the province with confirmed cases, all I did was find the length of the “reported_province_cases” dataframe to get the number of provinces affected by the pandemic.

len(reported_province_cases)
4. Countries With Deaths Reported

Just like the previous question, you need to know what column you need to make the value and index. There was a bit of sorting done in descending order and some styling also done.

countries_reported_death = pd.pivot_table(df, values=['Death'], index=['Country_Region'], 
               aggfunc={'Death': 'max'}).reset_index()

(countries_reported_death.sort_values('Death',ascending=False)
.style
.bar(subset=["Death",], color='#cd4f39')
.highlight_min(color='lightgreen')
)
5. Deaths To Recovered Ratios

The values parameter had an array of Death and Recovered, and each values aggfunc needs to be declared. Some styling was added here as well.

death_to_recovered = pd.pivot_table(df, values=['Death', 'Recovered'], index=['Country_Region'], 
               aggfunc={'Death': 'max', 'Recovered':'max'}).reset_index()

death_to_recovered.sort_values('Death', ascending=False).style.highlight_max(color='#cd4f39').bar(subset=["Recovered"], color='lightgreen').background_gradient(cmap='Reds')
6. Countries With No Recovered Cases

Please note here that we are using the df_pivot dataframe. The query is pretty simple; it gives us all countries where there has been no recovery.

no_recovery = pd.pivot_table(df_pivot[(df_pivot['Recovered'] == 0)], values=['Recovered'], index=['Country_Region'], 
               aggfunc={'Recovered': 'sum'}).reset_index()
no_recovery.style.highlight_max(color='#cd4f39')
7. Countries With No Affected Cases Anymore

The query is pretty simple as well, we are saying give us all the countries where the Recovered cases is equal to the Confirmed cases. Technically this means there are no COVID-19 cases, because all affected people have recovered.

no_cases = pd.pivot_table(df_pivot[(df_pivot['Recovered'] == df_pivot['Confirmed'])], values=['Confirmed', 'Death','Recovered'], index=['Country_Region'], 
               aggfunc={'Recovered': 'sum', 'Confirmed':'sum','Death':'sum'}).reset_index()
no_cases.style.highlight_max(color='lightgreen')
Last Words

These questions from the research paper were garnered at the preliminary stages of the pandemic, hence are inadequate. I am open to more questions you’ll need me to answer for you. Kindly leave them in the comment section and I will attend to them as early as possible.

Kindly find the Jupyter Notebook of this tutorial on Github.

Next week, we will be visualizing some of these results on a map and you don’t wanna miss out.

See you, take care and be safe.

Spread the love

You may also like

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *