Hello everyone, I hope you are all doing well and staying safe? The support has been incredible and I appreciate each and everyone of you.
Background
If you have been following this tutorial series you would know the background. We are basically trying to analyze the COVID-19 data set based on an HTML Jupyter Notebook of a research paper by Assistant Professor Samrat Kumar Dey.
In this tutorial, we are going to perform some comparative analysis and Plot our COVID-19 data on charts
Prerequisite
To better understand and follow this tutorial, kindly go through the prior tutorials as it is a continuation of what we are talking about today.
In my previous tutorial about Data visualization I explained why it is important to perform data visualisations and why my ideal library was Plotly.
However, today is to compare and delve into;
- Number of Countries of COVID-19 spread over time
- Number of Provinces in China COVID-19 spread over time
- Comparative analysis of cases in Hubei, Other provinces of China, and worldwide
- Number of Cases in China and Outside China
Import Libararies
As always we start by importing all libraries to used for our analysis.
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import timedelta,date
import plotly.express as px
Please Note the dataframe used for the analysis below(df), how its was derived, datasets used to derived it, etc. can be found in my previous tutorial listed above hence that will not be discussed here but we will go ahead and use it for our analysis.
1. Number of Countries of COVID-19 Spread
In this analysis we want to check the number or count of countries that recorded COVID-19 per Date. So basically we want to know the date and number of countries that had their first confirmed case on that date. Let’s get right into it;
Line 1: Instantiate a dictionary
Line 3: Loop through all the rows in the dataframe using the itertuples() method
Line 4: Since its looping through the dataframe we want to check if the selected row.Date is NOT already add to the dictionary
Line 5: If it is not, add the row.Date value as a key in the dictionary, else “discard” the Date value. As we well know, dictionary store items in key:value pairs. In view of that, the date is added as the key and an empty array as the corresponding value.
Line 6: We check whether the Confirmed case is greater than zero
Line 7-8: We check if row.Country_Region is NOT in the value of our finalData dictionary of the date(key) selected, append the country who confirmed cases number is greater than zero. I hope this is not confusing? 🥺
finalData = {}
for row in df.itertuples():
if not row.Date in finalData:
finalData[row.Date] = [];
if row.Confirmed > 0:
if not row.Country_Region in finalData[row.Date]:
finalData[row.Date].append(row.Country_Region)
finalData
Here again,
- We instantiate a dictionary.
- Loop through the key and value of the items in our finalData dictionary.
- The length of the values of each item in the dictionary is then set as the corresponding value of the key in our new count_countries dictionary.
count_countries = {}
for key, values in finalData.items():
count_countries[key] = len(values)
count_countries
Way to go, now we get to create our dataframe. Whew! Pretty self-explanatory here, items in our count_countries dictionary is our data and we passed the name of our columns.
all_countries = pd.DataFraame(count_countries.items(),
columns=['Date', 'Country_Region'])
all_countries
Plotting our results using Plotly
When plotting, first include the dataframe(all_countries), pass the column to use on both the X and Y axis. Then you add a suitable title.
fig = px.line(all_countries, x="Date", y="Country_Region",
title='Number of Countries/Regions to which COVID-19 spread over the time')
fig.show()
2. Number of Provinces in China COVID-19 spread over time
Because we want to analyze provinces in China, we query from our dataframe where the Country_Region selected is China
df_china = df.loc[(df['Country_Region'] == 'China')]
We practically go through what we did for Number of Countries of COVID-19 Spread above. The only change is that instead of Country we want the Province_State
chinaFinalData = {}
for row in df_china.itertuples():
if not row.Date in chinaFinalData:
chinaFinalData[row.Date] = [];
if row.Confirmed > 0:
if not row.Province_State in chinaFinalData[row.Date]:
chinaFinalData[row.Date].append(row.Province_State)
chinaFinalData
count_provinces = {}
for key, values in chinaFinalData.items():
count_provinces[key] = len(values)
count_provinces
provinces_china = pd.DataFrame(count_provinces.items(),
columns=['Date', 'Province_State'])
provinces_china
fig = px.line(provinces_china, x="Date", y="Province_State",
title='Number of Provinces/States/Regions of China to which COVID-19 spread over the time')
fig.show()
Plotting our chart
3. Comparative case analysis of Hubei, Other provinces of China, and worldwide
We want to compare the sum of cases of Hubei, the other provinces in China and the rest of the world not China. Why Hubei, you may be asking? Hubei is a province in China where COVID-19 first started and you guessed right, Wuhan is the capital Hubei. So you see where we are going with this…? Let’s head right into it.
Here we create a pivot table while using Country_Region equal to China and also getting the maximum value of the various cases(Confirmed, Death and Recovered) for each country. This then becomes our dataframe going to be used for our comparative analysis relating to China.
reported_province_cases = pd.pivot_table(df[(df['Country_Region'] == 'China')],
values=['Confirmed', 'Death', 'Recovered'], index=['Province_State'],
aggfunc={'Confirmed': 'max', 'Death': 'max', 'Recovered': 'max'}).reset_index()
reported_province_cases
a. Hubei Province
Using the dataframe we created above reported_province_cases, we want to the row in the dataframe where Province_State is equal to Hubei
hubei_df = reported_province_cases.loc[(reported_province_cases['Province_State'] == 'Hubei')].reset_index()
hubei_df
b. Other Province in China Not Hubei
Here we do the opposite of the above, we want all the rows in the reported_province_cases dataframe EXCEPT Province_State equal to Hubei. Next we sum the number for the various cases.
not_hubei = reported_province_cases.loc[(reported_province_cases['Province_State'] != 'Hubei')]
not_hubei_death = not_hubei['Death'].sum()
not_hubei_recovered = not_hubei['Recovered'].sum()
not_hubei_confirmed = not_hubei['Confirmed'].sum()
Again dictionary is a key-value pair, hence we create one where the label “Province_State” is the key and “Outside_Hubei” is the value. We do the same from the different cases.
As always, we create a dataframe from our dictionary(not_hubei) by adding the column names.
not_hubei = {'Province_State': ['Outside_Hubei'],
'Confirmed': [not_hubei_confirmed],
'Death': [not_hubei_death] ,
'Recovered': [not_hubei_recovered]
}
not_hubei_df = pd.DataFrame(not_hubei,
columns = ['Province_State', 'Confirmed', 'Death', 'Recovered'])
not_hubei_df
Remember the Hubei dataframe hubei_df we created up there, we want to add that as a new row to the not_hubei_df. We do that by appending it.
new_plot_df = not_hubei_df.append(hubei_df)
new_plot_df
c. Other Countries in the World Not China
We are doing the opposite of what we did earlier. Here we create a pivot table while using Country_Region NOT equal to China and also getting the maximum value of the various cases(Confirmed, Death and Recovered) for each country. This then becomes our dataframe going to be used for our comparative analysis relating to other countries in the NOT China.
Technically we go through the same steps as the previous one. Get the sum of the various cases.
other_countries = pd.pivot_table(df[(df['Country_Region'] != 'China')],
values=['Confirmed', 'Death', 'Recovered'], index=['Country_Region'],
aggfunc={'Confirmed': 'max', 'Death': 'max', 'Recovered': 'max'}).reset_index()
other_countries_confirmed = other_countries['Confirmed'].sum()
other_countries_recovered = other_countries['Recovered'].sum()
other_countries_death = other_countries['Death'].sum()
Then we create a dictionary and a dataframe.
other_countries = {'Province_State': ['Rest_of_World'],
'Confirmed': [other_countries_confirmed],
'Death': [other_countries_death] ,
'Recovered': [other_countries_recovered]
}
other_countries_df = pd.DataFrame(other_countries,
columns = ['Province_State', 'Confirmed', 'Death', 'Recovered'])
other_countries_df
Remember the dataframe we derived after appending hubei_df and not_hubei_df and got a new dataframe we called new_plot_df? Yes, That! We are going to add a new row to it with our dataframe other_countries_df. Got it?
new_plot_df = new_plot_df.append(other_countries_df)
new_plot_df
I won’t explain melt in details because I did that in my previous tutorial, but I can show you what happens to the dataframe after melting as seen here.
new_plot_df = new_plot_df.melt(id_vars=['Province_State'])
new_plot_df = new_plot_df.rename(columns={'variable': 'Cases', 'value': 'Count'})
new_plot_df
Plotting our data
fig = px.bar(new_plot_df, x="Count", y='Province_State' , color='Cases', orientation='h',barmode='group', text="Count",
hover_data=["Province_State","Cases","Count"],
height=800,
title='Comparative case analysis of Hubei, Other provinces of China, and worldwide')
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.show()
4. Number of Cases in China and Outside China
a. Cases in China
Remember our first dataframe we created? reported_province_cases? We are going to be using that here. Instead of repeating all of this line df.loc[(df['Country_Region'] == 'China')]
, we will replace our data parameter in the pivot table with reported_province_cases.
Next after that is to melt the dataframe and sort the values. It’s as simple as that.
provinces_in_china = pd.pivot_table(reported_province_cases,
values=['Confirmed', "Recovered", "Death"], index=["Province_State"],
aggfunc={'Confirmed': 'max', "Recovered": 'max', "Death": 'max'}).reset_index()
provinces_in_china = provinces_in_china.melt(id_vars=["Province_State"])
provinces_in_china = provinces_in_china.sort_values(by='value', ascending=True)
Plot our data
fig = px.bar(provinces_in_china,
x="value",
y='Province_State' ,
color='variable',
orientation='h',
hover_data=["Province_State","variable","value"],
height=800,
title='Number of Cases inside China')
textposition='outside')
fig.show()
b. Cases Outside China
The opposite of the above is done that’s is countries not China. Perform a melt and sort the dataset for plotting.
outside_china = pd.pivot_table(country_codes_df.loc[(country_codes_df['Country_Region'] != 'China')], values=['Confirmed', "Recovered", "Death"], index=["Country_Region"],
aggfunc={'Confirmed': 'max', "Recovered": 'max', "Death": 'max'}).reset_index()
outside_china = outside_china.melt(id_vars=["Country_Region"])
outside_china = outside_china.sort_values(by='value', ascending=True)
outside_china
Plot that data
fig = px.bar(outisde_china, x="value", y='Country_Region' , color='variable', orientation='h',
hover_data=["Country_Region","variable","value"],
height=2000,
title='Number of Cases outside China')
fig.show()
Additional Resources
Final Thoughts
Being able to implement these solutions is great, being able to visualize these dataset has its own perks but most importantly being able to understand and interpret these plots and tell a good story is a different ball game all together.
I believe we didn’t really interpret our results and maybe that should something we focus on after completing all these analysis.
As always, we don’t leave you hanging, kindly find Jupyter Notebook of this tutorial on Github