EDA and V-EDA Tutorial 1: Data Collection and Preprocessing

I trust you are safe? Despite this unpleasant period we’re all currently in, the coronavirus period, we need not lose hope and we need to continue doing our best.

Today, I come your way with yet another interesting tutorial.

Background

This is going to be a series where I try to analyze the COVID-19 data set based on a research paper I found. The data set was analyzed using Exploratory Data Analysis (EDA) methods and visualization model in the hope to help increase the situational awareness among the mass community.

The EDA and visualization EDA (V-EDA) were performed by Assistant Professor Samrat Kumar Dey.

The Problem

The HTML Jupyter Notebook had the code folded in but displayed the expected results.

My hope is to replicate what has been done by writing the code to solve the questions asked during the EDA and in effect get the expected results as seen in the research.

A little side note here;
The answers we will derive may differ from that of the research because this paper was written on 03 March 2020 and the COVID-19 data set keeps changing every single day.

In this first part of the series, we will work on getting the data in the best format to help us with an efficient EDA and also replicate what we have in the research paper using techniques like melting, concatenating, etc.

Installations

Data sets Credit

The data set is from John Hopkins Research

Prerequisite

Before we get started, you should have installed the following packages/software:

Anaconda – this installation comes with almost everything we will need for this task.

We will be working with the Python programming language for the purpose of this tutorial.

Before we get into any installations, create a virtual environment.

python -m venv research_paper_analysis
source research_paper_analysis/bin/activate

Now let’s install the other tools that are not included in Anaconda. You can do this by running the following command in the virtual environment created.

pip install wordcloud

Data Collection and Preprocessing

Now that we have all our installations, we can begin with our first stage in the EDA process.

  1. Import Libraries
  2. Import data set or read data sets
  3. Melt dataframe
  4. Merge dataframes

Start your Jupyter Notebook and create a new Python3 Project.

1. Import Libraries

import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

2. Read Data set

There are three data sets here; confirmed, death and recovered cases as listed above. We are going to read each of them.

Confirmed Cases
confirmed_cases = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
confirmed_cases = pd.read_csv(confirmed_cases)
confirmed_cases
Death Cases
confirmed_deaths ="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"
confirmed_deaths = pd.read_csv(confirmed_deaths)
confirmed_deaths
Recovered Cases
confirmed_recovered ="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv"
confirmed_recovered = pd.read_csv(confirmed_recovered)
confirmed_recovered

You will realize that the data is in a pretty untidy format because of the long columns.

Note: The number of columns will keep increasing each day.

Tidy Data Concept

According to Dr. Wickham, in his paper Tidy Data, he proposed an ideal format and structure of a DataFrame:

  • Each variable is a column,
  • Each observation is a row, and
  • Each type of observational unit is a table. 

Obviously our dataframe doesn’t follow the above principles.

Why Tidy the data?

Simple: Tidy datasets are easy to manipulate, model and visualize.

Tidy Data From The Research Paper

From the research paper, you can see that the data was tidied by switching the date columns into rows in order to make each type (case in a province) an observational unit.

Two things need to happen to achieve the results in the second dataframe;

  1. Convert all Date columns to rows – Melt()
  2. Join the three dataframes into one to have Confirmed, Death and Recovered Cases – Merge

3. Melt() for tidying Data

The melt function is written is as below:

confirmed_cases.melt(
    id_vars=None,
    value_vars=None,
    var_name=None,
    value_name='value',
    col_level=None, 
)

According to the documentation, here is what each parameter means;

Parameters
----------
frame : DataFrame
id_vars : Column(s) to use as identifier variables.
value_vars : Column(s) to unpivot. If not specified, uses all columns that are not set as `id_vars`.
var_name : Name to use for the 'variable' column. If None, it uses the name in the dataframe.
value_name : Name to use for the 'value' column.
col_level : If columns are a MultiIndex then use this level to melt.
Confirmed Cases Tidied Data

As part of tidying our data, all rows without Province/State was replaced with the country and also change to the data type of the date column.

df_confirmed = confirmed_cases.melt(id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name='Date', value_name='Confirmed_Cases')
df_confirmed['Province/State'] = df_confirmed['Province/State'].fillna(df_confirmed['Country/Region'])
df_confirmed['Date'] = pd.to_datetime(df_confirmed['Date'])

df_confirmed.sample(5)
Death Cases Tidied Data

As part of tidying our data, all rows without Province/State was replaced with the country and also change to the data type of the date column.

df_deaths = confirmed_deaths.melt(id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name='Date', value_name='Death_Cases')
df_deaths['Province/State'] = df_deaths['Province/State'].fillna(df_deaths['Country/Region'])
df_deathas['Date'] = pd.to_datetime(df_deaths['Date'])

df_deaths.sample(5)
Recovered Cases Tidied Data

As part of tidying our data, all rows without Province/State was replaced with the country and also change to the data type of the date column.

df_recovered = confirmed_recovered.melt(id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name='Date', value_name='Recovered_Cases')
df_recovered['Province/State'] = df_recovered['Province/State'].fillna(df_recovered['Country/Region'])
df_recovered['Date'] = pd.to_datetime(df_recovered['Date'])

df_recovered.sample(5)

4. Merge() – Joining All DataFrames

  1. Combine the 3 datasets into one where we have confirmed, deaths, and recovered cases.
  2. We’ll use left joins to combine the 3 datasets starting from confirmed. This is because the confirmed cases ideally hold more records than deaths/recovered. So the dataframe with the confirmed cases will be our central dataframe.
df = df_confirmed.merge(df_deaths, on=['Province/State', 'Country/Region', 'Lat', 'Long', 'Date'], how='left')
df = df.merge(df_recovered, on=['Province/State', 'Country/Region', 'Lat', 'Long', 'Date'], how='left')
df = df.rename(columns={'Province/State': 'Province_State', 'Country/Region': 'Country_Region'})
df.head()

Bonus – Word Cloud

wc = WordCloud(max_font_size=250,collocations=False, 
                       max_words=1000,width=1600, 
               height=800,background_color="white").generate(' '.join(df['Country_Region'].replace(' ', '_', regex=True)))
plt.figure( figsize=(20,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
plt.savefig('Country_Cloud.png')

World Cloud count the occurrence of a particular country in the dataframe and derives the frequency from that.

In this chart, the most affected countries are most easily visible.

Our chat is a bit flawed in the sense that, the US is supposed to be very visible as China since it is currently at the epicenter of the pandemic. This flaw is probably due few Province/State not be populated being populated during the data collection phase.

We have come to the end of this tutorial. You’ll realize that we’ve achieved similar results to the results from the Data Collection and Preprocessing from the research paper.

Next week, we will get into EDA by answering the questions in the research paper

You haven’t checked out my tutorial on Web Scraping? Let’s get started

Kindly find Jupyter Notebook of this tutorial on Github

Stay Safe <3

Don’t forget to leave your comments, questions and feedback below.

Spread the love

You may also like

3 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *