Introduction to Web Scraping with BeautifulSoup from a Covid-19 website.

“Data is the new oil”

~ Somebody Wise

One thing I have been learning a lot about, over the past few years is the importance of data in our world today. From running a small business to predicting events in the future, with the right set of data, you can attain your desired results. The cool thing too about data is that it’s always available and always growing and this alone makes a field worth considering.

During this ongoing coronavirus pandemic, most of us have come to know what’s happening around us, and have visited websites to get the latest numbers regarding it.

For our very first lesson, we will be scraping data on the pandemic from the popular website, Worldometers.info. In subsequent posts, we will analyze the data we scraped and represent them in visual forms.

What is Web Scraping?

Web scraping is a technique used to collect large amounts of data from a website. The data from the source website is in human-readable format (unstructured). Unstructured data found on websites include audios, videos and photos, just to name a few. This is then scraped and parsed into another program (in a structured format), where machine learning algorithms are applied to it to extract meaning from it.

In this tutorial, we are going to learn how to scrape and get data from the web. And as a bonus, we will save the scraped data into a database.

Prerequisite

Before we get started, you should have installed the following packages/software:

Anaconda – this installation comes with almost everything we will need for this task.

We will be working with the Python programming language for the purpose of this tutorial. There are other tools available in other languages like JavaScript.

Before we get into any installations, create a virtual environment.

python -m venv webscrape
source webscrape/bin/activate

Now let’s install the other tools that are not included in Anaconda. You can do this by running the following commands in the virtual environment created.

pip install beautifulsoup4
pip install lxml
pip install requests
pip install psycopg2

Web scraping

Now that we have all our installations, we can get right into scraping our data from the website.

Start your Jupyter Notebook and create a new Python3 Project.

Then go ahead and add these lines of code.

In [1] :

These lines import all libraries needed

import re
import datetime
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [2] :

This line gets the HTML page of the website

results = requests.get('https://www.worldometers.info/coronavirus/')
results.url

In [3] :

The HTML content needs to be converted into Python (since we’ll be working in Python). This can be done by using a function called BeautifulSoup, which accepts the HTML content and ‘lxml’ (the parser) as parameters. It transforms any HTML objects (such as tags, navigable string, or comments) in the HTML content to Python objects.

soup = BeautifulSoup(results.content, 'lxml')
soup

In [4] :

table = soup.table   #one way to retrieve table


table = soup.find('table') #or you use this approach

table

We will be going through two ways to scrape a Table from a website.

1. Using Prettify

Since we are scraping tables, using Prettify is the easiest way to go and it’s “pretty” straightforward, pun intended. With just a few lines of code, you end up getting a beautiful dataframe to get you started with your analysis.

In [5] :

df = pd.read_html(table.prettify())
df = df[0]
df

Out [5]:

2. Using Conditionals

This approach is a bit daisy in the sense that, as the table on the website is being modified, the current structure may not hold. I have been monitoring this website closely for some time now and it initially didn’t have certain tables and columns it currently has. For example; the section for all the different continents.

Hence, the downside of this approach as compared to the Prettify approach is that, if there is any modification in the structure of the tables on the website, the same modifications needs to be catered for here in our script.

In [6] :

table_rows = table.find_all('tr')  #find all the table rows in the table
table_rows

In [7]:

headers = []       #declare an empty headers variable
for tr in table_rows:
    td = tr.find_all('td')  #find all the table data in the table rows tag
    row = [i.text for i in td] #iterate the over table data and retrieve the "text"
    print(row)
    headers.append(row)  #appends all all the rows in the headers variable declared 

Out [7]:

We are creating a dataframe and passing each of the column headers name

In [8]:

df = pd.DataFrame(headers, columns=["Country,Other","TotalCases", "NewCases", "TotalDeaths", "NewDeaths", "TotalRecovered","ActiveCases", "Serious,Critical", "TotalCases/1M pop", "Deaths/1M pop", "Reported1st case", "Tests",  "Continent"])

df

Out [8]:

In [9]:

df.drop(df.index[:8], inplace=True) #drop records from index 0 to index 7. These are the records of the continent in the table which we don't need now

df

Out[9]:

This approach is a bit everywhere as compared to the Prettify approach. When you check those last 10 records in the dataframe, you realize the dataframe has the total collated data of the respective continents.

df.tail(10)  #list the last 10 records in the dataframe

And we need to get rid of them.

df.drop(df.index[213:221], inplace=True)  #drop records from the index specified
df.tail(10)

Bonus

Lets save this data we have into a database.

from sqlalchemy import create_engine
import psycopg2


engine = create_engine('postgresql://postgres:postgres@localhost:5432/postgres')  #postgresql://username:password@host:port/database

engine

df.to_sql('covid19', con=engine, if_exists='replace')  #save the dataframe to the database

Read the saved data from the database.

data = pd.read_sql('SELECT * FROM covid19', engine)  #read the data we save to the database
data

You can find the Jupyter Notebook file in this Github repo.

In subsequent posts, we will be heading into some Exploratory Data Analysis.

I have performed other webscrapings similar to this, from Tonaton and Meqasa websites. If you want me to cover that in my subsequent posts, kindly indicate it in the comment section below.

Thank you very much and I hope you are keeping yourself safe during this pandemic.

Spread the love

You may also like

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *