“Data is the new oil”
~ Somebody Wise
One thing I have been learning a lot about, over the past few years is the importance of data in our world today. From running a small business to predicting events in the future, with the right set of data, you can attain your desired results. The cool thing too about data is that it’s always available and always growing and this alone makes a field worth considering.
During this ongoing coronavirus pandemic, most of us have come to know what’s happening around us, and have visited websites to get the latest numbers regarding it.
For our very first lesson, we will be scraping data on the pandemic from the popular website, Worldometers.info. In subsequent posts, we will analyze the data we scraped and represent them in visual forms.
What is Web Scraping?
Web scraping is a technique used to collect large amounts of data from a website. The data from the source website is in human-readable format (unstructured). Unstructured data found on websites include audios, videos and photos, just to name a few. This is then scraped and parsed into another program (in a structured format), where machine learning algorithms are applied to it to extract meaning from it.
In this tutorial, we are going to learn how to scrape and get data from the web. And as a bonus, we will save the scraped data into a database.
Prerequisite
Before we get started, you should have installed the following packages/software:
Anaconda – this installation comes with almost everything we will need for this task.
We will be working with the Python programming language for the purpose of this tutorial. There are other tools available in other languages like JavaScript.
Before we get into any installations, create a virtual environment.
python -m venv webscrape source webscrape/bin/activate
Now let’s install the other tools that are not included in Anaconda. You can do this by running the following commands in the virtual environment created.
pip install beautifulsoup4 pip install lxml pip install requests pip install psycopg2
Web scraping
Now that we have all our installations, we can get right into scraping our data from the website.
Start your Jupyter Notebook and create a new Python3 Project.
Then go ahead and add these lines of code.
In [1] :
These lines import all libraries needed
import re import datetime import requests import pandas as pd from bs4 import BeautifulSoup
In [2] :
This line gets the HTML page of the website
results = requests.get('https://www.worldometers.info/coronavirus/') results.url
In [3] :
The HTML content needs to be converted into Python (since we’ll be working in Python). This can be done by using a function called BeautifulSoup, which accepts the HTML content and ‘lxml’ (the parser) as parameters. It transforms any HTML objects (such as tags, navigable string, or comments) in the HTML content to Python objects.
soup = BeautifulSoup(results.content, 'lxml') soup
In [4] :
table = soup.table #one way to retrieve table table = soup.find('table') #or you use this approach table
We will be going through two ways to scrape a Table from a website.
1. Using Prettify
Since we are scraping tables, using Prettify is the easiest way to go and it’s “pretty” straightforward, pun intended. With just a few lines of code, you end up getting a beautiful dataframe to get you started with your analysis.
In [5] :
df = pd.read_html(table.prettify()) df = df[0] df
Out [5]:
2. Using Conditionals
This approach is a bit daisy in the sense that, as the table on the website is being modified, the current structure may not hold. I have been monitoring this website closely for some time now and it initially didn’t have certain tables and columns it currently has. For example; the section for all the different continents.
Hence, the downside of this approach as compared to the Prettify approach is that, if there is any modification in the structure of the tables on the website, the same modifications needs to be catered for here in our script.
In [6] :
table_rows = table.find_all('tr') #find all the table rows in the table table_rows
In [7]:
headers = [] #declare an empty headers variable for tr in table_rows: td = tr.find_all('td') #find all the table data in the table rows tag row = [i.text for i in td] #iterate the over table data and retrieve the "text" print(row) headers.append(row) #appends all all the rows in the headers variable declared
Out [7]:
We are creating a dataframe and passing each of the column headers name
In [8]:
df = pd.DataFrame(headers, columns=["Country,Other","TotalCases", "NewCases", "TotalDeaths", "NewDeaths", "TotalRecovered","ActiveCases", "Serious,Critical", "TotalCases/1M pop", "Deaths/1M pop", "Reported1st case", "Tests", "Continent"]) df
Out [8]:
In [9]:
df.drop(df.index[:8], inplace=True) #drop records from index 0 to index 7. These are the records of the continent in the table which we don't need now df
Out[9]:
This approach is a bit everywhere as compared to the Prettify approach. When you check those last 10 records in the dataframe, you realize the dataframe has the total collated data of the respective continents.
df.tail(10) #list the last 10 records in the dataframe
And we need to get rid of them.
df.drop(df.index[213:221], inplace=True) #drop records from the index specified df.tail(10)
Bonus
Lets save this data we have into a database.
from sqlalchemy import create_engine import psycopg2 engine = create_engine('postgresql://postgres:postgres@localhost:5432/postgres') #postgresql://username:password@host:port/database engine df.to_sql('covid19', con=engine, if_exists='replace') #save the dataframe to the database
Read the saved data from the database.
data = pd.read_sql('SELECT * FROM covid19', engine) #read the data we save to the database data
You can find the Jupyter Notebook file in this Github repo.
In subsequent posts, we will be heading into some Exploratory Data Analysis.
I have performed other webscrapings similar to this, from Tonaton and Meqasa websites. If you want me to cover that in my subsequent posts, kindly indicate it in the comment section below.
Thank you very much and I hope you are keeping yourself safe during this pandemic.
2 Comments