Analysing COVID-19 Data with Pandas

| 5 min read | 25

User Avatar

User has not set any headline

Blog Wallpaper

In this tutorial, we will be analysing COVID-19 data with Pandas using jupyter notebook. We wiil be using data from covid19india.org.

Prerequisite:

  1. Hands-on with python.
  2. Pandas installed in your system.

Getting our Dataset:

First step will be getting our dataset from https://api.covid19india.org/raw_data3.json. Locate to this url, and download the data. The data is located inside of raw_data key. It is a list of dictionaries (or array of objects) and our dictionaries are in following format:

 

{
	"agebracket": "",
	"contractedfromwhichpatientsuspected": "",
	"currentstatus": "Hospitalized",
	"dateannounced": "27/04/2020",
	"detectedcity": "",
	"detecteddistrict": "",
	"detectedstate": "West Bengal",
	"entryid": "1",
	"gender": "",
	"nationality": "",
	"notes": "Details awaited",
	"numcases": "38",
	"patientnumber": "27892",
	"source1": "mohfw.gov.in",
	"source2": "",
	"source3": "",
	"statecode": "WB",
	"statepatientnumber": "",
	"statuschangedate": "",
	"typeoftransmission": ""
},

Constructing our DataFrame:

Since our data is inside raw_data key, it indicates that this is a nested structure and we will be

normalizing it into a flat structure. The following code snippet will be executed with jupyter notebook, and at the end of this tutorial, I will add the link to download the notebook.

Here we will construct our DataFrame from a python dict and for that, we will use from_dict method of DataFrame class.

import json
import pandas as pd

with open('rawdata.json', 'r') as json_file:
    data = json.load(json_file)

df = pd.DataFrame.from_dict(data)
df.head()

df.head() will give us following output:

Output of df.head()

We can see that our DataFrame has been created but it is not in correct format. It contains rows with dict objects. We are looking for the dict fields as columns. To achieve that we will have to extract data from raw_key. Pandas have a function json_normalize which will help us to get the data from nested structure.

import json
import pandas as pd
from pandas.io.json import json_normalize

with open('rawdata.json', 'r') as json_file:
    data = json.load(json_file)

df = json_normalize(data['raw_data'])
df.head()

Running df.head() will give us starting 5 rows of our DataFrame.

output after json_normalize()

Now our DataFrame rendered data in correct format. We can also get the same output by directly getting data from a url, instead of downloading it to our local machine. We can use read_json funtion to get data from a URL, and then pass the output to json_normalize.

import json
import pandas as pd 
from pandas.io.json import json_normalize  

data = pd.read_json('https://api.covid19india.org/raw_data3.json') 
df = json_normalize(data['raw_data'])
df.head()

We can get the total number of records by running df.shape, it will return (10008, 20). Since it is a live database, you may get different number of records when you run df.shape. Output of df.shape will return a tuple of length 2, where 1st index will represent total number of rows and 2nd index will return total columns. So right now there are 10008 rows and 20 columns present in our DataFrame.


Grouping data State-wise:

To get total number of cases for a state, we will be grouping our data based on the State name, here the column for states is detectedstate. After getting the correct column, we can group our data with the groupby method. We are looking to group our data based on the state name and get total number of cases for that state. So to achive that, we will execute:

df.groupby('detectedstate')['numcases'].sum()​

The output of above command will return:

 

Not the output we wanted, we tried to sum all the number of cases for a state, but we got incorrect output. Let's debug the issue, by checking the datatype of values for column numcases. We can simply check the type of values of column for first row.

type(df['numcases'][0])

It returned str, now it makes sense why we got incorrect response. Since the datatype is str and if we try to add 2 strings, that will return a new string ('1' + '1' = '11'), i.e string concatenation. We need to change those strings to int and to perform that action, we will be using to_numeric function.

df['numcases'] = pd.to_numeric(df['numcases'])

Again if we run type(df['numcases'][0]), we will get numpy.float64, this indicates that we can run mathematical operations on our column. Now running df.groupby('detectedstate')['numcases'].sum() again, we will get the following output.

Grouping data state-wise

In our output we get desired result, but it is not sorted. To sort our data, we can call sort_values() after our grouping our data. We will pass ascending=Falseto our sort_values method, so that the total number of cases are in descending order.

df.groupby('detectedstate')['numcases'].sum().sort_values(ascending=False)

 This will give us output in descending order that is sorted by total number of cases:

Sorted state-wise data

Getting percentage of current status

Now we will try to fetch the percentage of "Hospitalized", "Recovered" and "Deceased" cases. In our dataset, we can get the status under currentstatus column. To achieve that, we will be grouping our data with respect to currentstatus column.

df.groupby('currentstatus')['numcases'].sum().sort_values(ascending=False).to_dict()

We have used to_dict() to get output in dict format. Output of above command will be:

{'Hospitalized': 34977.0, 'Recovered': 12814.0, 'Deceased': 1220.0}

 We will store the output in patient_stats_dict, and use that to create a new DataFrame, we will call our new DataFrame stats_df.

stats_df = pd.DataFrame(patient_stats_dict.items())

 Running stats_df.head() will give us our new DataFrame, output will be:

output of stats_df

Now we will a new column percentage to our newly created DataFrame, which will hold the percentage value from total value. To add a new column, we will use following command:

stats_df['percentage'] = (stats_df['patient_count'] * 100) /  stats_df['patient_count'].sum()

 Here we added a new column, which will hold the value based on the patient_count column for each row. It is a simple mathematical operation, multiply current value with 100 then divide that with total number. To get the total number of patients, we got the sum from stats_df['patient_count'].sum(). If we run df.head(), we will get our desired output.

stats_df with percentage

 

So these were some operation we did with the data we got from API, and we can perform other operations as well, such as getting date wise count, getting data for any week etc. Purpose of this tutorial was to help you to getting started with Pandas.

 

Bottom Line

We saw some basic uses of pandas and it's capability to manipulate data. You can find the code snippet on Github. What do you think we should add in our next post? If you liked this post, I’d be very grateful if you’d help it spread it to a friend, or sharing it on social media. Feel free to provide your feedback, Thank you!

 

00

TAGS

COMMENTS

You need to login to comment on this post.