Mapping the Spread of COVID-19 with Python
The purpose of this post isn’t to add new insight into the spread of the coronavirus - there are plenty of experts out there more qualified. Instead, our goal is to highlight how to construct simple, interactive visualisations using live data such as:
Getting the tools
Folium is a Python library that allows you to create different types of interactive Leaflet maps. Here, we are using the Novel Corona Virus 2019 Dataset to demonstrate how to make a choropleth (map) with a timeslider.
We will be using the following libraries:
- branca.colormap, a utility module for dealing with colourmaps
- folium, a tool to visualize Python data on an interactive Leaflet map
- GeoPandas, an open source project to make working with geospatial data - in Python easier
- NumPy, the fundamental package for scientific computing with Python
- pandas, a fast, powerful, flexible and easy to use data analysis and manipulation tool
import branca.colormap as cm
import folium
import geopandas as gpd
import numpy as np
import pandas as pd
Preparing the data
I’ve stored all my data in a directory called “python”. Using the pandas library, we can read in the data via
corona_df = pd.read_csv("python/covid_19_data.csv")
SNo | ObservationDate | Province.State | Country.Region | Last.Update | Confirmed |
---|---|---|---|---|---|
1 | 01/22/2020 | Anhui | Mainland China | 1/22/2020 17:00 | 1 |
2 | 01/22/2020 | Beijing | Mainland China | 1/22/2020 17:00 | 14 |
3 | 01/22/2020 | Chongqing | Mainland China | 1/22/2020 17:00 | 6 |
4 | 01/22/2020 | Fujian | Mainland China | 1/22/2020 17:00 | 1 |
5 | 01/22/2020 | Gansu | Mainland China | 1/22/2020 17:00 | 0 |
6 | 01/22/2020 | Guangdong | Mainland China | 1/22/2020 17:00 | 26 |
In order to map our data, we need a shapefile. A shapefile is a geospatial vector data format. The shapefile format can spatially describe vector features such as points, lines, and polygons. In our case, countries are represented as polygons. GeoPandas provides an inbuilt dataset of country shapes ('naturalearth_lowres'
), however this is missing some of the smaller countries that we require, such as Andorra. You can download the shapefiles that we used from here. We can read this file using GeoPandas:
countries = gpd.read_file('python/Countries_WGS84.shp')
The countries
object looks very similar to a standard pandas dataFrame. The only addition is a geometry
column containing (you guessed it) spatial information for that row. We can see here for Aruba, we have a polygon made up of several pairs of longitude and latitude.
X | OBJECTID | CNTRY_NAME | geometry |
---|---|---|---|
0 | 1 | Aruba | POLYGON ((-69.8822326660156 12.4111099243165, -69.9469451904296 12.4366655349731, -70.0590362548828 12.5402078628541, -70.0596618652343 12.6277761459352, -70.0331954956055 12.6183319091797, -69.93223571777339 12.5280551910401, -69.89695739746089 12.4808330535889, -69.8914031982421 12.4722213745117, -69.88555908203131 12.4577770233155, -69.87486267089839 12.4152765274048, -69.8822326660156 12.4111099243165)) |
To be able to join the shapefile and the coronavirus data, we need to do edit some of the information in both files. Unfortunately, this is a rather manual process. First, we need to make sure that the country names match between the data and the shapefiles:
corona_df = corona_df.replace({'Country/Region' :
dict.fromkeys(['Taiwan',
'Mainland China',
'Hong Kong',
'Macau'],
'China')})
corona_df = corona_df.replace({'Country/Region' : 'US'},
'United States')
corona_df = corona_df.replace({'Country/Region' : 'UK'},
'United Kingdom')
corona_df = corona_df.replace({'Country/Region' : 'North Ireland'},
'United Kingdom')
corona_df = corona_df.replace({'Country/Region' : 'Republic of Ireland'},
'Ireland')
corona_df = corona_df.replace({'Country/Region' : 'Vatican City'},
'Italy')
countries = countries.replace({'CNTRY_NAME' : 'Byelarus'},
'Belarus')
countries = countries.replace({'CNTRY_NAME' : 'Macedonia'},
'North Macedonia')
We also need to make sure that the country column has the same name in both files:
countries = countries.rename(columns={'CNTRY_NAME': 'Country/Region'})
Some countries are included in the data despite having zero confirmed cases. So we remove these:
corona_df = corona_df[corona_df.Confirmed != 0]
We then sort our data by country and reset the index:
sorted_df = corona_df.sort_values(['Country/Region',
'ObservationDate']).reset_index(drop=True)
Some countries, such as China, are split into different provinces/states. Since we just want the total number of cases per country, we get the sum for each country at each date:
sum_df = sorted_df.groupby(['Country/Region', 'ObservationDate'], as_index=False).sum()
Now we can join the data and the shapefile together:
joined_df = sum_df.merge(countries, on='Country/Region')
We are going to plot the log of the number of confirmed cases for each country, as there are a couple of countries, such as China and Italy, with a lot more cases compared to other countries.
joined_df['log_Confirmed'] = np.log10(joined_df['Confirmed'])
We also need to convert the ObservationDate to unix time in nanoseconds:
joined_df['date_sec'] = pd.to_datetime(joined_df['ObservationDate']).astype(int) / 10**9
joined_df['date_sec'] = joined_df['date_sec'].astype(int).astype(str)
We can now select the columns needed for the map and discard the others:
joined_df = joined_df[['Country/Region', 'date_sec', 'log_Confirmed', 'geometry']]
Time to map
A choropleth is a type of map where regions are shaded or patterned proportionally to a data variable. We are going to make a choropleth with a timeslider, to show the spread of COVID-19 over time. The TimeSliderChoropleth
class needs at least two arguments: a GeoJSON file containing the features (in this case, the countries) and a style dictionary. The style dictionary should have the following form:
styledict = {
: {
: {'color': , 'opacity': }
: {'color': , 'opacity': }
...
},
...,
: {
: {'color': , 'opacity': }
: {'color': , 'opacity': }
...
}
}
In this case, the keys are feature (country) ids. So for each country, we have a dictionary where the timestamps are the keys and the values are the colour and opacity of the country at that time.
We have to first initialise the map. Folium allows the use of different map tiles. If we do not specify a map, it defaults to OpenStreetMap. Here, we will use 'cartodbpositron'
:
mymap = folium.Map(tiles='cartodbpositron')
mymap.save(outfile='infinite_scroll.html')
Now we have a map of the world. However, there are a couple of problems: the continents are continually repeated and the map can be panned endlessly from either side. In order to prevent this from happening, we set a minimum zoom and set max_bounds=True
:
mymap_fix_boundary = folium.Map(min_zoom=2, max_bounds=True, tiles='cartodbpositron')
mymap_fix_boundary.save(outfile='fix_boundary.html')
Much better. You might need to change the value of min_zoom
depending on your platform. Now we define a colour map in terms of the log of the number of confirmed cases:
max_colour = max(joined_df['log_Confirmed'])
min_colour = min(joined_df['log_Confirmed'])
cmap = cm.linear.YlOrRd_09.scale(min_colour, max_colour)
joined_df['colour'] = joined_df['log_Confirmed'].map(cmap)
Next, we construct our style dictionary:
country_list = joined_df['Country/Region'].unique().tolist()
country_idx = range(len(country_list))
style_dict = {}
for i in country_idx:
country = country_list[i]
result = joined_df[joined_df['Country/Region'] == country]
inner_dict = {}
for _, r in result.iterrows():
inner_dict[r['date_sec']] = {'color': r['colour'], 'opacity': 0.7}
style_dict[str(i)] = inner_dict
Then we need to make a dataframe containing the features for each country:
countries_df = joined_df[['geometry']]
countries_gdf = gpd.GeoDataFrame(countries_df)
countries_gdf = countries_gdf.drop_duplicates().reset_index()
Finally, we create our map and add a colourbar:
from folium.plugins import TimeSliderChoropleth
slider_map = folium.Map(min_zoom=2, max_bounds=True,tiles='cartodbpositron')
_ = TimeSliderChoropleth(
data=countries_gdf.to_json(),
styledict=style_dict,
).add_to(slider_map)
_ = cmap.add_to(slider_map)
cmap.caption = "Log of number of confirmed cases"
slider_map.save(outfile='TimeSliderChoropleth.html')
By the time this blog post is published, this data will more than likely be out of date. For up to date information on COVID-19 in your area, the World Heath Orginisation and hub.arcgis.com are a great place to start. The Washington Post also produced a great article, using data simulation to show how social distancing is important in tackling COVID-19.