How to Work With Johns Hopkins University COVID-19 Data in MongoDB Atlas
Aaron Bassett, Joe Karlsson, Mark Smith, Maxime Beugnet8 min read • Published Feb 17, 2022 • Updated Sep 09, 2024
Rate this article
Our MongoDB Cluster is running in version 7.0.3.
You can connect to it using MongoDB Compass, the Mongo Shell, SQL or any MongoDB driver supporting at least MongoDB 7.0
with the following URI:
1 mongodb+srv://readonly:readonly@covid-19.hip2i.mongodb.net/covid19
readonly
is the username and the password, they are not meant to be replaced.- John Hopkins University (JHU) has stopped collecting data as of March 10th, 2023.
- First data entry is 2020-01-22, last one is 2023-03-09.
- Cluster now running on 7.0.3
- Removed the database
covid19jhu
with the raw data. Use the much better databasecovid19
. - BI Tools access is now disable.
- Upgraded the cluster to 4.4.
- Improved the python data import script to calculate the daily values using the existing cumulative values with an Aggregation Pipeline.
- confirmed_daily.
- deaths_daily.
- recovered_daily.
- Renamed the field "city" to "county" and "cities" to "counties" where appropriate. They contain the data from the column "Admin2" in JHU CSVs.
- The
covid19.statistics
collection is renamedcovid19.global_and_us
for more clarity. - The dataset is updated hourly so any commit done by JHU will be reflected at most one hour later in our cluster.
As the COVID-19 pandemic has swept the globe, the work of JHU (Johns Hopkins University) and
its COVID-19 dashboard has become vitally important in keeping people informed
about the progress of the virus in their communities, in their countries, and in the world.
JHU not only publishes their dashboard,
but they make the data powering it freely available for anyone to use.
However, their data is delivered as flat CSV files which you need to download each time to then query. We've set out to
make that up-to-date data more accessible so people could build other analyses and applications directly on top of the
data set.
We are now hosting a service with a frequently updated copy of the JHU data in MongoDB Atlas, our database in the cloud.
This data is free for anyone to query using the MongoDB Query language and/or SQL. We also support
a variety of BI tools directly, so you can query the data with Tableau,
Qlik and Excel.
With the MongoDB COVID-19 dataset there will be no more manual downloads and no more frequent format changes. With this
data set, this service will deliver a consistent JSON and SQL view every day with no
downstream ETL required.
None of the actual data is modified. It is simply structured to make it easier to query by placing it within
a MongoDB Atlas cluster and by creating some convenient APIs.
All the data we use to create the MongoDB COVID-19 dataset comes from the JHU dataset. In their
turn, here are the sources they are using:
- the World Health Organization,
- the National Health Commission of the People's Republic of China,
- the United States Centre for Disease Control,
- the Australia Government Department of Health,
- the European Centre for Disease Prevention and Control,
- and many others.
Using the CSV files they provide, we are producing two different databases in our cluster.
covid19
contains the same dataset but with a clean MongoDB schema design with all the good practices we are recommending.
Here is an example of a document in the
covid19
database:1 { 2 "_id" : ObjectId("5e957bfcbd78b2f11ba349bf"), 3 "uid" : 312, 4 "country_iso2" : "GP", 5 "country_iso3" : "GLP", 6 "country_code" : 312, 7 "state" : "Guadeloupe", 8 "country" : "France", 9 "combined_name" : "Guadeloupe, France", 10 "population" : 400127, 11 "loc" : { 12 "type" : "Point", 13 "coordinates" : [ -61.551, 16.265 ] 14 }, 15 "date" : ISODate("2020-04-13T00:00:00Z"), 16 "confirmed" : 143, 17 "deaths" : 8, 18 "recovered" : 67 19 }
The document above was obtained by joining together the file
UID_ISO_FIPS_LookUp_Table.csv
and the CSV files time
series you can find
in this folder.Some fields might not exist in all the documents because they are not relevant or are just not provided
by JHU. If you want more details, run a schema analysis
with MongoDB Compass on the different collections available.
If you prefer to host the data yourself, the scripts required to download and transform the JHU data are
open-source. You
can view them and instructions for how to use them on our GitHub repository.
In the
covid19
database, you will find 5 collections which are detailed in
our GitHub repository README.md file.- metadata
- global (the data from the time series global files)
- us_only (the data from the time series US files)
- global_and_us (the most complete one)
- countries_summary (same as global but countries are grouped in a single doc for each date)
You can begin exploring the data right away without any MongoDB or programming experience
using MongoDB Charts
or MongoDB Compass.
In the following sections, we will also show you how to consume this dataset using the Java, Node.js and Python drivers.
We will show you how to perform the following queries in each language:
- Retrieve the last 5 days of data for a given place,
- Retrieve all the data for the last day,
- Make a geospatial query to retrieve data within a certain distance of a given place.
With Charts, you can create visualisations of the data using any of the
pre-built graphs and charts. You can
then arrange this into a unique dashboard,
or embed the charts in your pages or blogs.
If you want to create your own MongoDB Charts dashboard, you will need to set up your
own Free MongoDB Atlas cluster and import the dataset in your cluster using
the import scripts or
use
mongoexport & mongoimport
or mongodump & mongorestore
. See this section for more
details: Take a copy of the data.Compass allows you to dig deeper into the data using
the MongoDB Query Language or via
the Aggregation Pipeline visual editor. Perform a range of
operations on the
data, including mathematical, comparison and groupings.
Create documents that provide unique insights and interpretations. You can use the output from your pipelines
as data-sources for your Charts.
For MongoDB Compass or your driver, you can use this connection string.
1 mongodb+srv://readonly:readonly@covid-19.hip2i.mongodb.net/covid19
Because we store the data in MongoDB, you can also access it via
the MongoDB Shell or
using any of our drivers. We've limited access to these collections to 'read-only'.
You can find the connection strings for the shell and Compass below, as well as driver examples
for Java, Node.js,
and Python to get you started.
1 mongo "mongodb+srv://covid-19.hip2i.mongodb.net/covid19" --username readonly --password readonly
The sample code shows how to install pymongo and use it to connect to the MongoDB COVID-19 dataset. There are some
example queries which show how to query the data and display it in the notebook, and the last example demonstrates how
to display a chart using Pandas & Matplotlib!
If you want to modify the notebook, you can take a copy by selecting "Save a copy in Drive ..." from the "File" menu,
and then you'll be free to edit the copy.
You can get lots of value from the dataset without any programming at all. We've enabled
the Atlas BI Connector (not anymore, see News section), which exposes
an SQL interface to MongoDB's document structure. This means you can use data analysis and dashboarding tools
like Tableau, Qlik Sense,
and even MySQL Workbench to analyze, visualise and extract understanding
from the data.
Here's an example of a visualisation produced in a few clicks with Tableau:
Tableau is a powerful data visualisation and dashboard tool, and can be connected to our COVID-19 data in a few steps.
We've written a short tutorial
to get you up and running.
As mentioned above, the Atlas BI Connector is activated (not anymore, see News section), so you can
connect any SQL tool to this cluster using the following connection information:
- Server: covid-19-biconnector.hip2i.mongodb.net,
- Port: 27015,
- Database: covid19,
- Username: readonly or readonly?source=admin,
- Password: readonly.
Accessing our copy of this data in a read-only database is useful, but it won't be enough if you want to integrate it
with other data within a single MongoDB cluster. You can obtain a copy of the database, either to use offline using a
different tool outside of MongoDB, or to load into your own MongoDB instance.
mongoexport
is a command-line tool that
produces a JSONL or CSV export of data stored in a MongoDB instance. First, follow
these instructions to install the MongoDB Database Tools.Now you can run the following in your console to download the metadata and global_and_us collections as jsonl files in
your current directory:
1 mongoexport --collection='global_and_us' --out='global_and_us.jsonl' --uri="mongodb+srv://readonly:readonly@covid-19.hip2i.mongodb.net/covid19" 2 mongoexport --collection='metadata' --out='metadata.jsonl' --uri="mongodb+srv://readonly:readonly@covid-19.hip2i.mongodb.net/covid19"
Documentation for all the features of
mongoexport
is available on
the MongoDB website and with the command mongoexport --help
.Once you have the data on your computer, you can use it directly with local tools, or load it into your own MongoDB
instance using mongoimport.
1 mongoimport --collection='global_and_us' --uri="mongodb+srv://<user>:<password>@<your-cluster>.mongodb.net/covid19" global_and_us.jsonl 2 mongoimport --collection='metadata' --uri="mongodb+srv://<user>:<password>@<your-cluster>.mongodb.net/covid19" metadata.jsonl
Note that you cannot run these commands against our cluster because the user we gave you (
readonly:readonly
) doesn't
have write permission on this cluster.
Read our Getting Your Free MongoDB Atlas Cluster blog post if you want to know more.Another smart way to duplicate the dataset in your own cluster would be to use
mongodump
and mongorestore
. Apart
from being more efficient, it will also grab the indexes definition along with the data.1 mongodump --uri="mongodb+srv://readonly:readonly@covid-19.hip2i.mongodb.net/covid19" 2 mongorestore --drop --uri="<YOUR_URI>"
We see the value and importance of making this data as readily available to everyone as possible, so we're not stopping
here. Over the coming days, we'll be adding a GraphQL and REST API, as well as making the data available within Excel
and Google Sheets.
We've also launched an Atlas credits program for
anyone working on detecting, understanding, and stopping the spread of COVID-19.
If you are having any problems accessing the data or have other data sets you would like to host please contact us
on the MongoDB community. We would also love to showcase any services you build on top
of this data set. Finally please send in PRs for any code changes you would like to make to the examples.
You can also reach out to the authors
directly (Aaron Bassett, Joe Karlsson, Mark Smith,
and Maxime Beugnet) on Twitter.
Related
Tutorial
How to Implement Databricks Workflows and Atlas Vector Search for Enhanced Ecommerce Search Accuracy
Sep 18, 2024 | 6 min read
Article
Implementing Robust RAG Pipelines: Integrating Google's Gemma 2 (2B) Open Model, MongoDB, and LLM Evaluation Techniques
Sep 12, 2024 | 20 min read