An Introduction to GDELT Data

Mark Smith5 min read • Published Apr 12, 2022 • Updated May 24, 2022

MongoDB

Rate this quickstart

An Introduction to GDELT Data

(and How to Work with It and MongoDB)

Hey there!

There's a good chance that if you're reading this, it's because you're planning to enter the MongoDB "Data as News" Hackathon! If not, well, go ahead and sign up here!

Now that that's over with, let's get to the first question you probably have:

What is GDELT?

GDELT is an acronym, standing for "Global Database of Events, Language and Tone". It's a database of geopolitical event data, automatically derived and translated in real time from hundreds of news sources in 65 languages. It's around two terabytes of data, so it's really quite big!

Each event contains the following data:

Details of the one or more actors - usually countries or political entities. The type of event that has occurred, such as "appeal for judicial cooperation" The positive or negative sentiment perceived towards the event, on a scale of -10 (very negative) to +10 (very positive) An "impact score" on the Goldstein Scale, indicating the theoretical potential impact that type of event will have on the stability of a country.

But what does it look like?

The raw data GDELT provides is hosted as CSV files, zipped and uploaded for every 15 minutes since February 2015. A row in the CSV files contains data that looks a bit like this:

Field Name	Value
_id	1037207900
Day	20210401
MonthYear	202104
Year	2021
FractionDate	2021.2493
Actor1Code	USA
Actor1Name	NORTH CAROLINA
Actor1CountryCode	USA
IsRootEvent	1
EventCode	43
EventBaseCode	43
EventRootCode	4
QuadClass	1
GoldsteinScale	2.8
NumMentions	10
NumSources	1
NumArticles	10
AvgTone	1.548672566
Actor1Geo_Type	3
Actor1Geo_Fullname	Albemarle, North Carolina, United States
Actor1Geo_CountryCode	US
Actor1Geo_ADM1Code	USNC
Actor1Geo_ADM2Code	NC021
Actor1Geo_Lat	35.6115
Actor1Geo_Long	-82.5426
Actor1Geo_FeatureID	1017529
Actor2Geo_Type	0
ActionGeo_Type	3
ActionGeo_Fullname	Albemarle, North Carolina, United States
ActionGeo_CountryCode	US
ActionGeo_ADM1Code	USNC
ActionGeo_ADM2Code	NC021
ActionGeo_Lat	35.6115
ActionGeo_Long	-82.5426
ActionGeo_FeatureID	1017529
DateAdded	2022-04-01T15:15:00Z
SourceURL	https://www.dailyadvance.com/news/local/museum-to-host-exhibit-exploring-change-in-rural-us/article\_42fd837e-c5cf-5478-aec3-aa6bd53566d8.html
downloadId	20220401151500

This event encodes Actor1 (North Carolina) hosting a visit (Cameo Code 043) … and in this case the details of the visit aren't included - it's an "exhibit exploring change in the Rural US." You can click through the SourceURL link to read further details.

Every event looks like this. One or two actors, possibly some "action" detail, and then a verb, encoded using the CAMEO verb encoding. CAMEO is short for "Conflict and Mediation Event Observations", and you can find the full verb listing in this PDF. If you need a more "computer readable" version of the CAMEO verbs, one is hosted here.

What's So Interesting About an Enormous Table of Geopolitical Data?

We think that there are a bunch of different ways to think about the data encoded in the GDELT dataset.

Firstly, it's a longitudinal dataset, going back through time. Data in GDELT v2 goes from the present day back to 2015, providing a huge amount of event data for the past 7 years. But the GDELT v1 dataset, which is less rich, goes back until 1979! This gives an unparalleled opportunity to study the patterns and trends of geopolitics for the past 43 years.

More than just a historical dataset, however, GDELT is a living dataset, updated every 15 minutes. This means it can also be considered an event system for understanding the world right now. How you use this ability is up to you, but it shouldn't be ignored!

GDELT is also a geographical dataset. Each event encodes one or more points of its actors and actions, so the data can be analysed from a GIS standpoint. But more than all of this, GDELT models human interactions at a large scale. The Goldstein (impact) score (GoldsteinScale), and the sentiment score (AvgTone) provide the human impact of the events being encoded.

Whether you choose to explore one of the axes above, using ML, or visualisation; whether you choose to use GDELT data on its own, or combine it with another data source; whether you choose to home in on specific events in the recent past; we're sure that you'll discover new understandings of the world around you by analysing the news data it contains.

How To Work with GDELT?

Over the next few weeks we're going to be publishing blog posts, hosting live streams and AMA (ask me anything) sessions to help you with your GDELT and MongoDB journey. In the meantime, you have a couple of options: You can work with our existing GDELT data cluster (containing the entirety of last year's GDELT data), or you can load a subset of the GDELT data into your own cluster.

Work With Our Hosted GDELT Cluster

We currently host the past year's GDELT data in a cluster called GDELT2. You can access it read-only using Compass, or any of the MongoDB drivers, with the following connection string:

1 mongodb+srv://readonly:readonly@gdelt2.rgl39.mongodb.net/GDELT?retryWrites=true&w=majority

The raw data is contained in a collection called "eventsCSV", and a slightly massaged copy of the data (with Actors and Actions broken down into subdocuments) is contained in a collection called "recentEvents".

We're still making changes to this cluster, and plan to load more data in as time goes on (as well as keeping up-to-date with the 15-minute updates to GDELT!), so keep an eye out for updates to this blog post!

How to Get GDELT into Your Own MongoDB Cluster

There's a high likelihood that you can't work with the data in its raw form. For one reason or another you need the data in a different format, or filtered in some way to work with it efficiently. In that case, I highly recommend you follow Adrienne's advice in her GDELT Primer README.

In the next few days we'll be publishing a tool to efficiently load the data you want into a MongoDB cluster. In the meantime, read up on GDELT, have a look at the sample data, and find some teammates to build with!

What next?

We hope the above gives you some insight into this fascinating dataset. We’ve chosen it as the theme, "Data as News", for this year's MongoDB World Hackathon due to it’s size, longevity, currency and global relevance. If you fancy exploring the GDELT dataset more, as well as learning MongoDB, and competing for some one-of-a-kind prizes, well, go ahead and sign up here to the Hackathon! We’d be glad to have you!

Rate this quickstart

Tutorial

Introduction to Data Pagination With Quarkus and MongoDB: A Comprehensive Tutorial

Apr 25, 2024 | 7 min read

Quickstart

Store Sensitive Data With Python & MongoDB Client-Side Field Level Encryption

Sep 23, 2022 | 11 min read

Article

Aggregation Pipeline: Applying Benford's Law to COVID-19 Data

Jan 26, 2023 | 16 min read

Tutorial

Building with Patterns: The Bucket Pattern

May 16, 2022 | 3 min read

An Introduction to GDELT Data

MongoDB

An Introduction to GDELT Data

An Introduction to GDELT Data

(and How to Work with It and MongoDB)

What is GDELT?

But what does it look like?

What's So Interesting About an Enormous Table of Geopolitical Data?

How To Work with GDELT?

Work With Our Hosted GDELT Cluster

How to Get GDELT into Your Own MongoDB Cluster

Further Reading

What next?

Related

Introduction to Data Pagination With Quarkus and MongoDB: A Comprehensive Tutorial

Store Sensitive Data With Python & MongoDB Client-Side Field Level Encryption

Aggregation Pipeline: Applying Benford's Law to COVID-19 Data

Building with Patterns: The Bucket Pattern

Table of Contents