MongoDB Developer
MongoDB
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
MongoDBchevron-right

An Introduction to GDELT Data

Mark SmithPublished Apr 12, 2022 • Updated May 24, 2022
MongoDB
Copy Link
facebook icontwitter iconlinkedin icon
random alt
Rate this quickstart
star-empty
star-empty
star-empty
star-empty
star-empty

An Introduction to GDELT Data

(and How to Work with It and MongoDB)
Hey there!
There's a good chance that if you're reading this, it's because you're planning to enter the MongoDB "Data as News" Hackathon! If not, well, go ahead and sign up here!
Now that that's over with, let's get to the first question you probably have:
What is GDELT?
GDELT is an acronym, standing for "Global Database of Events, Language and Tone". It's a database of geopolitical event data, automatically derived and translated in real time from hundreds of news sources in 65 languages. It's around two terabytes of data, so it's really quite big!
Each event contains the following data:
Details of the one or more actors - usually countries or political entities. The type of event that has occurred, such as "appeal for judicial cooperation" The positive or negative sentiment perceived towards the event, on a scale of -10 (very negative) to +10 (very positive) An "impact score" on the Goldstein Scale, indicating the theoretical potential impact that type of event will have on the stability of a country.
But what does it look like?
The raw data GDELT provides is hosted as CSV files, zipped and uploaded for every 15 minutes since February 2015. A row in the CSV files contains data that looks a bit like this:
Field NameValue
_id
1037207900
Day20210401
MonthYear202104
Year2021
FractionDate2021.2493
Actor1CodeUSA
Actor1NameNORTH CAROLINA
Actor1CountryCodeUSA
IsRootEvent1
EventCode43
EventBaseCode43
EventRootCode4
QuadClass1
GoldsteinScale2.8
NumMentions10
NumSources1
NumArticles10
AvgTone1.548672566
Actor1Geo_Type3
Actor1Geo_FullnameAlbemarle, North Carolina, United States
Actor1Geo_CountryCodeUS
Actor1Geo_ADM1CodeUSNC
Actor1Geo_ADM2CodeNC021
Actor1Geo_Lat35.6115
Actor1Geo_Long-82.5426
Actor1Geo_FeatureID1017529
Actor2Geo_Type0
ActionGeo_Type3
ActionGeo_FullnameAlbemarle, North Carolina, United States
ActionGeo_CountryCodeUS
ActionGeo_ADM1CodeUSNC
ActionGeo_ADM2CodeNC021
ActionGeo_Lat35.6115
ActionGeo_Long-82.5426
ActionGeo_FeatureID1017529
DateAdded2022-04-01T15:15:00Z
SourceURLhttps://www.dailyadvance.com/news/local/museum-to-host-exhibit-exploring-change-in-rural-us/article_42fd837e-c5cf-5478-aec3-aa6bd53566d8.html
downloadId20220401151500
This event encodes Actor1 (North Carolina) hosting a visit (Cameo Code 043) … and in this case the details of the visit aren't included - it's an "exhibit exploring change in the Rural US." You can click through the SourceURL link to read further details.
Every event looks like this. One or two actors, possibly some "action" detail, and then a verb, encoded using the CAMEO verb encoding. CAMEO is short for "Conflict and Mediation Event Observations", and you can find the full verb listing in this PDF. If you need a more "computer readable" version of the CAMEO verbs, one is hosted here.
What's So Interesting About an Enormous Table of Geopolitical Data?
We think that there are a bunch of different ways to think about the data encoded in the GDELT dataset.
Firstly, it's a longitudinal dataset, going back through time. Data in GDELT v2 goes from the present day back to 2015, providing a huge amount of event data for the past 7 years. But the GDELT v1 dataset, which is less rich, goes back until 1979! This gives an unparalleled opportunity to study the patterns and trends of geopolitics for the past 43 years.
More than just a historical dataset, however, GDELT is a living dataset, updated every 15 minutes. This means it can also be considered an event system for understanding the world right now. How you use this ability is up to you, but it shouldn't be ignored!
GDELT is also a geographical dataset. Each event encodes one or more points of its actors and actions, so the data can be analysed from a GIS standpoint. But more than all of this, GDELT models human interactions at a large scale. The Goldstein (impact) score (GoldsteinScale), and the sentiment score (AvgTone) provide the human impact of the events being encoded.
Whether you choose to explore one of the axes above, using ML, or visualisation; whether you choose to use GDELT data on its own, or combine it with another data source; whether you choose to home in on specific events in the recent past; we're sure that you'll discover new understandings of the world around you by analysing the news data it contains.
How To Work with GDELT?
Over the next few weeks we're going to be publishing blog posts, hosting live streams and AMA (ask me anything) sessions to help you with your GDELT and MongoDB journey. In the meantime, you have a couple of options: You can work with our existing GDELT data cluster (containing the entirety of last year's GDELT data), or you can load a subset of the GDELT data into your own cluster.
Work With Our Hosted GDELT Cluster
We currently host the past year's GDELT data in a cluster called GDELT2. You can access it read-only using Compass, or any of the MongoDB drivers, with the following connection string:
The raw data is contained in a collection called "eventsCSV", and a slightly massaged copy of the data (with Actors and Actions broken down into subdocuments) is contained in a collection called "recentEvents".
We're still making changes to this cluster, and plan to load more data in as time goes on (as well as keeping up-to-date with the 15-minute updates to GDELT!), so keep an eye out for updates to this blog post!
How to Get GDELT into Your Own MongoDB Cluster
There's a high likelihood that you can't work with the data in its raw form. For one reason or another you need the data in a different format, or filtered in some way to work with it efficiently. In that case, I highly recommend you follow Adrienne's advice in her GDELT Primer README.
In the next few days we'll be publishing a tool to efficiently load the data you want into a MongoDB cluster. In the meantime, read up on GDELT, have a look at the sample data, and find some teammates to build with!
Further Reading
The following documents contain most of the official documentation you'll need for working with GDELT. We've summarized much of it here, but it's always good to check the source, and you'll need the CAMEO encoding listing!
What next?
We hope the above gives you some insight into this fascinating dataset. We’ve chosen it as the theme, "Data as News", for this year's MongoDB World Hackathon due to it’s size, longevity, currency and global relevance. If you fancy exploring the GDELT dataset more, as well as learning MongoDB, and competing for some one-of-a-kind prizes, well, go ahead and sign up here to the Hackathon! We’d be glad to have you!

Copy Link
facebook icontwitter iconlinkedin icon
Rate this quickstart
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Currency Analysis with Time Series Collections #1 — Generating Candlestick Charts Data


May 16, 2022
Article

Active-Active Application Architectures with MongoDB


Sep 23, 2022
Article

Mapping Terms and Concepts from SQL to MongoDB


Sep 23, 2022
Tutorial

Integrating MongoDB with Amazon Managed Streaming for Apache Kafka (MSK)


May 26, 2022
Table of Contents
  • An Introduction to GDELT Data