#An Introduction to GDELT Data
#(and How to Work with It and MongoDB)
There's a good chance that if you're reading this, it's because you're planning to enter the MongoDB "Data as News" Hackathon! If not, well, go ahead and sign up here!
Now that that's over with, let's get to the first question you probably have:
#What is GDELT?
GDELT is an acronym, standing for "Global Database of Events, Language and Tone". It's a database of geopolitical event data, automatically derived and translated in real time from hundreds of news sources in 65 languages. It's around two terabytes of data, so it's really quite big!
Each event contains the following data:
Details of the one or more actors - usually countries or political entities. The type of event that has occurred, such as "appeal for judicial cooperation" The positive or negative sentiment perceived towards the event, on a scale of -10 (very negative) to +10 (very positive) An "impact score" on the Goldstein Scale, indicating the theoretical potential impact that type of event will have on the stability of a country.
#But what does it look like?
The raw data GDELT provides is hosted as CSV files, zipped and uploaded for every 15 minutes since February 2015. A row in the CSV files contains data that looks a bit like this:
|Actor1Geo_Fullname||Albemarle, North Carolina, United States|
|ActionGeo_Fullname||Albemarle, North Carolina, United States|
This event encodes Actor1 (North Carolina) hosting a visit (Cameo Code 043) … and in this case the details of the visit aren't included - it's an "exhibit exploring change in the Rural US." You can click through the SourceURL link to read further details.
Every event looks like this. One or two actors, possibly some "action" detail, and then a verb, encoded using the CAMEO verb encoding. CAMEO is short for "Conflict and Mediation Event Observations", and you can find the full verb listing in this PDF. If you need a more "computer readable" version of the CAMEO verbs, one is hosted here.
#What's So Interesting About an Enormous Table of Geopolitical Data?
We think that there are a bunch of different ways to think about the data encoded in the GDELT dataset.
Firstly, it's a longitudinal dataset, going back through time. Data in GDELT v2 goes from the present day back to 2015, providing a huge amount of event data for the past 7 years. But the GDELT v1 dataset, which is less rich, goes back until 1979! This gives an unparalleled opportunity to study the patterns and trends of geopolitics for the past 43 years.
More than just a historical dataset, however, GDELT is a living dataset, updated every 15 minutes. This means it can also be considered an event system for understanding the world right now. How you use this ability is up to you, but it shouldn't be ignored!
GDELT is also a geographical dataset. Each event encodes one or more points of its actors and actions, so the data can be analysed from a GIS standpoint. But more than all of this, GDELT models human interactions at a large scale. The Goldstein (impact) score (GoldsteinScale), and the sentiment score (AvgTone) provide the human impact of the events being encoded.
Whether you choose to explore one of the axes above, using ML, or visualisation; whether you choose to use GDELT data on its own, or combine it with another data source; whether you choose to home in on specific events in the recent past; we're sure that you'll discover new understandings of the world around you by analysing the news data it contains.
#How To Work with GDELT?
Over the next few weeks we're going to be publishing blog posts, hosting live streams and AMA (ask me anything) sessions to help you with your GDELT and MongoDB journey. In the meantime, you have a couple of options: You can work with our existing GDELT data cluster (containing the entirety of last year's GDELT data), or you can load a subset of the GDELT data into your own cluster.
#Work With Our Hosted GDELT Cluster
We currently host the past year's GDELT data in a cluster called GDELT2. You can access it read-only using Compass, or any of the MongoDB drivers, with the following connection string:
The raw data is contained in a collection called "eventsCSV", and a slightly massaged copy of the data (with Actors and Actions broken down into subdocuments) is contained in a collection called "recentEvents".
We're still making changes to this cluster, and plan to load more data in as time goes on (as well as keeping up-to-date with the 15-minute updates to GDELT!), so keep an eye out for updates to this blog post!
#How to Get GDELT into Your Own MongoDB Cluster
There's a high likelihood that you can't work with the data in its raw form. For one reason or another you need the data in a different format, or filtered in some way to work with it efficiently. In that case, I highly recommend you follow Adrienne's advice in her GDELT Primer README.
In the next few days we'll be publishing a tool to efficiently load the data you want into a MongoDB cluster. In the meantime, read up on GDELT, have a look at the sample data, and find some teammates to build with!
The following documents contain most of the official documentation you'll need for working with GDELT. We've summarized much of it here, but it's always good to check the source, and you'll need the CAMEO encoding listing!
We hope the above gives you some insight into this fascinating dataset. We’ve chosen it as the theme, "Data as News", for this year's MongoDB World Hackathon due to it’s size, longevity, currency and global relevance. If you fancy exploring the GDELT dataset more, as well as learning MongoDB, and competing for some one-of-a-kind prizes, well, go ahead and sign up here to the Hackathon! We’d be glad to have you!