The GDELT dataset Primer (for use in hackathon)

The GDELT Dataset

For this hackathon you will be working with the GDELT Project Dataset . The GDELT ( Global Database of Events, Language, and Tone ) Project monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day.

How To Work with GDELT?

Over the next few weeks we’re going to be publishing blog posts, hosting live streams and AMA (ask me anything) sessions to help you with your GDELT and MongoDB journey. In the meantime, you have a couple of options: You can work with our existing GDELT data cluster (containing the entirety of last year’s GDELT data), or you can load a subset of the GDELT data into your own cluster.

Work With Our Hosted GDELT Cluster

We currently host the past year’s GDELT data in a cluster called GDELT2. Once you have an Atlas account set-up, you can access it read-only using Compass, or any of the MongoDB drivers, with the following connection string:

mongodb+srv://readonly:readonly@gdelt2.rgl39.mongodb.net/GDELT?retryWrites=true&w=majority

The raw data is contained in a collection called “eventsCSV”, and a slightly massaged copy of the data (with Actors and Actions broken down into subdocuments) is contained in a collection called “recentEvents”.

We’re still making changes to this cluster, and plan to load more data in as time goes on (as well as keeping up-to-date with the 15-minute updates to GDELT!), so keep an eye out for the updates!

How to Get GDELT into Your Own MongoDB Cluster

There’s a high likelihood that you can’t work with the data in its raw form. For one reason or another you need the data in a different format, or filtered in some way to work with it efficiently. In that case, I highly recommend you follow Adrienne’s advice in her GDELT Primer README.

In the next few days we’ll be publishing a tool to efficiently load the data you want into a MongoDB cluster - bear with us. In the meantime, read up on GDELT, have a look at the sample data, and find some teammates to build with!

Further Reading

The following documents contain most of the official documentation you’ll need for working with GDELT. We’ve summarized much of it here, but it’s always good to check the source, and you’ll need the CAMEO encoding listing!

Please reply below with any questions you may have regarding GDELT and we’ll endeavour to answer them as quickly as we can.

8 Likes

We are continuing to improve the gdelttools python package. This is a package which mainly creates a command line program (which should be on your path after install) called gdeltloader.

To install this package run:

pip install gdelttools

To get the real value out of the package you should also clone the repo that this package is generated from. This contains a script mongoimport.sh that @Mark_Smith has done some sterling work on to improve how it loads large collections of input files.

You should definitely check both items out.

Here is the help for gdeltloader:

% gdeltloader -h
usage: gdeltloader [-h] [--host HOST] [--database DATABASE]
                   [--collection COLLECTION] [--master] [--update]
                   [--local LOCAL] [--overwrite] [--download] [--importdata]
                   [--metadata] [--filter {all,gkg,mentions,export}]
                   [--last LAST] [--version]

optional arguments:
  -h, --help            show this help message and exit
  --host HOST           MongoDB URI
  --database DATABASE   Default database for loading [GDELT2]
  --collection COLLECTION
                        Default collection for loading [eventscsv]
  --master              GDELT master file [False]
  --update              GDELT update file [False]
  --local LOCAL         load data from local list of zips
  --overwrite           Overwrite files when they exist already
  --download            download zip files from master or local file
  --importdata          Import files into MongoDB
  --metadata            grab meta data files
  --filter {all,gkg,mentions,export}
                        download a subset of the data, the default is all data
                        [export, mentions gkg, all]
  --last LAST           how many recent files to download default : [0]
                        implies all files
  --version             show program's version number and exit

Version: 0.07b2 More info : https://github.com/jdrumgoole/gdelttools

This version also have latent support for loading the downloaded files directly, but right now that approach is much slower than the mongoimport.sh script.

4 Likes

Hello @Shane_McAllister , which tool were you referring we use for efficiently loading the data we want into a MongoDB cluster?

This one located here - gdelttools · PyPI - the GDELT Tools Python Package