The GDELT dataset Primer (for use in hackathon)

Shane_McAllister · April 11, 2022, 4:46pm

The GDELT Dataset

For this hackathon you will be working with the GDELT Project Dataset . The GDELT ( Global Database of Events, Language, and Tone ) Project monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day.

How To Work with GDELT?

Over the next few weeks we’re going to be publishing blog posts, hosting live streams and AMA (ask me anything) sessions to help you with your GDELT and MongoDB journey. In the meantime, you have a couple of options: You can work with our existing GDELT data cluster (containing the entirety of last year’s GDELT data), or you can load a subset of the GDELT data into your own cluster.

Work With Our Hosted GDELT Cluster

We currently host the past year’s GDELT data in a cluster called GDELT2. Once you have an Atlas account set-up, you can access it read-only using Compass, or any of the MongoDB drivers, with the following connection string:

mongodb+srv://readonly:readonly@gdelt2.rgl39.mongodb.net/GDELT?retryWrites=true&w=majority

The raw data is contained in a collection called “eventsCSV”, and a slightly massaged copy of the data (with Actors and Actions broken down into subdocuments) is contained in a collection called “recentEvents”.

We’re still making changes to this cluster, and plan to load more data in as time goes on (as well as keeping up-to-date with the 15-minute updates to GDELT!), so keep an eye out for the updates!

How to Get GDELT into Your Own MongoDB Cluster

There’s a high likelihood that you can’t work with the data in its raw form. For one reason or another you need the data in a different format, or filtered in some way to work with it efficiently. In that case, I highly recommend you follow Adrienne’s advice in her GDELT Primer README.

In the next few days we’ll be publishing a tool to efficiently load the data you want into a MongoDB cluster - bear with us. In the meantime, read up on GDELT, have a look at the sample data, and find some teammates to build with!

% gdeltloader -h
usage: gdeltloader [-h] [--host HOST] [--database DATABASE]
                   [--collection COLLECTION] [--master] [--update]
                   [--local LOCAL] [--overwrite] [--download] [--importdata]
                   [--metadata] [--filter {all,gkg,mentions,export}]
                   [--last LAST] [--version]

optional arguments:
  -h, --help            show this help message and exit
  --host HOST           MongoDB URI
  --database DATABASE   Default database for loading [GDELT2]
  --collection COLLECTION
                        Default collection for loading [eventscsv]
  --master              GDELT master file [False]
  --update              GDELT update file [False]
  --local LOCAL         load data from local list of zips
  --overwrite           Overwrite files when they exist already
  --download            download zip files from master or local file
  --importdata          Import files into MongoDB
  --metadata            grab meta data files
  --filter {all,gkg,mentions,export}
                        download a subset of the data, the default is all data
                        [export, mentions gkg, all]
  --last LAST           how many recent files to download default : [0]
                        implies all files
  --version             show program's version number and exit

Version: 0.07b2 More info : https://github.com/jdrumgoole/gdelttools

This version also have latent support for loading the downloaded files directly, but right now that approach is much slower than the mongoimport.sh script.

Ayo_Exbizy · May 14, 2022, 4:43pm

Hello @Shane_McAllister , which tool were you referring we use for efficiently loading the data we want into a MongoDB cluster?

Shane_McAllister · May 15, 2022, 10:42pm

This one located here - gdelttools · PyPI - the GDELT Tools Python Package

The GDELT dataset Primer (for use in hackathon)

The GDELT Dataset

How To Work with GDELT?

Work With Our Hosted GDELT Cluster

How to Get GDELT into Your Own MongoDB Cluster

Further Reading