GDELT, Missiles, and Image Collection

[This article was first published on R Bloggers on syknapptic, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Global Database of Events, Language, and Tone, or GDELT, is “a realtime network diagram and database of global human society for open research”.

The potential for a firehose stream of global data has tantalizing possibilities for research, but concrete examples of work beyond simple evaluations of the database’s capabilities are notably absent…

See also:

Let’s see how well we can scoop up a set of ballistic missile images using a combination of packages:

  • gdeltr2: a package that is decidedly buggy, but works for these purposes
  • tidyverse suite:
    • dplyr for data carpentry
    • stringr for string processing
    • purrr for functional enhancements and simplified error-handling
  • knitr for nicely rendered tabular data
  • imager for slick image processing tools
library(gdeltr2)       # devtools::install_github("abresler/gdeltr2")
library(tidyverse)     # install.packages("tidyverse")
library(knitr)
library(imager)        # install.packages("tidyverse")

Before we start extracting any data, let’s refine our search as much as possible by assigning a handful of variables that we can use in the API call.

GDELT uses a set of codebooks that can be referenced with get_gdelt_codebook_ft_api().

Query Variables

Languages

gdeltr2 will only query English articles by default, but we don’t really care about the language of the article or even the article text

We’ll pull() all the languages from code_book = "languages" into a vector variable that we can use as a search argument like so:

langs <- get_gdelt_codebook_ft_api(code_book = "languages") %>%
  pull(value)

langs
##  [1] "Afrikaans"        "Albanian"         "Arabic"          
##  [4] "Armenian"         "Azerbaijani"      "Bengali"         
##  [7] "Bosnian"          "Bulgarian"        "Catalan"         
## [10] "Chinese"          "Croatian"         "Czech"           
## [13] "Danish"           "Dutch"            "Estonian"        
## [16] "Finnish"          "French"           "Galician"        
## [19] "Georgian"         "German"           "Greek"           
## [22] "Gujarati"         "Hebrew"           "Hindi"           
## [25] "Hungarian"        "Icelandic"        "Indonesian"      
## [28] "Italian"          "Japanese"         "Kannada"         
## [31] "Kazakh"           "Korean"           "Latvian"         
## [34] "Lithuanian"       "Macedonian"       "Malay"           
## [37] "Malayalam"        "Marathi"          "Mongolian"       
## [40] "Nepali"           "Norwegian"        "NorwegianNynorsk"
## [43] "Persian"          "Polish"           "Portuguese"      
## [46] "Punjabi"          "Romanian"         "Russian"         
## [49] "Serbian"          "Sinhalese"        "Slovak"          
## [52] "Slovenian"        "Somali"           "Spanish"         
## [55] "Swahili"          "Swedish"          "Tamil"           
## [58] "Telugu"           "Thai"             "Tibetan"         
## [61] "Turkish"          "Ukrainian"        "Urdu"            
## [64] "Vietnamese"

Image Tags

Since we’re looking specifically for imagery, we can query the relevant codebooks with code_book = "imagetags" and code_book = "imageweb" like so:

get_gdelt_codebook_ft_api(code_book = "imagetags") %>%
  head() %>%
  kable()
nameCodebookidImageTagvalue
IMAGETAGSperson65791693
IMAGETAGSprofession33553949
IMAGETAGSvehicle25342998
IMAGETAGSsports17180851
IMAGETAGSspeech16976988
IMAGETAGSpeople13358317
get_gdelt_codebook_ft_api(code_book = "imageweb") %>%
  head() %>%
  kable()
nameCodebookidImageWebvalue
IMAGEWEBImage2198300
IMAGEWEBNews2136894
IMAGEWEBPhotograph1027341
IMAGEWEBUnited States of America659847
IMAGEWEBSpeech649292
IMAGEWEBCar621304

We’ll filter() the tags to retain only those that explicitly reference “missile” with a regex.

We also want to handle a bug in gdeltr2’s query functions where sometimes a a large amount of incorrect information makes it into tag lists. Fortunately, we can omit that by excluding results containing blocks of multiple digits.

tag_regex <- "\\b[Mm]issile\\b"

bind_rows(
  get_gdelt_codebook_ft_api(code_book = "imagetags") %>%
    filter(str_detect(idImageTag, tag_regex),
           !str_detect(idImageTag, "\\d{2,}")),
  
  get_gdelt_codebook_ft_api(code_book = "imageweb") %>%
    filter(str_detect(idImageWeb, tag_regex),
           !str_detect(idImageWeb, "\\d{2,}"))
  ) %>%
  head() %>%
  kable()
nameCodebookidImageTagvalueidImageWeb
IMAGETAGSmissile247486NA
IMAGETAGSguided missile destroyer194660NA
IMAGETAGSmissile boat147549NA
IMAGETAGSballistic missile submarine55996NA
IMAGETAGScruise missile submarine11508NA
IMAGEWEBNA77637Missile

We’ll refine our results by excluding some of the tags that have a tendency to return less relevant images.

  • vehicle terms tend to emphasize the vehicle itself, rather than weapon systems
    • "boat"
    • "submarine"
    • "tank"
    • "destroyer"
  • Missile "defense" emphasizes politics over hardware
  • specific "system" tags are all in reference to surface-to-air platforms
    • S-300 missile system
    • S-400 missile system
    • Buk missile system
  • generalized Surface-to-"air" doesn’t seem fuzzy enough to ever reference ballistic missiles

We’ll use another regex to omit those tags, including the multiple digit regex used to exclude the buggy data that may leak into our results.

Junk Tag Filtering

junk_tag_regex <- c("boat", "[Ss]ubmarine", "tank", "destroyer",
                    "defense",
                    "system",
                    "air") %>%
  paste0("\\b", ., "\\b") %>%
  str_c(collapse = "|") %>%
  paste0("|\\d{2,}")

junk_tag_regex
## [1] "\\bboat\\b|\\b[Ss]ubmarine\\b|\\btank\\b|\\bdestroyer\\b|\\bdefense\\b|\\bsystem\\b|\\bair\\b|\\d{2,}"

With some parameters in mind and filtering variables assigned, let’s pull() the desired tags from each codebook into a pair of variables which we will use to query GDELT’s API.

image_tags <- get_gdelt_codebook_ft_api(code_book = "imagetags") %>%
  filter(str_detect(idImageTag, tag_regex),
         !str_detect(idImageTag, junk_tag_regex)) %>%
  pull(idImageTag)

imageweb_tags <- get_gdelt_codebook_ft_api(code_book = "imageweb") %>%
  filter(str_detect(idImageWeb, tag_regex),
         !str_detect(idImageWeb, junk_tag_regex)) %>%
  pull(idImageWeb)

combine(image_tags, imageweb_tags)
##  [1] "missile"                           
##  [2] "Missile"                           
##  [3] "Ballistic missile"                 
##  [4] "Cruise missile"                    
##  [5] "Intercontinental ballistic missile"
##  [6] "Anti-ballistic missile"            
##  [7] "Missile launch facility"           
##  [8] "Medium-range ballistic missile"    
##  [9] "Land-attack missile"               
## [10] "Short-range ballistic missile"     
## [11] "Surface-to-surface missile"

Dates

We’ll specify a time period using gdeltr2::generate_dates(). For this example, we’ll select September 22-23 of 2017 to see if we can capture coverage of an Iranian military parade.

target_dates <- generate_dates(start_date = "2017-09-22",
                               end_date = "2017-09-23")

API Call

With all of our query variables prepared, we’ll call GDELT’s API using get_data_ft_v2_api(). As duplicate articles are commonly published in many venues, we’ll omit results to only include distinct() titleArticles.

articles_df <- get_data_ft_v2_api(images_tag = image_tags,
                                  images_web_tag = imageweb_tags,
                                  search_language = langs,
                                  dates = target_dates, 
                                  visualize_results = FALSE) %>%
  distinct(titleArticle, .keep_all = TRUE)

Query Results

Here’s a summary of what we get back.

articles_df %>% 
  glimpse()
## Observations: 393
## Variables: 16
## $ modeSearch           "ArtList", "ArtList", "ArtList", "ArtList"...
## $ imagewebtagSearch    "Missile", "Missile", "Missile", "Missile"...
## $ datetimeStartSearch  "2017-09-22 12:00:00", "2017-09-22 12:00:0...
## $ datetimeEndSearch    "2017-09-23 11:59:59", "2017-09-23 11:59:5...
## $ imagetagSearch       NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ isOR                 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ countMaximumRecords  250, 250, 250, 250, 250, 250, 250, 250, 25...
## $ urlGDELTV2FTAPI      "https://api.gdeltproject.org/api/v2/doc/d...
## $ urlArticle           "http://www.iranherald.com/news/254804227/...
## $ urlArticleMobile     NA, "http://m.alarab.com/Article/825920", ...
## $ titleArticle         "Iran Herald", "     ...
## $ datetimeArticle      2017-09-23 05:00:00, 2017-09-23 05:00:00,...
## $ urlImage             "http://cdn.bignewsnetwork.com/voa15061647...
## $ domainArticle        "iranherald.com", "alarab.com", "baomoi.co...
## $ languageArticle      "English", "Arabic", "Vietnamese", "Nepali...
## $ countryArticle       "Iran", "Israel", "Vietnam", "Tuvalu", "Un...

Extracting Images

Now that we have a data frame of articles that includes a column of image URLs, we can download the data.

Directory

Let’s assign a variable for our directory.

dir <- "data/missile_images/"

Then we’ll actually create the directory.

dir.create(dir)

Workflow

We’re going to take advantage of the magic of the purrr package in several ways to stabilize our workflow.

Error Handling

The Internet is littered with broken links and webpages, which becomes more likely the further back in time we go. We’ll use one of purrr’s adverbs, safely(), to handle the inevitable download errors that will occur by creating a new function called safe_download().

safe_download <- safely(download.file)

We’re also going to create safe versions of functions we’ll use for loading and plotting images. Although most of the valid URLs will link to clean images, it’s not uncommon for otherwise successful downloads to actually come from already corrupted sources.

To handle this, we’ll create safe_image() and safe_plot().

safe_image <- safely(load.image)
safe_plot <- safely(plot)

Download Images

  1. filter() images using a regex that confirms either a .jpg or .png extension and simultaneously validates a URL sequence that we can use for each image’s eventual file path.
  2. select a sample of 100 random rows sample_n(100)
  3. pull() the urlImage column into a vector
  4. iterate through each item of the vector with walk()
    • safe_download() each image’s binary format (mode = "wb")
      • and write it to dir using its match to valid_path_regex
valid_path_regex <- "/[A-z0-9-_]+\\.(jpg|png)$"

articles_df %>%
  filter(str_detect(urlImage, valid_path_regex)) %>%
  sample_n(100) %>%
  pull(urlImage) %>%
  walk(~
         safe_download(.x,
                       paste0(dir, 
                              str_extract(.x, valid_path_regex)),
                       mode = "wb")
         )

Inspect Images

Let’s insepct a sample of the downloaded images.

Clearly the results are not perfect. There are images without anything resembling a missile as well as several duplicate or near-duplicate images. That said, manual renaming of files will allow filtering of useless images.

This is a quick proof of concept that sets us up well for enhancing data sets established through other methods.

More importantly,it demonstrates a basic workflow for bulk image processing that can be easily expanded to iteratively prepare a large dataset for many kinds of analyis.

We can take a look at our results with the following:

  1. list.files() the full paths of all the files in dir
  2. iterate through the resulting vector, reading each file with safe_image() and map()ping the results to a list
  3. remove a layer of the list hierarchy by flatten()ing it
  4. omit any resulting NULL values by compact()ing the list
  5. subset a sample() of half the images, just for demonstration
  6. walk() through the list, plotting each image
par(mfrow = c(5, 2))

list.files(dir, full.names = TRUE) %>%
  map(safe_image) %>%
  flatten() %>%
  compact() %>%
  sample(50) %>%
  walk(~ 
         safe_plot(.x, 
                   axes = FALSE, ann = FALSE)
       )

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2       imager_0.41.1        magrittr_1.5        
##  [4] knitr_1.20.8         forcats_0.3.0        stringr_1.3.1       
##  [7] dplyr_0.7.6          purrr_0.2.5          readr_1.1.1         
## [10] tidyr_0.8.1          tibble_1.4.2.9004    ggplot2_3.0.0.9000  
## [13] tidyverse_1.2.1.9000 gdeltr2_0.3.11702   
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137            xts_0.10-2             
##  [3] lubridate_1.7.4         webshot_0.5.0          
##  [5] progress_1.2.0          httr_1.3.1             
##  [7] tools_3.5.1             backports_1.1.2        
##  [9] utf8_1.1.4              R6_2.2.2               
## [11] lazyeval_0.2.1          colorspace_1.3-2       
## [13] withr_2.1.2             readbitmap_0.1.5       
## [15] tidyselect_0.2.4        prettyunits_1.0.2      
## [17] mnormt_1.5-5            curl_3.2               
## [19] compiler_3.5.1          cli_1.0.0              
## [21] rvest_0.3.2             xml2_1.2.0             
## [23] bookdown_0.7            triebeard_0.3.0        
## [25] scales_0.5.0.9000       checkmate_1.8.5        
## [27] psych_1.8.4             RApiDatetime_0.0.3     
## [29] trelliscopejs_0.1.13    digest_0.6.15          
## [31] tiff_0.1-5              foreign_0.8-70         
## [33] rmarkdown_1.10.7        base64enc_0.1-3        
## [35] jpeg_0.1-8              pkgconfig_2.0.1        
## [37] htmltools_0.3.6         highcharter_0.5.0      
## [39] highr_0.7               htmlwidgets_1.2        
## [41] rlang_0.2.1             ggthemes_3.5.0         
## [43] readxl_1.1.0            TTR_0.23-3             
## [45] htmldeps_0.1.0          rstudioapi_0.7         
## [47] quantmod_0.4-13         bindr_0.1.1            
## [49] zoo_1.8-2               jsonlite_1.5           
## [51] mclust_5.4.1            rlist_0.4.6.1          
## [53] fansi_0.2.3             Rcpp_0.12.17           
## [55] munsell_0.5.0           purrrlyr_0.0.3         
## [57] stringi_1.2.3           yaml_2.1.19            
## [59] plyr_1.8.4              grid_3.5.1             
## [61] parallel_3.5.1          crayon_1.3.4           
## [63] lattice_0.20-35         haven_1.1.2            
## [65] hms_0.4.2               anytime_0.3.1          
## [67] pillar_1.3.0.9000       igraph_1.2.1           
## [69] reshape2_1.4.3          codetools_0.2-15       
## [71] glue_1.2.0              evaluate_0.10.1        
## [73] blogdown_0.7.1          DistributionUtils_0.5-1
## [75] bmp_0.3                 data.table_1.11.5      
## [77] modelr_0.1.2            png_0.1-7              
## [79] urltools_1.7.0          cellranger_1.1.0       
## [81] gtable_0.2.0            assertthat_0.2.0       
## [83] xfun_0.3                broom_0.4.5            
## [85] autocogs_0.0.1          wordcloud2_0.2.1

To leave a comment for the author, please follow the link and comment on their blog: R Bloggers on syknapptic.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)