This post provides a brief description of methods for quantifying political bias of online news media based on the media-sharing habits of US lawmakers on Twitter. I have discussed this set of methods in a previous post. Here, the focus is on a more streamlined (and multi-threaded) approach to resolving shortened URLs via the
quicknews package. We also present unsupervised methods for visualizing media bias in two-dimensional space via tSNE, and compare results to the manually curated fact and bias checking online resource, Media Bias/Fact Check (MBFC), with some fairly nice results.
library(tidyverse) localdir <- '/home/jtimm/jt_work/GitHub/data_sets' ## devtools::install_github("jaytimm/quicknews")
The tweet-set used here was accessed via the GWU Library, and subsequently “hydrated” using the Hydrator desktop application. Tweets were generated by members of the 116th House from 3 Jan 2019 to 7 May 2020. Subsequent analyses are based on a sample of 500 tweets/lawmaker containing shared URLs.
setwd(localdir) house_tweets <- readRDS('house116-sample-urls.rds') %>% filter(urls != '')
Media bias data set
Media Bias/Fact Check is a fact-checking organization that classifies online news sources along two dimensions: (1) political bias and (2) factuality. These two scores (for ~850 sources) have been extracted by Baly et al. (2020), and made available in tabular format here.
setwd('/home/jtimm/jt_work/GitHub/packages/quicknews/data-raw') ## emnlp18 <- read.csv('emnlp18-corpus.tsv', sep = '\t') acl2020 <- read.csv('acl2020-corpus.tsv', sep = '\t')
A sample of this data set is presented below.
set.seed(221) acl2020 %>% group_by(fact, bias) %>% sample_n(1) %>% # ungroup() %>% select(source_url_normalized, fact, bias) %>% # spread(bias, source_url_normalized) %>% knitr::kable()
Resolving shortened URLs
The quicknews package is a collection of tools for navigating the online news landscape; here, we detail a simple workflow for researchers to use for multi-threaded URL un-shortening. As a three step process: (1) identify URLs that have been shortened via
qnews_clean_urls, (2) split vector of URLs into multiple batches via
qnews_split_batches for distribution across multiple cores, and (3) resolve shortened URLs via
## step 1 shortened_urls <- quicknews::qnews_clean_urls(url = house_tweets$urls) %>% filter(is_short == 1) ## step 2 batch_urls <- shortened_urls %>% quicknews::qnews_split_batches(n = 12) ## step 3 unshortened_urls <- parallel::mclapply(lapply(batch_urls, "[[", 1), quicknews::qnews_unshorten_urls, seconds = 10, mc.cores = 12) unshortened_urls1 <- data.table::rbindlist(unshortened_urls)
Media bias & tSNE
To aggregate these data, we build a simple
domain-lawmaker matrix, in which each domain/news organization is represented by the number of times each lawmaker has shared one of its news stories.
ft1 <- filt.tweets %>% group_by(user_screen_name, source) %>% count() %>% filter(source %in% share.summary$source) %>% tidytext::cast_sparse(row = 'source', column = 'user_screen_name', value = n) ft2 <- as.matrix(ft1) #%>% Rtsne::normalize_input()
ft2[1:5, 1:5] ## AUSTINSCOTTGA08 BENNIEGTHOMPSON BETTYMCCOLLUM04 BILLPASCRELL ## abcnews.go.com 1 4 0 3 ## airforcetimes.com 1 0 0 0 ## ajc.com 6 0 0 0 ## bloomberg.com 2 3 0 5 ## c-span.org 2 1 4 3 ## BOBBYSCOTT ## abcnews.go.com 0 ## airforcetimes.com 0 ## ajc.com 0 ## bloomberg.com 2 ## c-span.org 1
set.seed(77) ## 9 tsne <- Rtsne::Rtsne(X = ft2, check_duplicates = FALSE) tsne_clean <- data.frame(descriptor_name = rownames(ft1), tsne$Y) %>% #mutate(screen_name = toupper(descriptor_name)) %>% left_join(acl2020, by = c('descriptor_name' = 'source_url_normalized')) %>% replace(is.na(.), 'x')
Per figure below, the first dimension of the tSNE plot does a fairly nice job capturing differences in bias classifications as presented by Media Bias/Fact Check, and results are generally intuitive. Factors underlying variation along the second dimension, however, are less clear, and do not appear to be capturing factuality in this case. Note: news organizations indicated by orange Xs are not included in the MB/FC data set.
split_pal <- c('#3c811a', '#395f81', '#9e5055', '#e37e00') tsne_clean %>% ggplot(aes(X1, X2)) + geom_point(aes(col = bias, shape = fact), size = 3) + geom_text(aes(label = descriptor_name, col = bias, shape = fact), # size = 3, check_overlap = TRUE) + theme_minimal() + theme(legend.position = "bottom") + scale_color_manual(values = split_pal) + xlab('Dimension 1') + ylab('Dimension 2')+ labs(title = "Measuring political bias")
Bias score distributions
tsne_clean %>% ggplot() + geom_density(aes(X1, fill = bias), alpha = .4) + theme_minimal() + theme(legend.position = "bottom") + scale_fill_manual(values = split_pal) + ggtitle('Media bias scores by MB/FC bias classification')
Baly, Ramy, Georgi Karadzhov, Jisun An, Haewoon Kwak, Yoan Dinkov, Ahmed Ali, James Glass, and Preslav Nakov. 2020. “What Was Written Vs. Who Read It: News Media Profiling Using Text Analysis and Social Media Context.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ACL ’20.