Deceit in Politics; An Analysis of PolitiFact Data

[This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Naturally, both Hillary Clinton and Donald Trump have been accused of lying; if I had told you in 2012 that both candidates from both political parties were being accused of lies, you would likely have given me a blank, disinterested stare; this alone is not shocking. What is shocking, though, is the level of deceit and how central a theme it was to this campaign season.

Some accuse the other side of being the liars, the other side counters with a similar accusation, and those not committed to a side like to lazily declare both sides to be equally guilty of lying. It does not take much thought, though, to realize there is no reason why both sides should be equally guilty of lying, and that is especially true for this election.

Donald Trump takes lying to a new level, living in his own invented reality and inviting the rest of America to participate in his nightmarish hallucination that only he can save us from. The media has struggled to handle this. They worry about appearing unfairly biased, and in the past, perhaps behaving as if both sides were equally guilty of lying seemed a good enough proxy to reality to avoid coming across as biased. But Donald Trump lies so much it’s thrown them off their toes (if you don’t believe me, read on; I have evidence later), and his candidacy has sparked a conversation in the media world about how to handle a candidate so casually dishonest he himself may not know what is true and what is not.

Here, I’m going to use R to dig deeper into the question about how honest are our politicians, and whether one party lies more than another. All of my data was scraped from PolitiFact’s website, a popular and well-known fact checker with an excellent categorization system that makes analyzing their data easier. I present various graphics and tables showing who lies more, and what they lie about.

Data Extraction and Analysis

Before scraping, I use MySQL to create a database that will hold the data I scrape from PolitiFact. The SQL statements that define the tables in the database politifactscraper are shown below.

/* PolitiFactScraper_define.sql
 *
 * Defines tables used to hold information for my PolitiFact web scraper.
 */

/* Rating: contains rating schema
    aid: The rating id
    label: The name of the rating
*/
create table Rating (
    aid int auto_increment primary key,
    label char(15)
) auto_increment = 0;

/* Party: contains (political) party schema
    rid: The id of the party
    name: Contains the name of the party
*/
create table Party (
    rid int auto_increment primary key,
    name char(50)
) auto_increment = 0;

/* Speaker: Contains details about speakers for which statements exist
    pid: The id of the speaker
    name: The name of the speaker
    rid: The id of the party with whom the speaker is affiliated
*/
create table Speaker (
    pid int auto_increment primary key,
    name varchar(140) not null,
    rid int,
    foreign key (rid) references Party (rid) on update cascade on delete set null
) auto_increment = 1;

/* Stmnt: Contains statements made
    sid: The id of the statement
    aid: The id of the rating given to the statement
    text: The statement's text
    s_date: The date on which the statement was made
*/
create table Stmnt (
    sid int auto_increment primary key,
    aid int,
    text varchar(2000) not null,
    s_date date,
    pid int not null,
    foreign key (pid) references Speaker (pid) on update cascade on delete cascade,
    foreign key (aid) references Rating (aid) on update cascade on delete set null
) auto_increment = 1;

/* About: Contains relations identifying an individual about whom the statement was made
    sid: The id of the statement
    pid: The id of the person about whom the statement was made

    NOTE: This table is not used currently
*/
create table About (
    sid int not null,
    pid int not null,
    primary key (sid, pid),
    foreign key (sid) references Stmnt (sid) on update cascade on delete cascade,
    foreign key (pid) references Speaker (pid) on update cascade on delete cascade
);

I wrote a Python program to scrape the data from PolitiFact and save the data in the database politifactscraper. The code for the program is listed below:

from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error import *
from datetime import datetime
import time
import html
import dateparser
import requests
import string
import pymysql as sql

def get_statements(people_dict):
    """
    :param people_dict: dict; A dictionary object with names for indices and a string of the form "/personalities/im-a-person/" that will be appended to the end of a PolitiFact URL

    :return: dict; A dict containing two entries: "Statements", with a list containing tuples of the form (name, url, page number, rating, date, text); and "Errors", a list of URLs that failed to be scraped

    This function scrapes PolitiFact's website, extracting information for all persons in people_dict and returning a list with tuples with the scraped information.
    """

    # Prepare session
    session = requests.Session()
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
               "Accept-Language":"en-US,en;q=0.8",
               "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}
    waittime = 10     # Don't want to go too fast (preferred time according to site's robost.txt)
    politifact_base = "http://www.politifact.com"
    statements = set()
    error_pages = list()

    for name, link in people_dict.items():
        page = 0
        time.sleep(waittime + 1)     # Don't go too fast
        try:
            # Get page and process via BeautifulSoup
            src = session.get(politifact_base + link + "statements/?page=1", headers = headers)
            statement_pg = BeautifulSoup(src.text, "lxml")
            # Below's latter condition checks to see if there are more pages to "click"
            while page == 0 or statement_pg.find("", {"title": "Next"}) != None:
                page += 1     # Update page number
                try:
                    if page > 1:
                        # If we're on a new page, read it
                        time.sleep(waittime + 1)
                        src = session.get(politifact_base + link + "statements/?page=" + str(page), headers = headers)
                    statement_pg = BeautifulSoup(src.text, "lxml")
                    # Begin processing statements
                    for s in statement_pg.findAll("", {"class": "statement"}):
                        if s.find("div", {"class":"statement__source"}).a.get("href") == link:
                            # This statement was made by the individual who's page we are on
                            try:
                                statements.add((name.replace('  ', ' '),
                                                politifact_base + link,
                                                page,
                                                s.find("div", {"class":"meter"}).img.get("alt"),
                                                dateparser.parse(s.find("span", {"class":"article__meta"}).get_text()[3:]),
                                                s.find("p", {"class":"statement__text"}).a.get_text().strip().replace('\xa0', ' ')))
                            except:
                                # If something bad happens, add a blank entry
                                statements.add((name.replace('  ', ' '),
                                                politifact_base + link,
                                                page,
                                                None,
                                                None,
                                                None))
                except URLError as e:
                    # Print errors and add to list of error pages
                    print(e)
                    error_pages.append((name, politifact_base + link + "statements/?page=" + str(page)))
        except URLError as e:
            print(e)
            error_pages.append((name, politifact_base + link + "statements/?page=1"))

    return {"Statements": list(statements), "Errors": error_pages}

def speakers_to_db(people_dict, party, cur):
    """
    :param people_dict: dict; A dictionary object with names for indices and a string of the form "/personalities/im-a-person/" that will be appended to the end of a PolitiFact URL
    :param party: string; An identifier for the political party associated with the individuals in people_dict
    :param cur: cursor; A pymysql cursor object

    This function enters the people in people_dict into the speaker table in the MySQL data base connected to by cur. Note that party must be an existing entry in the party table in the data base.
    """

    # Get party rid identifier
    res = cur.execute("select rid from party where name=\"" + party + "\";")
    if res == 1:
        rid = cur.fetchone()[0]
    else:
        raise RuntimeError("When I looked up party " + party + " in database I did not get exactly one result!")

    # Populate database
    for name in people_dict.keys():
        res = cur.execute("select name, rid from speaker where name='" + name.replace("'", "''") + "' and rid=" + str(rid) + ";")
        if res == 0:
            cur.execute("insert into speaker (name, rid) values ('" + name.replace("'", "''") +"'," + str(rid) + ");")

    cur.connection.commit()

def statements_to_db(stmnt_list, party, cur):
    """
    :param stmnt_list: list; A list containing statements to be entered into the data base connected to by cur
    :param party: string; A string identifying the party associated with the statements
    :param cur: cursor; A pymysql cursor object

    This function enters the statements in stmnt_list into the smnt table in the data base connected to by cur. Be sure that all speakers in the list are already included in the speaker table in the data base.
    """

    # Create table of statement rating aid values
    res = cur.execute("select * from rating;")
    # There should be nine rating values
    if res == 9:
        # Table is a reverse crosswalk; given name of rating, it gives the aid value
        aid_table = dict()
        for _ in range(res):
            row = cur.fetchone()
            aid_table[row[1]] = row[0]
    else:
        raise RuntimeError("Something is wrong with the rating table in database; does not have appropriate number of entries!")

    # Create table for speakers, giving their pid numbers, all from selected party
    res = cur.execute("select pid, name from speaker where rid = (select rid from party where name = \"" + party + "\");")
    # The table should have at least one person in it
    if res > 0:
        # Table is a reverse crosswalk; given name of speaker, it gives the pid value
        pid_table = dict()
        for _ in range(res):
            row = cur.fetchone()
            pid_table[row[1].replace("  ", " ")] = row[0]
    else:
        raise RuntimeError("You should populate the table of speakers before trying to add statements to database!")

    # We can now finally start adding statements to the table
    for name, _, __, rating, date, text in stmnt_list["Statements"]:
        cur.execute("insert into stmnt (pid, aid, s_date, text) values (" +                    str(pid_table[name.replace("  ", " ")]) + ", " + str(aid_table[rating]) + ", \'" +                     date.strftime('%Y%m%d') + "\', \'" + text.replace("'", "''") + "\');")

    cur.connection.commit()

def main():
    conn = sql.connect(host='localhost', user = "root", passwd=my_pass, db="mysql", charset = "utf8")
    cur = conn.cursor()
    cur.execute("use politifactscraper;")

    politifact_base = "http://www.politifact.com"
    html = urlopen(politifact_base + "/personalities/")
    bsObj = BeautifulSoup(html.read(), "lxml")

    people = bsObj.findAll("li", {"class": "az-list__item"})

    people_affil = [p for p in people if (p.find("span", {"class":"people-party"}) != None)]

    republicans = [p for p in people_affil if (p.find("span", {"class":"people-party"}).get_text() == "Republican")]
    democrats = [p for p in people_affil if (p.find("span", {"class":"people-party"}).get_text() == "Democrat")]
    independents = [p for p in people_affil if (p.find("span", {"class":"people-party"}).get_text() == "Independent")]
    libertarians = [p for p in people_affil if (p.find("span", {"class":"people-party"}).get_text() == "Libertarian")]

    r_links = dict([(p.a.get_text(), p.a.get("href")) for p in republicans])
    d_links = dict([(p.a.get_text(), p.a.get("href")) for p in democrats])
    i_links = dict([(p.a.get_text(), p.a.get("href")) for p in independents])
    l_links = dict([(p.a.get_text(), p.a.get("href")) for p in libertarians])

    # Manually add notorious individuals
    r_links.update({
            "Mitch McConnel": "/personalities/mitch-mcconnell/",
            "Paul Ryan": "/personalities/paul-ryan/"
        })
    d_links.update({
            "Barack Obama": "/personalities/barack-obama/",
            "Joe Biden": "/personalities/joe-biden/",
            "Nancy Pelosi": "/personalities/nancy-pelosi/",
            "Harry Reid": "/personalities/harry-reid/"
        })

    # Links for select individuals
    u_links = {
        "Bloggers": "/personalities/blog-posting/",
        "Facebook posts": "/personalities/facebook-posts/",
        "Chain email": "/personalities/chain-email/"
    }

    speakers_to_db(l_links,"Libertarian",cur)
    speakers_to_db(r_links,"Republican",cur)
    speakers_to_db(d_links,"Democrat",cur)
    speakers_to_db(i_links,"Independent",cur)
    speakers_to_db(u_links,"Unaffiliated",cur)

    l_statements = get_statements(l_links)
    d_statements = get_statements(d_links)
    r_statements = get_statements(r_links)
    i_statements = get_statements(i_links)
    u_statements = get_statements(u_links)

    statements_to_db(l_statements, "Libertarian", cur)
    statements_to_db(d_statements, "Democrat", cur)
    statements_to_db(r_statements, "Republican", cur)
    statements_to_db(i_statements, "Independent", cur)
    statements_to_db(u_statements, "Unaffiliated", cur)

    cur.close()
    conn.close()

if __name__ == '__main__':
    main()

Now that the data is in a database, it’s easy to access and process it using either R or Python. I will be using both languages for analyzing this data.

In R, I first assess how honest individuals who associate with a party are (I also include the “Unaffiliated” group, which does not include politicians but the unending stream of bloggers and Facebook/e-mail spam that we see on a daily basis). PolitiFact ratings can be seen as ordinal data, which means that metrics based on order, such as the median are well-defined. I base judgement of honesty on the median first, then break ties using the mean (which is not well-defined for this data). In the data base, I conveniently assigned the values in the aid column of the rating table so that it reflects the order of “honesty” for each possible rating, with 0 for “Pants on Fire!” and 5 for “True” (the flip-flopping scores have values 6 for “Full Flop” to 8 for “No Flip”, but are excluded in further analysis). The result is shown below:

library(dplyr)
library(magrittr)
library(htmlTable)
library(reshape2)
library(vcd)

# Get access to data base
db <- src_mysql("politifactscraper", password = my_pswd)
# Begin working with data base with dplyr
db %>%
    # Get stmnt table
    tbl("stmnt") %>%
    # Need to join with speaker table to get party id
    left_join(tbl(db, "speaker"), by = "pid") %>%
    as.data.frame %>%
    # Exclude flops
    filter(aid <= 5) %>%
    group_by(rid) %>%
    # Compute desired metrics per party
    summarize(med = median(aid), avg = mean(aid), num = length(aid)) %>%
    # Get party names
    left_join(tbl(db, "party") %>% as.data.frame, by = "rid") %>%
    arrange(desc(med), desc(avg)) %>%
    select(name, med, avg, num) %>%
    # Get rating names
    left_join(tbl(db, "rating") %>% as.data.frame, by = c("med" = "aid")) %>%
    mutate(avg = round(avg, digits = 2)) %$%
    # Finally put in a pretty HTML table for display in markdown
    htmlTable(select(., "Median Honesty" = label, "Average Honesty Score" = avg, "Rated Statements" = num), rnames = name, caption = "Party Honesty", tfoot = "Data source: PolitiFact")
Party Honesty
Median Honesty Average Honesty Score Rated Statements
Independent Mostly True 3.25 181
Democrat Half-True 3.07 4026
Libertarian Half-True 2.92 147
Republican Half-True 2.57 5480
Unaffiliated Pants on Fire! 1.06 350
Data source: PolitiFact

It seems that Independents, in the aggregate, tell the truth the most, and Republicans the least. While Democrats, Libertarians, and Republicans all have median honesty scores of “Half-True”, the tie-breaker suggests Democrats are the most honest, Republicans the least. Interestingly, PolitiFact rates Republicans more often than Democrats.

As for the Unaffiliated group, take anything you see on Facebook, on some random dude’s blog, or in a chain e-mail with a grain of salt; their median honesty is “Pants on Fire!”

Okay, so that’s the parties. What about individuals? I repeat the above pipeline to see the the top 20 most honest individuals on PolitiFact. Because some people have only a couple ratings in file, I require that PolitiFact have rated at least 15 statements made by the individual to be included in the following lists.

db %>%
    # Get stmnt table
    tbl("stmnt") %>%
    as.data.frame %>%
    # Exclude flops
    filter(aid <= 5) %>%
    group_by(pid) %>%
    # Compute desired metrics per person (round ratings up, if needed)
    summarize(med = ceiling(median(aid)), avg = mean(aid), num = length(aid)) %>%
    filter(num > 15) %>%
    arrange(desc(med), desc(avg)) %>%
    slice(1:20) %>%
    # Need to join with speaker table to get speaker information
    left_join(tbl(db, "speaker") %>% as.data.frame, by = "pid") %>%
    # Get party names
    left_join(tbl(db, "party") %>% rename("p_name" = name) %>% as.data.frame, by = "rid") %>%
    select(name, p_name, med, avg, num) %>%
    # # Get rating names
    left_join(tbl(db, "rating") %>% as.data.frame, by = c("med" = "aid")) %>%
    mutate(avg = round(avg, digits = 2)) %$%
    # # Finally put in a pretty HTML table for display in markdown
    htmlTable(select(., "Party" = p_name, "Median Honesty" = label, "Average Honesty Score" = avg, "Rated Statements" = num), rnames = name, caption = "Tope 20 Honest Entites", tfoot = "Data source: PolitiFact")
Tope 20 Honest Entites
Party Median Honesty Average Honesty Score Rated Statements
Alex Sink Democrat True 4 19
Dennis Kucinich Democrat Mostly True 3.84 25
Sheldon Whitehouse Democrat Mostly True 3.79 24
Cory Booker Democrat Mostly True 3.6 20
Mark Warner Democrat Mostly True 3.6 20
Gina Raimondo Democrat Mostly True 3.59 17
Sherrod Brown Democrat Mostly True 3.59 34
Rob Portman Republican Mostly True 3.57 47
Bill Nelson Democrat Mostly True 3.48 25
David Axelrod Democrat Mostly True 3.39 18
Hillary Clinton Democrat Mostly True 3.34 292
Bernie Sanders Independent Mostly True 3.25 107
John Kasich Republican Mostly True 3.25 64
Fred Thompson Republican Mostly True 3.25 16
Bill Richardson Democrat Mostly True 3.24 17
Alan Grayson Democrat Mostly True 3.18 34
Bill White Democrat Mostly True 3.08 26
Tim Kaine Democrat Half-True 3.38 50
Nathan Deal Republican Half-True 3.37 49
Bill Clinton Democrat Half-True 3.29 41
Data source: PolitiFact

Alex Sink, a Florida Democrat who ran for governor and the House of Representatives (and lost both races) has the highest rating. Rob Portman, the junior Senator from Ohio, is the most honest Republican, and Bernie Sanders the most honest Independent. Of those who ran for President in the 2016 election, Hillary Clinton, according to PolitiFact’s ratings, was the most honest candidate (Bernie Sanders was a close second), and John Kasich was the most honest Republican (in fact, tied with Bernie Sanders). Barack Obama, interestingly, does not appear on this list.

And now the list of shame: The most dishonest individuals.

db %>%
    # Get stmnt table
    tbl("stmnt") %>%
    as.data.frame %>%
    # Exclude flops
    filter(aid <= 5) %>%
    group_by(pid) %>%
    # Compute desired metrics per person (round ratings up, if needed)
    summarize(med = ceiling(median(aid)), avg = mean(aid), num = length(aid)) %>%
    filter(num > 15) %>%
    arrange(med, avg) %>%
    slice(1:20) %>%
    # Need to join with speaker table to get speaker information
    left_join(tbl(db, "speaker") %>% as.data.frame, by = "pid") %>%
    # Get party names
    left_join(tbl(db, "party") %>% rename("p_name" = name) %>% as.data.frame, by = "rid") %>%
    select(name, p_name, med, avg, num) %>%
    # # Get rating names
    left_join(tbl(db, "rating") %>% as.data.frame, by = c("med" = "aid")) %>%
    mutate(avg = round(avg, digits = 2)) %$%
    # # Finally put in a pretty HTML table for display in markdown
    htmlTable(select(., "Party" = p_name, "Median Honesty" = label, "Average Honesty Score" = avg, "Rated Statements" = num), rnames = name, caption = "Tope 20 Dishonest Entities", tfoot = "Data source: PolitiFact")
Tope 20 Dishonest Entities
Party Median Honesty Average Honesty Score Rated Statements
Chain email Unaffiliated Pants on Fire! 0.78 178
Bloggers Unaffiliated Pants on Fire! 0.92 72
Democratic Party of Wisconsin Democrat False 1.5 24
Ben Carson Republican False 1.54 28
Michele Bachmann Republican False 1.59 61
Facebook posts Unaffiliated False 1.65 100
Herman Cain Republican False 1.77 26
Donald Trump Republican False 1.82 327
Ken Cuccinelli Republican False 2.2 20
Democratic Congressional Campaign Committee Democrat Mostly False 1.44 34
National Republican Senatorial Committee Republican Mostly False 1.87 30
Allen West Republican Mostly False 1.88 26
Paul Broun Republican Mostly False 2 19
National Republican Congressional Committee Republican Mostly False 2.08 53
Reince Priebus Republican Mostly False 2.12 24
Ted Cruz Republican Mostly False 2.2 116
Tommy Thompson Republican Mostly False 2.26 27
Newt Gingrich Republican Mostly False 2.27 77
Republican Party of Florida Republican Mostly False 2.29 34
Rick Santorum Republican Mostly False 2.32 59
Data source: PolitiFact

I allowed chain e-mails, bloggers, and Facebook posts to appear in this list just to make the following point: they’re full of shit. Go to legitimate news sources to get your information. (In my defense as a blogger, I try to be pretty transparent; judge my honesty as you will.) The two “Democrats” that appear on this list are organizations, the Democratic Party of Wisconsin (is this why Scott Walker is a thing?) and the DCCC. Ben Carson is the most dishonest individual on this list and thus the most dishonest person who ran for President in the 2016 election season, according to PolitiFact. (In Dr. Carson’s defense, though, I don’t know if it’s “dishonesty” per se or just ignorance/stupidity.) Donald Trump, according to PolitiFact, is extremely dishonest, yet somehow Hillary is the corrupt liar.

98 people in PolitiFact’s data had at least 15 ratings, so here is the full list, with rankings provided:

db %>%
    # Get stmnt table
    tbl("stmnt") %>%
    as.data.frame %>%
    # Exclude flops
    filter(aid <= 5) %>%
    group_by(pid) %>%
    # Compute desired metrics per person (round ratings up, if needed)
    summarize(med = ceiling(median(aid)), avg = mean(aid), num = length(aid)) %>%
    filter(num > 15) %>%
    arrange(desc(med), desc(avg)) %>%
    # Need to join with speaker table to get speaker information
    left_join(tbl(db, "speaker") %>% as.data.frame, by = "pid") %>%
    # Get party names
    left_join(tbl(db, "party") %>% rename("p_name" = name) %>% as.data.frame, by = "rid") %>%
    select(name, p_name, med, avg, num) %>%
    # # Get rating names
    left_join(tbl(db, "rating") %>% as.data.frame, by = c("med" = "aid")) %>%
    mutate(avg = round(avg, digits = 2), rank = row_number()) %$%
    # # Finally put in a pretty HTML table for display in markdown
    htmlTable(select(., "Party" = p_name, "Rank" = rank, "Median Honesty" = label, "Average Honesty Score" = avg, "Rated Statements" = num), rnames = name, caption = "Honesty of Politically Active Entities", tfoot = "Data source: PolitiFact")
Honesty of Politically Active Entities
Party Rank Median Honesty Average Honesty Score Rated Statements
Alex Sink Democrat 1 True 4 19
Dennis Kucinich Democrat 2 Mostly True 3.84 25
Sheldon Whitehouse Democrat 3 Mostly True 3.79 24
Cory Booker Democrat 4 Mostly True 3.6 20
Mark Warner Democrat 5 Mostly True 3.6 20
Gina Raimondo Democrat 6 Mostly True 3.59 17
Sherrod Brown Democrat 7 Mostly True 3.59 34
Rob Portman Republican 8 Mostly True 3.57 47
Bill Nelson Democrat 9 Mostly True 3.48 25
David Axelrod Democrat 10 Mostly True 3.39 18
Hillary Clinton Democrat 11 Mostly True 3.34 292
Bernie Sanders Independent 12 Mostly True 3.25 107
John Kasich Republican 13 Mostly True 3.25 64
Fred Thompson Republican 14 Mostly True 3.25 16
Bill Richardson Democrat 15 Mostly True 3.24 17
Alan Grayson Democrat 16 Mostly True 3.18 34
Bill White Democrat 17 Mostly True 3.08 26
Tim Kaine Democrat 18 Half-True 3.38 50
Nathan Deal Republican 19 Half-True 3.37 49
Bill Clinton Democrat 20 Half-True 3.29 41
Barack Obama Democrat 21 Half-True 3.28 596
Jeb Bush Republican 22 Half-True 3.24 79
John Cornyn Republican 23 Half-True 3.23 26
Barbara Buono Democrat 24 Half-True 3.12 16
Charlie Crist Democrat 25 Half-True 3.12 80
George LeMieux Republican 26 Half-True 3.12 17
Wendy Davis Democrat 27 Half-True 3.07 27
David Cicilline Democrat 28 Half-True 3.07 29
Gary Johnson Libertarian 29 Half-True 3.06 51
Kay Bailey Hutchison Republican 30 Half-True 3.06 17
Rand Paul Republican 31 Half-True 3.04 51
Joe Biden Democrat 32 Half-True 2.99 75
George Allen Republican 33 Half-True 2.96 26
Paul Ryan Republican 34 Half-True 2.95 65
Martin O’Malley Democrat 35 Half-True 2.94 18
Tim Pawlenty Republican 36 Half-True 2.94 17
Chris Christie Republican 37 Half-True 2.93 102
Bob McDonnell Republican 38 Half-True 2.91 35
Jon Huntsman Republican 39 Half-True 2.89 18
Tammy Baldwin Democrat 40 Half-True 2.88 25
John McCain Republican 41 Half-True 2.88 183
Greg Abbott Republican 42 Half-True 2.86 43
Ron Paul Republican 43 Half-True 2.85 40
Marco Rubio Republican 44 Half-True 2.84 148
Gwen Moore Democrat 45 Half-True 2.84 19
Kendrick Meek Democrat 46 Half-True 2.84 19
Lincoln Chafee Democrat 47 Half-True 2.83 18
Russ Feingold Democrat 48 Half-True 2.81 21
Debbie Wasserman Schultz Democrat 49 Half-True 2.81 47
Mitch McConnel Republican 50 Half-True 2.79 28
Rick Scott Republican 51 Half-True 2.77 142
Karl Rove Republican 52 Half-True 2.76 17
Ron Johnson Republican 53 Half-True 2.73 44
Mitt Romney Republican 54 Half-True 2.7 206
David Perdue Republican 55 Half-True 2.69 16
Republican National Committee Republican 56 Half-True 2.68 34
Scott Walker Republican 57 Half-True 2.67 172
Mary Burke Democrat 58 Half-True 2.65 34
Florida Democratic Party Democrat 59 Half-True 2.64 25
Rudy Giuliani Republican 60 Half-True 2.6 47
Rick Perry Republican 61 Half-True 2.59 169
Mike Pence Republican 62 Half-True 2.58 38
Tom Barrett Democrat 63 Half-True 2.52 25
Republican Governors Association Republican 64 Half-True 2.5 18
Harry Reid Democrat 65 Half-True 2.5 24
Ted Strickland Democrat 66 Half-True 2.48 21
Nancy Pelosi Democrat 67 Half-True 2.38 29
John Boehner Republican 68 Mostly False 2.64 69
Mike Huckabee Republican 69 Mostly False 2.63 41
Eric Cantor Republican 70 Mostly False 2.56 34
Dick Cheney Republican 71 Mostly False 2.53 17
Carly Fiorina Republican 72 Mostly False 2.45 22
Dan Patrick Republican 73 Mostly False 2.41 22
Terry McAuliffe Democrat 74 Mostly False 2.4 30
Josh Mandel Republican 75 Mostly False 2.39 28
Crossroads GPS Republican 76 Mostly False 2.37 19
David Dewhurst Republican 77 Mostly False 2.35 40
Sarah Palin Republican 78 Mostly False 2.33 39
Rick Santorum Republican 79 Mostly False 2.32 59
Republican Party of Florida Republican 80 Mostly False 2.29 34
Newt Gingrich Republican 81 Mostly False 2.27 77
Tommy Thompson Republican 82 Mostly False 2.26 27
Ted Cruz Republican 83 Mostly False 2.2 116
Reince Priebus Republican 84 Mostly False 2.12 24
National Republican Congressional Committee Republican 85 Mostly False 2.08 53
Paul Broun Republican 86 Mostly False 2 19
Allen West Republican 87 Mostly False 1.88 26
National Republican Senatorial Committee Republican 88 Mostly False 1.87 30
Democratic Congressional Campaign Committee Democrat 89 Mostly False 1.44 34
Ken Cuccinelli Republican 90 False 2.2 20
Donald Trump Republican 91 False 1.82 327
Herman Cain Republican 92 False 1.77 26
Facebook posts Unaffiliated 93 False 1.65 100
Michele Bachmann Republican 94 False 1.59 61
Ben Carson Republican 95 False 1.54 28
Democratic Party of Wisconsin Democrat 96 False 1.5 24
Bloggers Unaffiliated 97 Pants on Fire! 0.92 72
Chain email Unaffiliated 98 Pants on Fire! 0.78 178
Data source: PolitiFact

Barack Obama appears 21st on this list followed by Vice President Joe Biden (32), Speaker of the House Paul Ryan (34), Senate Majority Leader Mitch McConnel (50), Senate Minority Leader Harry Reid (65), and House Minority Leader Nancy Pelosi (67). Finally, it seems the most dishonest Democratic individual is Terry McAuliffe, the Governor of Virginia.

We can also see what parties account for more honesty/deceit. Below I create a table with the proportion of each rating given each party is responsible for.

# A base data frame used throughout
base_df = db %>%
    # Get stmnt table
    tbl("stmnt") %>%
    # Need to join with speaker table to get party id
    left_join(tbl(db, "speaker"), by = "pid") %>%
    as.data.frame %>%
    # Exclude flops
    filter(aid %
    group_by(aid, rid) %>%
    summarize(value = n()) %>%
    left_join(tbl(db, "party") %>% as.data.frame, by = "rid") %>%
    left_join(tbl(db, "rating") %>% as.data.frame, by = "aid") %>%
    select(name, label, value) %>%
    dcast(label ~ name)


## Adding missing grouping variables: `aid`


# Convert to matrix, and get proportions
r_mat = matrix(r_count[,-1] %>% as.matrix, nrow = 6, dimnames = list(r_count$label, names(r_count)[-1]))
# Order rows correctly
r_mat = r_mat[c("Pants on Fire!", "False", "Mostly False", "Half-True", "Mostly True", "True"),]
(r_mat / rowSums(r_mat)) %>%
    round(digits = 2) %>%
    `*`(100) %>%
    htmlTable(caption = "Proportion of Statements of Certain Truthfulness Made by Members of Parties", tfoot = "Data source: PolitiFact")


##                Democrat Independent Libertarian Republican Unaffiliated
## Pants on Fire!       22           0           1         54           22
## False                30           1           1         63            4
## Mostly False         34           2           1         61            2
## Half-True            43           1           2         52            1
## Mostly True          49           3           2         45            1
## True                 48           2           2         48            1

Republicans account for a large share of false statements, much more than Democrats. In fact, a chi-square tests rejects the null hypothesis that statement rating and political party are independent.

# Chi-Square Test for Independence
chisq.test(r_mat)


## 
##  Pearson's Chi-squared test
## 
## data:  r_mat
## X-squared = 1260.8, df = 20, p-value % as.data.frame, by = "rid") %>%
    left_join(tbl(db, "rating") %>% as.data.frame, by = "aid") %>%
    select(Rating = label, Party = name.y)

mos_df$Rating %% factor(levels = c("Pants on Fire!", "Mostly False", "False", "Half-True", "Mostly True", "True"))d
mosaic(~ Party + Rating, data = mos_df, shade = TRUE, gp = shading_hsv,  labeling_args = list(abbreviate_labs = c(Rating = TRUE, Party = TRUE)))

I would also like to see what people are lying about (as opposed to how much they lie). For this, I'm simply going to make a word cloud, using the Python package wordcloud (read more about it here). The code for creating the word clouds is listed below:

from os import path
from scipy.misc import imread
import matplotlib.pyplot as plt
from pylab import rcParams
import pymysql as sql
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# This line is necessary for the plot to appear in a Jupyter notebook
get_ipython().magic('matplotlib inline')
# Control the default size of figures in this Jupyter notebook
rcParams['figure.figsize'] = 30, 18

d = path.dirname(my_pics)

def get_party_statements(name, cur):
    """
    :param name: string; The name of the party for which to look up data
    :param cur: cursor; A pymysql cursor object

    :return: dict; Contains strings with two parameters, "truth" and "lie", that consists of concatenated statements of all members in the political party identified by name

    This object extracts from a database (connected to via cur) statements made by members of a party identified by name. True statements are those with at least a rating of "half-true"; all else are treated as false.
    """

    party_true = ""
    party_lie = ""

    # Get all true statements by party
    nstate = cur.execute("select text from stmnt where aid >= 3 and aid = 0 and aid = 3 and aid = 0 and aid <= 2 and pid = (select pid from speaker where name = \"" + name + "\")")
    for i in range(nstate):
        person_lie += " " + cur.fetchone()[0]

    return {"truth": person_true, "lie": person_lie}

def generate_wordcloud(text, logo, font):
    """
    :param text: string; A text string that will be turned into a word cloud
    :param logo: array; An object representing the image used for masking (implicitly created by imread in scipy)
    :param font: string; An identifier for the font to be used when creating the word cloud

    Creates a word cloud from text using the font specified by parameter font and the mask specified by logo.
    """

    wc = WordCloud(font_path = font, background_color="white", max_words=200, mask=logo,
               stopwords=STOPWORDS.update(["said", "say", "says"]))
    # generate word cloud
    wc.generate(text)

    # create coloring from image
    image_colors = ImageColorGenerator(logo)

    # show
    # recolor wordcloud and show
    # we could also give color_func=image_colors directly in the constructor
    plt.imshow(wc.recolor(color_func=image_colors))
    plt.axis("off")
    plt.show()

def main():
    hillary_logo = imread(path.join(d, "Hillary_for_America_2016_logo.png"))
    democrat_logo = imread(path.join(d, "Democratslogo.png"))
    republican_logo = imread(path.join(d, "Republicanlogo.png"))
    trump_face = imread(path.join(d, "DonaldTrump_bp.png"))

    conn = sql.connect(host='localhost', user = "root", passwd=my_pass, db="mysql", charset = "utf8")
    cur = conn.cursor()
    cur.execute("use politifactscraper;")

    r_statements = get_party_statements("Republican", cur)
    d_statements = get_party_statements("Democrat", cur)

    u_statements = get_party_statements("Unaffiliated", cur)

    hillary_statements = get_people_statements("Hillary Clinton", cur)
    trump_statements = get_people_statements("Donald Trump", cur)

    neutral_font = "OldNewspaperTypes"
    true_font = "BodoniXT"
    lie_font = "DK Coal Brush"

    # Pictures!
    generate_wordcloud(d_statements["truth"] + " " + d_statements["lie"], democrat_logo, neutral_font)
    generate_wordcloud(d_statements["truth"], democrat_logo, true_font)
    generate_wordcloud(d_statements["lie"], democrat_logo, lie_font)
    generate_wordcloud(r_statements["truth"] + " " + r_statements["lie"], republican_logo, neutral_font)
    generate_wordcloud(r_statements["truth"], republican_logo, true_font)
    generate_wordcloud(r_statements["lie"], republican_logo, lie_font)
    generate_wordcloud(hillary_statements["truth"] + " " + hillary_statements["lie"], hillary_logo, neutral_font)
    generate_wordcloud(hillary_statements["truth"], hillary_logo, true_font)
    generate_wordcloud(hillary_statements["lie"], hillary_logo, lie_font)
    generate_wordcloud(trump_statements["truth"] + " " + trump_statements["lie"], trump_face, neutral_font)
    generate_wordcloud(trump_statements["truth"], trump_face, true_font)
    generate_wordcloud(trump_statements["lie"], trump_face, lie_font)
    generate_wordcloud(u_statements["truth"] + " " + u_statements["lie"], internet_logo, neutral_font)
    generate_wordcloud(u_statements["truth"] , internet_logo, true_font)
    generate_wordcloud(u_statements["lie"], internet_logo, lie_font)

    cur.close()
    conn.close()

if __name__ == '__main__':
    main()

In the process of making these word clouds, I used the fonts Old Newspaper Types, Bodoni XT, and DK Coal Brush. The source for the Donald Trump mask is here (after some editing), the WordPress, GMail, and Facebook logos were found via Google search, and the rest are from WikiMedia.

Mosaic Plot

I first show the mosaic plot I created earlier. As you can see, Republicans account for more falsehood than Democrats, more than if you had expected there to be no relationship between political party and truthfulness. (Blue indicates more than expected, red less than expected.)

All Statements Word Clouds

First, I show a word cloud for all statements made by the entities considered that were rated by PolitiFact, true or not.





True Statements Word Clouds

Next I show the word clouds formed by the statements made by the entities considered that PolitiFact has rated as being at least “half-true”.





False Statements Word Clouds

Finally, I show the word clouds formed by the statements made by the entities considered that PolitiFact has rated as being at least as dishonest as “mostly false”.





What do we see? Naturally, all of this is in the eye of the beholder (word clouds are not “scientific”), but I noticed some patterns. Naturally, the presidential candidates mention their opponents frequently, and they both say many things true or false that involve their opponents. Meanwhile, the political party word clouds are more focused on policy. The words “Scott”, “Walker”, and “Wisconson” appear a lot in the Democrats’ “lie” cloud (perhaps thanks to the Wisconsin Democratic Party, the most dishonest entity in my earlier lists), and Republicans lie a lot about one person: Obama. As for Facebook posts, chain e-mails, and bloggers, they like to talk about Obama, and my guess is that lies propagated through these channels generally suggest something new will be done by Obama this year that will hurt people.

Thoughts on PolitiFact

All of the data used in this source comes from one source: PolitiFact. There are those who believe that this makes these results questionable. I would not be a good statistician if I did not disclose potential problems, so here I discuss potential problems my analysis faces.

Political Bias

Upon seeing a result saying that PolitiFact rates Republicans as more dishonest than Democrats, conservatives may immediately reach for the “biased” argument, that the people responsible for maintaining PolitiFact (the Tampa Bay Times) have a political agenda that reflects in their scores, and thuse we should not trust their data.

Sure, “bias” (in this typical sense of the word) is a possibility. I will acknowledge the possibility. Unfortunately, though, many have taken to accusing supposedly authoritative sources of being “biased” whenever those sources contradict their existing beliefs. Trump, and especially Trump supporters, have turned this into an art, but Fox News has been laying the foundation for this line of attack for decades and many on the left now resort to it as well (I felt the need to have this discussion after a Bernie Sanders supporter, in a Facebook argument, accused PolitiFact of being biased against Bernie Sanders in favor of Hillary Clinton). It’s a wonderful line of attack, because there’s usually no way for the victim to prove she is not “biased”, or the accuser cares little for whatever proof she finds.

Accusing PolitiFact of being “biased” (along with many other traditional, well-reputed media outlets) quickly angers me. It fuels tribalism by eroding our “common ground” information sources that we accept as reporting an objective, baseline “truth” from which we can then debate the approach to take to society’s problems. Continuing along this path allows one to eventually paint reality as he hallucinates it to be, and anyone who disagrees with that vision with an argument as simple as “that’s not true” is “biased”. If we continue to refuse to accept opposing views because the source is “biased”, we will no longer have a functioning democracy; minds will never change, we will become locked into our tribes of yes-men with common broken-clock media, and voting will simply become a battle of wills between two equally mad fictions.

So if you’re going to criticize these results because PolitiFact is “biased”, you’re further hurtling us to post-truth politics. I will resist this dangerous trend.

Curation Bias

While I will simply refuse to entertain the possibility of political bias, there are other ways this data could become “biased” that have little to do with any political beliefs by the individuals at PolitiFact. Fact checking is not a science, and other sources of bias could be introduced.

When I mention “curation bias”, I reference the fact that fact-checkers must decide what to fact-check. They may be drawn to fact-check statements that:

  • Are hot button issues (as a former intern in a lobbying firm, I can promise you that a lot more goes on in Washington than just the headline makers, much of which is important but not nearly as exciting or “sexy” to discuss)
  • Are made by prominent individuals (Barack Obama has a lot of his statements rated, unlike Rep. Rob Bishop, who’s been in Congress for fourteen years and has not a single statement on record; I was also sad to see Jill Stein did not have a file)
  • Sound like they may be false (so there is a propensity to show a statement is false than show a statement is true; additionally, the writer may have some initial idea about the truth in the statement, so if a statement with many technical details about an esoteric topic is made, the fact-checker may not consider it)

These are more serious threats to the quality of this analysis, ones for which there is no fix and for which the effect is unclear. I won’t dare to claim that my analysis is immune to them and can’t be tainted by bias. Nevertheless, I believe that if we want an idea of honesty in politics, we can’t do much better than use this data.

Conclusion

On Tuesday, this nightmare of an election will be over. Americans will cast their votes, and hopefully they make the right decision. I will not lie: I am very nervous about this election. As of this writing, FiveThirtyEight is showing a close election. I hope that this analysis will highlight the source of dishonesty in this election, and people will not judge both sides as equally bad or equally guilty. That is simply not the state of reality, and if we are going to move past the problems we have seen in our democracy, we will need to come to terms with reality.


To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)