Who are the Swedish radio P1 summer guests? Answer via Wikidata

[This article was first published on Maëlle, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This week, a very promising new R blog was launched, namely the blog of Eric Persson a.k.a as expersso on Twitter. I had really been looking forward to this because expersso’s code screenshots have always been quite cool, so seeing his no longer being limited to them is awesome! His first articles series is about a game, you should really check it out. (PSA: if you post screenshots of R code on Twitter, have a look at Sean Kross’ codefinch package!).

Because I’m a nosy person I asked Eric whether he was Swedish, his last name being quite Swedish-looking in my opinion. He is, which made me wonder about Swedish blog topics and actually decided to use one Swedish topic I came up with, the summer guests of the Swedish radio P1! Every summer since 1959, P1 selects a bunch of famous or interesting people and have them record a bit more than one hour program where they’re free to discuss what they want (important events of their life for instance) and to choose the musical breaks (which you don’t get to listen entirely to in the online version because of copyright stuff). The program is then broadcasted in the summer, one guest a day from the end of June to the beginning of August. There’s even a winter version now but I’ll ignore it because it’s too hot here in Barcelona to even think about winter.

It’s a very cool radio program in my opinion. I discovered it at the end of my 5-month research stay in Gothenburg in 2010 and decided it’d be one way to keep my Swedish skills up to date (my other methods include listening to ABBA in Swedish and reading Camilla Läckberg’s novels). I haven’t listened to that many guests but I really enjoy it when I do, and I like how diverse the list of guests is. In this post, I’ll actually try to have a look at the occupations of the guests via Wikidata!

How to get the data?

In order to answer my question I needed a list of all summer guests, and their occupations. I first thought I’d resort to webscraping P1 website, which would have been a good way to provide expersso with an opportunity to give me some constructive feedback. However after websurfing a bit to assess my options, I realized I could use Wikipedia data obtained through APIs rather than webscraping as in my posts about famous dead people. Here was my strategy.

Getting the list of all summer guests from 1959

I downloaded the table of all summer guests from 1959 from this page, which gave me their names and the dates of their episode(s) – yeah some people are invited back.

# all summer guests
sommargaester <- readr::read_csv("data/p1sommar.csv", col_names = FALSE, 
                                 locale = readr::locale(encoding = "latin1"))
knitr::kable(sommargaester[1:10,])
X1X2X3X4X5X6X7
Aaro,Lars-Eric110714NANANANANA
Abdelhadi,Magdi070809120716NANANANA
Abdulle,Sherihan ‘Cherrie’170706NANANANANA
Abele,Anton130725NANANANANA
Abenius,Folke800627NANANANANA
Abrahamson,Kjell-Albin880809NANANANANA
Ackebo,Lena970708NANANANANA
Adami,Zanyar060707NANANANANA
Adamo,Amelia860806870718080701NANANA
Adams,Maud920703070805NANANANA
# get their names
sommargaester_names <- unique(sommargaester$X1)

# for putting names in the right order for later queries
transform_name <- function(name){
  paste(stringr::str_split(name, ",",
                           simplify = TRUE)[2],
        stringr::str_split(name, ",",
                           simplify = TRUE)[1])
}

pretty_sommargaester_names <- purrr::map_chr(sommargaester_names, transform_name)
sommargaester_names <- tibble::tibble(name = pretty_sommargaester_names,
                                      X1 = sommargaester_names)
sommargaester <- dplyr::left_join(sommargaester, sommargaester_names,
                                  by = "X1")

sommargaester <- dplyr::select(sommargaester, - X1)
# transform the date
sommargaester <- tidyr::gather(sommargaester, "rep", "date", X2:X7)
sommargaester <- dplyr::group_by(sommargaester, name)
sommargaester <- dplyr::mutate(sommargaester, rep = 1:n())
sommargaester <- dplyr::ungroup(sommargaester)

sommargaester <- dplyr::filter(sommargaester, !is.na(date))
# remove winter
sommargaester <- dplyr::filter(sommargaester, !stringr::str_detect(date, "V"))
# remove repeat episodes
sommargaester <- dplyr::filter(sommargaester, !stringr::str_detect(date, "R"))


# transform the date to a format with a non ambiguous year
sommargaester <- dplyr::mutate(sommargaester, date = as.numeric(date))
sommargaester <- dplyr::mutate(sommargaester, date = ifelse(date > 180000, paste0("19", date),
                                                            ifelse(date < 100000,
                                                                   paste0("200", date),
                                                                   paste0("20", date))))
sommargaester <- dplyr::mutate(sommargaester, pretty_date = lubridate::ymd(date))

knitr::kable(sommargaester[1:10,])
namerepdatepretty_date
Lars-Eric Aaro1201107142011-07-14
Magdi Abdelhadi1200708092007-08-09
Sherihan ‘Cherrie’ Abdulle1201707062017-07-06
Anton Abele1201307252013-07-25
Folke Abenius1198006271980-06-27
Kjell-Albin Abrahamson1198808091988-08-09
Lena Ackebo1199707081997-07-08
Zanyar Adami1200607072006-07-07
Amelia Adamo1198608061986-08-06
Maud Adams1199207031992-07-03

Getting Wikidata about the summer guests

Then I read the list of Wikipedia R packages put together by Mikhail Popov, data scientist at the Wikimedia foundation with whom I exchanged a few messages after my Wikipedia deaths posts. These packages were either created by him or by Oliver Keyes and really give you access to plenty of structured data so I was happy to at last give the list a try!

I used WikidataQueryServiceR to access Wikidata via the query language SPARQL. Luckily I didn’t need to actually learn SPARQL, I used the first example of the documentation, getting Douglas Adams’ data and played with this online query tool to see how I could modify it to 1) obtain data in Swedish because Swedish Wikipedia is probably more complete about Swedish people and 2) obtain data about the summer guests, not Douglas Adams. For that second point I needed the item ID of each summer guests which I queried via another R package, WikidataR. I could have modified the SPARQL even more since I wouldn’t be using picture information but I wasn’t in a very perfectionist mood.

Doing all of this I only scraped the surface of the possibilities offered by the Wikipedia-related R packages and can already tell one could do a bunch of exciting analyses with them!

Note that I used Bob Rudis’ code as an example of how to insert a progress bar which made me feel quite cool. I also inserted a pause of 1 second between calls to the APIs in order to be a good person. My function name however shows how uninspired I was for naming things.

# function for getting someone's data
get_someone_data <- function(name, pb = NULL){
  if (!is.null(pb)) pb$tick()$print()
  Sys.sleep(1)
  item <- WikidataR::find_item(name, language = "sv")
  # sometimes people have no Wikidata entry so I need this condition
  if(length(item) > 0){
    entity_code <- item[[1]]$id
    query <-  paste0("PREFIX entity: 
                     #partial results
                     
                     SELECT ?propUrl ?propLabel ?valUrl ?valLabel ?picture
                     WHERE
                     {
                     hint:Query hint:optimizer 'None' .
                     {	BIND(entity:",entity_code," AS ?valUrl) .
                     BIND(\"N/A\" AS ?propUrl ) .
                     BIND(\"identity\"@sv AS ?propLabel ) .
                     }
                     UNION
                     {	entity:", entity_code," ?propUrl ?valUrl .
                     ?property ?ref ?propUrl .
                     ?property rdf:type wikibase:Property .
                     ?property rdfs:label ?propLabel
                     }
                     
                     ?valUrl rdfs:label ?valLabel
                     FILTER (LANG(?valLabel) = 'sv') .
                     OPTIONAL{ ?valUrl wdt:P18 ?picture .}
                     FILTER (lang(?propLabel) = 'sv' )
                     }
                     ORDER BY ?propUrl ?valUrl
                     LIMIT 200")
    results <- WikidataQueryServiceR::query_wikidata(query)
    results$name<- name
    results
  }else{
    NULL
  }
   
  }

pb <- dplyr::progress_estimated(length(unique(sommargaester$name)))
sommargaester_wiki <- purrr::map_df(unique(sommargaester$name),
                                    get_someone_data, pb=pb)

sommargaester <- dplyr::left_join(sommargaester, sommargaester_wiki, by = "name")

readr::write_csv(sommargaester, path = "data/p1_wiki_data.csv")
knitr::kable(sommargaester[1:10,])
namereppretty_datepropLabelvalLabel
Lars-Eric Aaro12011-07-14arbetsgivareLKAB
Lars-Eric Aaro12011-07-14födelseortVittangi
Lars-Eric Aaro12011-07-14könman
Lars-Eric Aaro12011-07-14nationalitetSverige
Lars-Eric Aaro12011-07-14instans avmänniska
Lars-Eric Aaro12011-07-14alma materLuleå tekniska universitet
Lars-Eric Aaro12011-07-14medlem avKungliga Ingenjörsvetenskapsakademien
Lars-Eric Aaro12011-07-14förnamnLars
Lars-Eric Aaro12011-07-14identityLars-Eric Aaro
Magdi Abdelhadi12007-08-09sysselsättningjournalist

I get one line per guest and propriety value, sometimes some propriety labels have more than one value for the same person, e.g. if the person has several occupations (occupation = sysselsättning).

Translation the occupations

This is the point at which I realized that maybe doing the search in English in the first place wouldn’t have made me get less data? Oh well. In any case for each possible occupation I can get an English translation which is awesome.

occupation <- dplyr::filter(sommargaester, propLabel == "sysselsättning")
occupations <- unique(occupation$valLabel)
translate_occupation <- function(occupation){
  job <- WikidataR::find_item(occupation, language = "sv")[[1]]$label
  if(is.null(job)){
    job <- ""
  }
  return(job)
}
english_occupations <- purrr::map_chr(occupations, translate_occupation)
translations <- data.frame(sysselsaettning = occupations,
                           occupation = english_occupations)
occupation <- dplyr::left_join(occupation, translations, by = c("valLabel" = "sysselsaettning"))

I guess it’s good to know that Wikidata has data in many languages because I can totally see it being useful as a sort of translation service.

Profiling the summer guests

How much data?

withoccupation <- dplyr::filter(sommargaester, !is.na(propLabel))
withoccupation <- dplyr::group_by(withoccupation, name)
withoccupation <- dplyr::summarise(withoccupation, withoccupation = any(propLabel == "sysselsättning"))

There are 2192 unique guests in the dataset. We got Wikidata information for 1587 of them which means 72% of them. We got at least one occupation for 1421 guests which is 65% of them. At this point I have no way to know whether the sample is representative. Maybe the people with more Wikidata output are the most famous ones and some occupations probably help getting more famous than other so I tend to think this makes my sample a bit limited.

How often do guests get invited

As I said earlier some guests were invited several times. I want to look at the distribution of the number of invitations per guest.

invites <- dplyr::select(sommargaester, name, rep)
invites <- dplyr::group_by(invites, name)
invites <- dplyr::summarize(invites, no_invites = max(rep))
invites <- unique(invites)
library("ggplot2")
 theme_set(theme_gray(base_size = 18))
ggplot(invites) +
  geom_histogram(aes(no_invites)) +
  xlab("Number of invitations per P1 summer guest")

plot of chunk unnamed-chunk-5

Some people have been invited a lot! If you remember my initial tables, it had only a few columns for the episode dates but one guest could have several lines.

Who are the people that got invited that many times?

jobs <- dplyr::select(occupation, name, occupation) 
jobs <- unique(jobs)
invites <- dplyr::left_join(invites, occupation, by = "name")
invites <- dplyr::select(invites, name, no_invites, occupation)
invites <- unique(invites)
invites <- dplyr::group_by(invites, name, no_invites)
invites <- dplyr::summarise(invites, occupation = toString(occupation))
invites <- dplyr::ungroup(invites)
invites <- dplyr::arrange(invites, desc(no_invites))
knitr::kable(invites[1:20,])
nameno_invitesoccupation
Torsten Ehrenmark54journalist, translator, writer
Jörgen Cederberg53NA
Gösta Knutsson42cartoonist, journalist, translator, children’s writer, writer
Maud Reuterswärd35writer
Georg Eliasson29screenwriter
Pekka Langer29television presenter, actor
Britt Edwall24writer
Arne Ericsson23musician
Bengt Feldreich23journalist, television presenter
Berndt Friberg23journalist
Carl-Olof Lång23director
Carl-Uno Sjöblom23television presenter
Lars Ulvenstam23journalist
Lars Widding23journalist, writer
Åke Falck17film director, director, television presenter, screenwriter, actor
Barbro Alving17journalist, screenwriter, writer
Bo Setterlind17writer, poet
Bo Strömstedt17NA
Elisabeth Söderström17singer, opera singer
Fredrik Burgman17journalist

Most frequent occupations (by episode)

everything <- dplyr::group_by(occupation, occupation)
everything <- dplyr::summarize(everything, n = n())
everything <- dplyr::filter(everything, n > 50)
library("magrittr")
everything %>%
  dplyr::arrange(n) %>%
  dplyr::mutate(occupation = factor(occupation, ordered = TRUE, levels = unique(occupation))) %>%
ggplot() +
  geom_col(aes(occupation, n)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Number of episodes by occupation",
subtitle = "for occupations represented more than 50 times")

plot of chunk unnamed-chunk-7

So in terms of episodes I’d say the media and culture fields are quite well represented. That said it might also be due to the fact that a person doing sports for instance will be classified as either ice skater or sprinter, so there’s no chance for sports to appear here. I’ll leave the classification of occupations in categories as an exercise for the reader though.

Most frequent occupations (by unique guest)

everything <- dplyr::select(occupation, name, occupation)
everything <- unique(everything)
everything <- dplyr::group_by(everything, occupation)
everything <- dplyr::summarize(everything, n = n())
everything <- dplyr::filter(everything, n > 20)
library("magrittr")
everything %>%
  dplyr::arrange(n) %>%
  dplyr::mutate(occupation = factor(occupation, ordered = TRUE, levels = unique(occupation))) %>%
ggplot() +
  geom_col(aes(occupation, n)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Number of guests by occupation",
subtitle = "for occupations represented more than 20 times")

plot of chunk unnamed-chunk-8

With this figure by guest rather than by episode, a bit more diversity has appeared, e.g. “university professor” and “association football player”. Moreover although this was a nice exercise I won’t pretend that someone’s occupation characterizes them fully! The P1 summer guest programm really presents a variety of episodes, and I’d tend to think everyone can find an episode that they appreciate. For instance I remember liking the baker Sébastien Boudet’s episode (yep he’s French) and the one featuring the former high jumper Kajsa Berqvist. Now all this data munging has left me more than ready to look at the list of episodes again to choose the next ones I’ll listen to!

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)