R Questions Tag Pairs on Stackoverflow

[This article was first published on Once Upon Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Months ago, I passed by R Questions from Stack Overflow published on Kaggle. I was interested in tag pairs in particular, i.e. which tags appear together in R questions, so I worked on this simple kernel.

This week, I had some time so I thought about deploying a simple Shiny App, to give more people access to exploring the tag pairs. So here is the App, where you can see the most frequent tags that appear with a certain tag. And below is the full code of how I processed and aggregated the data.

R Tag Pairs Shiny App

Data Aggregation

I selected the questions with more than one tag (in addition to R) and did the following:

  • Step 1: get all the tags corresponding to question ID
  • Step 2: find all pair combinations from these tags
  • Step 3: combine all pairs from all the questions in one dataframe

Step 1: get all the tags corresponding to question ID

# group by question ID and nest tags
datn <- dat %>% 
        group_by(Id) %>% 
        filter(n()>1) %>% 
        nest(.key="Tags")

Now if we look at a certain question ID, we find all the tags in one list, for example for Q#79709

[[1]]
# A tibble: 4 × 1
               Tag
             <chr>
1           memory
2         function
3 global-variables
4     side-effects

Step 2: find all pair combinations from these tags

Now, we will get all the possible pairs from the questions’ tags:

# map each Tag list to combn() to get all the combinations from a list
datn <- datn %>% 
        mutate(pairs=map(Tags, ~combn(.x[["Tag"]], 2) %>% 
                                 t %>% 
                                 as.data.frame(stringsAsFactors = F)))

For the same question we checked in the previous step, we can see that the pairs are as follows:

[[1]]
                V1               V2
1           memory         function
2           memory global-variables
3           memory     side-effects
4         function global-variables
5         function     side-effects
6 global-variables     side-effects

Step 3: combine all pairs from all the questions in one dataframe

Now we will combine all the pairs from all questions in one dataframe and count the freq of each pair:

# combine all pairs in one dataframe
dat_pairs <- plyr::rbind.fill(datn$pairs)

# put pairs in the same order
dat_pairs <- dat_pairs %>% 
        mutate(firstV=map2_chr(V1,V2,function(x,y) sort(c(x,y))[1]),
               secondV=map2_chr(V1,V2,function(x,y) sort(c(x,y))[2])) %>% 
        select(-V1,-V2)

# count the frequency of each pair
pair_freq <- dat_pairs %>% 
        group_by(firstV,secondV) %>% 
        summarise(pair_count=n()) %>% 
        arrange(desc(pair_count)) %>% 
        ungroup()

Here we can see the top 40 pairs:

datatable(head(pair_freq,40), options = list(pageLength = 5))

Tag-Pairs for a Certain Tag

Here we can pick one tag and see all the other tags that appear with it and the frequency of each.

# Get all pairs with a certain tag
GetTagPairs <- function(df, tag) {
        df %>% 
                filter(firstV==tag|secondV==tag) %>% 
                arrange(desc(pair_count)) %>% 
                mutate(T2 = ifelse(secondV==tag, firstV, secondV)) %>% 
                select(T2, pair_count)
}

Example: ggplot2 Pairs

If we take ggplot2 for example, we can see that the most frequent tags that appeared with it are the following:

ex <- GetTagPairs(pair_freq, "ggplot2")

datatable(head(ex,40), options = list(pageLength = 5))

You can see the whole list for any tag in Shiny App.

In conclusion

Pair tags give us an idea about the areas of interest, the relations between topics/packages, and the frequently used packages in the R community. We can also draw a full network to visualize more complex relations. However, these were the tags in questions posted till 19 October 2016. Definitely things change, and more tags get into the list with time. I personally expect that Tidyverse and its packages are mentioned more frequently in 2017. An updated dataset would help confirm this hypothesis!

To leave a comment for the author, please follow the link and comment on their blog: Once Upon Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)