A Tidy Text Analysis of My Google Search History
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
While brainstorming about cool ways to practice text mining with R I
came up with the idea of exploring my own Google search history. Then,
after googling (ironically) if anyone had done something like this, I
stumbled upon Lisa Charlotte’s blog
post.
Lisa’s post (actually, a series of posts) are from a while back, so her
instructions for how to download your personal Google history and the
format of the downloads (nowadays, it’s in a .html file instead of a
series of .json files) are no longer applicable.
I googled a bit more and found a recent RPubs write-up by Stephanie
Lancz
that not only included concise instructions on how/where to get personal
Google data, but also how to clean it with R
! With the hard work of
figuring out how to set up the data provided for me, I was excited to
find out what I could do.
In this write-up (which can be downloaded from GitHub and re-used for
one’s own analysis), I explore different techniques for visualizing and
understanding my data. I do my best to implement methods that are
generic and could be applied to any kind of similar analysis,
irregardless of the topic. Much of my code is guided by the work of
others, especially that of David Robinson and Julia Silge, who have
written an amazingly helpful book on text analysis– Tidy Text Mining
with R book, I provide references for
my inspiration where appropriate.
Setup
First, following “best practices”, I import all of the packages that
I’ll be using.
library("dplyr") library("stringr") library("xml2") library("rvest") library("lubridate") library("viridis") library("ggplot2") library("tidytext") library("tidyr") library("ggalt") library("widyr") library("drlib") library("igraph") library("ggraph") # library("topicmodels") # devtools::install_github("tonyelhabr/temisc") library("teplot") # Personal package.
Next, I create a config
list in order to emulate what one might do
with a parameterized RMarkdown report (where the config
would be a
part of the yaml header.
config <- list( path = file.path("data-raw", "Tony-My Activity-Search-MyActivity.html"), name_main = "Tony", color_main = "firebrick" )
I’ll also go ahead and create a couple of functions for coloring some of
the plots that I’ll create. These can be customized to one’s personal
preferences.
scale_color_func <- function() { viridis::scale_color_viridis( option = "D", discrete = TRUE, begin = 0, end = 0.75 ) } scale_fill_func <- function() { viridis::scale_fill_viridis( option = "D", discrete = TRUE, begin = 0, end = 0.75 ) }
Import and Clean
Then, on to the “dirty” work of importing and cleaning the data. I don’t
deviate much from Stephanie Lancz’s
methods
for extracting data elements from the .html file
# Reference: # + https://rstudio-pubs-static.s3.amazonaws.com/355045_90b7464be9b4437393670340ad67c310.html# doc_html <- config$path search_archive <- xml2::read_html(doc_html) # Extract search time. date_search <- search_archive %>% html_nodes(xpath = '//div[@class="mdl-grid"]/div/div') %>% str_extract(pattern = "(?<=
)(.*)(?<=PM|AM)") %>% mdy_hms() # Extract search text. text_search <- search_archive %>% html_nodes(xpath = '//div[@class="mdl-grid"]/div/div') %>% str_extract(pattern = '(?<=)') %>% str_extract(pattern = '(?<=\">)(.*)') # Extract search type. type_search <- search_archive %>% html_nodes(xpath = '//div[@class="mdl-grid"]/div/div') %>% str_extract(pattern = "(?<=mdl-typography--body-1\">)(.*)(?=% str_extract(pattern = "(\\w+)(?=\\s)") # Differences from reference: # + Using `lubridate::wday()` instead of calling `weekdays()` and coercing to factors. # + Using `yyyy`, `mm`, `wd`, and `hh` instead of `year`, `month`, `wday`, and `hour`. # + Convert `yyyy` to an integer (from a double). # + Adding a `time` column to use for a later visualization. # + Adding a `name` column to make this code more "parametric". data <- tibble( name = config$name_main, timestamp = date_search, date = lubridate::as_date(date_search), yyyy = lubridate::year(date_search) %>% as.integer(), mm = lubridate::month(date_search, label = TRUE), wd = lubridate::wday(date_search, label = TRUE), hh = lubridate::hour(date_search), time = lubridate::hour(timestamp) + (lubridate::minute(timestamp) / 60), type = type_search, text = text_search )
Notably, there are some rows that did not get parsed correctly. I decide
to exclude them from the rest of the analysis. Also, my first searches
come at the end of 2010, and my most recent ones (i.e. the ones just
before I downloaded my data) come in the first month or so of 2018. To
make the aspect of my analysis that deal with years a bit “cleaner”,
I’ll truncate these ends so that my data spans the years 2011 through
2017.
data %>% count(yyyy, sort = TRUE) data <- data %>% filter(!is.na(yyyy)) data <- data %>% filter(!(yyyy %in% c(2010, 2018)))
Analysis
Search Count Distributions
Next, it’s time to start doing some basic exploratory data analysis
(EDA). Given the temporal nature of the data, an easy EDA approach to
implement is visualization across different time periods. To save some
effort (or, as I like to see it, make my code more efficient), we can
create a helper function. (Notably, the geom
to use is a parameter to
this function. Through experimentation, I found that
ggplot2::geom_bar()
seems to work best with most temporal periods,
with the exception of plotting Date
variables, where
ggplot2::geom_hist()
seems more appropriate.)
# Reference: # + https://juliasilge.com/blog/ten-thousand-data-ext/. visualize_time <- function(data, colname_x, geom = c("bar", "hist"), color = "grey50", lab_subtitle = NULL) { geom <- match.arg(geom) viz_labs <- labs( x = NULL, y = NULL, title = "Count Of Searches", subtitle = lab_subtitle ) viz_theme <- teplot::theme_te() + theme(panel.grid.major.x = element_blank()) + theme(legend.position = "none") viz <- ggplot(data, aes_string(x = colname_x)) if (geom == "bar") { viz <- viz + geom_bar(aes(y = ..count.., alpha = ..count..), fill = color) + scale_alpha(range = c(0.5, 1)) } else if (geom == "hist") { viz <- viz + geom_histogram(aes(y = ..count..), fill = color, bins = 30) } viz <- viz + viz_labs + viz_theme viz }
Using this function is fairly straightforward. For example, to visualize
the count of searches by year, it can be invoked in the following
manner.
viz_time_yyyy <- visualize_time( data = data, colname_x = "yyyy", geom = "bar", color = config$color_main, lab_subtitle = "By Year" )
The same pattern can be repeated for timestamp
, yyyy
, mm
, wd
,
and hh
.
I can make a couple of interesting observations about my data.
- It’s evident that I’ve googled stuff more and more frequently over
the years. - It seems like my most active months correspond with typical American
high school/college breaks–winter break occurs during
December/January, spring break occurs in March, and the end of
summer break occurs in August. - My relatively high activity on Saturdays and Sundays (compared to
the rest of the days of the week) indicate that I like to spend my
“breaks” of weekly school/work on the Internet. - Regarding my hour-to-hour activity, mine seems relatively even
throughout the day. I think if you compared my by-hour activity to
others, mine would stand out as abnormally consistent.
Word Frequencies
Now we’ll “tokenize” the search text into n-grams. We’ll parse each
search query into unigrams and bigrams.
# Reference (for regular expression): # + https://rstudio-pubs-static.s3.amazonaws.com/355045_90b7464be9b4437393670340ad67c310.html# rgx_patt <- '(http|https)\\S+\\s*|(#|@)\\S+\\s*|\\n|\\"|(.*.)\\.com(.*.)\\S+\\s|[^[:alnum:]]' rgx_repl <- " " rgx_custom_ignore <- "google|search" # References: # + https://www.tidytextmining.com/ # + https://www.tidytextmining.com/ngrams.html # + https://www.tidytextmining.com/twitter.html stop_words <- tidytext::stop_words unigrams <- data %>% mutate(text = str_replace_all(text, rgx_patt, rgx_repl)) %>% tidytext::unnest_tokens(word, text) %>% anti_join(stop_words, by = "word") %>% filter(!str_detect(word, rgx_custom_ignore)) %>% filter(str_detect(word, "[a-z]")) unigrams %>% select(word) %>% count(word, sort = TRUE)
## # A tibble: 15,299 x 2 ## word n ## <chr> <int> ## 1 nutrition 1677 ## 2 excel 1224 ## 3 ut 979 ## 4 austin 811 ## 5 vba 683 ## 6 python 551 ## 7 chicken 486 ## 8 sql 453 ## 9 nba 404 ## 10 oracle 389 ## # ... with 1.529e+04 more rows
# References: # + https://www.tidytextmining.com/ # + https://www.tidytextmining.com/ngrams.html bigrams <- data %>% mutate(text = str_replace_all(text, rgx_patt, rgx_repl)) %>% tidytext::unnest_tokens(word, text, token = "ngrams", n = 2) %>% tidyr::separate(word, into = c("word1", "word2"), sep = " ", remove = FALSE) %>% anti_join(stop_words, by = c("word1" = "word")) %>% anti_join(stop_words, by = c("word2" = "word")) %>% filter(!str_detect(word1, rgx_custom_ignore)) %>% filter(!str_detect(word2, rgx_custom_ignore)) %>% filter(str_detect(word1, "[a-z]")) %>% filter(str_detect(word2, "[a-z]")) bigrams %>% select(word) %>% count(word, sort = TRUE)
## # A tibble: 33,404 x 2 ## word n ## <chr> <int> ## 1 excel vba 598 ## 2 ut austin 474 ## 3 pl sql 167 ## 4 san antonio 126 ## 5 baton rouge 113 ## 6 peanut butter 102 ## 7 round rock 100 ## 8 sweet potato 95 ## 9 oracle sql 94 ## 10 chicken breast 88 ## # ... with 3.339e+04 more rows
With the data parsed into tokens, we can visualize counts of individual
n-grams.
# Reference: # + https://github.com/dgrtwo/dgrtwo.github.com/blob/master/_R/2016-08-09-trump-data.Rmd. visualize_cnts <- function(data, color = "grey50", num_top = 20) { data %>% count(word, sort = TRUE) %>% filter(row_number(desc(n)) <= num_top) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n)) + ggalt::geom_lollipop(size = 2, point.size = 4, color = color) + coord_flip() + teplot::theme_te() + labs(x = NULL, y = NULL) + labs(title = "Most Common Words") + theme(legend.position = "none") + theme(panel.grid.major.y = element_blank()) } num_top_cnt <- 15 viz_unigram_cnts <- visualize_cnts( data = unigrams, color = config$color_main, num_top = num_top_cnt ) viz_unigram_cnts
viz_bigram_cnts <- visualize_cnts( data = bigrams, color = config$color_main, num_top = num_top_cnt ) viz_bigram_cnts
These count totals reflect my personal interests relatively well. In
particular, words like nutrition
and chicken breast
, excel vba
and
python
, and nba
and nfl scores
highlight my interest in in
food/nutrition, software and data analysis, and sports. Additionally,
the places I’ve lived re apparent from my searches– ut austin
reflects
my undergraduate studies at the University of Texas at
Austin, baton rouge
alludes to my internship
with ExxonMobil in Baton Rouge in the
summer of 2015, round rock
hints to my current residence in Round
Rock, Texas.
Word Clouds
Another method of visualizing counts is with a word cloud. Normally, I’m
staunchly opposed to word clouds; however, when used to initialize a
mental model of the data, they’re not so bad. I write a basic function
so I can use it twice.
visualize_cnts_wordcloud <- function(data, color, num_top = 25) { data_proc <- data %>% count(word, sort = TRUE) wordcloud::wordcloud( word = data_proc$word, freq = data_proc$n, random.order = FALSE, colors = color, max.words = num_top ) } get_rpal_byname <- function(name) { paste0(name, c("", as.character(seq(1, 4, 1)))) } colors_wordcloud <- get_rpal_byname(config$color_main) num_top_cnt_wordcloud <- 25 viz_unigram_cnts_wordcloud <- visualize_cnts_wordcloud( data = unigrams, color = colors_wordcloud, num_top = num_top_cnt_wordcloud )
## NULL
viz_bigram_cnts_wordcloud <- visualize_cnts_wordcloud( data = bigrams, color = colors_wordcloud, num_top = num_top_cnt_wordcloud )
## NULL
These word clouds essentially show the same information as the other
frequency plots, so it’s not surprising to see the same set of words
shown. The word clouds arguably do a better job of emphasizing the words
themselves (as opposed to the raw count totals associated with each
word(s)).
compute_freqs <- function(data, colname_cnt = "word") { colname_cnt_quo <- rlang::sym(colname_cnt) data %>% group_by(!!colname_cnt_quo) %>% mutate(n = n()) %>% # ungroup() %>% # group_by(!!colname_cnt_quo) %>% summarize(freq = sum(n) / n()) %>% ungroup() %>% arrange(desc(freq)) } unigram_freqs <- compute_freqs( data = unigrams, colname_cnt = "word" ) unigram_freqs
## # A tibble: 15,299 x 2 ## word freq ## <chr> <dbl> ## 1 nutrition 1677 ## 2 excel 1224 ## 3 ut 979 ## 4 austin 811 ## 5 vba 683 ## 6 python 551 ## 7 chicken 486 ## 8 sql 453 ## 9 nba 404 ## 10 oracle 389 ## # ... with 1.529e+04 more rows
bigram_freqs <- compute_freqs( data = bigrams, colname_cnt = "word" ) bigram_freqs
## # A tibble: 33,404 x 2 ## word freq ## <chr> <dbl> ## 1 excel vba 598 ## 2 ut austin 474 ## 3 pl sql 167 ## 4 san antonio 126 ## 5 baton rouge 113 ## 6 peanut butter 102 ## 7 round rock 100 ## 8 sweet potato 95 ## 9 oracle sql 94 ## 10 chicken breast 88 ## # ... with 3.339e+04 more rows
Word Correlations
Let’s add a layer of complexity to our analysis. We can look at
correlations among individual words in each search. We’ll create a
fairly robust function here because we’ll need to perform the same
actions twice–once to view the computed values, and another time to put
the data in the proper format for a network visualization. (We need both
the counts and the correlations of each word pair to create node-edge
pairs.)
# Reference: # + https://www.tidytextmining.com/ngrams.html # + http://varianceexplained.org/r/seven-fav-packages/. compute_corrs <- function(data = NULL, colname_word = NULL, colname_feature = NULL, num_top_ngrams = 50, num_top_corrs = 50, return_corrs = TRUE, return_words = FALSE, return_both = FALSE) { colname_word_quo <- rlang::sym(colname_word) colname_feature_quo <- rlang::sym(colname_feature) data_cnt <- data %>% count(!!colname_word_quo, sort = TRUE) data_cnt_top <- data_cnt %>% mutate(rank = row_number(desc(n))) %>% filter(rank <= num_top_ngrams) data_joined <- data %>% semi_join(data_cnt_top, by = colname_word) %>% rename( word = !!colname_word_quo, feature = !!colname_feature_quo ) data_corrs <- widyr::pairwise_cor( data_joined, word, feature, sort = TRUE, upper = FALSE ) data_corrs_top <- data_corrs %>% mutate(rank = row_number(desc(correlation))) %>% filter(rank <= num_top_corrs) if(return_both | (return_words & return_corrs)) { out <- list(words = data_cnt_top, corrs = data_corrs_top) } else if (return_corrs) { out <- data_corrs_top } else if (return_words) { out <- data_cnt_top } out } num_top_ngrams <- 50 num_top_corrs <- 50 unigram_corrs <- compute_corrs( unigrams, num_top_ngrams = num_top_ngrams, num_top_corrs = num_top_corrs, colname_word = "word", colname_feature = "timestamp" ) unigram_corrs
Not surprisingly, many of the same word pairs seen among the most
frequently used bigrams also appear here.
# Reference: # + http://varianceexplained.org/r/seven-fav-packages/. unigram_corrs_list <- compute_corrs( unigrams, num_top_ngrams = num_top_ngrams, num_top_corrs = num_top_corrs, colname_word = "word", colname_feature = "timestamp", return_both = TRUE ) seed <- 42 set.seed(seed) viz_corrs_network <- igraph::graph_from_data_frame( d = unigram_corrs_list$corrs, vertices = unigram_corrs_list$words, directed = TRUE ) %>% ggraph::ggraph(layout = "fr") + ggraph::geom_edge_link(edge_width = 1) + ggraph::geom_node_point(aes(size = n), fill = "grey50", shape = 21) + ggraph::geom_node_text(ggplot2::aes_string(label = "name"), repel = TRUE) + teplot::theme_te() + theme(line = element_blank(), rect = element_blank(), axis.text = element_blank(), axis.ticks = element_blank()) + theme(legend.position = "none") + labs(x = NULL, y = NULL) + labs(title = "Network of Pairwise Correlations", subtitle = "By Search") viz_corrs_network
It looks like the network captures most of the similar terms fairly
well. The words for food/nutrition, software, and locations are grouped.
Word Changes Over Time
We might be interested to find out whether certain words have either
been used more or less as time has passed. Determining the highest
changes in word usage is not quite as straightforward as some of the
other components of the text analysis so far. There are various valid
approaches that could be implemented. Here, we’ll follow the approach
shown in the Twitter chapter in the Tidy Text Mining
book.
First, we’ll group my word usage by year and look only at the most used
words.
# Reference: # + https://www.tidytextmining.com/twitter.html#changes-in-word-use. timefloor <- "year" top_pct_words <- 0.05 unigram_bytime <- unigrams %>% mutate(time_floor = floor_date(timestamp, unit = timefloor)) %>% group_by(time_floor, word) %>% summarise(n = n()) %>% ungroup() %>% group_by(time_floor) %>% mutate(time_total = sum(n)) %>% ungroup() %>% group_by(word) %>% mutate(word_total = sum(n)) %>% ungroup() %>% filter(word_total >= quantile(word_total, 1 - top_pct_words)) %>% arrange(desc(word_total)) unigram_bytime
## # A tibble: 1,377 x 5 ## time_floor word n time_total word_total ## <dttm> <chr> <int> <int> <int> ## 1 2012-01-01 00:00:00 nutrition 63 5683 1677 ## 2 2013-01-01 00:00:00 nutrition 263 13380 1677 ## 3 2014-01-01 00:00:00 nutrition 647 18824 1677 ## 4 2015-01-01 00:00:00 nutrition 388 17202 1677 ## 5 2016-01-01 00:00:00 nutrition 215 28586 1677 ## 6 2017-01-01 00:00:00 nutrition 101 31681 1677 ## 7 2013-01-01 00:00:00 excel 2 13380 1224 ## 8 2014-01-01 00:00:00 excel 111 18824 1224 ## 9 2015-01-01 00:00:00 excel 80 17202 1224 ## 10 2016-01-01 00:00:00 excel 307 28586 1224 ## # ... with 1,367 more rows
Next, we’ll create logistic models for each word-year pair. These models
attempt essentially answer the question “How likely is it that a given
word appears in a given year?”
unigram_bytime_models <- unigram_bytime %>% tidyr::nest(-word) %>% mutate( models = purrr::map(data, ~ glm(cbind(n, time_total) ~ time_floor, ., family = "binomial")) ) unigram_bytime_models_slopes <- unigram_bytime_models %>% tidyr::unnest(purrr::map(models, broom::tidy)) %>% filter(term == "time_floor") %>% mutate(adjusted_p_value = p.adjust(p.value)) unigram_bytime_models_slopes
## # A tibble: 255 x 7 ## word term estimate std.error statistic p.value adjusted_p_value ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 nutri~ time_~ -1.033e-8 4.891e-10 -21.12 5.093e- 99 1.294e- 96 ## 2 excel time_~ 2.060e-8 9.561e-10 21.54 6.326e-103 1.613e-100 ## 3 ut time_~ -9.729e-9 6.339e-10 -15.35 3.748e- 53 9.370e- 51 ## 4 austin time_~ -4.307e-9 7.014e-10 -6.141 8.197e- 10 1.476e- 7 ## 5 vba time_~ 3.471e-8 1.892e- 9 18.35 3.415e- 75 8.641e- 73 ## 6 python time_~ -7.032e-8 4.364e- 9 -16.11 2.092e- 58 5.250e- 56 ## 7 chick~ time_~ -1.010e-8 8.992e-10 -11.24 2.708e- 29 6.634e- 27 ## 8 sql time_~ 3.082e-8 2.148e- 9 14.35 1.098e- 46 2.735e- 44 ## 9 nba time_~ 1.522e-8 1.392e- 9 10.93 8.010e- 28 1.947e- 25 ## 10 oracle time_~ 2.175e-8 2.042e- 9 10.65 1.727e- 26 4.129e- 24 ## # ... with 245 more rows
The p.values of the logistic models indicate whether or not the change
in usage of a given word over time is non-trivial. We’ll look at the
words with the smallest p.values, indicating that they have the most
significant change in usage.
num_top_change <- 5 unigram_bytime_models_slopes_top <- unigram_bytime_models_slopes %>% top_n(num_top_change, -adjusted_p_value) viz_unigram_change_bytime <- unigram_bytime %>% inner_join(unigram_bytime_models_slopes_top, by = c("word")) %>% mutate(pct = n / time_total) %>% mutate(label = if_else(time_floor == max(time_floor), word, NA_character_)) %>% ggplot(aes(x = time_floor, y = pct, color = word)) + geom_line(size = 1.5) + ggrepel::geom_label_repel(aes(label = label), nudge_x = 1, na.rm = TRUE) + scale_y_continuous(labels = scales::percent_format()) + scale_color_func() + teplot::theme_te() + theme(legend.position = "none") + labs(x = NULL, y = NULL) + labs(title = "Largest Changes in Word Frequency") viz_unigram_change_bytime
This visualization does a great job of capturing how my interests/skills
have changed over time.
- The rise of
excel
andvba
recently reflect how much I have had
to develop my Excel VBA skills at work (after starting my job in
mid-2016). (I’ve actually had to enhance other software-related
skills, such as with those withSQL
,R
, and mapping software
such asAutoCAD
, but Excel’sVBA
has a lot of little nuances
that make it not so intuitive (in my opinion) and put me more in
need of Internet help than anything else.) - The steep decline in
python
from 2016 to 2017 illustrates how I
learned python early in 2016 as part of a “side-project”, but then
stopped my learning of it in favor of other languages/technologies
that I need/use for my job. - My interest in nutrition has waxed and waned over time. I think this
is probably because I learned a lot when reading about it for a
couple of years, but now have found myself less in need of
researching because I know a good deal about it now. - The appearance of
ib
might be confusing to the reader. “IB” stands
for International Baccalaureate (IB) which is a high
school program for that is similar to th Advanced
Placement (AP) program
that United States high school students are probably more familiar
with. After participating in the IB program in high school, it is
evident that my interest in it dropped off.
Unique Words
Term-Frequency Inverse Document Frequency (TF-IDF)
Another good way of evaluating my search behavior is to look at
term-frequency inverse-document-frequency (TF-IDF). I’ll leave the
reader to read the details in the Tidy Text
Mining book, but, in a
nutshell, TF-IDF provides a good measure of the most “unique” words in a
given document compared to other documents. For this analysis, we’ll
treat the years of search history as documents.
# References: # + https://www.tidytextmining.com/tfidf.html # + https://juliasilge.com/blog/sherlock-holmes-stm/ data_tfidf <- unigrams %>% count(yyyy, word, sort = TRUE) %>% tidytext::bind_tf_idf(word, yyyy, n) num_top_tfidf <- 10 viz_tfidf <- data_tfidf %>% group_by(yyyy) %>% # arrange(yyyy, desc(tf_idf)) %>% # slice(1:num_top_tfidf) %>% top_n(num_top_tfidf, tf_idf) %>% ungroup() %>% mutate(yyyy = factor(yyyy)) %>% mutate(word = drlib::reorder_within(word, tf_idf, yyyy)) %>% ggplot(aes(word, tf_idf, fill = yyyy)) + geom_col() + scale_fill_func() + facet_wrap(~ yyyy, scales = "free") + drlib::scale_x_reordered() + coord_flip() + teplot::theme_te_dx() + theme(legend.position = "none") + theme(axis.text.x = element_blank()) + labs(x = NULL, y = NULL) + labs(title = "Highest TF-IDF Words", subtitle = "By Year") viz_tfidf
This TF-IDF plot is probably my favorite one out of all of them. It
really highlights how my use of the Internet and personal interests have
changed over time (perhaps even better than the previous plot.
- Up through graduating from high school in the middle of 2012, my
search terms don’t appear correlated with anything in particular.
This makes sense to me–at this point in my life, I mostly used
Google for doing research for school projects. - From the middle of 2012 to the middle of 2016, I was in college
studying to get a degree in Electrical Engineering. Words such as
integral
,ut
, andneutron
, reflect my education. At the same
time, my interest in health was strongest during these years, so the
food-related words are not surprising. - My more recent focus on software-related skills is evident in 2016
and beyond. In particular, my growing passion forR
in 2017 is
illustrated with words such as{ggplot}
,{shiny}
,{dplyr}
, and
{knitr}
. This is one aspect of my personal skill development that
was not as evident in the previous plot.
Conclusion
I’ll continue this analysis in a separate write-up, where I plan to
investigate how “topic modeling” can be applied to gain further insight.
Topic modeling is much more dependent on the nature of the data than the
analysis done here (which is fairly “generalizable”), so I think it
deserves distinct treatment.
Thanks again to David Robinson and Julia Silge for their great Tidy
Text Mining with R book! It
demonstrates simple, yet powerful, techniques that can be easily
leveraged to gain meaningful insight into nearly anything you can
imagine.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.