Site icon R-bloggers

Preprocessing and analyzing web tracking data with webtrackR

[This article was first published on schochastics - all things R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post was semi automatically converted from blogdown to Quarto and may contain errors. The original can be found in the archive.

Researchers have relied on free/easy access to APIs from social media platforms for a very long time. But in the recent past, many prominent platforms revoked the free access to their API and made accessing the data almost unaffordable for regular researchers. The need for alternative data sources to study the online behaviour of individuals is big. One such alternative are studies that use webtracking to obtain the web browsing history of participants. This type of data is far richer than social media data but can also be far more heterogeneous and complex. Enter the R package webtrackR, a package to preprocess and analyze webtracking data.

< section id="installation" class="level1">

Installation

You can install the development version of webtrackR from GitHub with:

# install.packages("remotes")
remotes::install_github("schochastics/webtrackR")

The CRAN version can be installed with:

install.packages("webtrackR")

The package is still under heavy development and new features are being added on regular basis. If you are working with webtracking data, feel free to reach out with your feature requests.

< section id="an-s3-class-for-webtracking-data" class="level1">

An S3 class for webtracking data

The package defines an S3 class called wt_dt which inherits most of the functionality from data.table. Each row in a web tracking data set represents a visit. Raw data read with the package need to have at least the following variables:

The function as.wt_dt assigns the class wt_dt to a raw web tracking data set. It also allows you to specify the name of the raw variables corresponding to panelist_id, url and timestamp.

All preprocessing functions check if these three variables are present and an error is thrown if one is not found

< section id="data-preprocessing" class="level1">

Data Preprocessing

Currently, the main functionality of the package is to preprocess a raw webtracking dataset and add some more helpful variables for later analysis:

< section id="classification" class="level2">

Classification

So far, one function, classify_visits(), is implemented which is used to categorize website visits by either extracting the URL’s domain or host and matching them to a list of domains or hosts, or by matching a list of regular expressions against the visit URL. Currently, some precompiled lists are included in the package, but these will move to a dedicated package domainator at a later stage.

< section id="summarizing-and-aggregating" class="level2">

Summarizing and aggregating

< section id="example-code" class="level2">

Example code

A typical workflow including preprocessing, classifying and aggregating web tracking data looks like this (using the in-built example data):

library(webtrackR)

# load example data and turn it into wt_dt
data("testdt_tracking")
wt <- as.wt_dt(testdt_tracking)

# add duration
wt <- add_duration(wt)

# extract domains
wt <- extract_domain(wt)

# drop duplicates (consecutive visits to the same URL within one second)
wt <- deduplicate(wt, within = 1, method = "drop")

# load example domain classification and classify domains
data("domain_list")
wt <- classify_visits(wt, classes = domain_list, match_by = "domain")

# load example survey data and join with web tracking data
data("testdt_survey_w")
wt <- add_panelist_data(wt, testdt_survey_w)

# aggregate number of visits by day and panelist, and by domain class
wt_summ <- sum_visits(wt, timeframe = "date", visit_class = "type")

Twitter Facebook Google+ LinkedIn

Please enable JavaScript to view the comments powered by Disqus.

< section id="schochastics" class="level1">

schochastics

© 2023 / Powered by Hugo

Ghostwriter theme By JollyGoodThemes / Ported to Hugo By jbub

< section class="quarto-appendix-contents">

Reuse

CC BY 4.0
< section class="quarto-appendix-contents">

Citation

BibTeX citation:
@online{schoch2023,
  author = {Schoch, David},
  title = {Preprocessing and Analyzing Web Tracking Data with
    {webtrackR}},
  date = {2023-09-12},
  url = {http://blog.schochastics.net/posts/2023-09-12_preprocessing-and-analyzing-web-tracking-data-with-webtrackr},
  langid = {en}
}
For attribution, please cite this work as:
Schoch, David. 2023. “Preprocessing and Analyzing Web Tracking Data with webtrackR.” September 12, 2023. http://blog.schochastics.net/posts/2023-09-12_preprocessing-and-analyzing-web-tracking-data-with-webtrackr.
To leave a comment for the author, please follow the link and comment on their blog: schochastics - all things R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version