100 grams of Lego, please.

[This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Everyone who has children or deep down still is one himself might had the pleasure to find himself in a toy store’s Lego counter once. Since it is my son’s favourite father-son-activity to build Lego stuff I did so several times in the last year. Now as I stood in front of all those Lego boxes studying prices and themes, while my son carefully pondered on which box to buy, I became suspicious because the pricing of Lego Star Wars seemed to be somewhat higher than for other Lego themes (Star Wars is my sons favourite Lego theme so I’m kinda interested in its pricing). One thought further I wondered if there’s any usable data in the world wide web on Lego sets to do some crunching.

As you’re reading this article, the answer is yes!

The website https://brickset.com has detailed data on Lego sets from the past 60 years that can easily be accessed via its API. Additionally each sets website offers some more information that can be obtained by web scraping.

Important Note: If you’d like to execute the code you need to get yourself an API key. This code unfortunately won’t work with the dummy key.

<pre class="r"><code>library("httr") library("XML") library("rvest") library("cowplot") library("Hmisc") library("lme4") library("dplyr") library("tidyr")  library("ggplot2") </code>

Get Data From Brickset API

You need to get yourself an API key. There you go: https://brickset.com/tools/webservices/requestkey. Copy the key below where the example key is specified. Mind the quotes and the hyphen.

<pre class="r"><code>apiKey <- "XXXDUMMYXXX" years  <- 2010:2016 # dont do linebreaks within url character string url <- paste0("brickset.com/api/v2.asmx/getSets?&query=&apiKey=",               apiKey,               "&userHash=&theme=&subtheme=&setNumber=&year=",               paste(years, collapse = ","),               "&Owned=&Wanted=&orderBy=&pageSize=20000&pageNumber=&userName=") xmlDat <- url %>% GET() # check if it worked http_status(xmlDat) # write xml content to data frame lego <- xmlDat %>% xmlParse() %>% xmlToDataFrame() </code>

Pick Relevant Variables And Set Correct Class

The dataset has lots of variables that are redundant to this analysis. So let’s clean up a little bit.

<pre class="r"><code> lego <- lego %>%   select(., setID, name, year, theme, pieces, UKRetailPrice, USRetailPrice,                                      EURetailPrice, bricksetURL) num <- c("year", "pieces", "UKRetailPrice", "USRetailPrice", "EURetailPrice") lego[, num] <- sapply(lego[, num], as.numeric) </code>

Now take a look on which Lego themes we’ve got.

<pre class="r"><code>lego %>%    group_by(theme) %>%   tally(sort = TRUE)</code>

There are some themes that are no ‘real’ themes or hard to compare, i.e. “Books” or “Gear”. That’s why I exclude some of them. Then I’ll pick the 4 themes with the most sets (Star Wars is one of them).

<pre class = "r"><code>lego <- lego %>%    filter(!theme %in% c("Gear", "Collectable Minifigures", "Duplo", "Books", "Promotional", "Miscellaneous")) relevantThemes <- lego %>%    group_by(theme) %>%    tally(sort = TRUE) %>%    slice(1:4) %>%    .[["theme"]] lego <- lego %>% filter(., theme %in% relevantThemes) </code>

Web Scraping For Additional Information

If you take a look at the variables in the dataset and if you take a further look on the brickset website for each set you might notice that there are some information in the web that we don’t have in our dataset. Such things as the weight and the dimensions of a set could be quite helpful information here. So what we’ll do is scrape each set-website (luckily the brickset URL is one of the variables in the dataset) and get us the data we want.

<pre class = "r"><code>for (i in 1:length(lego$bricksetURL)) {   r <- GET(lego$bricksetURL[i])      # get features and infos   feat <- r %>% read_html %>% html_nodes("dt") %>% xml_text()   info <- r %>% read_html %>% html_nodes("dd") %>% xml_text()   names(info) <- feat      # write them into the lego table   lego$weight[i] <- info["Weight"]   lego$volume[i]    <- info["Dimensions"] } </code>

Make Information Gained From Web ‘Workable’

Okay, the information is there but if you take a look at ‘weight’ for example, you’ll see that these character strings are far from ready to work with.

<pre code = "r"><code>lego$weight <- lego$weight %>%    gsub("Kg.*", "", .) %>%  # extract the weight only from character strings   as.numeric()*1000        # get grams instead of kilo lego$volume <- lego$volume %>%    strsplit(., " x | cm") %>%  # cut character string where 'x' and 'cm' is   lapply(., `[`, 1:3) %>%     # take the three first arguments from each list (length, width, height)   lapply(., as.numeric) %>%   # convert to numeric   lapply(., prod ) %>%        # dimension in cubic centimeter    unlist() / 1000             # calculate the packaging size in liters (1 liter = 1000 cubic centimeter)    # pricePerPiece (€) can be calculated by hand lego <- mutate(lego, pricePerPiece = lego$EURetailPrice/lego$pieces) </code>

Descriptive Statistics

Before start doing analysis with our nice and clean Lego dataset, we should take a look at some descriptive stats. This task is facilitated tremendously by the describe()-function from the package ‘Hmisc’.

<pre class = "r"><code>describe(lego)</code>

Use Cases Without NAs Only

As the description of the dataset has shown, there are some variables that have missing values. In this analysis we only want to use sets without any missing value (NA).

<pre class = "r"><code># how much sets are without NAs? sum(complete.cases(lego)) # exclude sets with NAs lego <- lego[complete.cases(lego), ] </code>


<pre class ="r"><code># first get rid of the extreme outliers in weight and pieces lego <- lego %>% filter(!weight/pieces > quantile(weight/pieces, na.rm = T, probs = c(.99))) </code>

For the reason that I live in Germany I will work with the € price of the Lego sets. The dashed lines refer to the average value of the corresponding variable.

<pre class ="r"><code># Price ~ Packaging Size by Theme p1 <- ggplot(lego, aes(volume, EURetailPrice)) +    geom_point(aes(color = theme), size = 2, alpha = .5) +    geom_hline(yintercept = mean(lego$EURetailPrice), lty = "dashed") +   geom_vline(xintercept = mean(lego$volume), lty = "dashed") +   ylab("Price [€]") + xlab("Packaging Size [liter]")  +   theme(legend.position = "none")  # Price ~ Pieces p2 <- ggplot(lego, aes(pieces, EURetailPrice)) +    geom_point(aes(color = theme), size = 2, alpha = .5) +    geom_hline(yintercept = mean(lego$EURetailPrice), lty = "dashed") +    geom_vline(xintercept = mean(lego$pieces), lty = "dashed") +   ylab("") + xlab("Pieces") +   theme(legend.position = "none", axis.text.y = element_blank()) # Price ~ Weight p3 <- ggplot(lego, aes(weight, EURetailPrice)) +    geom_point(aes(color = theme), size = 2, alpha = .5) +    geom_hline(yintercept = mean(lego$EURetailPrice), lty = "dashed") +    geom_vline(xintercept = mean(lego$weight), lty = "dashed") +   ylab("") + xlab("Weight [g]") +   theme(axis.text.y = element_blank())</code>
<pre class ="r"><code># combine plots plot_grid(p1, p2, p3, nrow=1, rel_widths = c(2, 2, 2.5))</code>
<pre class = "r"><code>p4 <- ggplot(lego, aes(weight/pieces, EURetailPrice/pieces)) +    geom_point(aes(color = theme, size = lego$pieces), alpha = .5) +   xlab("Weight Per Piece [g]") +   ylab("Price Per Piece [€]") +   theme(legend.position = c(0.2,0.8)) +   scale_size(guide = 'none')  p5 <-   ggplot(lego, aes(x = theme, y = EURetailPrice/pieces)) +   geom_boxplot(aes(group = theme, color = theme)) +   ylab("") + xlab("") + theme_classic() +   theme(axis.text = element_blank(), axis.ticks = element_blank(),          legend.position = "none")  p6 <-   ggplot(lego, aes(x = theme, y = weight/pieces)) +   geom_boxplot(aes(group = theme, color = theme)) +   ylab("") + xlab("") + theme_classic() + coord_flip() +   theme(axis.text = element_blank(), axis.ticks = element_blank(), legend.position = "none") 	 </code>
<pre class ="r"><code>plot_grid(p6, NULL, p4, p5, ncol = 2, rel_widths = c(1,0.3), rel_heights = c(0.3,1))</code>

The plot shows the relation of the weight per piece and the price per piece, note that the size of the points is proportional to the number of pieces in a set. The boxplots show the distribution of 'EURetailPrice/pieces' and 'weight/pieces'.
It can be seen that on average the Star Wars Lego theme has the lightest but most expensive pieces.

Price of 1g Lego by theme and currency area

Now I wondered if Lego has a different price policy in the other currency areas. For each I regressed the price per gram on the theme. In order to get a regression coefficient for every factor of theme I excluded the intercept.

<pre class ="r"><code># USA fit <- lm(USRetailPrice/weight ~ 0 + theme, data = lego) PPG_USA <- coef(fit) %>%    setNames(., unlist(fit$x)) %>%    round(., 4)*100 </code>
<pre class ="r"><code># EU fit <- lm(EURetailPrice/weight ~ 0 + theme, data = lego) PPG_EU <- coef(fit) %>%    setNames(., unlist(fit$x)) %>%    round(., 4)*100  </code>
<pre class ="r"><code># UK fit <- lm(UKRetailPrice/weight ~ 0 + theme, data = lego) PPG_UK <- coef(fit) %>%    setNames(., unlist(fit$x)) %>%    round(., 4)*100  </code>


The prices per 1 gram Lego in the table below are given in Cents ($) for the US, Pence (£) for the UK and in Cents (€) for the EU.

<table>   <tr>   <th></th>     <th>City</th>     <th>Friends</th>     <th>Ninjago</th>     <th>Star Wars</th>   </tr>   <tr>   <th>USA</th>     <td>6.50</td>     <td>7.37</td>     <td>7.98</td>      <td>8.16</td>   </tr>   <tr>    <th>UK</th>      <td>4.82</td>     <td>6.20</td>     <td>6.35</td>      <td>7.40</td>   </tr>   <tr>    <th>EU</th>     <td>5.59</td>     <td>6.91</td>     <td>7.93</td>      <td>8.95</td>   </tr> </table>

Roughly there is the same price relation among the themes in each currency area. But still you pay appr. 37 % more for Lego Star Wars compared to City in the EU. In the UK it is 35 % and in the US 20 %.

Price Development (per 1g) By Theme And Year

One information we still didn't use is the year, let's change that and take a look at the yearly price development for each theme.

<pre class ="r"><code># regression grouped by year lego$year <- as.factor(lego$year) lego$theme <- as.factor(lego$theme) priceDev <- lego %>%                lmList(EURetailPrice/weight ~ 0 + theme | year, data = .)  %>%   # regression grouped by year               coef() %>%                                                # get regression coefficients               data.frame() %>%                                          # create a data frame                setNames(., gsub('theme', '', names(.))) %>%              # nice colnames               mutate(., year = as.numeric(rownames(.)))                 # add 'year' variable from rownames                # for ggplot we need to have a clean data frame (each variable in one column) priceDev <- gather(priceDev, theme, pricePerGram, -year)   </code>
<pre class ="r"><code>priceDev <- lego %>%    group_by(year) %>%    nest() %>%    mutate(model = purrr::map(data, ~ lm(EURetailPrice/weight ~ 0 + theme, data = .))) %>%     unnest(model %>% purrr::map(broom::tidy)) %>%    select(year, term, estimate, std.error)  priceDev$term <- gsub('theme', '', priceDev$term) </code>
<pre class ="r"><code>ggplot(priceDev, aes(x = year, y = estimate, group = term)) +     geom_point(aes(color = term), size = 3) + geom_line(aes(color = term), size = 1) +    geom_errorbar(aes(ymax = estimate + std.error, ymin = estimate - std.error, width=0.2, color = term), alpha = 0.5)</code>

So the good thing is that I was right with my assumption concerning Lego Star Wars. But unfortunately this fact wont invalidate my son's argument that it is Lego Star Wars he likes to play with most.

To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)