**CYBAEA Data and Analysis**, and kindly contributed to R-bloggers)

We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary.

Let’s try the FTSE-100 index of leading UK companies to see if they are significantly different from the S&P 500 leading American companies that we analyzed four years ago.

We will of course use the R statistical computing and analysis platform for our analysis, and once again we are grateful to Yahoo Finance for providing the data.

The analysis script is available as ftse100.R and is really simple:

## ftse100.R - Display employee productivity for FTSE-100 consitituents ## Copyright © 2010 Allan Engelhardt <http://www.cybaea.net/> ## All Rights Reserved. ## Get the index constituents. ftse.100 <- read.csv(file = "http://uk.old.finance.yahoo.com/d/quotes.csv?s=@%5EFTSE&f=s&e=.csv", header = FALSE) names(ftse.100) <- c("symbol") data <- data.frame(symbol=NULL, employees=NULL, profit=NULL, sector=NULL) ## For each stock symbol, get employees, profit, and sector for (symbol in ftse.100$symbol) { profile.url <- paste("http://uk.finance.yahoo.com/q/pr?s=", symbol, sep="") con <- url(profile.url, open = "r") text <- readChar(con, 2^24) # enough bytes close(con) x <- sub('.*Number of employees:</td><td.*?>[[:space:]]*([[:digit:],]+).*', "\\1", text, ignore.case = TRUE) x <- gsub(',', '', x) empl <- tryCatch(as.integer(x), warning = function(x) NA) x <- sub('.*Net Profit.*?</td><td.*?>[[:space:]]*([+-]?[[:digit:],]+).*', '\\1', text) x <- gsub(',', '', x) profit <- tryCatch(as.integer(x)*1e6, warning = function(x) NA) sector <- sub('.*Sector:</td><td.*?>(.*?)</td>.*', '\\1', text) if (any(c(empl, profit) <= 0, is.na(c(empl, profit)))) { cat("Error parsing symbol", symbol, "see", profile.url, "\n") } else { data <- rbind(data, data.frame(symbol=symbol, employees=empl, profit=profit, sector=sector)) } Sys.sleep(1) } ## Save the data so we don't have to hit Yahoo all the time. save(data, file = "data.RData") ## Save plot to file: #png(filename="ftse100.png", width=800, height=800, pointsize=14, bg="white", res=100) opar <- par(cex.sub = sqrt(sqrt(2)), font.sub = 3, font.lab = 2) ## x and y coordinates of plot and plot limits x <- with(data, employees) y <- with(data, profit/employees) xlim <- c(10^floor(log10(min(x))), 10^ceiling(log10(max(x)))) ylim <- c(10^floor(log10(min(y))), 10^ceiling(log10(max(y)))) ## Set up to display different color and symbols plot_col <- 1 plot_pch <- 1 markers <- 21:25 pchs <- rep(markers, ceiling(length(levels(data$sector))/length(markers))) palette(rainbow(length(levels(data$sector)), start=3/6, end=6/6)) # Make empty plot: plot.new() plot(profit/employees ~ employees, data = data[FALSE, ], type = "p", pch = pchs[plot_pch], col = plot_col, log="xy", xaxp = c(xlim, 1), yaxp = c(ylim, 1), xlim = xlim, ylim = ylim, main = "Profit per employee (FTSE 100)", xlab = "Employees", ylab = "Profit per employees (GBP)") ## Plot each sector for (sector in levels(data$sector)) { plot.xy(xy.coords(with(data[data$sector == sector,], employees), with(data[data$sector == sector,], profit/employees), log = "xy", xlab = "", ylab = ""), type = "p", pch = pchs[plot_pch], col = plot_col, bg = plot_col) plot_pch <- plot_pch + 1 plot_col <- plot_col + 1 } legend(x = "bottomleft", legend = levels(data$sector), title = "Industry Sectors", col = palette(), pt.bg = palette(), pch = pchs, cex = 2/3, pt.cex = 1, ncol = 2) ## Fit a linear model to the log-log data: m <- lm(log10(y) ~ log10(x)) xl <- c(xlim[1]*5, xlim[2]/5) yl <- 10^predict(m, data.frame(x = xl)) lines(xl, yl, col = "darkred", lty = "dashed", lwd = 2) t <- sprintf("Power = %0.3g", m$coefficients[2]) text(xl[2], yl[2], t, adj = c(0.25, -1.5), col = "darkred", font = 2) ## All done. par(opar) dev.off()

Leave it to run and this is what you get:

The power law still broadly holds. In a large company, the productivity of the individual employee is only ¼ of the productivity in a company with one-tenth of the number of workers.

The analysis for the FTSE All-Share index is easy (ftse-all.R) and gives a slope of -0.7605541 for the 301 companies with the required information, which is much worse. More convincingly, fitting the companies with more than 1,000 employees (to avoid some bias of smaller companies needing to have large profits per employee in order to be big enough to afford a stock market listing) gives a slope of -0.2838.

Jump to comments.

# You may also like these posts:

Area Plots with Intensity Coloring

I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function.

Benchmarking feature selection with Boruta and caret

Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with.

Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how.

R code for Chapter 2 of Non-Life Insurance Pricing with GLM

We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article.

R: Eliminating observed values with zero variance

I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.

**leave a comment**for the author, please follow the link and comment on his blog:

**CYBAEA Data and Analysis**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...