The post R Weekly Bulletin Vol – XII appeared first on .

]]>
(This article was first published on ** R programming**, and kindly contributed to R-bloggers)

This week’s R bulletin will cover topics on how to resolve some common errors in R.

We will also cover functions like do.call, rename, and lapply. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

1. Find and Replace – Ctrl+F

2. Find Next – F3

3. Find Previous – Shift+F3

There can be two reasons for this error to show up when we run an R script: 1) A file/connection can’t be opened because R can’t find it (mostly due to an error in the path) 2) Failure in .onLoad() because a package can’t find a system dependency

**Example:**

symbol = "AXISBANK" noDays = 1 dirPath = paste(getwd(), "/", noDays, " Year Historical Data", sep = "") fileName = paste(dirPath, symbol, ".csv", sep = "") data = as.data.frame(read.csv(fileName))

Warning in file(file, “rt”): cannot open file ‘C:/Users/Madhukar/Documents/

1 Year Historical DataAXISBANK.csv’: No such file or directory

Error in file(file, “rt”): cannot open the connection

We are getting this error because we have specified the wrong path to the “dirPath” object in the code. The right path is shown below. We missed adding a forward slash after “Year Historical Data” in the paste function. This led to the wrong path, and hence the error.

dirPath = paste(getwd(),”/”,noDays,” Year Historical Data/”,sep=””)

After adding the forward slash, we re-ran the code. Below we can see the right dirPath and fileName printed in the R console.

**Example:**

symbol = "AXISBANK" noDays = 1 dirPath = paste(getwd(), "/", noDays, " Year Historical Data/", sep = "") fileName = paste(dirPath, symbol, ".csv", sep = "") data = as.data.frame(read.csv(fileName)) print(head(data, 3))

This error arises when an R package is not loaded properly or due to the misspelling of the function names.

When we run the code shown below, we get a “could not find the function ymd” error in the console. This is because we have misspelled the “ymd” function as “ymed”. If we do not load the required packages, this will also throw up a “could not find function ymd” error.

**Example:**

# Read NIFTY price data from the csv file df = read.csv("NIFTY.csv") # Format date dates = ymed(df$DATE)

Error in eval(expr, envir, enclos): could not find function “ymed”

This error occurs when one tries to assign a vector of values to an existing object and the lengths do not match up.

In the example below, the stock price data of Axis bank has 245 rows. In the code, we created a sequence “s” of numbers from 1 to 150. When we try to add this sequence to the Axis Bank data set, it throws up a “replacement error” as the lengths of the two do not match. Thus to resolve such errors one should ensure that the lengths match.

**Example:**

symbol = "AXISBANK" ; noDays = 1 ; dirPath = paste(getwd(),"/",noDays," Year Historical Data/",sep="") fileName = paste(dirPath,symbol,".csv",sep="") df = as.data.frame(read.csv(fileName)) # Number of rows in the dataframe "df" n = nrow(df); print(n); # create a sequence of numbers from 1 to 150 s = seq(1,150,1) # Add a new column "X" to the existing data frame "df" df$X = s print(head(df,3))

Error in $<-.data.frame(*tmp*, “X”, value = c(1, 2, 3, 4, 5, 6, 7, : replacement has 150 rows, data has 245

The do.call function is used for calling other functions. The function which is to be called is provided as the first argument to the do.call function, while the second argument of the do.call function is a list of arguments of the function to be called. The syntax for the function is given as:

do.call (function_name, arguments)

**Example:** Let us first define a simple function that we will call later in the do.call function.

numbers = function(x, y) { sqrt(x^3 + y^3) } # Now let us call this 'numbers' function using the do.call function. We provide the function name as # the first argument to the do.call function, and a list of the arguments as the second argument. do.call(numbers, list(x = 3, y = 2))

[1] 5.91608

The rename function is part of the dplyr package, and is used to rename the columns of a data frame. The syntax for the rename function is to have the new name on the left-hand side of the = sign, and the old name on the right-hand side. Consider the data frame “df” given in the example below.

**Example:**

library(dplyr) Tic = c("IOC", "BPCL", "HINDPETRO", "ABAN") OP = c(555, 570, 1242, 210) CP = c(558, 579, 1248, 213) df = data.frame(Tic, OP, CP) print(df)

# Renaming the columns as 'Ticker', 'OpenPrice', and 'ClosePrice'. This can be done in the following # manner: renamed_df = rename(df, Ticker = Tic, OpenPrice = OP, ClosePrice = CP) print(renamed_df)

The lapply function is part of the R base package, and it takes a list “x” as an input, and returns a list of the same length as “x”, each element of which is the result of applying a function to the corresponding element of X. The syntax of the function is given as:

lapply(x, Fun)

where,

x is a vector (atomic or list)

Fun is the function to be applied

**Example 1:**

Let us create a list with 2 elements, OpenPrice and the ClosePrice. We will compute the mean of the values in each element using the lapply function.

x = list(OpenPrice = c(520, 521.35, 521.45), ClosePrice = c(521, 521.1, 522)) lapply(x, mean)

$OpenPrice

[1] 520.9333

$ClosePrice

[1] 521.3667

**Example 2:**

x = list(a = 1:10, b = 11:15, c = 1:50) lapply(x, FUN = length)

$a

[1] 10

$b

[1] 5

$c

[1] 50

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

The post R Weekly Bulletin Vol – XII appeared first on .

To **leave a comment** for the author, please follow the link and comment on their blog: ** R programming**.

R-bloggers.com offers

(This article was first published on ** R Blog**, and kindly contributed to R-bloggers)

*Stefan Feuerriegel
*

This blog entry concerns our course on “Operations Reserch with R” that we teach as part of our study program. We hope that the materials are of value to lectures and everyone else working in the field of numerical optimiatzion.

**Course outline**

The course starts with a review of numerical and linear algebra basics for optimization. Here, students learn how to derive a problem statement that is compatible with solving algorithms. This is followed by an overview on problem classes such as one and multi-dimensional problems. Starting with linear and quadratic algorithms, we also cover convex optimization, followed by non-linear approaches such as gradient based (gradient descent methods), Hessian based (Newton and quasi-Newton methods) and non-gradient based (Nelder-Mead). We finally demonstrate the potent capabilities of R for Operations Research: we show how to solve optimization problems in industry and business, as well as illustrate the use in methods for statistics and data mining (e.g. support vector machine or quantile regression). All examples are supported by appropriate visualizations.

**Goals**

1. Motivate to use R for operations research tasks

2. Familiarize with classes of optimization problems

3. Perform numerical optimization tasks in R using suitable packages

**Motivation
**

R is widely taught in business courses and, hence, known by most data scientists with business background. However, when it comes to Operations Research, many other languages are used. Especially for optimization, solutions range from Microsoft Excel solvers to modeling environments such as Matlab and GAMS. Most of these are non-free and require students to learn yet another language. Because of this, we propose to use R in optimization problems of Operations Research, since R is open source, comes for free and is widely known. Furthermore, R provides a multitude of numerical optimization packages that are readily available. At the same time, R is widely used in industry, making it a suitable and skillful tool to lever the potential of numerical optimization.

**Download link to the resources:**

https://www.is.uni-freiburg.de/resources/computational-economics?set_language=en

To **leave a comment** for the author, please follow the link and comment on their blog: ** R Blog**.

R-bloggers.com offers

(This article was first published on ** S+/R – Yet Another Blog in Statistical Computing**, and kindly contributed to R-bloggers)

In the development of operational loss models, it is important to identify which distribution should be used to model operational risk measures, e.g. frequency and severity. For instance, why should we use the Gamma distribution instead of the Inverse Gaussian distribution to model the severity?

In my previous post https://statcompute.wordpress.com/2016/11/20/modified-park-test-in-sas, it is shown how to use the Modified Park test to identify the mean-variance relationship and then decide the corresponding distribution of operational risk measures. Following the similar logic, we can also leverage the flexibility of the Tweedie distribution to accomplish the same goal. Based upon the parameterization of a Tweedie distribution, the variance = Phi * (Mu ** P), where Mu is the mean and P is the power parameter. Depending on the specific value of P, the Tweedie distribution can accommodate several important distributions commonly used in the operational risk modeling, including Poisson, Gamma, Inverse Gaussian. For instance,

- With P = 0, the variance would be independent of the mean, indicating a Normal distribution.
- With P = 1, the variance would be in a linear form of the mean, indicating a Poisson-like distribution
- With P = 2, the variance would be in a quadratic form of the mean, indicating a Gamma distribution.
- With P = 3, the variance would be in a cubic form of the mean, indicating an Inverse Gaussian distribution.

In the example below, it is shown that the value of P is in the neighborhood of 1 for the frequency measure and is near 3 for the severity measure and that, given P closer to 3, the Inverse Gaussian regression would fit the severity better than the Gamma regression.

library(statmod) library(tweedie) profile1 <- tweedie.profile(Claim_Count ~ Age + Vehicle_Use, data = AutoCollision, p.vec = seq(1.1, 3.0, 0.1), fit.glm = TRUE) print(profile1$p.max) # [1] 1.216327 # The P parameter close to 1 indicates that the claim_count might follow a Poisson-like distribution profile2 <- tweedie.profile(Severity ~ Age + Vehicle_Use, data = AutoCollision, p.vec = seq(1.1, 3.0, 0.1), fit.glm = TRUE) print(profile2$p.max) # [1] 2.844898 # The P parameter close to 3 indicates that the severity might follow an Inverse Gaussian distribution BIC(glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = Gamma(link = log))) # [1] 360.8064 BIC(glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = inverse.gaussian(link = log))) # [1] 350.2504

Together with the Modified Park test, the estimation of P in a Tweedie distribution is able to help us identify the correct distribution employed in operational loss models in the context of GLM.

To **leave a comment** for the author, please follow the link and comment on their blog: ** S+/R – Yet Another Blog in Statistical Computing**.

R-bloggers.com offers

(This article was first published on ** R – Mark van der Loo**, and kindly contributed to R-bloggers)

So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R

```
> data(retailers, package="validate")
> head(retailers, 3)
size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
1 sc0 0.02 75 NA NA 1130 NA 18915 20045 NA
2 sc3 0.14 9 1607 NA 1607 131 1544 63 NA
3 sc3 0.14 NA 6886 -33 6919 324 6493 426 NA
```

This data is dirty with missings and full of errors. Let us do some imputations with simputation.

```
> out <- retailers %>%
+ impute_lm(other.rev ~ turnover) %>%
+ impute_median(other.rev ~ size)
>
> head(out,3)
size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
1 sc0 0.02 75 NA 6114.775 1130 NA 18915 20045 NA
2 sc3 0.14 9 1607 5427.113 1607 131 1544 63 NA
3 sc3 0.14 NA 6886 -33.000 6919 324 6493 426 NA
>
```

Ok, cool, we know all that. But what if you’d like to know what value was imputed with which method? That’s where the lumberjack comes in.

The lumberjack operator is a `pipe'[1] operator that allows you to track changes in data.

```
> library(lumberjack)
> retailers$id <- seq_len(nrow(retailers))
> out <- retailers %>>%
+ start_log(log=cellwise$new(key="id")) %>>%
+ impute_lm(other.rev ~ turnover) %>>%
+ impute_median(other.rev ~ size) %>>%
+ dump_log(stop=TRUE)
Dumped a log at cellwise.csv
>
> read.csv("cellwise.csv") %>>% dplyr::arrange(key) %>>% head(3)
step time expression key variable old new
1 2 2017-06-23 21:11:05 CEST impute_median(other.rev ~ size) 1 other.rev NA 6114.775
2 1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover) 2 other.rev NA 5427.113
3 1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover) 6 other.rev NA 6341.683
>
```

So, to track changes we only need to switch from `%>%`

to `%>>%`

and add the `start_log()`

and `dump_log()`

function calls in the data pipeline. (to be sure: it works with any function, not only with simputation). The package is on CRAN now, and please see the introductory vignette for more examples and ways to customize it.

There are many ways to track changes in data. That is why the lumberjack is completely extensible. The package comes with a few loggers, but users or package authors are invited to write their own. Please see the extending lumberjack vignette for instructions.

If this post got you interested, please install the package using

```
install.packages('lumberjack')
```

You can get started with the introductory vignette or even just use the lumberjack operator `%>>%`

as a (close) replacement of the `%>%`

operator.

As always, I am open to suggestions and comments. Either through the packages github page.

Also, I will be talking at useR2017 about the simputation package, but I will sneak in a bit of lumberjack as well :p.

And finally, here’s a picture of a lumberjack smoking a pipe.

[1] It really should be called a function composition operator, but potetoes/potatoes.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Mark van der Loo**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

R is incredible software for statistics and data science. But while the bits and bytes of software are an essential component of its usefulness, software needs a **community** to be successful. And that's an area where R really shines, as Shannon Ellis explains in this lovely ROpenSci blog post. For software, a thriving community offers developers, expertise, collaborators, writers and documentation, testers, agitators (to keep the community *and* software on track!), and so much more. Shannon provides links where you can find all of this in the R community:

**#rstats hashtag**— a responsive, welcoming, and inclusive community of R users to interact with on Twitter**R-Ladies**— a world-wide organization focused on promoting gender diversity within the R community, with more than 30 local chapters**Local R meetup groups**— a google search may show that there's one in your area! If not, maybe consider starting one! Face-to-face meet-ups for users of all levels are incredibly valuable**Rweekly**— an incredible weekly recap of all things R**R-bloggers**— an awesome resource to find posts from many different bloggers using R**DataCarpentry**and**Software Carpentry**— a resource of openly available lessons that promote and model reproducible research**Stack Overflow**— chances are your R question has already been answered here (with additional resources for people looking for jobs)

I'll add a couple of others as well:

**R Conferences**— The annual useR! conference is the major community event of the year, but there are many smaller community-led events on various topics.**Github**— there's a fantastic community of R developers on Github. There's no directory, but the list of trending R developers is a good place to start.**The R Consortium**— proposing or getting involved with an R Consortium project is a great way to get involved with the community

As I've said before, the R community is one of the greatest assets of R, and is an essential component of what makes R useful, easy, and fun to use. And you couldn't find a nicer and more welcoming group of people to be a part of.

To learn more about the R community, be sure to check out Shannon's blog post linked below.

ROpenSci Blog: Hey! You there! You are welcome here

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers

Consider U.S. 2016 merchandise trade partner balances data set where each point is a country with 2 features: U.S. imports and exports against it:

Suppose we decided to visualize top 30 U.S trading partners using bubble chart, which simply is a 2D scatter plot with the third dimension expressed through point size. Then U.S. trade partners become disks with imports and exports for *xy* coordinates and trade balance (*abs(export - import)*) for size:

China, Canada, and Mexico run far larger balances compared to the other 27 countries which causes most data points to collapse into crowded lower left corner. One way to "solve" this problem is to eliminate 3 mentioned outliers from the picture:

While this plot does look better it no longer serves its original purpose of displaying all top trading partners. And undesirable effect of outliers though reduced still presents itself with new ones: Japan, Germany, and U.K. So let us bring all countries back into the mix by trying logarithmic scale.

Quick refresher from algebra. Log function (in this example log base 10 but the same applies to natural log or log base 2) is commonly used to transform positive real numbers. All because of its property of mapping multiplicative relationships into additive ones. Indeed, given numbers *A*, *B*, and *C *such that

applying*log * results in additive relationship:

For example, let*A=100*, *B=1000*, and *C=100000* then

`log(100) + log(1000) = log(100000)` or `2 + 3 = 5`

Observe this on 1D plane:

Logarithmic scale is simply a log transformation applied to all feature's values before plotting them. In our example we used it on both trading partners' features - imports and exports which gives bubble chart new look:
*This is a re-post from the original blog on LinkedIn.*

]]>`A*B=C and A,B,C > 0`

applying

`log(A) + log(B) = log(C)`

For example, let

`100 * 1000 = 100000`

so that after transformation it becomes `log(100) + log(1000) = log(100000)` or `2 + 3 = 5`

Observe this on 1D plane:

The same data displayed on logarithmic scale appear almost uniform but not to forget the farther away points from 0 the more orders of magnitude they are apart on actual scale (observe this by scrolling back to the original plot). The main advantage of using log scale in this plot is ability of observing relationships between all top 30 countries without loosing the whole picture and avoiding collapsing smaller points together.

For more detailed discussion of logarithmic scale refer to When Should I Use Logarithmic Scales in My Charts and Graphs? Oh, and how about that trade deficit with China?

(This article was first published on ** novyden**, and kindly contributed to R-bloggers)

Skewed data prevail in real life. Unless you observe trivial or near constant processes data is skewed one way or another due to outliers, long tails, errors or something else. Such effects create problems in visualizations when a few data elements are much larger than the rest.

Consider U.S. 2016 merchandise trade partner balances data set where each point is a country with 2 features: U.S. imports and exports against it:

Suppose we decided to visualize top 30 U.S trading partners using bubble chart, which simply is a 2D scatter plot with the third dimension expressed through point size. Then U.S. trade partners become disks with imports and exports for *xy* coordinates and trade balance (*abs(export – import)*) for size:

China, Canada, and Mexico run far larger balances compared to the other 27 countries which causes most data points to collapse into crowded lower left corner. One way to “solve” this problem is to eliminate 3 mentioned outliers from the picture:

While this plot does look better it no longer serves its original purpose of displaying **all** top trading partners. And undesirable effect of outliers though reduced still presents itself with new ones: Japan, Germany, and U.K. So let us bring all countries back into the mix by trying logarithmic scale.

Quick refresher from algebra. Log function (in this example log base 10 but the same applies to natural log or log base 2) is commonly used to transform positive real numbers. All because of its property of mapping multiplicative relationships into additive ones. Indeed, given numbers *A*, *B*, and *C *such that

`A*B=C and A,B,C > 0`

applying *log * results in additive relationship:

`log(A) + log(B) = log(C)`

For example, let *A=100*, *B=1000*, and *C=100000* then

*This is a re-post from the original blog on LinkedIn.*

`100 * 1000 = 100000`

so that after transformation it becomes

`log(100) + log(1000) = log(100000)` or `2 + 3 = 5`

Observe this on 1D plane:

Logarithmic scale is simply a log transformation applied to all feature’s values before plotting them. In our example we used it on both trading partners’ features – imports and exports which gives bubble chart new look:

The same data displayed on logarithmic scale appear almost uniform but not to forget the farther away points from 0 the more orders of magnitude they are apart on actual scale (observe this by scrolling back to the original plot). The main advantage of using log scale in this plot is ability of observing relationships between all top 30 countries without loosing the whole picture and avoiding collapsing smaller points together.

For more detailed discussion of logarithmic scale refer to When Should I Use Logarithmic Scales in My Charts and Graphs? Oh, and how about that trade deficit with China?

To **leave a comment** for the author, please follow the link and comment on their blog: ** novyden**.

R-bloggers.com offers

I was in need of importing SPSS© data for work. There are some options but I've used both `foreign`

and `haven`

R packages. I prefer `haven`

because it integrates better with R's tidyverse and started using it in detriment of `foreign`

when I verified it behaves well with factors and solves the deprecated factors labels in newer R versions.

For this post I found Diego Portales University National Survey. It consist in a publicly available survey applied since 2005 and applied at nation-wide level to ask people about their trust in institutions (e.g. government, police, firefighters, etc) and what its their option on same-sex marriage, restricting spaces to smoke, and more.

```
#devtools::install_github("ropenscilabs/skimr")
# Exploratory Data Analysis tools
library(ggplot2)
library(dplyr)
library(sjlabelled)
library(skimr)
library(readr)
# Import foreign statistical formats
library(haven)
# Data
url = "http://encuesta.udp.cl/descargas/banco%20de%20datos/2015/Encuesta%20Nacional%20UDP%202015.sav"
sav = "2017-06-24_working_with_spss_data_in_r/udp_national_survey_2015.sav"
if(!file.exists(sav)){download.file(url,sav)}
survey = read_sav(sav)
```

To explore the data consider the survey is in spanish. So, "fecha" means date, "edad" means age, and sexo means "sex".

```
# How many surveys do I have by day?
daily = survey %>%
mutate(Fecha = as.Date(Fecha, "%d-%m-%Y")) %>%
rename(date = Fecha) %>%
group_by(date) %>%
summarise(n = n())
ggplot(daily, aes(date, n)) +
geom_line()
```

```
# How is the age distributed?
summary(survey$Edad_Entrevistado)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 32.00 48.00 47.92 61.00 89.00
```

```
age = survey %>%
mutate(as.integer(Edad_Entrevistado)) %>%
rename(age = Edad_Entrevistado) %>%
group_by(age) %>%
summarise(n = n())
ggplot(age, aes(age, n)) +
geom_line()
```

```
# How is the sex distributed?
survey %>%
rename(sex_id = Sexo_Entrevistado) %>%
group_by(sex_id) %>%
summarise(n = n())
```

```
# A tibble: 2 x 2
sex_id n
<dbl+lbl> <int>
1 1 651
2 2 651
```

In the last tibble we have no idea what is 1 and 2.

```
survey %>%
select(Sexo_Entrevistado) %>%
rename(sex_id = Sexo_Entrevistado) %>%
distinct() %>%
mutate(sex = as_factor(sex_id))
```

```
# A tibble: 2 x 2
sex_id sex
<dbl+lbl> <fctr>
1 2 Mujer
2 1 Hombre
```

The last column (in spanish) shows us that in this survey "1 = Male" and "2 = Female".

I could run

```
survey %>%
rename(sex = Sexo_Entrevistado) %>%
mutate(sex = as.integer(sex)) %>%
mutate(sex = recode(sex, `1` = "Male", `2` = "Female")) %>%
group_by(sex) %>%
summarise(n = n())
```

```
# A tibble: 2 x 2
sex n
<chr> <int>
1 Female 651
2 Male 651
```

The column names are labelled as well. Here `sjlabelled`

helps if I want to know for example what "P12" means. But instead of just translating labels I'll describe the complete dataset.

```
valid_replies = survey %>%
mutate_if(is.labelled,as.numeric) %>%
skim() %>%
filter(stat=="complete") %>%
mutate(description = get_label(survey)) %>%
select(var,description,everything()) %>%
select(-c(stat,level,type)) %>%
rename(pcent_valid = value) %>%
mutate(pcent_valid = paste0(100*round(pcent_valid / nrow(survey),2),'%'))
histograms = survey %>%
mutate_if(is.labelled,as.numeric) %>%
skim() %>%
filter(stat=="hist") %>%
select(var,level) %>%
rename(histogram = level)
survey_description = valid_replies %>%
left_join(histograms) %>%
write_csv("2017-06-24_working_with_spss_data_in_r/survey_description.csv")
survey_description
```

```
# A tibble: 203 x 4
var description pcent_valid histogram
<chr> <chr> <chr> <chr>
1 PONDERADOR Ponderador 100% ▂▇▇▅▅▃▁▁▁▁
2 Folio Folio 100% ▇▇▇▇▇▇▇▇▇▇
3 Región Región 100% ▁▁▂▁▂▁▁▁▇▁
4 Comuna Comuna 100% ▁▁▂▁▁▂▁▁▇▁
5 Fecha Fecha entrevista 100% <NA>
6 Sexo_Encuestador Sexo Entrevistador 91% ▂▁▁▁▁▁▁▁▁▇
7 GSE GSE Visual 100% ▁▁▂▁▇▁▁▆▁▁
8 Sexo_Entrevistado Sexo Entrevistado 100% ▇▁▁▁▁▁▁▁▁▇
9 Edad_Entrevistado Edad Entrevistado 100% ▇▆▅▆▇▇▅▃▃▂
10 Hora_Inicio Hora Inicio Medición 100% <NA>
# ... with 193 more rows
```

Exploring the last tibble there are interesting questions. For example, P12 refers to *"Apoyo a la democracia"* that is Do you support democracy?.

(This article was first published on ** Pachá (Batteries Included)**, and kindly contributed to R-bloggers)

I was in need of importing SPSS© data for work. There are some options but I’ve used both `foreign`

and `haven`

R packages. I prefer `haven`

because it integrates better with R’s tidyverse and started using it in detriment of `foreign`

when I verified it behaves well with factors and solves the deprecated factors labels in newer R versions.

For this post I found Diego Portales University National Survey. It consist in a publicly available survey applied since 2005 and applied at nation-wide level to ask people about their trust in institutions (e.g. government, police, firefighters, etc) and what its their option on same-sex marriage, restricting spaces to smoke, and more.

```
#devtools::install_github("ropenscilabs/skimr")
# Exploratory Data Analysis tools
library(ggplot2)
library(dplyr)
library(sjlabelled)
library(skimr)
library(readr)
# Import foreign statistical formats
library(haven)
# Data
url = "http://encuesta.udp.cl/descargas/banco%20de%20datos/2015/Encuesta%20Nacional%20UDP%202015.sav"
sav = "2017-06-24_working_with_spss_data_in_r/udp_national_survey_2015.sav"
if(!file.exists(sav)){download.file(url,sav)}
survey = read_sav(sav)
```

To explore the data consider the survey is in spanish. So, “fecha” means date, “edad” means age, and sexo means “sex”.

```
# How many surveys do I have by day?
daily = survey %>%
mutate(Fecha = as.Date(Fecha, "%d-%m-%Y")) %>%
rename(date = Fecha) %>%
group_by(date) %>%
summarise(n = n())
ggplot(daily, aes(date, n)) +
geom_line()
```

```
# How is the age distributed?
summary(survey$Edad_Entrevistado)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 32.00 48.00 47.92 61.00 89.00
```

```
age = survey %>%
mutate(as.integer(Edad_Entrevistado)) %>%
rename(age = Edad_Entrevistado) %>%
group_by(age) %>%
summarise(n = n())
ggplot(age, aes(age, n)) +
geom_line()
```

```
# How is the sex distributed?
survey %>%
rename(sex_id = Sexo_Entrevistado) %>%
group_by(sex_id) %>%
summarise(n = n())
```

```
# A tibble: 2 x 2
sex_id n
```
1 1 651
2 2 651

In the last tibble we have no idea what is 1 and 2.

```
survey %>%
select(Sexo_Entrevistado) %>%
rename(sex_id = Sexo_Entrevistado) %>%
distinct() %>%
mutate(sex = as_factor(sex_id))
```

```
# A tibble: 2 x 2
sex_id sex
```
1 2 Mujer
2 1 Hombre

The last column (in spanish) shows us that in this survey “1 = Male” and “2 = Female”.

I could run

```
survey %>%
rename(sex = Sexo_Entrevistado) %>%
mutate(sex = as.integer(sex)) %>%
mutate(sex = recode(sex, `1` = "Male", `2` = "Female")) %>%
group_by(sex) %>%
summarise(n = n())
```

```
# A tibble: 2 x 2
sex n
```
1 Female 651
2 Male 651

The column names are labelled as well. Here `sjlabelled`

helps if I want to know for example what “P12” means. But instead of just translating labels I’ll describe the complete dataset.

```
valid_replies = survey %>%
mutate_if(is.labelled,as.numeric) %>%
skim() %>%
filter(stat=="complete") %>%
mutate(description = get_label(survey)) %>%
select(var,description,everything()) %>%
select(-c(stat,level,type)) %>%
rename(pcent_valid = value) %>%
mutate(pcent_valid = paste0(100*round(pcent_valid / nrow(survey),2),'%'))
histograms = survey %>%
mutate_if(is.labelled,as.numeric) %>%
skim() %>%
filter(stat=="hist") %>%
select(var,level) %>%
rename(histogram = level)
survey_description = valid_replies %>%
left_join(histograms) %>%
write_csv("2017-06-24_working_with_spss_data_in_r/survey_description.csv")
survey_description
```

```
# A tibble: 203 x 4
var description pcent_valid histogram
```
1 PONDERADOR Ponderador 100% ▂▇▇▅▅▃▁▁▁▁
2 Folio Folio 100% ▇▇▇▇▇▇▇▇▇▇
3 Región Región 100% ▁▁▂▁▂▁▁▁▇▁
4 Comuna Comuna 100% ▁▁▂▁▁▂▁▁▇▁
5 Fecha Fecha entrevista 100%
6 Sexo_Encuestador Sexo Entrevistador 91% ▂▁▁▁▁▁▁▁▁▇
7 GSE GSE Visual 100% ▁▁▂▁▇▁▁▆▁▁
8 Sexo_Entrevistado Sexo Entrevistado 100% ▇▁▁▁▁▁▁▁▁▇
9 Edad_Entrevistado Edad Entrevistado 100% ▇▆▅▆▇▇▅▃▃▂
10 Hora_Inicio Hora Inicio Medición 100%
# ... with 193 more rows

Exploring the last tibble there are interesting questions. For example, P12 refers to *“Apoyo a la democracia”* that is *Do you support democracy?*.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Pachá (Batteries Included)**.

R-bloggers.com offers

(This article was first published on ** Peter's stats stuff - R**, and kindly contributed to R-bloggers)

In an important 2005 article in the Australian Journal of Political Science, Simon Jackman set out a statistically-based approach to pooling polls in an election campaign. He describes the sensible intuitive approach of modelling a latent, unobserved voting intention (unobserved except on the day of the actual election) and treats each poll as a random observation based on that latent state space. Uncertainty associated with each measurement comes from sample size and bias coming from the average effect of the firm conducting the poll, as well as of course uncertainty about the state of the unobserved voting intention. This approach allows house effects and the latent state space to be estimated simultaneously, quantifies the uncertainty associated with both, and in general gives a much more satisfying method of pooling polls than any kind of weighted average.

Jackman gives a worked example of the approach in his excellent book Bayesian Analysis for the Social Sciences, using voting intention for the Australian Labor Party (ALP) in the 2007 Australian federal election for data. He provides `JAGS`

code for fitting the model, but notes that with over 1,000 parameters to estimate (most of those parameters are the estimated voting intention for each day between the 2004 and 2007 elections) it is painfully slow to fit in general purpose MCMC-based Bayesian tools such as `WinBUGS`

or `JAGS`

– several days of CPU time on a fast computer in 2009. Jackman estimated his model with Gibbs sampling implemented directly in R.

Down the track, I want to implement Jackman’s method of polling aggregation myself, to estimate latent voting intention for New Zealand to provide an alternative method for my election forecasts. I set myself the familiarisation task of reproducing his results for the Australian 2007 election. New Zealand’s elections are a little complex to model because of the multiple parties in the proportional representation system, so I wanted to use a general Bayesian tool for the purpose to simplify my model specification when I came to it. I use Stan because its Hamiltonian Monte Carlo method of exploring the parameter space works well when there are many parameters – as in this case, with well over 1,000 parameters to estimate.

Stan describes itself as “a state-of-the-art platform for statistical modeling and high-performance statistical computation. Thousands of users rely on Stan for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences, engineering, and business.” It lets the programmer specify a complex statistical model, and given a set of data will return a range of parameter estimates that were most likely to produce the observed data. Stan isn’t something you use as an end-to-end workbench – it’s assumed that data manipulation and presentation is done with another tool such as R, Matlab or Python. Stan focuses on doing one thing well – using Hamiltonian Monte Carlo to estimate complex statistical models, potentially with many thousands of hierarchical parameters, with arbitrarily set prior distributions.

*Caveat! – I’m fairly new to Stan and I’m pretty sure my Stan programs that follow aren’t best practice, even though I am confident they work. Use at your own risk!*

I approached the problem in stages, gradually making my model more realistic. First, I set myself the task of modelling latent first-preference support for the ALP in the absence of polling data. If all we had were the 2004 and 2007 election results, where might we have thought ALP support went between those two points? Here’s my results:

For this first analysis, I specified that support for the ALP had to be a random walk that changed by a normally distributed variable with standard deviation of 0.25 percentage points for each daily change. Why 0.25? Just because Jim Savage used it in his rough application of this approach to the US Presidential election in 2016. I’ll be relaxing this assumption later.

Here’s the R code that sets up the session, brings in the data from Jackman’s `pscl`

R package, and defines a graphics function that I’ll be using for each model I create.

Here’s the Stan program that specifies this super simple model of changing ALP support from 2004 to 2007:

And here’s the R code that calls that Stan program and draws the resulting summary graphic. Stan works by compiling a program in C++ that is based on the statistical model specified in the `*.stan`

file. Then the C++ program zooms around the high-dimensional parameter space, moving slower around the combinations of parameters that seem more likely given the data and the specified prior distributions. It can use multiple processors on your machine and works super fast given the complexity of what it’s doing.

Next I wanted to add a single polling firm. I chose Nielsen’s 42 polls because Jackman found they had a fairly low bias, which removed one complication for me as I built up my familiarity with the approach. Here’s the result:

That model was specified in Stan as set out below. The Stan program is more complex now; I’ve had to specify how many polls I have (`y_n`

), the values for each poll (`y_values`

), and the days since the last election each poll was taken (`y_days`

). This way I only have to specify 42 measurement errors as part of the probability model – other implementations I’ve seen of this approach ask for an estimate of measurement error for each poll on each day, treating the days with no polls as missing values to be estimated. That obviously adds a huge computational load I wanted to avoid.

In this program, I haven’t yet added in the notion of a house effect for Nielsen. Each measurement Nielsen made is assumed to have been an unbiased one. Again, I’ll be relaxing this later. The state model is also the same as before ie standard deviation of the day to day innovations is still hard coded as 0.25 percentage points.

Here’s the R code to prepare the data and pass it to Stan. Interestingly, fitting this model is noticeably faster than the one with no polling data at all. My intuition for this is that now the state space is constrained to being reasonably close to some actually observed measurements, it’s an easier job for Stan to know where is good to explore.

Finally, the complete model replicating Jackman’s work:

As well as adding the other four sets of polls, I’ve introduced five house effects that need to be estimated (ie the bias for each polling firm/mode); and I’ve told Stan to estimate the standard deviation of the day-to-day innovations in the latent support for ALP rather than hard-coding it as 0.25. Jackman specified a uniform prior on `[0, 1]`

for that parameter, but I found this led to lots of estimation problems for Stan. The Stan developers give some great practical advice on this sort of issue and I adapted some of that to specify the prior distribution for the standard deviation of day to day innovation as `N(0.5, 0.5)`

, constrained to be positive.

Here’s the Stan program:

Building the fact there are 5 polling firms (or firm-mode combinations, as Morgan is in there twice) directly into the program must be bad practice, but seeing as there are different numbers of polls taken by each firm and on different days I couldn’t work out a better way to do it. Stan doesn’t support ragged arrays, or objects like R’s lists, or (I think) convenient subsetting of tables, which would be the three ways I’d normally try to do that in another language. So I settled for the approach above, even though it has some ugly bits of repetition.

Here’s the R code that sorts the data and passes it to Stan

Here’s the house effects estimated by me with Stan, compared to those in Jackman’s 2009 book:

Basically we got the same results – certainly close enough anyway. Jackman writes:

“The largest effect is for the face-to-face polls conducted by Morgan; the point estimate of the house effect is 2.7 percentage points, which is very large relative to the classical sampling error accompanhying these polls.”

Interestingly, Morgan’s phone polls did much better.

Here’s the code that did that comparison:

So there we go – state space modelling of voting intention, with variable house effects, in the Australian 2007 federal election.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Peter's stats stuff - R**.

R-bloggers.com offers