The post Choroplethr v3.5.3 is now on CRAN appeared first on AriLamstein.com.

]]>install.packages("choroplethr") packageVersion("choroplethr") [1] '3.3.1'

This new version was motivated by a few warnings that had started appearing since the latest update to ggplot2:

library(choroplethr) data(df_pop_state) state_choropleth(df_pop_state) Warning messages: 1: `panel.margin` is deprecated. Please use `panel.spacing` property instead 2: `panel.margin` is deprecated. Please use `panel.spacing` property instead 3: `panel.margin` is deprecated. Please use `panel.spacing` property instead

These warnings did not effect the actual functionality of choroplethr. But the issue that caused them has been resolved.

It appears like the latest issue of ggplot2 has also caused issues with the ggmap package. Choroplethr uses ggmap when superimposing choropleth maps over reference maps (i.e. setting “reference_map=TRUE”).

library(choroplethr) data(df_pop_county) county_choropleth(df_pop_county, state_zoom="california", reference_map=TRUE) Error: GeomRasterAnn was built with an incompatible version of ggproto. Please reinstall the package that provides this extension.

One workaround for this is to use the development version of ggmap, which you can get from github like this:

library(devtools) install_github("dkahle/ggmap") # restart R library(choroplethr) data(df_pop_county) county_choropleth(df_pop_county, state_zoom="california", reference_map=TRUE)

I do not know when the maintainer of ggmap is planning to push this fix to CRAN.

This update also changes how I handle documentation for choroplethr.

In the past choropethr had several vignettes which I published online via CRAN. I’m now trying to consolidate all the documentation for my open source projects here on my own website. If you’d like to view the old choroplethr vignettes, you can still do so. They are on my new “Open Source” page here: http://www.arilamstein.com/open-source.

The post Choroplethr v3.5.3 is now on CRAN appeared first on AriLamstein.com.

]]>
(This article was first published on ** DataCamp Blog**, and kindly contributed to R-bloggers)

Our newest course, Object-Oriented Programming in R: S3 and R6 taught by Richie Cotton is now available! Object-oriented programming (OOP) lets you specify relationships between functions and the objects that they can act on, helping you manage complexity in your code. This is an intermediate level course, providing an introduction to OOP, using the S3 and R6 systems. S3 is a great day-to-day R programming tool that simplifies some of the functions that you write. R6 is especially useful for industry-specific analyses, working with web APIs, and building GUIs. The course concludes with an interview with Winston Chang, creator of the R6 package. What are you waiting for?

Object-Oriented Programming in R: S3 and R6 features 56 interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will get you on your way to becoming an Object-oriented programming (OOP) master!

In the first chapter, you’ll learn what object-oriented programming (OOP) consists of, when to use it, and what OOP systems are available in R. You’ll also learn how R identifies different types of variable, using classes, types, and modes. [Start First Chapter For Free] Next, Richie explains how to use S3, and how generics and methods work. S3 is a very simple object-oriented system that lets you define different behavior for functions, depending upon their input argument. In the third chapter, you’ll learn how to define R6 classes, and to create R6 objects. You’ll also learn about the structure of R6 classes, and how to separate the user interface from the implementation details. Next, you’ll learn how to inherit from an R6 class, and how the relationship between parent and child classes works. Finally, you’ll complete your mastery of R6 by learning about advanced topics such as copying by reference, shared fields, cloning objects, and finalizing objects.

To **leave a comment** for the author, please follow the link and comment on their blog: ** DataCamp Blog**.

R-bloggers.com offers

(This article was first published on ** R – Fronkonstin**, and kindly contributed to R-bloggers)

Hi Hillary, It’s Donald, would you like to have a beer with me in La Cabra Brewing, in Berwyn, Pensilvania? (Hypothetical utilization of The Meeting Point Locator)

Finding a place to have a drink with someone may become a difficult task. It is quite common that one of them does not want to move to the other’s territory. I am sure you have faced to this situation many times. With *The Meeting Point Locator* this will be no longer an issue, since it will give you a list of equidistant bars and coffees to any two given locations. Let’s see an example.

I do not know if Hillary Clinton and Donald Trump have met each other after the recent elections in United States, but the will probably do. Let’s suppose Hillary doesn’t want to go to The White House and that Donald prefers another place instead Hillary’s home. No problem at all. According to this, Hillary lives in **Chappaqua, New York **and Donald will live in **The White House, Washington **(although he supposedly won’t do full time as he announced recently). These two locations are the only input that *The Meeting Point Locator* needs to purpose equidistant places where having a drink. This is how it works:

- Generates a number of coordinates on the great circle which passes through the midpoint of the original locations and is orthogonal to the rhumb defined by them; the number of points depends on the distance between the original locations.
- Arranges these coordinates according to the distance to the original locations, from the nearest to the most distant.
- Depending also on the distance of the original locations, defines a radius to search around each point generated on the great circle (once calculated, this radius is constant for all searches).
- Starting from the nearest point, looks for a number of places (20 by default) to have a drink using the radius calculated previously. To do this, it calls to the Google Places API. Once the number of locations is reached, the proccess stops.

This map shows the places purposed for Hillary and Donald (blue points) as well as the original locations (red ones). You can make *zoom in* for details:

These are the 20 closest places to both of them:

Some other examples of the *utility* of *The Meeting Point Locator*:

**Pau Gasol**(who lives in San Antonio, Texas) and**Marc Gasol**(in Memphis, Tennessee) can meet into have a beer while watching a NBA match. It is 537 kilometers far from both of them.*The Draft Sports Bar,*in Leesville (Louisiana)**Bob Dylan**(who lives in Malibu, California) and**The Swedish Academy**(placed in Stockholm, Sweden) can*smooth things over*drinking a caipirinha in,*Bar São João*, in Tremedal (Brasil)*only*9.810 kilometers far from both of them.**Spiderman**(placed in New York City) and**Doraemon**(in Tokio, Japan) can meet in**Andreyevskaya, in Stroitel (Russia)**to have a have a hot drink. Since they are*superheroes*, they will cover the 9.810 kilometers of separation in no time at all.

I faced with two challenges to do this experiment: how to generate the orthogonal great circle from two given locations and how to define radius and number of points over this circle to do searchings. I will try to explain in depth both things in the future in another post.

You will find the code below. To make it work, do not forget to get your own key for Google Places API Web Service here. I hope this tool will be helpful for someone; if yes, do not hesitate to tell it to me.

library(httr) library(jsonlite) library(dplyr) library(ggmap) library(geosphere) library(DT) library(leaflet) # Write both addresses here (input) place1="Chappaqua, New York, United States of America" place2="The White House, Washington DC, United States of America" # Call to Google Maps API to obtain coordinates of previous addresses p1=geocode(place1, output = "latlon") p2=geocode(place2, output = "latlon") # To do searchings I need a radius radius=ifelse(distGeo(p1, p2)>1000000, 10000, ifelse(distGeo(p1, p2)>100000, 2500, 1000)) # And a number of points npoints=ifelse(distGeo(p1, p2)>1000000, 2002, ifelse(distGeo(p1, p2)>100000, 7991, 19744)) # Place here the Google Places API Key key="PLACE_YOUR_OWN_KEY_HERE" # Build the url to look for bars and cafes with the previous key url1="https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=lat,lon&radius=" url2="&types=cafe|bar&key=" url=paste0(url1,radius,url2,key) # This is to obtain the great circle orthogonal to the rhumb defined by input locations # and which passes over the midpoint. I will explain this step in the future mid=midPoint(p1, p2) dist=distGeo(p1, p2) x=p1 y=p2 while(dist>1000000) { x=midPoint(mid, x) y=midPoint(mid, y) dist=distGeo(x, y) } bea=bearingRhumb(x, y) points=greatCircle(destPoint(p=mid, b=bea+90, d=1), mid, n=npoints) # Arrange the points dependning on the distance to the input locations data.frame(dist2p1=apply(points, 1, function (x) distGeo(p1, x)), dist2p2=apply(points, 1, function (x) distGeo(p2, x))) %>% mutate(order=apply(., 1, function(x) {max(x)})) %>% cbind(points) %>% arrange(order) -> points # Start searchings nlocs=0 # locations counter (by default stops when 20 is reached) niter=1 # iterations counter (if greater than number of points on the great circle, stops) results=data.frame() while(!(nlocs>=20 | niter>npoints)) { print(niter) url %>% gsub("lat", points[niter, 'lat'], .) %>% gsub("lon", points[niter, 'lon'], .) %>% GET %>% content("text") %>% fromJSON -> retrieve df=data.frame(lat=retrieve$results$geometry$location$lat, lng=retrieve$results$geometry$location$lng, name=retrieve$results$name, address=retrieve$results$vicinity) results %>% rbind(df)->results nlocs=nlocs+nrow(df) niter=niter+1 } # I prepare results to do a Data Table data.frame(dist2p1=apply(results, 1, function (x) round(distGeo(p1, c(as.numeric(x[2]), as.numeric(x[1])))/1000, digits=1)), dist2p2=apply(results, 1, function (x) round(distGeo(p2, c(as.numeric(x[2]), as.numeric(x[1])))/1000, digits=1))) %>% mutate(mx=apply(., 1, function(x) {max(x)})) %>% cbind(results) %>% arrange(mx) %>% mutate(rank=row_number()) %>% select(-mx)-> resultsDT # This is the Data table datatable(resultsDT, class = 'cell-border stripe', rownames = FALSE, options = list(pageLength = 5), colnames = c('Distance to A (Km)', 'Distance to B (Km)', 'Latitude', 'Longitude', 'Name', 'Address', 'Rank')) # Map with the locations using leaflet resultsDT %>% leaflet() %>% addTiles() %>% addCircleMarkers( lng=resultsDT$lng, lat=resultsDT$lat, radius = 8, color = "blue", stroke = FALSE, fillOpacity = 0.5, popup=paste(paste0("<b>", resultsDT$name, "</b>"), resultsDT$address, sep=" ") ) %>% addCircleMarkers( lng=p1$lon, lat=p1$lat, radius = 10, color = "red", stroke = FALSE, fillOpacity = 0.5, popup=paste("<b>Place 1</b>", place1, sep=" ") )%>% addCircleMarkers( lng=p2$lon, lat=p2$lat, radius = 10, color = "red", stroke = FALSE, fillOpacity = 0.5, popup=paste("<b>Place 2</b>", place2, sep=" ") )

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Fronkonstin**.

R-bloggers.com offers

(This article was first published on ** Data Literacy - The blog of Andrés Gutiérrez**, and kindly contributed to R-bloggers)

Multilevel regression with poststratification (MrP) is a useful technique to predict a parameter of interest within small domains through modeling the mean of the variable of interest conditional on poststratification counts. This method (or methods) was first proposed by Gelman and Little (1997) and is widely used in political science where the voting intention is modeling conditional on the interaction of classification variables.

The aim fo this methodology is to provide reliable estimates on strata based on census counts. For those who have some background on survey sampling, this method should look very similar to the Raking method, where sampling weights are adjusted due to known census cell counts. However, a significant difference with Raking is that MrP is a model-based approach, rather than a design-based method. This way, even in the presence of a (maybe complex) survey design, MrP does not take it into account for inference. In other words, sampling design will be considered as ignorable. So, the probability measure that governs the whole inference is based on modeling the voting intention (variable of interest) to demographic categories (auxiliary variables).

Is this a major technical issue to ignore the complex survey design? Yes, because in any case we are considering a probability measure to draw the sample. However, when it comes to voting intention (a major area where this technique is used), we rarely find a sophisticated complex design. Moreover, this kind of studies is barely based on probabilistic polls. So, if the survey lacks a proper sampling plan, it is always better to model the response variable.

Therefore, the ultimate goal of this technique it to estimate a parameter of interest (totals, means, proportions, etc.) for all of the strata (domains, categories or subgroups) in a finite population. From now on, let’s assume that:

- a population is divided into $H$ strata of interest (for example states),
- the parameters of interest are the means (same rules apply for proportions) in each strata $\theta_h$ ($h=1, \ldots, H$),
- every stratum is cross-classified by some demographics $j \in H$ (from now on defined as post-srata), besides every population count $N_j$ is known, and
- all of the population means $\mu_j$ can be estimated by using some statistical technique, such as multilevel regression.

This way, the mean in stratum $h$ ($\theta_h$), is defined as a function of means in post-strata $j$ ($\mu_j$), and post-strata counts ($N_j$):

$$\theta_h = \frac{\sum_{j \in h} N_j \mu_j }{\sum_{j \in h} N_j}$$

The first part of MrP is defined by a multilevel regression (MR). This kind of models is a particular case of mixed effect models. The core components of this phase are

- the variable of interest (for example, voting intention),
- the auxiliary variables (classification on census demographic cells) and,
- the random effects (strata of interest, that usually are states or counties).

The second part of MrP is cell poststratification (P), and the predicted response variable is aggregated into states and adjusted by corresponding poststratification weights.

Let’s consider the Lucy database of the TeachingSampling package. This population contains some economic variables of 2396 industrial companies in a particular year. Assume that we want to estimate the mean income of industries by each of the five existing zones (strata of interest) on that database. This way, our parameters of interest are $\theta_1, \ldots, \theta_5$.

Now, we also know that the population is divided into three levels (small, medium and big industries) and we have access to the total number of industries within each cross group. That is, we know exactly how many small industries are on each of the five zones, and how many medium industries are on each of the five zones, and so on.

The following code shows how to load the database and obtain the cell counts.

> rm(list = ls())

> set.seed(123)

>

> library(TeachingSampling)

> library(dplyr)

> library(lme4)

>

> data("Lucy")

> # Number of industries per level

> table(Lucy$Level)

Big Medium Small

83 737 1576

> # Number of industries per zone

> table(Lucy$Zone)

A B C D E

307 727 974 223 165

> # Size of post-strata

> (Np <- table(Lucy$Level, Lucy$Zone))

A B C D E

Big 30 13 1 16 23

Medium 180 121 111 187 138

Small 97 593 862 20 4

Of course, this technique works over a selected sample. That’s why we are going to select a random sample of size $n = 1000$. We can also create some tables showing the counts on the sample.

> # A sample is selected

> SLucy <- sample_n(Lucy, size = 1000)

> table(SLucy$Level)

Big Medium Small

33 280 687

> table(SLucy$Zone)

A B C D E

130 295 426 86 63

The first step of MRP is Multilevel Regression in order to estimate post-strata means. The following code shows how to estimate them by using the **lmer** function. The object **Mupred** contains the corresponding $\mu_j$ ($j$ is defined for each level) regarding each stratum (zones).

> # Step 1: <<MR>> - Multilevel regression

> M1 <- lmer(Income ~ Level + (1 | Zone), data = SLucy)

> coef(M1)

$Zone

(Intercept) LevelMedium LevelSmall

A 1265.596 -579.1851 -893.8958

B 1138.337 -579.1851 -893.8958

C 1189.285 -579.1851 -893.8958

D 1248.658 -579.1851 -893.8958

E 1284.322 -579.1851 -893.8958

attr(,"class")

[1] "coef.mer"

> SLucy$Pred <- predict(M1)

>

> # Summary

> grouped <- group_by(SLucy, Zone, Level)

> sum <- summarise(grouped, mean2 = mean(Pred))

> (Mupred <- matrix(sum$mean2, ncol = 5, nrow = 3))

[,1] [,2] [,3] [,4] [,5]

[1,] 1265.5959 1138.3370 1189.2852 1248.6575 1284.3224

[2,] 686.4107 559.1518 610.1000 669.4724 705.1373

[3,] 371.7001 244.4412 295.3894 354.7618 390.4267

Now we have estimated the post-strata means, it’s time to weight every strata by their corresponding counts in order to obtain an estimate of the mean income by zone. As we know each post-strata size, we simply use the function **aggregate** to obtain the MRP estimator of the parameters of interest.

> # Step 2: <<P>> - Post-stratification

> # Mean income estimation per zone

> colSums(Np * Mupred) / table(Lucy$Zone)

A B C D E

643.5724 312.8052 332.1726 682.8031 778.2428

How accurate was the estimation? It was good compared to the true parameters on the finite population.

> # True parameters

> aggregate(Lucy$Income, by = list(Lucy$Zone), FUN = mean)

Group.1 x

1 A 652.2834

2 B 320.7469

3 C 331.0195

4 D 684.9821

5 E 767.3879

To **leave a comment** for the author, please follow the link and comment on their blog: ** Data Literacy - The blog of Andrés Gutiérrez**.

R-bloggers.com offers

(This article was first published on ** R / Notes**, and kindly contributed to R-bloggers)

This note lists a few of the organizations that are pushing the R language forward, as of early 2017. R is happy language right now.

Historically, the R Project for Statistical Computing has been supported by the R Foundation since its inception in 2002. It has laid down some of the most important building blocks of the R ecosystem, including, of course, CRAN, as well as the *R Journal* and the R mailing-lists.

Fifteen years later, many other organizations have been set up to help developing R and its user base, at various levels and through various means:

- The R Foundation has set up the R Foundation Taskforce on Women and Other Under-Represented Groups, which teams up with a new kind of R user group, R-Ladies, a worldwide initiative to encourage gender diversity in the R community.
- The R Consortium brings private-sector funds to R developers through grants for development and community projects. Its blog documents the progress made on each project, many of which are pretty awesome.
- The R-Bloggers blogs aggregator has been around for a while, and a new initiative launched last year, R Weekly, provides a more digestible list of R-related material, at a slower pace. The list is written collaboratively.
- Last but not least, the RStudio company develops the RStudio IDE and related things like Shiny, R server-side products, and many R packages. It also just recently held its very first conference, rstudio::conf.

This list does not cover the smaller organizations, such as the recently created r-spatial group, which help developing R packages for a myriad of different applications with often very different audiences.

I would say that R is pretty happy community right now. Getting help to use R is easier than ever, the quality of many new software releases is very high, and the user base is becoming more and more diverse, which is a huge (and indispensable) asset.

The next step might be to boost the job opportunities available to R users, and to better organise the ways that it is taught in universities, on online learning platforms like Coursera or DataCamp, or through private training firms.

Although there is no single way to keep track of everything going on in the R community, almost everything shows up on Twitter at some point, generally labelled with the #rstats hashtag.

Go and explore, and happy new year!

To **leave a comment** for the author, please follow the link and comment on their blog: ** R / Notes**.

R-bloggers.com offers

Rcpp has become the most popular way of enhancing GNU R with C or C++ code. As of today, 906 packages on CRAN depend on Rcpp for making analytical code go faster and further. That is up by sixthythree packages over the two months since the last release -- or about a package a day!

Some of the changes in this release are smaller and detail-oriented. We did squash one annoying bug (stemming from the improved exception handling) in `Rcpp::stop()`

that hit a few people. Nathan Russell added a `sample()`

function (similar to the optional one in RcppArmadillo; this required a minor cleanup by for small number of other packages which used both namespaces 'opened'. `Date`

and `Datetime`

objects now have `format()`

methods and `<<`

output support. We now have coverage reports via covr as well. Last but not least James "coatless" Balamuta was once more tireless on documentation and API consistency --- see below for more details.

## Changes in Rcpp version 0.12.9 (2017-01-14)

Changes in Rcpp API:

The exception stack message is now correctly demangled on all compiler versions (Jim Hester in #598)

Date and Datetime object and vector now have format methods and

`operator<<`

support (#599).The

`size`

operator in`Matrix`

is explicitly referenced avoiding a g++-6 issues (#607 fixing #605).The underlying date calculation code was updated (#621, #623).

Addressed improper diagonal fill for non-symmetric matrices (James Balamuta in #622 addressing #619)

Changes in Rcpp Sugar:

Changes in Rcpp unit tests

Changes in Rcpp Documentation:

Changes in Rcpp build system

Thanks to CRANberries, you can also look at a diff to the previous release. As always, even fuller details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads page, the browseable doxygen docs and zip files of doxygen output for the standard formats. A local directory has source and documentation too. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

]]>

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

Yesterday afternoon, the nineth update in the 0.12.* series of Rcpp made it to the CRAN network for GNU R. Windows binaries have by now been generated; and the package was updated in Debian too. This 0.12.9 release follows the 0.12.0 release from late July, the 0.12.1 release in September, the 0.12.2 release in November, the 0.12.3 release in January, the 0.12.4 release in March, the 0.12.5 release in May, the 0.12.6 release in July, the 0.12.7 release in September, and the 0.12.8 release in November — making it the thirteenth release at the steady bi-montly release frequency.

Rcpp has become *the* most popular way of enhancing GNU R with C or C++ code. As of today, 906 packages on CRAN depend on Rcpp for making analytical code go faster and further. That is up by sixthythree packages over the two months since the last release — or about a package a day!

Some of the changes in this release are smaller and detail-oriented. We did squash one annoying bug (stemming from the improved exception handling) in `Rcpp::stop()`

that hit a few people. Nathan Russell added a `sample()`

function (similar to the optional one in RcppArmadillo; this required a minor cleanup by for small number of other packages which used both namespaces ‘opened’. `Date`

and `Datetime`

objects now have `format()`

methods and `<<`

output support. We now have coverage reports via covr as well. Last but not least James "coatless" Balamuta was once more tireless on documentation and API consistency — see below for more details.

## Changes in Rcpp version 0.12.9 (2017-01-14)

Changes in Rcpp API:

The exception stack message is now correctly demangled on all compiler versions (Jim Hester in #598)

Date and Datetime object and vector now have format methods and

`operator<<`

support (#599).The

`size`

operator in`Matrix`

is explicitly referenced avoiding a g++-6 issues (#607 fixing #605).The underlying date calculation code was updated (#621, #623).

Addressed improper diagonal fill for non-symmetric matrices (James Balamuta in #622 addressing #619)

Changes in Rcpp Sugar:

Changes in Rcpp unit tests

Changes in Rcpp Documentation:

Changes in Rcpp build system

Thanks to CRANberries, you can also look at a diff to the previous release. As always, even fuller details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads page, the browseable doxygen docs and zip files of doxygen output for the standard formats. A local directory has source and documentation too. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Thinking inside the box **.

R-bloggers.com offers

`prepdat`

is an R package that helps researchers to optimize and speedup their analysis, providing various cross sections of the data in order to better understand the results.

`prepdat`

was created by Ayala S. Allon and Roy Luria. The full papaer about `prepdat`

was published in Journal of Open Research Software on Nov 25, 2016, and can be downloaded here.

To better understand the abilities of the package let’s look at an example. Look at the two distributions in the figure below. The difference between the means of these two distributions is significant (*t*(20) = 8.65, *p* < 0.001). As can be seen, the “mdvc1” distribution is right skewed, and the “mdvc5” distribution is not skewed.

It could be that the “mdvc1” distribution, because of it’s skeweness, is not best characterized by the mean, but rather by another dependent measure such as the 25th percentile. As such, one should consider looking and testing different cross sections and dependent measures of the data because it can provide information about the source of the effect in question.

Yet, in many studies the comparision between experimental conditions in the statistical inference stage is done directly on the means without examining other cross sections of the data.

`prepdat`

, using the `prep()`

function, outputs various dependent measures of the dependent variable (e.g., means after rejecting observations according to a flexible standard deviation criteria and percentiles), enabling the user to better understand the results. In addition, `prep()`

enables to aggregate raw data tables in a long format according to any number of grouping variables (i.e., independent variables).

Aggregating a raw data table includes reducing the amounts of data to the desired level of information, resulting in a finalized table, usually in a wide format, in which each row in the table refers to a specific subject (which is the variable that identifies the unit upon which the measurement took place; i.e., the id variable), and each cell in the table usually reflects the averaged performance of that subject according to the desired grouping variables (i.e., the independent and dependent variables). This finalized table often contains only selected variables relative to the raw data table:

For each aggregated cell in the finalized table `prep()`

will output:

- Means before and after rejecting observations according to a flexible standard deviation criteria.
- Number of rejected observations according to the flexible standard deviation criteria.
- Proportions of rejected observations according to the flexible standard deviation criteria.
- Number of observations before rejection.
- Standard deviations.
- Medians.
- Additional percentiles (e.g., the 0.05th, 0.25th, 0.75th, 0.95th percentiles).
- Means after rejecting observations according to procedures described in Van Selst & Jolicoeur (1994; suitable when measuring reaction-times).
- Harmonic means.

`prep()`

is suitable for aggregating various types of experimental designs such as between-subjects designs, within-subjects (i.e., repeated measures) designs, and mixed designs (i.e., designs that combine between-subjects and within-subjects independent variables). `prep()`

is very easy to use, and only involves filling various arguments and when needed, making changes to default procedures for removing outliers.

`prep()`

accepts the following arguments:

```
prep(
dataset = NULL
, file_name = NULL
, file_path = NULL
, id = NULL
, within_vars = c()
, between_vars = c()
, dvc = NULL
, dvd = NULL
, keep_trials = NULL
, drop_vars = c()
, keep_trials_dvc = NULL
, keep_trials_dvd = NULL
, id_properties = c()
, sd_criterion = c(1, 1.5, 2)
, percentiles = c(0.05, 0.25, 0.75, 0.95)
, outlier_removal = NULL
, keep_trials_outlier = NULL
, decimal_places = 4
, notification = TRUE
, dm = c()
, save_results = TRUE
, results_name = "results.txt"
, results_path = NULL
, save_summary = TRUE
)
```

In many research fields the outcome of running an experiment is a raw data file (e.g., a text file) for each subject, containing a table in which each row describes one trial conducted during the experiment. For example in Experimental Psychology, this file will contain numerical description of the subject’s performance in the various experimental conditions. The columns in this raw data table will describe the independent variables, dependent variables, and various characteristics of the subject and the experiment (e.g., age, gender, and a numerical description of the stimulus in the experiment). The rows in this raw data table will describe the observations (i.e., trials) conducted during the experiment, such that each row in the table corresponds to one observation. Usually, this raw data table has over a hundred lines, and the number of raw data files corresponds to the number of subjects in a given experiment.

The next step (before aggregating the data into one finalized table), is to merge all files into one big raw data table. `file_merge()`

enables to merge the individuals raw data files into one big raw data table containing a ‘chain’ of raw data from all subjects, one after the other.

`file_merge()`

accepts the following arguments:

```
file_merge(
folder_path = NULL
, has_header = TRUE
, new_header = c()
, raw_file_name = NULL
, raw_file_extension = NULL
, file_name = "dataset.txt"
, save_table = TRUE
, dir_save_table = folder_path
, notification = TRUE
)
```

`install.packages("prepdat")`

To install the most current version of `prepdat`

, sometimes even before its official release on CRAN:

`devtools::install_github("ayalaallon/prepdat")`

To load `prepdat`

in a current R session:

`library(prepdat)`

To summarize, `prepdat`

enables the user to easily and quickly merge (using `file_merge()`

) and aggregate raw data tables (using `prep()`

) while keeping track and summarizing every step of the preparation.

For questions, comments, and suggestions please email me at ayalaallon@gmail.com or open an issue in GitHub.

Allon, A. S., & Luria, R. (2016). prepdat- An R Package for Preparing Experimental Data for Statistical Analysis. *Journal of Open Research Software*, 4(1), e43. DOI: http://doi.org/10.5334/jors.134

Allon, A. S. & Luria, R. (2016). prepdat: Preparing Experimental Data for Statistical Analysis. R package version 1.0.8. http://CRAN.R-project.org/package=prepdat

Van Selst, M. & Jolicoeur, P. (1994). A solution to the effect of sample size on outlier elimination. *The Quarterly Journal of Experimental Psychology*, Section A, 47:3, 631-650, DOI: http://dx.doi.org/10.1080/14640749408401131

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

In the exercises below we cover some material on multiple regression in R.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

We will be using the dataset `state.x77`

, which is part of the `state`

datasets available in `R`

. (Additional information about the dataset can be obtained by running `help(state.x77)`

.)

**Exercise 1**

a. Load the `state`

datasets.

b. Convert the `state.x77`

dataset to a dataframe.

c. Rename the `Life Exp`

variable to `Life.Exp`

, and `HS Grad`

to `HS.Grad`

. (This avoids problems with referring to these variables when specifying a model.)

**Exercise 2**

Suppose we wanted to enter all the variables in a first-order linear regression model with `Life Expectancy`

as the dependent variable. Fit this model.

**Exercise 3**

Suppose we wanted to remove the `Income`

, `Illiteracy`

, and `Area`

variables from the model in Exercise 2. Use the `update`

function to fit this model.

- Model basic and complex real world problem using linear regression
- Understand when models are performing poorly and correct it
- Design complex models for hierarchical data
- And much more

**Exercise 4**

Let’s assume that we have settled on a model that has `HS.Grad`

and `Murder`

as predictors. Fit this model.

**Exercise 5**

Add an interaction term to the model in Exercise 4 (3 different ways).

**Exercise 6**

For this and the remaining exercises in this set we will use the model from Exercise 4.

Obtain 95% confidence intervals for the coefficients of the two predictor variables.

**Exercise 7**

Predict the Life Expectancy for a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

**Exercise 8**

Obtain a 98% confidence interval for the mean Life Expectancy in a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

**Exercise 9**

Obtain a 98% confidence interval for the Life Expectancy of a person living in a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

**Exercise 10**

Since our model only has two predictor variables, we can generate a 3D plot of our data and the fitted regression plane. Create this plot.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.

R-bloggers.com offers