It provides six entry points for C-level functions of the R API for Date
and Datetime
calculations: asPOSIXlt
and asPOSIXct
convert between long and compact datetime representation, formatPOSIXlt
and Rstrptime
convert to and from character strings, and POSIXlt2D
and D2POSIXlt
convert between Date
and POSIXlt
datetime. These six functions are all fairly essential and useful, but not one of them was previously exported by R. Hence the need to put them together in the this package to complete the accessible API somewhat.
These should be helpful for fellow package authors as many of us have either our own partial copies of some of this code, or rather farm back out into R to get this done.
As a simple (yet real!) illustration, here is an actual Rcpp function which we could now cover at the C level rather than having to go back up to R (via Rcpp::Function()
):
inline Datetime::Datetime(const std::string &s, const std::string &fmt) {
Rcpp::Function strptime("strptime"); // we cheat and call strptime() from R
Rcpp::Function asPOSIXct("as.POSIXct"); // and we need to convert to POSIXct
m_dt = Rcpp::as<double>(asPOSIXct(strptime(s, fmt)));
update_tm();
}
I had taken a first brief stab at this about two years ago, but never finished. With the recent emphasis on C-level function registration, coupled with a possible use case from anytime I more or less put this together last weekend.
It currently builds and tests fine on POSIX-alike operating systems. If someone with some skill and patience in working on Windows would like to help complete the Windows side of things then I would certainly welcome help and pull requests.
For questions or comments please use the issue tracker off the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
]]>
Very happy to announce a new package of mine is now up on the CRAN repository network: RApiDatetime.
It provides six entry points for C-level functions of the R API for Date
and Datetime
calculations: asPOSIXlt
and asPOSIXct
convert between long and compact datetime representation, formatPOSIXlt
and Rstrptime
convert to and from character strings, and POSIXlt2D
and D2POSIXlt
convert between Date
and POSIXlt
datetime. These six functions are all fairly essential and useful, but not one of them was previously exported by R. Hence the need to put them together in the this package to complete the accessible API somewhat.
These should be helpful for fellow package authors as many of us have either our own partial copies of some of this code, or rather farm back out into R to get this done.
As a simple (yet real!) illustration, here is an actual Rcpp function which we could now cover at the C level rather than having to go back up to R (via Rcpp::Function()
):
inline Datetime::Datetime(const std::string &s, const std::string &fmt) {
Rcpp::Function strptime("strptime"); // we cheat and call strptime() from R
Rcpp::Function asPOSIXct("as.POSIXct"); // and we need to convert to POSIXct
m_dt = Rcpp::as<double>(asPOSIXct(strptime(s, fmt)));
update_tm();
}
I had taken a first brief stab at this about two years ago, but never finished. With the recent emphasis on C-level function registration, coupled with a possible use case from anytime I more or less put this together last weekend.
It currently builds and tests fine on POSIX-alike operating systems. If someone with some skill and patience in working on Windows would like to help complete the Windows side of things then I would certainly welcome help and pull requests.
For questions or comments please use the issue tracker off the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
The post QR Decomposition with the Gram-Schmidt Algorithm appeared first on Aaron Schlegel.
]]>QR decomposition is another technique for decomposing a matrix into a form that is easier to work with in further applications. The QR decomposition technique decomposes a square or rectangular matrix, which we will denote as A, into two components, Q, and R.
A = QR
Where Q is an orthogonal matrix, and R is an upper triangular matrix. Recall an orthogonal matrix is a square matrix with orthonormal row and column vectors such that Q^T Q = I, where I is the identity matrix. The term orthonormal implies the vectors are of unit length and are perpendicular (orthogonal) to each other.
QR decomposition is often used in linear least squares estimation and is, in fact, the method used by R in its lm()
function. Signal processing and MIMO systems also employ QR decomposition. There are several methods for performing QR decomposition, including the Gram-Schmidt process, Householder reflections, and Givens rotations. This post is concerned with the Gram-Schmidt process.
The Gram-Schmidt process is used to find an orthogonal basis from a non-orthogonal basis. An orthogonal basis has many properties that are desirable for further computations and expansions. As noted previously, an orthogonal matrix has row and column vectors of unit length:
||a_n|| = \sqrt{a_n \cdot a_n} = \sqrt{a_n^T a_n} = 1
Where a_n is a linearly independent column vector of a matrix. The vectors are also perpendicular in an orthogonal basis. The Gram-Schmidt process works by finding an orthogonal projection q_n for each column vector a_n and then subtracting its projections onto the previous projections (q_j). The resulting vector is then divided by the length of that vector to produce a unit vector.
Consider a matrix A with n column vectors such that:
A = \left[ a_1 | a_2 | \cdots | a_n \right]
The Gram-Schmidt process proceeds by finding the orthogonal projection of the first column vector a_1.
v_1 = a_1, \qquad e_1 = \frac{v_1}{||v_1||}
Because a_1 is the first column vector, there is no preceeding projections to subtract. The second column a_2 is subtracted by the previous projection on the column vector:
v_2 = a_2 – proj_{v_1} (a_2) = a_2 – (a_2 \cdot e_1) e_1, \qquad e_2 = \frac{v_2}{||v_2||}
This process continues up to the n column vectors, where each incremental step k + 1 is computed as:
v_{k+1} = a_{k+1} – (a_{k+1} \cdot e_{1}) e_1 – \cdots – (a_{k+1} \cdot e_k) e_k, \qquad e_{k+1} = \frac{u_{k+1}}{||u_{k+1}||}
The || \cdot || is the L_2 norm which is defined as:
\sqrt{\sum^m_{j=1} v_k^2} The projection can also be defined by:
Thus the matrix A can be factorized into the QR matrix as the following:
A = \left[a_1 | a_2 | \cdots | a_n \right] = \left[e_1 | e_2 | \cdots | e_n \right] \begin{bmatrix}a_1 \cdot e_1 & a_2 \cdot e_1 & \cdots & a_n \cdot e_1 \\\ 0 & a_2 \cdot e_2 & \cdots & a_n \cdot e_2 \\\ \vdots & \vdots & & \vdots \\\ 0 & 0 & \cdots & a_n \cdot e_n\end{bmatrix} = QR
Consider the matrix A:
\begin{bmatrix} 2 & – 2 & 18 \\\ 2 & 1 & 0 \\\ 1 & 2 & 0 \end{bmatrix}
We would like to orthogonalize this matrix using the Gram-Schmidt process. The resulting orthogonalized vector is also equivalent to Q in the QR decomposition.
The Gram-Schmidt process on the matrix A proceeds as follows:
v_1 = a_1 = \begin{bmatrix}2 \\\ 2 \\\ 1\end{bmatrix} \qquad e_1 = \frac{v_1}{||v_1||} = \frac{\begin{bmatrix}2 \\\ 2 \\\ 1\end{bmatrix}}{\sqrt{\sum{\begin{bmatrix}2 \\\ 2 \\\ 1\end{bmatrix}^2}}} e_1 = \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix}
v_2 = a_2 – (a_2 \cdot e_1) e_1 = \begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix} – \left(\begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix}, \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix}\right)\begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix}
v_2 = \begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix} \qquad e_2 = \frac{v_2}{||v_2||} = \frac{\begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix}}{\sqrt{\sum{\begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix}^2}}}
e_2 = \begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix}
v_3 = a_3 – (a_3 \cdot e_1) e_1 – (a_3 \cdot e_2) e_2
v_3 = \begin{bmatrix}18 \\\ 0 \\\ 0\end{bmatrix} – \left(\begin{bmatrix}18 \\\ 0 \\\ 0\end{bmatrix}, \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix}\right)\begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix} – \left(\begin{bmatrix}18 \\\ 0 \\\ 0\end{bmatrix}, \begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix} \right)\begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix}
v_3 = \begin{bmatrix}2 \\\ – 4 \\\ 4 \end{bmatrix} \qquad e_3 = \frac{v_3}{||v_3||} = \frac{\begin{bmatrix}2 \\\ -4 \\\ 4\end{bmatrix}}{\sqrt{\sum{\begin{bmatrix}2 \\\ -4 \\\ 4\end{bmatrix}^2}}}
e_3 = \begin{bmatrix} \frac{1}{3} \\\ -\frac{2}{3} \\\ \frac{2}{3} \end{bmatrix}
Thus, the orthogonalized matrix resulting from the Gram-Schmidt process is:
\begin{bmatrix} \frac{2}{3} & -\frac{2}{3} & \frac{1}{3} \\\ \frac{2}{3} & \frac{1}{3} & -\frac{2}{3} \\\ \frac{1}{3} & \frac{1}{3} & \frac{2}{3} \end{bmatrix}
The component R of the QR decomposition can also be found from the calculations made in the Gram-Schmidt process as defined above.
R = \begin{bmatrix}a_1 \cdot e_1 & a_2 \cdot e_1 & \cdots & a_n \cdot e_1 \\\ 0 & a_2 \cdot e_2 & \cdots & a_n \cdot e_2 \\\ \vdots & \vdots & & \vdots \\\ 0 & 0 & \cdots & a_n \cdot e_n \end{bmatrix} = \begin{bmatrix} \begin{bmatrix} 2 \\\ 2 \\\ 1 \end{bmatrix} \cdot \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix} & \begin{bmatrix} -2 \\\ 1 \\\ 2 \end{bmatrix} \cdot \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix} & \begin{bmatrix} 18 \\\ 0 \\\ 0 \end{bmatrix} \cdot \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix} \\\ 0 & \begin{bmatrix} -2 \\\ 1 \\\ 2 \end{bmatrix} \cdot \begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix} & \begin{bmatrix} 18 \\\ 0 \\\ 0 \end{bmatrix} \cdot \begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix} \\\ 0 & 0 & \begin{bmatrix} 18 \\\ 0 \\\ 0 \end{bmatrix} \cdot \begin{bmatrix} \frac{1}{3} \\\ -\frac{2}{3} \\\ \frac{2}{3} \end{bmatrix}\end{bmatrix}
R = \begin{bmatrix} 3 & 0 & 12 \\\ 0 & 3 & -12 \\\ 0 & 0 & 6 \end{bmatrix}
We use the same matrix A to verify our results above.
A <- rbind(c(2,-2,18),c(2,1,0),c(1,2,0)) A ## [,1] [,2] [,3] ## [1,] 2 -2 18 ## [2,] 2 1 0 ## [3,] 1 2 0
The following function is an implementation of the Gram-Schmidt algorithm using the modified version of the algorithm. A good comparison of the classical and modified versions of the algorithm can be found here. The Modified Gram-Schmidt algorithm was used above due to its improved numerical stability, which results in more orthogonal columns over the Classical algorithm.
gramschmidt <- function(x) { x <- as.matrix(x) # Get the number of rows and columns of the matrix n <- ncol(x) m <- nrow(x) # Initialize the Q and R matrices q <- matrix(0, m, n) r <- matrix(0, n, n) for (j in 1:n) { v = x[,j] # Step 1 of the Gram-Schmidt process v1 = a1 # Skip the first column if (j > 1) { for (i in 1:(j-1)) { r[i,j] <- t(q[,i]) %*% x[,j] # Find the inner product (noted to be q^T a earlier) # Subtract the projection from v which causes v to become perpendicular to all columns of Q v <- v - r[i,j] * q[,i] } } # Find the L2 norm of the jth diagonal of R r[j,j] <- sqrt(sum(v^2)) # The orthogonalized result is found and stored in the ith column of Q. q[,j] <- v / r[j,j] } # Collect the Q and R matrices into a list and return qrcomp <- list('Q'=q, 'R'=r) return(qrcomp) }
Perform the Gram-Schmidt orthogonalization process on the matrix A using our function.
gramschmidt(A) ## $Q ## [,1] [,2] [,3] ## [1,] 0.6666667 -0.6666667 0.3333333 ## [2,] 0.6666667 0.3333333 -0.6666667 ## [3,] 0.3333333 0.6666667 0.6666667 ## ## $R ## [,1] [,2] [,3] ## [1,] 3 0 12 ## [2,] 0 3 -12 ## [3,] 0 0 6
The results of our function match those of our manual calculations!
The qr()
function in R also performs the Gram-Schmidt process. The qr() function does not output the Q and R matrices, which must be found by calling qr.Q()
and qr.R()
, respectively, on the qr
object.
A.qr <- qr(A) A.qr.out <- list('Q'=qr.Q(A.qr), 'R'=qr.R(A.qr)) A.qr.out ## $Q ## [,1] [,2] [,3] ## [1,] -0.6666667 0.6666667 0.3333333 ## [2,] -0.6666667 -0.3333333 -0.6666667 ## [3,] -0.3333333 -0.6666667 0.6666667 ## ## $R ## [,1] [,2] [,3] ## [1,] -3 0 -12 ## [2,] 0 -3 12 ## [3,] 0 0 6
Thus the qr()
function in R matches our function and manual calculations as well.
http://www.calpoly.edu/~jborzell/Courses/Year%2005-06/Spring%202006/304Gram_Schmidt_Exercises.pdf
http://cavern.uark.edu/~arnold/4353/CGSMGS.pdf
https://www.math.ucdavis.edu/~linear/old/notes21.pdf
http://www.math.ucla.edu/~yanovsky/Teaching/Math151B/handouts/GramSchmidt.pdf
The post QR Decomposition with the Gram-Schmidt Algorithm appeared first on Aaron Schlegel.
by Shahrokh Mortazavi, Partner PM, Visual Studio Cloud Platform Tools at Microsoft
I’m delighted to announce the general availability of R Tools 1.0 for Visual Studio 2015 (RTVS). This release will be shortly followed by R Tools 1.0 for Visual Studio 2017 in early May.
RTVS is a free and open source plug-in that turns Visual Studio into a powerful and productive R development environment. Check out this video for a quick tour of its core features:
RTVS builds on Visual Studio, which means you get numerous features for free: from using multiple languages to word-class Editing and Debugging to over 7,000 extensions for every need:
RTVS includes various features that address the needs of individual as well as Data Science teams, for example:
RTVS integrates with SQL Server 2016 R Services and SQL Server Tools for Visual Studio 2015. These separate downloads enhance RTVS with support for syntax coloring and Intellisense, interactive queries, and deployment of stored procedures directly from Visual Studio.
Use the stock CRAN R interpreter, or the enhanced Microsoft R Client and its ScaleR functions that support multi-core and cluster computing for practicing data science at scale.
Integrated support for git, continuous integration, agile tools, release management, testing, reporting, bug and work-item tracking through Visual Studio Team Services. Use our hosted service or host it yourself privately.
Whether it’s data governance, security, or running large jobs on a powerful server, RTVS workspaces enable setting up your own R server or connecting to one in the cloud.
We’re very excited to officially bring another language to the Visual Studio family! Along with Python Tools for Visual Studio, you have the two main languages for tackling most any analytics and ML related challenge. In the near future (~May), we’ll release RTVS for Visual Studio 2017 as well. We’ll also resurrect the “Data Science workload” in VS2017 which gives you R, Python, F# and all their respective package distros in one convenient install. Beyond that, we’re looking forward to hearing from you on what features we should focus on next! R package development? Mixed R+C debugging? Model deployment? VS Code/R for cross-platform development? Let us know at the RTVS Github repository!
Thank you!
Bits: http://microsoft.github.io/RTVS-docs/installation
Code: https://github.com/Microsoft/RTVS
Docs: http://microsoft.github.io/RTVS-docs
We recently released the survminer verion 0.3, which includes many new features to help in visualizing and sumarizing survival analysis results.
In this article, we present a cheatsheet for survminer, created by Przemysław Biecek, and provide an overview of main functions.
The cheatsheet can be downloaded from STHDA and from Rstudio. It contains selected important functions, such as:
Additional functions, that you might find helpful, are briefly described in the next section.
The main functions, in the package, are organized in different categories as follow.
Survival Curvesggsurvplot(): Draws survival curves with the ‘number at risk’ table, the cumulative number of events table and the cumulative number of censored subjects table.
arrange_ggsurvplots(): Arranges multiple ggsurvplots on the same page.
ggsurvevents(): Plots the distribution of event’s times.
surv_summary(): Summary of a survival curve. Compared to the default summary() function, surv_summary() creates a data frame containing a nice summary from survfit results.
surv_cutpoint(): Determines the optimal cutpoint for one or multiple continuous variables at once. Provides a value of a cutpoint that correspond to the most significant relation with survival.
pairwise_survdiff(): Multiple comparisons of survival curves. Calculate pairwise comparisons between group levels with corrections for multiple testing.
ggcoxzph(): Graphical test of proportional hazards. Displays a graph of the scaled Schoenfeld residuals, along with a smooth curve using ggplot2. Wrapper around plot.cox.zph().
ggcoxdiagnostics(): Displays diagnostics graphs presenting goodness of Cox Proportional Hazards Model fit.
ggcoxfunctional(): Displays graphs of continuous explanatory variable against martingale residuals of null cox proportional hazards model. It helps to properly choose the functional form of continuous variable in cox model.
ggforest(): Draws forest plot for CoxPH model.
ggcoxadjustedcurves(): Plots adjusted survival curves for coxph model.
Find out more at http://www.sthda.com/english/rpkgs/survminer/, and check out the documentation and usage examples of each of the functions in survminer package.
This analysis has been performed using R software (ver. 3.3.2).
We recently released the survminer verion 0.3, which includes many new features to help in visualizing and sumarizing survival analysis results.
In this article, we present a cheatsheet for survminer, created by Przemysław Biecek, and provide an overview of main functions.
The cheatsheet can be downloaded from STHDA and from Rstudio. It contains selected important functions, such as:
Additional functions, that you might find helpful, are briefly described in the next section.
The main functions, in the package, are organized in different categories as follow.
Survival Curves
ggsurvplot(): Draws survival curves with the ‘number at risk’ table, the cumulative number of events table and the cumulative number of censored subjects table.
arrange_ggsurvplots(): Arranges multiple ggsurvplots on the same page.
ggsurvevents(): Plots the distribution of event’s times.
surv_summary(): Summary of a survival curve. Compared to the default summary() function, surv_summary() creates a data frame containing a nice summary from survfit results.
surv_cutpoint(): Determines the optimal cutpoint for one or multiple continuous variables at once. Provides a value of a cutpoint that correspond to the most significant relation with survival.
pairwise_survdiff(): Multiple comparisons of survival curves. Calculate pairwise comparisons between group levels with corrections for multiple testing.
Diagnostics of Cox Model
ggcoxzph(): Graphical test of proportional hazards. Displays a graph of the scaled Schoenfeld residuals, along with a smooth curve using ggplot2. Wrapper around plot.cox.zph().
ggcoxdiagnostics(): Displays diagnostics graphs presenting goodness of Cox Proportional Hazards Model fit.
ggcoxfunctional(): Displays graphs of continuous explanatory variable against martingale residuals of null cox proportional hazards model. It helps to properly choose the functional form of continuous variable in cox model.
Summary of Cox Model
ggforest(): Draws forest plot for CoxPH model.
ggcoxadjustedcurves(): Plots adjusted survival curves for coxph model.
Competing Risks
Find out more at http://www.sthda.com/english/rpkgs/survminer/, and check out the documentation and usage examples of each of the functions in survminer package.
This analysis has been performed using R software (ver. 3.3.2).
At the [R] Kenntnis-Tage 2017 on November 8 and 9, 2017 you will get the chance to benefit not only from the exchange about the programming language R in a business context and practical tutorials but also from the audience: use the [R] Kenntnis-Tage 2017 as your platform and hand in your topic for the call of papers.
The topics can be as diverse as the event itself: Whether you share your data science use case with the participants, tell them your personal lessons learned with R in the business environment or show how your company uses data science and R. As a speaker at the [R] Kenntnis-Tage 2017, you will get free admission to the event.
Companies present themselves as big data pioneers
At last year’s event, many speakers already took their chance: Julia Flad from Trumpf Laser talked about the application possibilities of data science in the industry and shared her experiences with the participants. In his talk “Working efficiently with R – faster to the data product” Julian Gimbel from Lufthansa Industry Solutions gave a vivid example for working with the open source programming language.
Take your chance to become part of the [R] Kenntnis-Tage 2017 and hand in your topic related to data science or R. For more information, e.g. on the length of the presentation or the choice of topic, visit our website.
You don’t want to give a talk but still want to participate? When you register until May 5 you can profit from our early bird tickets. Data science goes professional – join in.
I’ve just finished a major overhaul to my widely read article, Why R is Hard to Learn. It describes the main complaints I’ve heard from the participants to my workshops, and how those complaints can often be mitigated. Here’s the only new section:
The Tidyverse Curse
There’s a common theme in many of the sections above: a task that is hard to perform using base a R function is made much easier by a function in the dplyr package. That package, and its relatives, are collectively known as the tidyverse. Its functions help with many tasks, such as selecting, renaming, or transforming variables, filtering or sorting observations, combining data frames, and doing by-group analyses. dplyr is such a helpful package that Rdocumentation.org shows that it is the single most popular R package (as of 3/23/2017.) As much of a blessing as these commands are, they’re also a curse to beginners as they’re more to learn. The main packages of dplyr, tibble, tidyr, and purrr contain a few hundred functions, though I use “only” around 60 of them regularly. As people learn R, they often comment that base R functions and tidyverse ones feel like two separate languages. The tidyverse functions are often the easiest to use, but not always; its pipe operator is usually simpler to use, but not always; tibbles are usually accepted by non-tidyverse functions, but not always; grouped tibbles may help do what you want automatically, but not always (i.e. you may need to ungroup or group_by higher levels). Navigating the balance between base R and the tidyverse is a challenge to learn.
A demonstration of the mental overhead required to use tidyverse function involves the usually simple process of printing data. I mentioned this briefly in the Identity Crisis section above. Let’s look at an example using the built-in mtcars data set using R’s built-in print function:
> print(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ...
We see the data, but the variable names actually ran off the top of my screen when viewing the entire data set, so I had to scroll backwards to see what they were. The dplyr package adds several nice new features to the print function. Below, I’m taking mtcars and sending it using the pipe operator “%>%” into dplyr’s as_data_frame function to convert it to a special type of tidyverse data frame called a “tibble” which prints better. From there I send it to the print function (that’s R’s default function, so I could have skipped that step). The output all fits on one screen since it stopped at a default of 10 observations. That allowed me to easily see the variable names that had scrolled off the screen using R’s default print method. It also notes helpfully that there are 22 more rows in the data that are not shown. Additional information includes the row and column counts at the top (32 x 11), and the fact that the variables are stored in double precision (<dbl>).
> library("dplyr") > mtcars %>% + as_data_frame() %>% + print() # A tibble: 32 × 11 mpg cyl disp hp drat wt qsec vs am gear carb * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows
The new print format is helpful, but we also lost something important: the names of the cars! It turns out that row names get in the way of the data wrangling that dplyr is so good at, so tidyverse functions replace row names with 1, 2, 3…. However, the names are still available if you use the rownames_to_columns() function:
> library("dplyr") > mtcars %>% + as_data_frame() %>% + rownames_to_column() %>% + print() Error in function_list[[i]](value) : could not find function "rownames_to_column"
Oops, I got an error message; the function wasn’t found. I remembered the right command, and using the dplyr package did cause the car names to vanish, but the solution is in the tibble package that I “forgot” to load. So let’s load that too (dplyr is already loaded, but I’m listing it again here just to make each example stand alone.)
> library("dplyr") > library("tibble") > mtcars %>% + as_data_frame() %>% + rownames_to_column() %>% + print() # A tibble: 32 × 12 rowname mpg cyl disp hp drat wt qsec vs am gear carb <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows
Another way I could have avoided that problem is by loading the package named tidyverse, which includes both dplyr and tibble, but that’s another detail to learn.
In the above output, the row names are back! What if we now decided to save the data for use with a function that would automatically display row names? It would not find them because now they’re now stored in a variable called rowname, not in the row names position! Therefore, we would need to use either the built-in names function or the tibble package’s column_to_rownames function to restore the names to their previous position.
Most other data science software requires row names to be stored in a standard variable e.g. rowname. You then supply its name to procedures with something like SAS’
“ID rowname;” statement. That’s less to learn.
This isn’t a defect of the tidyverse, it’s the result of an architectural decision on the part of the original language designers; it probably seemed like a good idea at the time. The tidyverse functions are just doing the best they can with the existing architecture.
Another example of the difference between base R and the tidyverse can be seen when dealing with long text strings. Here I have a data frame in tidyverse format (a tibble). I’m asking it to print the lyrics for the song American Pie. Tibbles normally print in a nicer format than standard R data frames, but for long strings, they only display what fits on a single line:
> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + print() # A tibble: 1 × 1 lyrics <chr> 1 a long long time ago i can still remember how that music used
The whole song can be displayed by converting the tibble to a standard R data frame by routing it through the as.data.frame function:
> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + as.data.frame() %>% + print() ... <truncated> 1 a long long time ago i can still remember how that music used to make me smile and i knew if i had my chance that i could make those people dance and maybe theyd be happy for a while but february made me shiver with every paper id deliver bad news on the doorstep i couldnt take one more step i cant remember if i cried ...
These examples demonstrate a small slice of the mental overhead you’ll need to deal with as you learn base R and the tidyverse packages, such as dplyr. Since this section has focused on what makes R hard to learn, it may make you wonder why dplyr is the most popular R package. You can get a feel for that by reading the Introduction to dplyr. Putting in the time to learn it is well worth the effort.
Dot density maps are an increasingly popular way of showing population datasets. The technique has its limitations (see here for more info), but it can create some really nice visual outputs, particularly when trying to show density. I have tried to push this to the limit by using a high resolution population grid for Europe (GEOSTAT 2011) to plot a dot for every 50 people in Europe (interactive). The detailed data aren’t published for all European countries – there are modelled values for those missing but I have decided to stick to only those countries with actual counts here.
I produced this map as part of an experiment to see how R could handle generating millions of dots from spatial data and then plotting them. Even though the final image is 3.8 metres by 3.8 metres – this is as big as I could make it without causing R to crash – many of the grid cells in urban areas are saturated with dots so a lot of detail is lost in those areas. The technique is effective in areas that are more sparsely populated. It would probably work better with larger spatial units that aren’t necessarily grid cells.
Generating the dots (using the spsample function) and then plotting them in one go would require way more RAM than I have access to. Instead I wrote a loop to select each grid cell of population data, generate the requisite number of points and then plot them. This created a much lighter load as my computer happily chugged away for a few hours to produce the plot. I could have plotted a dot for each person (500 million+) but this would have completely saturated the map, so instead I opted for 1 dot for every 50 people.
I have provided full code below. T&Cs on the data mean I can’t share that directly but you can download here.
Load rgdal
and the spatial object required. In this case it’s the GEOSTAT 2011 population grid. In addition I am using the rainbow function to generate a rainbow palette for each of the countries.
library(rgdal)
Input<-readOGR(dsn=".", layer="SHAPEFILE")
Cols<- data.frame(Country=unique(Input$Country), Cols=rainbow(nrow(data.frame(unique(Input$Country)))))
Create the initial plot. This is empty but it enables the dots to be added interatively to the plot. I have specified 380cm by 380cm at 150 dpi PNG – this is as big as I could get it without crashing R.
png("europe_density.png",width=380, height=380, units="cm", res=150)
par(bg='black')
plot(t(bbox(Input)), asp=T, axes=F, ann=F)
Loop through each of the individual spatial units – grid cells in this case – and plot the dots for each. The number of dots are specified in spsample as the grid cell’s population value/50.
for (j in 1:nrow(Cols))
{
Subset<- Input[Input$Country==Cols[j,1],]
for (i in 1:nrow(Subset@data))
{
Poly<- Subset[i,]
pts1<-spsample(Poly, n = 1+round(Poly@data$Pop/50), "random", iter=15)
#The colours are taken from the Cols object.
plot(pts1, pch=16, cex=0.03, add=T, col=as.vector(Cols[j,2]))
}
}
dev.off()
We at mlr are currently deciding on a new logo, and in the spirit of open-source, we would like to involve the community in the voting process!
You can vote for your favorite logo on GitHub by reacting to the logo with a +1.
Thanks to Hannah Atkin for designing the logos!