Importing Data into R, part II

January 31, 2017
By

(This article was first published on The Practical R, and kindly contributed to R-bloggers)

I recently downloaded the latest version of R Studio and noticed that their import dataset functionality had changed significantly. I had previously written about this HERE and wanted to provide an update for the current version of RStudio.

When you go to import data using R Studio, you get a menu like this.

rstudio-old-import

If you’re using the latest version of RStudio, when you click “From CSV” you’ll get a popup about downloading a new library ‘readr’.

readr

Once that has completed, you’ll see the new import data window (shown below).

new-import-screen

Okay, so first let’s make a simple comma delimited data file so we can test out the new import dataset process. I have made a simple file called “x-y-data.txt” as shown below. If you make this same file (no spaces, just a comma to separate the x column from the y column) then we can do this exercise together.

x-y-data

Now, let’s use the RStudio import to bring in the file “x-y-data.txt”. Here’s a screen grab of the import screen with my x-y dataset.

import-data

We can see that RStudio has used the first row as names, has recognized that it is a comma delimited file, and has read both x and y values as integers. Everything looks good, so I click “import”.

It was after this import process, that I had tried running some of my standard functions, such as making an empirical CDF (cumulative density function) and then I ran into problems. So let’s check the type of data we have imported.

# get the data structure
typeof(x_y_data)
#[1] "list"
class(x_y_data)
#[1] "tbl_df"     "tbl"        "data.frame"

While the old RStudio would have imported this as a matrix by default, this latest version of RStudio imports data as a data frame by default. Apparently RStudio has created their own version of a data frame called a “tbl_df” or tibble data frame. When you use the ‘readr’ package, your data is imported automatically as a “tbl_df”.

Now this isn’t necessarily a bad thing, in fact it seems like there is some nice functionality gained by using the “tbl_df” format. This change just broke some of my previously written code and it’s good to know what RStudio is doing by default.

If we wanted to get back to the matrix format, we can do this will a simple as.matrix function. From there we can verify it was converted using the typeof and class functions.

# convert to a matrix
data<-as.matrix(x_y_data)
#     x  y
#[1,] 1  2
#[2,] 2  4
#[3,] 3  6
#[4,] 4  8
#[5,] 5 10

typeof(data)
#[1] "integer"
class(data)
#[1] "matrix"

You can read more about the new Tibble structure at these websites:

https://blog.rstudio.org/2016/03/24/tibble-1-0-0/

http://www.sthda.com/english/wiki/tibble-data-format-in-r-best-and-modern-way-to-work-with-your-data

Enjoy!

To leave a comment for the author, please follow the link and comment on their blog: The Practical R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)