Clean Your Data in Seconds with This R Function

July 17, 2018
By

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

All data needs to be clean before you can explore and create models. Common sense, right. Cleaning data can be tedious but I created a function that will help.

The function do the following:

  • Clean Data from NA’s and Blanks
  • Separate the clean data – Integer dataframe, Double dataframe, Factor dataframe, Numeric dataframe, and Factor and Numeric dataframe.
  • View the new dataframes
  • Create a view of the summary and describe from the clean data.
  • Create histograms of the data frames.
  • Save all the objects

This will happen in seconds.

Package

First, load Hmisc package. I always save the original file.
The code below is the engine that cleans the data file.

cleandata <- dataname[complete.cases(dataname),] 

The function

The function is below. You need to copy the code and save it in an R file. Run the code and the function cleanme will appear.

cleanme <- function(dataname){
  
  #SAVE THE ORIGINAL FILE
  oldfile <- write.csv(dataname, file = "oldfile.csv", row.names = FALSE, na = "")
  
  #CLEAN THE FILE. SAVE THE CLEAN. IMPORT THE CLEAN FILE. CHANGE THE TO A DATAFRAME.
  cleandata <- dataname[complete.cases(dataname),]
  cleanfile <- write.csv(cleandata, file = "cleanfile.csv", row.names = FALSE, na = "")
  cleanfileread <- read.csv(file = "cleanfile.csv")
  cleanfiledata <- as.data.frame(cleanfileread)
  
  #SUBSETTING THE DATA TO TYPES
  logicmeint <- cleanfiledata[,sapply(cleanfiledata,is.integer)]
  logicmedouble <- cleanfiledata[,sapply(cleanfiledata,is.double)]
  logicmefactor <- cleanfiledata[,sapply(cleanfiledata,is.factor)]
  logicmenum <- cleanfiledata[,sapply(cleanfiledata,is.numeric)]
  mainlogicmefactors <- cleanfiledata[,sapply(cleanfiledata,is.factor) | sapply(cleanfiledata,is.numeric)]

  #VIEW ALL FILES
  View(cleanfiledata)
  View(logicmeint)
  View(logicmedouble)
  View(logicmefactor)
  View(logicmenum)
  View(mainlogicmefactors)
  
  #describeFast(mainlogicmefactors)
  
  #ANALYTICS OF THE MAIN DATAFRAME
  cleansum <- summary(cleanfiledata)
  print(cleansum)
  cleandec <- describe(cleanfiledata)
  print(cleandec)
  
  #ANALYTICS OF THE FACTOR DATAFRAME
  factorsum <- summary(logicmefactor)
  print(factorsum)
  factordec <- describe(logicmefactor)
  print(factordec)
  
  #ANALYTICS OF THE NUMBER DATAFRAME
  numbersum <- summary(logicmenum)
  print(numbersum)
  
  numberdec <- describe(logicmefactor)
  print(numberdec)
  
  mainlogicmefactorsdec <- describe(mainlogicmefactors)
  print(mainlogicmefactorsdec)
  
  mainlogicmefactorssum <- describe(mainlogicmefactors)
  print(mainlogicmefactorssum)
  
  #savemenow <- saveRDS("cleanmework.rds")
  #readnow <- readRDS(savemenow)
  
  #HISTOGRAM PLOTS OF ALL TYPES
  hist(cleanfiledata)
  hist(logicmeint)
  hist(logicmedouble)
  hist(logicmefactor)
  hist(logicmenum)
  #plot(mainlogicmefactors)

  save(cleanfiledata, logicmeint, mainlogicmefactors, logicmedouble, logicmefactor, logicmenum, numberdec, numbersum, factordec, factorsum, cleandec, oldfile, cleandata, cleanfile, cleanfileread,   file = "cleanmework.RData")
}

Type in and run:

cleanme(dataname)

When all the data frames appear, type to load the workspace as objects.

load("cleanmework.RData")

Enjoy

    Related Post

    1. Hands-on Tutorial on Python Data Processing Library Pandas – Part 2
    2. Hands-on Tutorial on Python Data Processing Library Pandas – Part 1
    3. Using R with MonetDB
    4. Recording and Measuring Your Musical Progress with R
    5. Spark RDDs Vs DataFrames vs SparkSQL – Part 4 Set Operators

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Search R-bloggers

    Sponsors

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)