Essential data cleaning for ad-hoc tasks in R

September 18, 2018
By

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Data Management

    Tags

    1. Data Manipulation
    2. R Programming
    3. Tips & Tricks

    I must admit that data cleaning sometimes feels like the necessary data step before the fun and also much more value creating process: Analysis!

    But in every coding project that a data scientist is involved in, the first step is always to get a clear understanding of the dataset by descriptive statistics and to clean the dataset.

    These steps are typically done in a large coding scale, which typically is a big problem for the data scientist because it leaves less time for the much more interesting and also value creating process: AI analysis and data analysis. Furthermore, when the data scientist is working with ad-hoc data analytical tasks, data cleaning can be a huge problem due to short deadlines. Most articles I have read use coding instead of packages by combining descriptive statistics with data cleaning.

    Therefore I here present the most essential data cleaning code for ad-hoc task in R done with R packages. I use two of the most elegant and efficient R packages for descriptive statistics and data cleaning: skimR and Hmisc. This frees the data scientists time schedule and leaves much more time for the more value creating process: AI analysis and data analysis.

    # Datamanagement packages 
    library(skimr)
    library(Hmisc)
    # Load dataset
    data("mydata")
    # Fast data management & data cleaning
    # Descriptive statistics before data cleaning
    skim(mydata)
    # Data cleaning 
    cleandata <- mydata[complete.cases(mydata),]
    cleandata <- unique(cleandata)
    View(cleandata)
    # Descriptive statistics after data cleaning
    skim(cleandata)
    

    And there you have it – elegant and essential data cleaning and also Descriptive statistics including histograms – before and after cleaning of the dataset. Done with the most efficient coding!
    Happy data cleaning!

    Related Post

    1. Efficient data management and SQL data selection in R
    2. Proteomics Data Analysis (2/3): Data Filtering and Missing Value Imputation
    3. Clean Your Data in Seconds with This R Function
    4. Hands-on Tutorial on Python Data Processing Library Pandas – Part 2
    5. Hands-on Tutorial on Python Data Processing Library Pandas – Part 1

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Search R-bloggers


    Sponsors

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)