Efficient data management and SQL data selection in R

September 18, 2018
By

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Data Management

    Tags

    1. Data Manipulation
    2. R Programming
    3. Tips & Tricks

    Before running your data analysis, every data scientist needs to make data management, data cleaning and data selection.

    This can be done in very different ways. For the data scientist – this process is typically quite time comsuming and it does not bring data value in the same degree as AI analysis or data analysis. It is therefore both in the data scientist interest as well as the customer, to do efficient data management and data selection, in order to free more value creating time for the analysis part of the specific project. I always look for R packages that bring the most value to the data scientist as well as the customer. The below coding has been trimmed down to the most efficient coding in order to ensure good quality data management, data cleaning, and selection of data. For data selection, I use SQL queries with the package sqldf (SQL for dataframes) because this code is efficient and elegant in its structure as well as thorough.

    # Packages for data management, data cleaning and data selection
    library(skimr)
    library(Hmisc)
    library(plyr)
    library(dplyr)
    # SQL package
    library(sqldf)
    

    After reading the library packages in R it is time to load the dataset. because the work directory is different for each data scientist, I have written the path in blank form. Remember you can set your work directory with the following code setwd

    # Load data
    data<-read.csv(file="/Mydata";, header=TRUE)
    dataname <- data
    

    The next step is efficient data management and cleaning of the dataset. I use coding for removing missing values complete.cases() and also removing of doublet observations unique(). After that I subset data to types as well. The code includes descriptive statistics and histogram graphs with the elegant and efficient package skimR

    # Data management & data cleaning
    skim(dataname)
    cleandata <- dataname[complete.cases(dataname),]
    cleandata <- unique(cleandata)
    cleandata <- unique(cleandata)
    cleandata <- dataname[complete.cases(dataname),]
    cleanfile <- write.csv(cleandata, file = "cleanfile.csv", row.names = FALSE, na = "")
    cleanfileread <- read.csv(file = "cleanfile.csv")
    cleanfiledata <- as.data.frame(cleanfileread)
    #Subsetting the data to types
    logicmeint <- cleanfiledata[,sapply(cleanfiledata,is.integer)]
    logicmedouble <- cleanfiledata[,sapply(cleanfiledata,is.double)]
    logicmefactor <- cleanfiledata[,sapply(cleanfiledata,is.factor)]
    logicmenum <- cleanfiledata[,sapply(cleanfiledata,is.numeric)]
    mainlogicmefactors <- cleanfiledata[,sapply(cleanfiledata,is.factor) | sapply(cleanfiledata,is.numeric)]
    #Descriptive Analytics
    skim(mainlogicmefactors)
    skim(dataname)
    skim(cleanfiledata)
    skim(logicmeint)
    skim(logicmedouble)
    skim(logicmefactor)
    skim(logicmenum)
    skim(mainlogicmefactors)
    

    The last step includes a selection of data with SQL queries. I have written the below R code in general form. Variable names are written as variable1, variable2, variable3..., conditions are written as condition and intervals are written as num if nummerical or something if not.

    # Load clean data
    mydata <- cleanfiledata
    #Filtering data
    sqldf("select variables as 'condition' from mydata where condition == 'something'")
    sqldf("select * from mydata order by variable1") #order by 1 condition
    sqldf("select * from mydata order by variable1,variable2") #order by 2 conditions
    sqldf("select * from mydata where variable like 'something' ") #variable starts with something
    sqldf("select * from mydata where variable like 'something") #variable ends with something
    sqldf("select * from mydata where variable like 'something' ") #variable must contain something
    sqldf("select * from mydata where variable in ('something1','something2','something3')") #using IN
    sqldf("select * from mydata where variable between something1 and something2") #using BETWEEN
    #Multiple filters - and,or,not
    sqldf("select * from mydata where variable1 >= 1980 and variable2 = 1980 and variable2 2000") #no variable1 in variable2 
    sqldf("select * from mydata where variable1 > num and variable2 not between year1 and year2")
    sqldf("select * from mydata where variable1 > num order by variable2 variable3 ")
    #Basic aggregations
    sqldf("select sum(n) as 'Total_Count' from mydata")
    sqldf("select min(n), max(n) from mydata")
    sqldf("select variable1,avg(n) as 'Average' from mydata group by variable1 order by Average variable1") #average by variable1
    sqldf("select variable1,count(*) as count from mydata group by variable1 limit 100") #count by variable1
    sqldf("select variable1,n,count(*) as 'my_count' from mydata where n > num group by variable1 order by my_count desc limit num") #multiple filters
    #Use having
    sqldf("select variable1,variable2,sum(n) as 'my_sum' from mydata group by variable1 having my_sum > num order by my_sum desc limit num")
    #Unique count for a variable
    sqldf("select count(distinct variable1) as 'count_variable1' from mydata ")
    sqldf("select variable1, n, case when variable1 = 'num' then 'condition1' else 'condition2' end as 'condition3' from mydata limit num")
    sqldf("select *, case when variable1 != '%condition1%' then 'statement1' when name = 'condition2' then 'statement2' else 'statement3' end as 'statement4' from mydata")
    

    Happy coding!

    Related Post

    1. Essential data cleaning for ad-hoc tasks in R
    2. Proteomics Data Analysis (2/3): Data Filtering and Missing Value Imputation
    3. Clean Your Data in Seconds with This R Function
    4. Hands-on Tutorial on Python Data Processing Library Pandas – Part 2
    5. Hands-on Tutorial on Python Data Processing Library Pandas – Part 1

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Search R-bloggers

    Sponsors

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)