Spring Cleaning Data: 4 of 6- Combining the files & Changing the Dates/Credit Type

[This article was first published on OutLie..R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

So far the individual files have been left on their own, it is now time to combine using the rbind function, simple enough after all we have done so far, then the quick check with summary.

Now that we have one data frame, time to make larger changes to the data. The first is to get the dates into a format that R can understand. The as.Date() function does this by defining the variable, then the pattern for the date. At this point, I had a hard time figuring out what each one meant; basically you are defining what the date looks like now in the data frame, not in the future.

For this data set the ‘%b %d %Y’ or in other words Feb 01 2011, if the date looked like Feb-01-2011, then the code would be ‘%b-%d-%Y’, or if the date was 02-02-2011, then ‘%m-%d-%Y’. For a more comprehensive tutorial, see the post on Quick-R.

#Changing the date variables, then 
#isolating the year variable for alter use
dw$loan.date<-as.Date(dw$loan.date, '%b %d %Y')
dw$mat.date<-as.Date(dw$mat.date, '%b %d %Y')
dw$repay.date<-as.Date(dw$repay.date, '%b %d %Y')

At this point, I like to have two extra variables so I can aggregate the data later for some nice results, in particular the year and the month. The reason is I want to know if there is a difference in the years.  I know there are only 2 years so far, but every quarter new data will be released so I am setting up the code for it now. The month I want to know if there is any seasonality to it. If I choose to I can isolate the day, but this gets messy because February has 28/29 days, then the rest of the months fluctuate between 30 and 31. The data is scattered and blotchy as is, making the day too small of a unit to be useful.

The code assumes the date has been changed to the R default of YYYY-MM-DD, for the year I selected the first 4 numbers using the str_sub() function, while making it a numerical value- as.numeric(). The year and date variable I made it a factor for easier sorting and categorizing, with a similar process as above except I want both.

#Create a year variable
dw$year<-as.numeric(str_sub(dw$loan.date, start=1, 
#Create a year and month variable
   start=1, end=7))

The next step is to change the credit type to something simpler for tables and graphs. I used the gsub, one of the most interesting and fun functions I never knew existed until I did this. Basically it will take a string then replace it with another. For this data I wanted to replace the "Primary Credit" with "primary" because it make things so much easier for graphs and tables. Then I changed it to a factor.
#Changing the type of credit to one word
   gsub("Primary Credit", 'primary', type.credit))
   gsub("Seasonal Credit", 'seasonal', type.credit))
   gsub("Secondary Credit", 'secondary', type.credit))
#change to factor

Links to the previous posts (post 1, post 2, post 3)

To leave a comment for the author, please follow the link and comment on their blog: OutLie..R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)