Inspired by the Community
One of the themes at useR 2017 in Brussels was “Get involved”. People were encouraged to contribute to the community, even when they did not consider themselves R specialists (yet). This could be by writing a package or a blog post, but also by simply correcting typos through pull requests, or sending a tweet about a successful analysis. Bottom line: get your stuff out in the open. Share your work!
I felt this urge of getting involved already a year ago, at the useR conference in 2016. Hearing all these people speak about the great work they had done was really inspiring, and I wanted to be a part of it. I wanted to find out what it was like to develop a package. I knew that what I could do was only minor; I have a full-time job as a data scientist of which contributing to the community is not part of the job description. However, I could free up one day a week by working a few hours more on the other days of the week and I could do a little on the train to work every day. I was not sure if I was up to developing a full R package, but I could at least try.
The idea for the padr package came at work, where I was dealing with time-stamped data a lot. Aggregating data to a higher level and filling missing values took me a quite some time each time I did it. I thought there must be a better way for doing this. I did not have a clear end product in mind. I just started trying stuff and stitched small ideas together to do more complex operations. In the meantime, I worked my way through Hadley’s R Packages, and learned about the fundamentals of software design in R, picking up skills while I went along.
This past January, I published the first version of
padr on CRAN, and I must say that it was quite scary. Putting your product out in the open comes with a lot of impostor feelings. I had thoroughly tested the functions, but still was afraid all kinds of bugs would pop up. Well, bugs did pop up and this appeared to be just fine. I had very nice comments by people suggesting improvements or who even fixed bugs themselves. If there is one lesson I learned from writing a package, it is that software does not have to be perfect to be useful. Of course, you don’t bring junk that is full of errors to CRAN, but once you have tried your best, you should just get it in the open and wait for feedback. This would improve the software much faster than spending endless evenings trying to find bugs.
The central concept for the
padr package is that time-stamped data are associated with an interval. This is the heartbeat of the data. Observed data points are separated by multiples of the interval. But, there can be observations missing. We compute the largest, evenly spaced grid that fits all our observations. See the following example.
library(padr) library(dplyr) as.Date(c("2017-01-01","2017-01-03","2017-01-07")) %>% get_interval()
##  "2 day"
Now, the first main thing that
padr does is
thicken() the datetime variable in a data frame. It adds a variable of a higher interval to the data frame. We can subsequently aggregate to this higher interval with
dplyr. We use the
coffee data set, that comes with
padr as an example. This contains four hypothetical purchases in a coffee shop.
## time_stamp amount ## 1 2016-07-07 00:11:21 3.14 ## 2 2016-07-07 00:46:48 2.98 ## 3 2016-07-09 04:25:17 4.11 ## 4 2016-07-10 01:45:11 3.14
Now, to get the total amount spent per day, we run the following code.
(coffee_day <- coffee %>% thicken("day") %>% group_by(time_stamp_day) %>% summarise(day_amnt = sum(amount)))
## # A tibble: 3 x 2 ## time_stamp_day day_amnt ##
## 1 2016-07-07 6.12 ## 2 2016-07-09 4.11 ## 3 2016-07-10 3.14
You will notice that on 2016-07-08 there was no visit to the store. However, this is implicit in our data. For visualisation or time-series analysis, it is useful to have a row that explicitly states there is no money spent on this day. Here comes
pad() into play. It detects the interval of the datetime variable and inserts a record.
(coffee_day_padded <- coffee_day %>% pad())
## # A tibble: 4 x 2 ## time_stamp_day day_amnt ##
## 1 2016-07-07 6.12 ## 2 2016-07-08 NA ## 3 2016-07-09 4.11 ## 4 2016-07-10 3.14
Finally, we need to fill the missing value in day_amnt with 0.
coffee_day_padded %>% fill_by_value()
## # A tibble: 4 x 2 ## time_stamp_day day_amnt ##
## 1 2016-07-07 6.12 ## 2 2016-07-08 0.00 ## 3 2016-07-09 4.11 ## 4 2016-07-10 3.14
I benefit greatly from the skills I picked up in the process in my day job. My code in data science projects is cleaner now. I try to write as many functions as I can. Data scientists that have no background in computer science, like myself, can become better analysts by acquainting themselves with principles of software design. R’s package structure is not only great for building packages. You can also benefit from it when building software tools for specific data analyses, making your work more reproducible and shareable over projects.
There is still a lot of stuff for me to learn about writing better software and gaining a deeper understanding of the language. However, I no longer think this is intimidating. Rather, it is a great opportunity to improve and improve the software I am writing. You don’t need complete mastery to start developing. On the contrary, mastery seems to come through developing. You only truly learn R by trying things you have not tried before.