# R Helper Functions to Increase Efficiency

January 4, 2019
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At Inspire, we have built an internal R package consisting of all kinds of functions to achieve business functionalities. In this package, we created some helper functions to make some frequent jobs simpler. Some of the helper functions are based on solutions provided on public websites, such as StackOverflow. In this post, I would like to share some of the helper functions we use in daily work. Hope you can benefit or get inspired from them!

1. Calculate weights from a non-negative vector using `lw()`:

We use this function a lot because we deal with a lot of redistribution jobs. The weight of each entry is calculated as its proportion to the sum of all entries. If the sum is equal to 0, then all weights should be equal across all entries because they should be all equal to 0. The source code is

`````` lw <- function(v) {
if (sum(v, na.rm = TRUE) == 0) {
1 / length(v)
} else {
v / sum(v, na.rm = TRUE)
}
}
``````

An example is

`````` > lw(c(1,2,3))
 0.1666667 0.3333333 0.5000000
``````
2. Quickly convert date vector in character type to date type using `quick_date()`:

When we read data from a csv file or from a database through DBI, the date column might still be in character type. People normally use command such as `as.Date()` in base R or `ymd()` in lubridate package, etc. to convert the column into date type. Noticing that the conversion could take a pretty long time for large data set with 10 million plus rows, we have built a function based on some online solution as follows:

`````` quick_date <- function(x, ...) {
if (anyDuplicated(x)) {
ux <- unique(x)
idx <- match(x, ux)
y <- as.Date.character(ux, ...)
return(y[idx])
}
as.Date.character(x, ...)
}
``````
3. Quickly check missing value in data frame using `cc()` and `cd()`:

Every R user should be very familiar with the `summary()` function to have a quick statistics overview across all columns in a data frame. However, in most of the cases, we only care about a specific statistics. Personally, I care about the missing value in each column very much in my work, so I developed two functions. One is `cc()`, which stands for “check cleanness” for me. The source code is as follows:

`````` cc <- function(df) {
print(df %>% is.na() %>% colSums())
}
``````

It simply tells you how many missing value in each column. I like this function because I just need to press “c” twice in the keyboard which is much faster than typing “summary”. The other function is `cd()`, which stands for “cleanness density” for me. The source code is as follows:

`````` cd <- function(df) {
print(df %>% is.na() %>% colMeans())
}
``````

It simply tells you the percentage of missing value in each column.

4. Start H2O instance using `h2o_start()`:

H2O is one of my favorite machine learning frameworks and R packages. I like to start H2o instance with my personal setting other than the default. So I use `h2o_start()` function to wrap the original init function with my preference. The source code is

`````` h2o_start <- function() {
h2o::h2o.init(nthreads = -1, enable_assertions = FALSE, min_mem_size = '8g',
strict_version_check = FALSE, port = 54321)
}
``````
5. Stop H2O instance using `h2o_stop()`:

Just like the above `h2o_start()` function, I like to stop H2O instance with my personal preference as follows

`````` h2o_stop <- function() {
h2o::h2o.shutdown(prompt = FALSE)
}
``````

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.