[This article was first published on Random R Ramblings, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Welcome to my series of blog posts about my data manipulation package, `{poorman}`. For those of you that don’t know, `{poorman}` is aiming to be a replication of `{dplyr}` but using only `{base}` R, and therefore be completely dependency free. What’s nice about this series is that if you would rather just use `{dplyr}`, then that’s absolutely OK! By highlighting `{poorman}` functionality, this series of blog posts simultaneously highlights `{dplyr}` functionality too! However I sometimes also describe how I developed the internals of `{poorman}`, often highlighting useful `{base}` R tips and tricks.

Today marks the release of v0.2.1 of `{poorman}` and with it a whole host of new functions and features. In today’s blog post we will be taking a look at some of these new features. Given the sheer amount of features this release brings, we won’t be focusing on the internals of any of these functions; the internals will be saved for another post. In stead, we will simply be taking a look at what some of them can do.

Selecting Distinct Rows

The first function we will take a look at is `distinct()`. Let’s say you want to select only the distinct, or unique, rows from your `data.frame`, `distinct()` will help you do that. Let’s create some fake data; some are duplicated.

```df <- data.frame(
id = c(1, 2, 3, 4, 5, 6, 1, 2, 7, 1, 4, 6),
age = c(26, 24, 26, 22, 23, 24, 26, 24, 22, 26, 22, 25),
score = c(85, 63, 55, 74, 31, 77, 85, 63, 42, 85, 74, 78)
)
df
#    id age score
# 1   1  26    85
# 2   2  24    63
# 3   3  26    55
# 4   4  22    74
# 5   5  23    31
# 6   6  24    77
# 7   1  26    85
# 8   2  24    63
# 9   7  22    42
# 10  1  26    85
# 11  4  22    74
# 12  6  25    78```

Now we wish to see the distinct records from this data.

```library(poorman, warn.conflicts = FALSE)
df %>% distinct()
#    id age score
# 1   1  26    85
# 2   2  24    63
# 3   3  26    55
# 4   4  22    74
# 5   5  23    31
# 6   6  24    77
# 9   7  22    42
# 12  6  25    78```

So we see that we now only have 8 records out of the original 12 because the duplicates have been removed. We can actually obtain the distinct rows for a particular column, returning just that column.

```df %>% distinct(age)
#    age
# 1   26
# 2   24
# 4   22
# 5   23
# 12  25```

But if you need the other variables still, you can choose to keep those too.

```df %>% distinct(age, .keep_all = TRUE)
#    id age score
# 1   1  26    85
# 2   2  24    63
# 4   4  22    74
# 5   5  23    31
# 12  6  25    78```

Slicing Data

`{dplyr}` provides a couple of ways to selecting a subset of rows. It has the functions `top_n()` and `top_frac()` as well as the `slice_*()` family of functions. The former functions have now been superseded by the latter and so `{poorman}` skipped the implementation of the former. So what exactly do they do? Let’s take a look at some examples using the `mtcars` dataset.

`slice_head()` returns the first `n` rows (defaults to 1). `slice_tail()` returns the last `n` rows (not shown here).

```slice_head(mtcars, n = 3)
#                mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1```

`slice_sample()` randomly selects rows with or without replacement.

```slice_sample(mtcars, n = 3, replace = TRUE)
#                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
# Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
# Merc 280C     17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4```

`slice_min()` and `slice_max()` select rows with highest or lowest values of a variable.

```mtcars %>% slice_min(mpg, n = 3)
#                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
# Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
# Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4```

Selecting With Predicates

It is now possible to select columns in your `data.frame` which match a predicate such as `is.numeric()`. `where()` takes a function and returns all variables for which the function returns `TRUE`.

```df <- data.frame(
col1 = c(1, 2, 3),
col2 = c("x", "y", "z"),
col3 = c(TRUE, FALSE, TRUE)
)
df %>% select(where(is.numeric))
#   col1
# 1    1
# 2    2
# 3    3```

Working With NA Values

Finding the First Non-Missing Element

Given a set of vectors, the `coalesce()` function finds the first non-missing value at each position.

```# Use a single value to replace all missing values
x <- sample(c(1:5, NA, NA, NA))
coalesce(x, 0L)
# [1] 4 0 5 0 1 2 0 3

# Or match together a complete vector from missing pieces
y <- c(1, 2, NA, NA, 5)
z <- c(NA, NA, 3, 4, 5)
coalesce(y, z)
# [1] 1 2 3 4 5```

Convert Values To NA

We can convert values in a vector `x` if they match values in a second vector `y`.

```na_if(1:5, 5:1)
# [1]  1  2 NA  4  5```

This is particularly useful in a `data.frame` if you need to replace a particular value.

```df <- data.frame(a = c("a", "b", "c", "BAD_VALUE"))
df %>% mutate(a = na_if(a, "BAD_VALUE"))
#      a
# 1    a
# 2    b
# 3    c
# 4 ```

Replacing NA Values

Within a `data.frame` we often have missing values in multiple columns. We sometimes wish to replace these values which is where `replace_na()` comes in. `replace_na()` is actually a function from the `{tidyr}` package but I decided to add it to `{poorman}` as it is extremely useful. Let’s take a look.

```df <- data.frame(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% replace_na(list(x = 0, y = "unknown"))
#   x       y
# 1 1       a
# 2 2 unknown
# 3 0       b```

Recoding Values

If we wish to replace values within a vector or a column of a `data.frame`, we can use `recode()`. This is a vectorised version of `base::switch()`: you can replace numeric values based on their position or their name, and character or factor values only by their name.

```char_vec <- sample(c("a", "b", "c"), 10, replace = TRUE)
recode(char_vec, a = "Apple")
#  [1] "b"     "c"     "b"     "c"     "Apple" "Apple" "Apple" "c"     "Apple" "b"
recode(char_vec, a = "Apple", b = "Banana")
#  [1] "Banana" "c"      "Banana" "c"      "Apple"  "Apple"  "Apple"  "c"      "Apple"
# [10] "Banana"```

Group Details

The final group (no pun intended) of features are focussed solely on grouped data. Given how many there are, I am not going to go into detail and instead I provide a brief overview here for the reader. The plan is to detail these functions in a separate blog post since a lot of work went on under the hood that may be interesting to discuss.

• Functions for splitting `data.frame`s: `group_split()`, `group_keys()`
• Extract grouping metadata: `group_data()`, `group_indices()`, `group_vars()`, `group_rows()`, `group_size()`, `n_groups()`, `groups()`
• Extract information about the current group: `cur_data()`, `cur_group()`, `cur_group_id()`, `cur_group_rows()`, `cur_column()`

Conclusion

You made it this far, great! I won’t keep you much longer. This post has demonstrated some of the capabilities of the `{poorman}` (and therefore `{dplyr}`) package. The v0.2.1 release actually includes a sleuth of other features and functions so be sure to check out the release page for a full list.

As this blog post is quite long, I haven’t gone into any further details of the internals of `{poorman}` however if you are interested in taking a closer look at how I handle the different input types, you can see the code on the relevant `{poorman}` GitHub page. `{poorman}` is still a work in progress but as you can see, it already has a lot of functionality you know and love from `{dplyr}` so if you are working on a new project and don’t want to have to deal with dependency management, especially if you are sharing work with colleagues, why not give `{poorman}` a try?

If you’d like to show your support for `{poorman}`, please consider giving the package a Star on Github as it gives me that boost of dopamine needed to continue development.

To leave a comment for the author, please follow the link and comment on their blog: Random R Ramblings.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)