poorman: Version 0.2.1 Release

Posted on July 1, 2020 by Random R Ramblings in R bloggers | 0 Comments

[This article was first published on Random R Ramblings, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Welcome to my series of blog posts about my data manipulation package, {poorman}. For those of you that don’t know, {poorman} is aiming to be a replication of {dplyr} but using only {base} R, and therefore be completely dependency free. What’s nice about this series is that if you would rather just use {dplyr}, then that’s absolutely OK! By highlighting {poorman} functionality, this series of blog posts simultaneously highlights {dplyr} functionality too! However I sometimes also describe how I developed the internals of {poorman}, often highlighting useful {base} R tips and tricks.

Today marks the release of v0.2.1 of {poorman} and with it a whole host of new functions and features. In today’s blog post we will be taking a look at some of these new features. Given the sheer amount of features this release brings, we won’t be focusing on the internals of any of these functions; the internals will be saved for another post. In stead, we will simply be taking a look at what some of them can do.

Selecting Distinct Rows

The first function we will take a look at is distinct(). Let’s say you want to select only the distinct, or unique, rows from your data.frame, distinct() will help you do that. Let’s create some fake data; some are duplicated.

df <- data.frame(
  id = c(1, 2, 3, 4, 5, 6, 1, 2, 7, 1, 4, 6),
  age = c(26, 24, 26, 22, 23, 24, 26, 24, 22, 26, 22, 25),
  score = c(85, 63, 55, 74, 31, 77, 85, 63, 42, 85, 74, 78)
)
df
#    id age score
# 1   1  26    85
# 2   2  24    63
# 3   3  26    55
# 4   4  22    74
# 5   5  23    31
# 6   6  24    77
# 7   1  26    85
# 8   2  24    63
# 9   7  22    42
# 10  1  26    85
# 11  4  22    74
# 12  6  25    78

Now we wish to see the distinct records from this data.

library(poorman, warn.conflicts = FALSE)
df %>% distinct()
#    id age score
# 1   1  26    85
# 2   2  24    63
# 3   3  26    55
# 4   4  22    74
# 5   5  23    31
# 6   6  24    77
# 9   7  22    42
# 12  6  25    78

So we see that we now only have 8 records out of the original 12 because the duplicates have been removed. We can actually obtain the distinct rows for a particular column, returning just that column.

df %>% distinct(age)
#    age
# 1   26
# 2   24
# 4   22
# 5   23
# 12  25

But if you need the other variables still, you can choose to keep those too.

df %>% distinct(age, .keep_all = TRUE)
#    id age score
# 1   1  26    85
# 2   2  24    63
# 4   4  22    74
# 5   5  23    31
# 12  6  25    78

Slicing Data

{dplyr} provides a couple of ways to selecting a subset of rows. It has the functions top_n() and top_frac() as well as the slice_*() family of functions. The former functions have now been superseded by the latter and so {poorman} skipped the implementation of the former. So what exactly do they do? Let’s take a look at some examples using the mtcars dataset.

slice_head() returns the first n rows (defaults to 1). slice_tail() returns the last n rows (not shown here).

slice_head(mtcars, n = 3)
#                mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

slice_sample() randomly selects rows with or without replacement.

slice_sample(mtcars, n = 3, replace = TRUE)
#                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
# Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
# Merc 280C     17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4

slice_min() and slice_max() select rows with highest or lowest values of a variable.

mtcars %>% slice_min(mpg, n = 3)
#                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
# Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
# Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4

Selecting With Predicates

It is now possible to select columns in your data.frame which match a predicate such as is.numeric(). where() takes a function and returns all variables for which the function returns TRUE.

df <- data.frame(
  col1 = c(1, 2, 3),
  col2 = c("x", "y", "z"),
  col3 = c(TRUE, FALSE, TRUE)
)
df %>% select(where(is.numeric))
#   col1
# 1    1
# 2    2
# 3    3

Working With NA Values

Finding the First Non-Missing Element

Given a set of vectors, the coalesce() function finds the first non-missing value at each position.

# Use a single value to replace all missing values
x <- sample(c(1:5, NA, NA, NA))
coalesce(x, 0L)
# [1] 4 0 5 0 1 2 0 3

# Or match together a complete vector from missing pieces
y <- c(1, 2, NA, NA, 5)
z <- c(NA, NA, 3, 4, 5)
coalesce(y, z)
# [1] 1 2 3 4 5

Convert Values To NA

We can convert values in a vector x if they match values in a second vector y.

na_if(1:5, 5:1)
# [1]  1  2 NA  4  5

This is particularly useful in a data.frame if you need to replace a particular value.

df <- data.frame(a = c("a", "b", "c", "BAD_VALUE"))
df %>% mutate(a = na_if(a, "BAD_VALUE"))
#      a
# 1    a
# 2    b
# 3    c
# 4 <NA>

Replacing NA Values

Within a data.frame we often have missing values in multiple columns. We sometimes wish to replace these values which is where replace_na() comes in. replace_na() is actually a function from the {tidyr} package but I decided to add it to {poorman} as it is extremely useful. Let’s take a look.

df <- data.frame(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% replace_na(list(x = 0, y = "unknown"))
#   x       y
# 1 1       a
# 2 2 unknown
# 3 0       b

Recoding Values

If we wish to replace values within a vector or a column of a data.frame, we can use recode(). This is a vectorised version of base::switch(): you can replace numeric values based on their position or their name, and character or factor values only by their name.

char_vec <- sample(c("a", "b", "c"), 10, replace = TRUE)
recode(char_vec, a = "Apple")
#  [1] "b"     "c"     "b"     "c"     "Apple" "Apple" "Apple" "c"     "Apple" "b"
recode(char_vec, a = "Apple", b = "Banana")
#  [1] "Banana" "c"      "Banana" "c"      "Apple"  "Apple"  "Apple"  "c"      "Apple" 
# [10] "Banana"

Group Details

The final group (no pun intended) of features are focussed solely on grouped data. Given how many there are, I am not going to go into detail and instead I provide a brief overview here for the reader. The plan is to detail these functions in a separate blog post since a lot of work went on under the hood that may be interesting to discuss.

Functions for splitting data.frames: group_split(), group_keys()
Extract grouping metadata: group_data(), group_indices(), group_vars(), group_rows(), group_size(), n_groups(), groups()
Extract information about the current group: cur_data(), cur_group(), cur_group_id(), cur_group_rows(), cur_column()

Conclusion

You made it this far, great! I won’t keep you much longer. This post has demonstrated some of the capabilities of the {poorman} (and therefore {dplyr}) package. The v0.2.1 release actually includes a sleuth of other features and functions so be sure to check out the release page for a full list.

As this blog post is quite long, I haven’t gone into any further details of the internals of {poorman} however if you are interested in taking a closer look at how I handle the different input types, you can see the code on the relevant {poorman} GitHub page. {poorman} is still a work in progress but as you can see, it already has a lot of functionality you know and love from {dplyr} so if you are working on a new project and don’t want to have to deal with dependency management, especially if you are sharing work with colleagues, why not give {poorman} a try?

If you’d like to show your support for {poorman}, please consider giving the package a Star on Github as it gives me that boost of dopamine needed to continue development.

To leave a comment for the author, please follow the link and comment on their blog: Random R Ramblings.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

poorman: Version 0.2.1 Release

Introduction

Selecting Distinct Rows

Slicing Data

Selecting With Predicates

Working With NA Values

Finding the First Non-Missing Element

Convert Values To NA

Replacing NA Values

Recoding Values

Group Details

Conclusion

Related

Introduction

Selecting Distinct Rows

Slicing Data

Selecting With Predicates

Working With NA Values

Finding the First Non-Missing Element

Convert Values To NA

Replacing NA Values

Recoding Values

Group Details

Conclusion

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)