Are you developing an automated exploration tool? Here we propose some alternatives to drop columns with high percentage of NAs.
In this previous tip we talk about BaseR vs Tidy & Purrr counting NAs performance.
Not leaving the pipeflow. How much does it cost?;) It depends on the NA distribution between features and its number, but not much that a few nanoseconds in small and big datasets
# library(microbenchmark) You can benchmark them in small and big datasets library(tidyverse) airquality %>% select_if(~mean(is.na(.)) < 0.2) airquality %>% select(which(colMeans(is.na(.)) < 0.2)) airquality[lapply(airquality, function(x) mean(is.na(x))) < 0.2]
Soooo what’s your choice??