Random sampling of files

April 2, 2019
By

(This article was first published on R Code – Geekcologist , and kindly contributed to R-bloggers)

random_files <- function(path, percent_number, pattern = "wav$|WAV$"){
  ####################################################################
  # path = path to folder with files to select                                 
  #                                                                            
  # percent_number = percentage or number of recordings to select. If value is 
  #   between 0 and 1 percentage of files is assumed, if value greater than 1, 
  #   number of files is assumed                                               
  #                                                                            
  # pattern = file extension to select. By default it selects wav files. For   
  #   other type of files replace wav and WAV by the desired extension         
  ####################################################################
  
  # Get file list with full path and file names
  files <- list.files(path, full.names = TRUE, pattern = pattern)
  file_names <- list.files(path, pattern = pattern)
  
  # Select the desired % or number of file by simple random sampling 
  randomize <- sample(seq(files))
  files2analyse <- files[randomize]
  names2analyse <- file_names[randomize]
  if(percent_number <= 1){
    size <- floor(percent_number * length(files))
  }else{
    size <- percent_number
  }
  files2analyse <- files2analyse[(1:size)]
  names2analyse <- names2analyse[(1:size)]

  # Create folder to output
  results_folder <- paste0(path, '/selected')
  dir.create(results_folder, recursive=TRUE)
  
  # Write csv with file names
  write.table(names2analyse, file = paste0(results_folder, "/selected_files.csv"),
              col.names = "Files", row.names = FALSE)
  
  # Move files
  for(i in seq(files2analyse)){
    file.rename(from = files2analyse[i], to = paste0(results_folder, "/", names2analyse[i]) )
  }
}

 

I normally use this function inside a little script for some extra functionalities:

  1. first I set up the environment by sourcing the required functions and loading the packages,
  2. as I always do when using functions that use randomness, I set a seed to be able to reproduce my results in a later time,
  3. as the function uses a folder path, I’ve included a little search window with tcltk to select the folder instead of having to write the full path by hand.
# Load packages and functions
require(tcltk)
source("random_files.R")

# Set seed to reproduce results
set.seed(1001)

# Select folder with recordings
path <- tcltk::tk_choose.dir()

# Percentage or number of recordings
percent_number <- 0.2 # using percentage

# Random sampling of files
random_files(path, percent_number, pattern = "wav$|WAV$")

This function was written for a specific purpose but with some tweaks you can probably adapt it for other purposes other than the one I use it for.

You can find this and other R scripts at: https://github.com/bmsasilva/Rscripts

 

To leave a comment for the author, please follow the link and comment on their blog: R Code – Geekcologist .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)