Finding duplicates in data frame across columns and replacing them with unique values using R

August 5, 2019
By

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

Suppose you have a dataset with many variables, and you want to check:

  • if there are any duplicated for each of the observation
  • replace duplicates with random value from pool of existing values.

 

In this manner, let’s create a sample dataset:

df <- structure(list(
   v1 = c(10,20,30,40,50,60,70,80)
  ,v2 = c(5,7,6,8,6,8,9,4)
  ,v3 = c(2,4,6,6,7,8,8,4)
  ,v4 = c(8,7,3,1,8,7,8,4)
  ,v5 = c(2,4,6,7,8,9,9,3))
  ,.Names = c("ID","a", "b","d", "e")
  ,.typeOf = c("numeric", "numeric", "numeric","numeric","numeric")
  ,row.names = c(NA, -8L)
  ,class = "data.frame"
  ,comment = "Sample dataframe for duplication example")

which has the following interesting characteristics:

2019-08-04 23_49_16-RStudio

Upon closer inspection, one will see that there are many duplicated values across different variables (variable ID, variable a, variable b, variable d and variable e). So let’s focus on:

  • row 2 has two times duplicated values (2x value 4 and 2x value 7)
  • row 3 has three times duplicated values (3x value 6)

Our pool of possible replacement values are:

    possible_new_values <- c(1,2,3,4,5,6,7,8,9

 

Creating loop for slicing the data, loop through the duplicated positions, at the end looks like:

for (row in 1:nrow(df)) {
      vec = df %>% slice(row) %>% unlist() %>% unname()
      #check for duplicates
      if(length(unique(vec)) != length(df)) {
        positions <- which(duplicated(vec) %in% c("TRUE"))
        #iterate through positions
        for(i in 1:length(positions)) {
        possible_new_values <- c(1,2,3,4,5,6,7,8,9)
        df[row,positions[i]]  <- sample(possible_new_values
                           [ ! possible_new_values %in% unique(vec)],1)
        }
      }
    }

revealing the final replacement of values in

2019-08-05 00_08_14-RStudio

So the end result, when putting old data.frame and the new data.frame (with replaced values) side by side, it looks like:

2019-08-05 00_19_18-Presentation1 - PowerPoint

Showing how replacement works per each row across given columns | rows.

Niffy, yet useful data de-duplication or data replacements, when you need one.

As always, code is available at Github.

Happy coding with R 🙂

 

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)