Removing Records by Duplicate Values

Posted on December 20, 2012 by statcompute in R bloggers | 0 Comments

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Removing records from a data table based on duplicate values in one or more columns is a commonly used but important data cleaning technique. Below shows an example about how to accomplish this task by SAS, R, and Python respectively.

SAS Example

data _data_;
  input label $ value;
datalines;
A     4
B     3
C     6
B     3
B     1
A     2
A     4
A     4
;
run;

proc sort data = _last_;
  by label value;
run;

data _data_;
  set _last_;
  by label;
  if first.label then output;
run;

proc print data = _last_ noobs;
run;

/* OUTPUT:
label    value
  A        2  
  B        1  
  C        6 
*/

R Example

> # INPUT DATA INT THE CONSOLE
> df <- read.table(header = T, text = '
+  label value
+      A     4
+      B     3
+      C     6
+      B     3
+      B     1
+      A     2
+      A     4
+      A     4
+ ')
> # SORT DATA FRAME BY COLUMNS
> df2 <- df[order(df$label, df$value), ]
> print(df2)
  label value
6     A     2
1     A     4
7     A     4
8     A     4
5     B     1
2     B     3
4     B     3
3     C     6
> # DEDUP RECORDS
> df3 <- df2[!duplicated(df2$label), ]
> print(df3)
  label value
6     A     2
5     B     1
3     C     6

Python Example

In [1]: import pandas as pd

In [2]: # INPUT DATA INTO DATAFRAME

In [3]: df = pd.DataFrame({'label': ['A', 'B', 'C'] + ['B'] * 2 + ['A'] * 3, 'value': [4, 3, 6, 3, 1, 2, 4, 4]})

In [4]: # SORT DATA BY COLUMNS

In [5]: df2 = df.sort(['label', 'value'])

In [6]: print(df2)
  label  value
5     A      2
0     A      4
6     A      4
7     A      4
4     B      1
1     B      3
3     B      3
2     C      6

In [7]: # DEDUP RECORDS

In [8]: df3 = df2.drop_duplicates(['label'])

In [9]: print(df3)
  label  value
5     A      2
4     B      1
2     C      6

To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Removing Records by Duplicate Values

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)