Removing Records by Duplicate Values

December 20, 2012
By

(This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers)

Removing records from a data table based on duplicate values in one or more columns is a commonly used but important data cleaning technique. Below shows an example about how to accomplish this task by SAS, R, and Python respectively.

SAS Example

data _data_;
  input label $ value;
datalines;
A     4
B     3
C     6
B     3
B     1
A     2
A     4
A     4
;
run;

proc sort data = _last_;
  by label value;
run;

data _data_;
  set _last_;
  by label;
  if first.label then output;
run;

proc print data = _last_ noobs;
run;

/* OUTPUT:
label    value
  A        2  
  B        1  
  C        6 
*/

R Example

> # INPUT DATA INT THE CONSOLE
> df <- read.table(header = T, text = '
+  label value
+      A     4
+      B     3
+      C     6
+      B     3
+      B     1
+      A     2
+      A     4
+      A     4
+ ')
> # SORT DATA FRAME BY COLUMNS
> df2 <- df[order(df$label, df$value), ]
> print(df2)
  label value
6     A     2
1     A     4
7     A     4
8     A     4
5     B     1
2     B     3
4     B     3
3     C     6
> # DEDUP RECORDS
> df3 <- df2[!duplicated(df2$label), ]
> print(df3)
  label value
6     A     2
5     B     1
3     C     6

Python Example

In [1]: import pandas as pd

In [2]: # INPUT DATA INTO DATAFRAME

In [3]: df = pd.DataFrame({'label': ['A', 'B', 'C'] + ['B'] * 2 + ['A'] * 3, 'value': [4, 3, 6, 3, 1, 2, 4, 4]})

In [4]: # SORT DATA BY COLUMNS

In [5]: df2 = df.sort(['label', 'value'])

In [6]: print(df2)
  label  value
5     A      2
0     A      4
6     A      4
7     A      4
4     B      1
1     B      3
3     B      3
2     C      6

In [7]: # DEDUP RECORDS

In [8]: df3 = df2.drop_duplicates(['label'])

In [9]: print(df3)
  label  value
5     A      2
4     B      1
2     C      6

To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)