Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Removing records from a data table based on duplicate values in one or more columns is a commonly used but important data cleaning technique. Below shows an example about how to accomplish this task by SAS, R, and Python respectively.

SAS Example

data _data_;
input label $value; datalines; A 4 B 3 C 6 B 3 B 1 A 2 A 4 A 4 ; run; proc sort data = _last_; by label value; run; data _data_; set _last_; by label; if first.label then output; run; proc print data = _last_ noobs; run; /* OUTPUT: label value A 2 B 1 C 6 */  R Example > # INPUT DATA INT THE CONSOLE > df <- read.table(header = T, text = ' + label value + A 4 + B 3 + C 6 + B 3 + B 1 + A 2 + A 4 + A 4 + ') > # SORT DATA FRAME BY COLUMNS > df2 <- df[order(df$label, df$value), ] > print(df2) label value 6 A 2 1 A 4 7 A 4 8 A 4 5 B 1 2 B 3 4 B 3 3 C 6 > # DEDUP RECORDS > df3 <- df2[!duplicated(df2$label), ]
> print(df3)
label value
6     A     2
5     B     1
3     C     6


Python Example

In : import pandas as pd

In : # INPUT DATA INTO DATAFRAME

In : df = pd.DataFrame({'label': ['A', 'B', 'C'] + ['B'] * 2 + ['A'] * 3, 'value': [4, 3, 6, 3, 1, 2, 4, 4]})

In : # SORT DATA BY COLUMNS

In : df2 = df.sort(['label', 'value'])

In : print(df2)
label  value
5     A      2
0     A      4
6     A      4
7     A      4
4     B      1
1     B      3
3     B      3
2     C      6

In : # DEDUP RECORDS

In : df3 = df2.drop_duplicates(['label'])

In : print(df3)
label  value
5     A      2
4     B      1
2     C      6  