(This article was first published on

**Yet Another Blog in Statistical Computing » S+/R**, and kindly contributed to R-bloggers)Removing records from a data table based on duplicate values in one or more columns is a commonly used but important data cleaning technique. Below shows an example about how to accomplish this task by SAS, R, and Python respectively.

**SAS Example**

data _data_; input label $ value; datalines; A 4 B 3 C 6 B 3 B 1 A 2 A 4 A 4 ; run; proc sort data = _last_; by label value; run; data _data_; set _last_; by label; if first.label then output; run; proc print data = _last_ noobs; run; /* OUTPUT: label value A 2 B 1 C 6 */

**R Example**

> # INPUT DATA INT THE CONSOLE > df <- read.table(header = T, text = ' + label value + A 4 + B 3 + C 6 + B 3 + B 1 + A 2 + A 4 + A 4 + ') > # SORT DATA FRAME BY COLUMNS > df2 <- df[order(df$label, df$value), ] > print(df2) label value 6 A 2 1 A 4 7 A 4 8 A 4 5 B 1 2 B 3 4 B 3 3 C 6 > # DEDUP RECORDS > df3 <- df2[!duplicated(df2$label), ] > print(df3) label value 6 A 2 5 B 1 3 C 6

**Python Example**

In [1]: import pandas as pd In [2]: # INPUT DATA INTO DATAFRAME In [3]: df = pd.DataFrame({'label': ['A', 'B', 'C'] + ['B'] * 2 + ['A'] * 3, 'value': [4, 3, 6, 3, 1, 2, 4, 4]}) In [4]: # SORT DATA BY COLUMNS In [5]: df2 = df.sort(['label', 'value']) In [6]: print(df2) label value 5 A 2 0 A 4 6 A 4 7 A 4 4 B 1 1 B 3 3 B 3 2 C 6 In [7]: # DEDUP RECORDS In [8]: df3 = df2.drop_duplicates(['label']) In [9]: print(df3) label value 5 A 2 4 B 1 2 C 6

To

**leave a comment**for the author, please follow the link and comment on their blog:**Yet Another Blog in Statistical Computing » S+/R**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...