In the world of data preparation a common task is to identify duplicate records in a file or data set. A few years ago, I did most development work in Java, and shudder to think of the amount of code required to accomplish this sort of task. Since that time I been involved in many projects that did not require programming in a specific language, but simply “getting the job done.” With that in mind, “removal of duplicate records in a file” can be construed as manipulating a data set rather than an exercise in file processing. The following shows how R compares with other technologies when performing this task.
A file named “file.txt” containing semicolon delimited records appears in the examples below:
Identifying rows in which every field is duplicated is relatively straight forward using Unix utilities. The file can be outputted using the cat (concatenate) command, the results sorted, and a unique list of results (prefaced by the count of occurrences in the file) can be filtered by a regular expression that indicates any row that has a number of occurrences not equal to one.
cat file.txt | sort | uniq -c | grep “^ [^1 ]”
Often, it is more concise to speak Unix than English. This starts to break down a bit when considering each line in the file as a record with distinguishable fields.
General purpose programming languages can do the same thing, but are a bit more verbose. In ruby, the file can be read into an array of arrays.
A list of unique rows can be outputted using the following one liner:
And with a bit more effort, you can write a program that will filter the results as needed. This type of processing can also be done declaratively in R where such a file is read in as a data frame.
The duplicated function can also be used to identify the single row that is duplicated.
That is to say, the all fields in all records in the following data frames are equal.
The situation gets a bit more complicated when you want to only use some of the delimited fields to identify duplicate records. In the data set above, rows 1 and 4 are identical. Consider the requirement to recognize lines 2 and 5 as identical (due to the first two fields matching). In Unix, you could use awk with the -F option to process the delimited fields. In ruby you could store key fields in variables and compare them with each row. If you come from the SQL world, you could use the R sqldf package to treat the data frame as a database table.
sqldf(‘select * from df group by V1, V2 having count(*) >1’)
The R way of getting this information is to identify the indices of the duplicated rows.
I enjoy looking at the overlapping aspects of programming languages and utilities (like this OTN Article from a few years ago). It is helpful for highlighting the right tool for the right job, and aids in communication with other technical professionals.