How to find dataset differences in R, when the pieces of information are changing between datasets it’s a difficult task to identify the same.
Here we are going to discuss the daff package in R, daff package helps us to identify the differences and visualize them in a beautiful way.
Its compare rows and columns and provides information on
- Value changes
- Which new row/column added
- Which row/ column removed
- How many rows /columns added/removed?
Let’s see how to execute the daff package in R.
First load the daff package in R.
If you are not installed the package, you can install the daff package from CRAN
Now we need to create two data frames and let’s name them mydata1 and mydata2.
library(daff) mydata1<-data.frame(Name=c("P1","P2","P3","P4","P5"),col1=c(1,2,3,4,5),col2=c(11,13,14,15,17)) mydata2<-data.frame(Name=c("P1","P2","P6","P4","P5"),col1=c(1,3,3,6,9))
Let’s see how to find the dataset differences
Daff Comparison: ‘mydata1’ vs. ‘mydata2’ --- @@ Name col1 col2 P1 1 11 -> P2 2->3 13 +++ P6 3 --- P3 3 14 -> P4 4->6 15 -> P5 5->9 17
-> Indicate the updated value.
—Indicate the row removed from the dataset
—If the symbol appears above the column indicate the column removed from the dataset.
+++indicate rows which are added
Basically, -> for updated data, – for removed data and + for added data.
You can visualize the same using render_diff
The HTML output will be displayed in different colors.
Suppose if you want to compare the datasets based on the primary columns, you can just pass the information like,
You can provide multiple column names as a primary column. Like c(“Name”,”col1”)
In HTML view, you can download the results into multiple formats like CSV, Excel, and pdf.
Sorting option and filter option also available in HTML view.
In daff package many other arguments are available you can make use of the same.
diff_data(data_ref, data, always_show_header = TRUE, always_show_order = FALSE, columns_to_ignore = c(), count_like_a_spreadsheet = TRUE, ids = c(), ignore_whitespace = FALSE, never_show_order = FALSE, ordered = TRUE, padding_strategy = c("auto", "smart", "dense", "sparse"), show_meta = TRUE, show_unchanged = FALSE, show_unchanged_columns = FALSE, show_unchanged_meta = FALSE, unchanged_column_context = 1L, unchanged_context = 1L)
Sometimes we need to ignore some particular columns.
columns_to_ignore argument will be very useful (character List of columns to ignore in all calculations.
Changes related to these columns should be discounted)
You can export the comparison results into a CSV file, for that you need to store the result into variables.
Suppose if you want to read the output, you can make use of read_diff
The post How to find dataset differences in R Quickly Compare Datasets appeared first on finnstats.