Merging Dataframes by Partly Matching String
[This article was first published on theBioBucket*, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The latest posting by Tony Hirst sparked my attention because I was thinking about a very similar issue recently.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I was also fiddling around with agrep and adist until I realised that for this very issue matching of substrings is not as important as matching multiple words.. With this different approach I quite easily matched all but 3 countries.
See what I did:
## look up matches of one dataframe in another dataframe. ## the strings to be matched are comprised of 1 or more words ## and seperated by white space. ## method: match strings that have the highest fraction of words that match up d1 <- read.csv("http://s.telegraph.co.uk/graphics/conrad/PercentageUsingTheNet.csv", header = T, sep = ",", encoding = "UTF-8") d2 <- read.csv("http://www.iso.org/iso/country_names_and_code_elements_txt", header = T, sep = ";", encoding = "UTF-8") ## strings to be compared d1$ECONOMY and d2$Country.Name mystr.1 <- as.character(d1$ECONOMY) mystr.2 <- as.character(d2$Country.Name) mystr.3 <- as.character(d2$ISO.3166.1.alpha.2.code) ## remove punctuation and multiple spaces mystr.1 <- tolower(gsub("[^[:alnum:][:space:]]", "", mystr.1)) mystr.1 <- gsub("\\s+", " ", mystr.1) mystr.2 <- tolower(gsub("[^[:alnum:][:space:]]", "", mystr.2)) mystr.2 <- gsub("\\s+", " ", mystr.2) ## function that finds matching words in string (words seperated by single space!) n.wordshared <- function(x, y) { sum(!is.na(match(unlist(strsplit(x, " ")), unlist(strsplit(y, " "))) ) ) } ## example n.wordshared(x = "hello world", y = "good bye world") ## [1] 1 ## function that calculates fraction of shared words fr.wordshared <- function(x, y) { n.wordshared(x, y) / (length(unique(unlist(strsplit(x, " ")))) + length(unique(unlist(strsplit(y, " "))))) } ## example fr.wordshared(x = "hello world", y = "good bye world") ## [1] 0.2 mydf <- data.frame(str1 = mystr.1, mymatch = "", match.iso = "", stringsAsFactors = F) ## now look up every element of string 1 in string 2 ## and if there are matching words assign match to dataframe for (i in 1:nrow(mydf)) { xx <- sapply(mystr.2, fr.wordshared, y = mystr.1[i]) if (sum(xx) == 0) { mydf$mymatch[i] <- NA mydf$match.iso[i] <- NA } else { mydf$mymatch[i] <- paste(names(which(xx == max(xx))), collapse = "; ") mydf$match.iso[i] <- paste(mystr.3[as.integer(which(xx == max(xx)))], collapse = "; ") } } ## see result print(mydf) ## these are the multiple matches (aa <- mydf[grep(";", mydf$mymatch), ]) ## ## str1 mymatch match.iso ## 28 slovak republic czech republic; dominican republic CZ; DO ## these were not matched (bb <- mydf[is.na(mydf$mymatch), ]) ## str1 mymatch match.iso ## ## 61 russia NA NA ## 108 syria NA NA
Now, expanding on this concept by introduction of partial matching would most propably result in a 100% match...
To leave a comment for the author, please follow the link and comment on their blog: theBioBucket*.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.