Merging Dataframes by Partly Matching String

September 29, 2012
By

(This article was first published on theBioBucket*, and kindly contributed to R-bloggers)

The latest posting by Tony Hirst sparked my attention because I was thinking about a very similar issue recently.

I was also fiddling around with agrep and adist until I realised that for this very issue matching of substrings is not as important as matching multiple words.. With this different approach I quite easily matched all but 3 countries.

See what I did:

## look up matches of one dataframe in another dataframe.
## the strings to be matched are comprised of 1 or more words
## and seperated by white space.
## method: match strings that have the highest fraction of words that match up

d1 <- read.csv("http://s.telegraph.co.uk/graphics/conrad/PercentageUsingTheNet.csv",
header = T, sep = ",", encoding = "UTF-8")
d2 <- read.csv("http://www.iso.org/iso/country_names_and_code_elements_txt",
header = T, sep = ";", encoding = "UTF-8")

## strings to be compared d1$ECONOMY and d2$Country.Name
mystr.1 <- as.character(d1$ECONOMY)
mystr.2 <- as.character(d2$Country.Name)
mystr.3 <- as.character(d2$ISO.3166.1.alpha.2.code)

## remove punctuation and multiple spaces
mystr.1 <- tolower(gsub("[^[:alnum:][:space:]]", "", mystr.1))
mystr.1 <- gsub("\\s+", " ", mystr.1)
mystr.2 <- tolower(gsub("[^[:alnum:][:space:]]", "", mystr.2))
mystr.2 <- gsub("\\s+", " ", mystr.2)

## function that finds matching words in string (words seperated by single space!)
n.wordshared <- function(x, y) {
sum(!is.na(match(unlist(strsplit(x, " ")),
unlist(strsplit(y, " ")))
)
)
}
## example
n.wordshared(x = "hello world", y = "good bye world")
## [1] 1

## function that calculates fraction of shared words
fr.wordshared <- function(x, y) {
n.wordshared(x, y) / (length(unique(unlist(strsplit(x, " "))))
+ length(unique(unlist(strsplit(y, " ")))))
}
## example
fr.wordshared(x = "hello world", y = "good bye world")
## [1] 0.2

mydf <- data.frame(str1 = mystr.1, mymatch = "", match.iso = "",
stringsAsFactors = F)

## now look up every element of string 1 in string 2
## and if there are matching words assign match to dataframe
for (i in 1:nrow(mydf)) {
xx <- sapply(mystr.2, fr.wordshared, y = mystr.1[i])
if (sum(xx) == 0) {
mydf$mymatch[i] <- NA
mydf$match.iso[i] <- NA
} else {
mydf$mymatch[i] <- paste(names(which(xx == max(xx))), collapse = "; ")
mydf$match.iso[i] <- paste(mystr.3[as.integer(which(xx == max(xx)))], collapse = "; ")
}
}

## see result
print(mydf)

## these are the multiple matches
(aa <- mydf[grep(";", mydf$mymatch), ])
##
## str1 mymatch match.iso
## 28 slovak republic czech republic; dominican republic CZ; DO


## these were not matched
(bb <- mydf[is.na(mydf$mymatch), ])
## str1 mymatch match.iso
##
## 61 russia NA NA
## 108 syria NA NA

Now, expanding on this concept by introduction of partial matching would most propably result in a 100% match...

To leave a comment for the author, please follow the link and comment on his blog: theBioBucket*.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.