Merging Dataframes by Partly Matching String

September 29, 2012
By

(This article was first published on theBioBucket*, and kindly contributed to R-bloggers)

The latest posting by Tony Hirst sparked my attention because I was thinking about a very similar issue recently.

I was also fiddling around with agrep and adist until I realised that for this very issue matching of substrings is not as important as matching multiple words.. With this different approach I quite easily matched all but 3 countries.

See what I did:

## look up matches of one dataframe in another dataframe.
## the strings to be matched are comprised of 1 or more words
## and seperated by white space.
## method: match strings that have the highest fraction of words that match up

d1 <- read.csv("http://s.telegraph.co.uk/graphics/conrad/PercentageUsingTheNet.csv",
header = T, sep = ",", encoding = "UTF-8")
d2 <- read.csv("http://www.iso.org/iso/country_names_and_code_elements_txt",
header = T, sep = ";", encoding = "UTF-8")

## strings to be compared d1$ECONOMY and d2$Country.Name
mystr.1 <- as.character(d1$ECONOMY)
mystr.2 <- as.character(d2$Country.Name)
mystr.3 <- as.character(d2$ISO.3166.1.alpha.2.code)

## remove punctuation and multiple spaces
mystr.1 <- tolower(gsub("[^[:alnum:][:space:]]", "", mystr.1))
mystr.1 <- gsub("\\s+", " ", mystr.1)
mystr.2 <- tolower(gsub("[^[:alnum:][:space:]]", "", mystr.2))
mystr.2 <- gsub("\\s+", " ", mystr.2)

## function that finds matching words in string (words seperated by single space!)
n.wordshared <- function(x, y) {
sum(!is.na(match(unlist(strsplit(x, " ")),
unlist(strsplit(y, " ")))
)
)
}
## example
n.wordshared(x = "hello world", y = "good bye world")
## [1] 1

## function that calculates fraction of shared words
fr.wordshared <- function(x, y) {
n.wordshared(x, y) / (length(unique(unlist(strsplit(x, " "))))
+ length(unique(unlist(strsplit(y, " ")))))
}
## example
fr.wordshared(x = "hello world", y = "good bye world")
## [1] 0.2

mydf <- data.frame(str1 = mystr.1, mymatch = "", match.iso = "",
stringsAsFactors = F)

## now look up every element of string 1 in string 2
## and if there are matching words assign match to dataframe
for (i in 1:nrow(mydf)) {
xx <- sapply(mystr.2, fr.wordshared, y = mystr.1[i])
if (sum(xx) == 0) {
mydf$mymatch[i] <- NA
mydf$match.iso[i] <- NA
} else {
mydf$mymatch[i] <- paste(names(which(xx == max(xx))), collapse = "; ")
mydf$match.iso[i] <- paste(mystr.3[as.integer(which(xx == max(xx)))], collapse = "; ")
}
}

## see result
print(mydf)

## these are the multiple matches
(aa <- mydf[grep(";", mydf$mymatch), ])
##
## str1 mymatch match.iso
## 28 slovak republic czech republic; dominican republic CZ; DO


## these were not matched
(bb <- mydf[is.na(mydf$mymatch), ])
## str1 mymatch match.iso
##
## 61 russia NA NA
## 108 syria NA NA

Now, expanding on this concept by introduction of partial matching would most propably result in a 100% match…

To leave a comment for the author, please follow the link and comment on their blog: theBioBucket*.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Mango solutions



RStudio homepage



Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training

datasociety

http://www.eoda.de





ODSC

ODSC

CRC R books series





Six Sigma Online Training









Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)