Generating random lists of names with errors to explore fuzzy word matching
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Health data systems are not always perfect. This was made painfully obvious when a study I am involved with required a matched list of nursing home residents taken from one system with set results from PCR tests for COVID-19 drawn from another. Name spellings for the same person from the second list were not always consistent across different PCR tests, nor were they always consistent with the cohort we were interested in studying defined by the first list. My research associate, Yifan Xu, and I were asked to help out and see what we could do to help out.
This was my first foray into fuzzy word-matching. We came up with simple solution to match the names on the two lists based on the R
function adist
that should allow the research team to finalize a matched list with minimal manual effort.
In order to test our proposed approach, we developed a way to generate random lists of names with errors. This post presents both the code for random list generation with errors as well as the simple matching algorithm.
Distance between strings
Fuzzy word matching can be approached using the concept of string distance. Quite simply, this can be measured by counting the number of transformations required to move from the original to the target word. A transformation is one of three moves: (1) a substitution, (2) an insertion, or (3) a deletion. The figure below illustrates the 5 “moves” that are required to get from CAT to KITTEN: two substitutions and three insertions.
The adist
function can calculate this string distance, and if you set counts = TRUE
, the function will provide the number of substitutions, insertions, and deletions. Here are the results for our example:
adist("CAT", "KITTEN", counts = TRUE) ## [,1] ## [1,] 5 ## attr(,"counts") ## , , ins ## ## [,1] ## [1,] 3 ## ## , , del ## ## [,1] ## [1,] 0 ## ## , , sub ## ## [,1] ## [1,] 2 ## ## attr(,"trafos") ## [,1] ## [1,] "SSIMII"
Assessing whether a distance is meaningful or no longer fuzzy certainly depends on the nature of the problem and the length of a strings. The distance from CAT to DOG is 3 (with 3 substitutions), and so is the distance from DERE, STEPHEN to DEERE, STEVEN (1 insertion, 1 deletion, and 1 substitution); we might be willing to declare the individual’s name a match while declining to pair the two different animals.
Simulating lists of names with errors
To test out our fuzzy matching process, we need to be able to create a master list of names from which we can create two sub-lists: (1) the cohort list of nursing home residents with correct name spellings, and (2) the list of PCR records that includes multiple records (test results) per individual, with possible inconsistent name spellings across the different tests for a specific person.
Generating names
The master list can easily be generated using the randomNames
function in the R package randomNames. A call to this function provides samples of names from a large scale database. (It provides gender and ethnic variation if you need it.)
library(data.table) library(randomNames) set.seed(6251) randomNames(4) ## [1] "Hale, James" "el-Qazi, Najeema" "Sourn, Raj" "Jensen, Tia"
Generating errors
To facilitate the generation of spelling errors, I’ve created a function that takes a string, a specified number of substitutions, a number of insertions (if negative then these are deletions), and an indicator that flips the order of the names (typically “Last Name, First Name”):
mis_string <- function(name, subs = 1, ins = 0, flip = FALSE) { names <- trimws(unlist(strsplit(name, split = ","))) if (subs) { for (i in 1 : subs) { change <- sample(1:2, 1) ii <- sample(nchar(names[change]), 1) l <- substr(names[change], ii, ii) s <- sample(letters[letters != l], 1) names[change] <- sub(l, s, names[change]) } } if (ins > 0) { for (i in 1 : ins) { change <- sample(c(1, 2), 1) ii <- sample(nchar(names[change]), 1) stringi::stri_sub(names[change], ii, ii-1) <- sample(letters, 1) } } if (ins < 0) { for (i in 1 : -ins) { change <- sample(c(1, 2), 1) ii <- sample(nchar(names[change]), 1) stringi::stri_sub(names[change], ii, ii) <- "" } } paste(names[flip + 1], names[2 - flip], sep = ", ") }
Here are two applications of mis_string
on the name “Vazquez Topete, Deyanira”:
mis_string("Vazquez Topete, Deyanira", subs = 2, ins = 2) ## [1] "Vazhquiez Topete, Dmyanika" mis_string("Vazquez Topete, Deyanira", subs = 1, ins = -2, flip = TRUE) ## [1] "Deynira, uazquez Topet"
Master list definitions
To generate the master list we define (using simstudy
) a set of key parameters: an indicator pcr
identifying whether the person has at least one test (70% will have a test), an indicator resident
identifying whether the person is part of our resident cohort (20% of those on the master list will be residents), and a variable for the number of tests an individual has (conditional on having at least 1 test). There will be names on the master list that do not have any tests nor are they a resident; these patients are removed from the master list.
library(simstudy) def_n <- defDataAdd(varname = "pcr", formula = 0.7, dist="binary") def_n <- defDataAdd(def_n, varname = "resident", formula = 0.2, dist="binary") def_n <- defDataAdd(def_n, varname = "n_tests", formula = 3, dist="noZeroPoisson")
PCR list error definitions
Each person with a PCR test will have one or more records in the PCR list. The following set of definitions indicates the number of substitutions and insertions (both specified as categorical integer variables) as well as whether the first and last names should be flipped.
def_p <- defDataAdd(varname = "subs", formula = ".8;.15;.04;.01", dist="categorical") def_p <- defDataAdd(def_p, varname = "ins", formula = "0.02;0.05;0.86;0.05;0.02", dist="categorical") def_p <- defDataAdd(def_p, varname = "flip", formula = 0.10, dist="binary")
Generating the data
In this simulation I am generating 50 possible names:
set.seed(3695) n <- 50 d_master <- data.table(id = 1:n, name = randomNames(n)) d_master <- addColumns(def_n, d_master) d_master <- d_master[(pcr | resident)] head(d_master) ## id name pcr resident n_tests ## 1: 1 Maas, Synneva 1 0 3 ## 2: 2 Rock, Alyssa 1 0 3 ## 3: 3 Lee, August 1 0 1 ## 4: 4 Keefe, Dylan 1 0 1 ## 5: 5 Yang, An 1 0 1 ## 6: 6 Andrew, Crysta 1 0 3
In this case, there are be 7 individuals in the resident cohort and 40 individuals have at least one PCR test. 5 residents were tested:
d_master[, .( num_res = sum(resident), num_pcr = sum(pcr), num_both = sum( (resident & pcr) ) ) ] ## num_res num_pcr num_both ## 1: 7 40 5
The PCR list will have 110 total tests for the 40 people with tests.
d_pcr <- genCluster(d_master[pcr == 1], "id", "n_tests", "pcr_id") d_pcr <- addColumns(def_p, d_pcr) d_pcr[, subs := subs - 1] d_pcr[, ins := ins - 3] d_pcr[, obs_name := mapply(mis_string, name, subs, ins, flip)] d_pcr[, .(pcr_id, id, name, obs_name, subs, ins, flip)] ## pcr_id id name obs_name subs ins flip ## 1: 1 1 Maas, Synneva Sycnfvq, Maas 3 0 1 ## 2: 2 1 Maas, Synneva Synneva, Maas 0 0 1 ## 3: 3 1 Maas, Synneva Maas, Sznneva 1 0 0 ## 4: 4 2 Rock, Alyssa Rock, Alyssa 0 0 0 ## 5: 5 2 Rock, Alyssa Ropk, Alyssa 1 0 0 ## --- ## 106: 106 48 Wall, Sebastian Wall, Sebtastian 0 1 0 ## 107: 107 48 Wall, Sebastian Wall, kebastian 1 0 0 ## 108: 108 48 Wall, Sebastian wall, Sebastian 1 0 0 ## 109: 109 49 Tafoya, April Tafoya, April 0 0 0 ## 110: 110 49 Tafoya, April Tafoya, April 0 0 0
We end up with two lists - one with just residents only and one with a list of PCR tests. This is mimicking the actual data we might get from our flawed health data systems.
d_res <- d_master[resident == 1, .(id, name)] d_pcr <- d_pcr[, .(pcr_id, id, name = obs_name, pcr)]
The truth
Before proceeding to the matching, here is a PCR test records for the residents. This is the correct match that we hope to recover.
d_pcr[ id %in% d_res$id] ## pcr_id id name pcr ## 1: 57 26 Diaper, nody 1 ## 2: 58 26 Draper, Cody 1 ## 3: 59 26 Cody, Drapeyr 1 ## 4: 60 27 al-Naqvi, Qamraaa 1 ## 5: 61 27 al-Naqvi, Qamraaa 1 ## 6: 64 29 el-Hallal, Zahraaa 1 ## 7: 65 29 Zahraaa, el-Hallal 1 ## 8: 86 42 Allen, Jalyn 1 ## 9: 87 42 llen, Jalyn 1 ## 10: 88 42 Allen, Jalyn 1 ## 11: 102 47 Sanandres, Bzandon 1 ## 12: 103 47 Sananores, Brandon 1 ## 13: 104 47 Sanandues, Brandon 1 ## 14: 105 47 Sanandres, Brandon 1
Fuzzy matching of simulated data
The fuzzy matching is quite simple (and I’ve simplified even more by ignoring the possibility that the data have been flipped). The first step is to merge each PCR row with each resident name, which in this case will result in \(7 \times 110 = 770\) rows. The idea is that we will be comparing each of the names from the PCR tests with each of the resident names. In the merged data table dd
, x
is the resident name, and name
is the PCR test list name.
dd <- data.table(merge(d_res$name, d_pcr)) dd ## x pcr_id id name pcr ## 1: Korenek, Tara 1 1 Sycnfvq, Maas 1 ## 2: Draper, Cody 1 1 Sycnfvq, Maas 1 ## 3: al-Naqvi, Qamraaa 1 1 Sycnfvq, Maas 1 ## 4: el-Hallal, Zahraaa 1 1 Sycnfvq, Maas 1 ## 5: Slee Ackerson, Jeremiah 1 1 Sycnfvq, Maas 1 ## --- ## 766: al-Naqvi, Qamraaa 110 49 Tafoya, April 1 ## 767: el-Hallal, Zahraaa 110 49 Tafoya, April 1 ## 768: Slee Ackerson, Jeremiah 110 49 Tafoya, April 1 ## 769: Allen, Jalyn 110 49 Tafoya, April 1 ## 770: Sanandres, Brandon 110 49 Tafoya, April 1
Next, we calculate the string distance for each pair of strings in dd
:
dd[, pid := .I] dd[, dist := adist(x, name), keyby = pid] dd ## x pcr_id id name pcr pid dist ## 1: Korenek, Tara 1 1 Sycnfvq, Maas 1 1 10 ## 2: Draper, Cody 1 1 Sycnfvq, Maas 1 2 11 ## 3: al-Naqvi, Qamraaa 1 1 Sycnfvq, Maas 1 3 12 ## 4: el-Hallal, Zahraaa 1 1 Sycnfvq, Maas 1 4 14 ## 5: Slee Ackerson, Jeremiah 1 1 Sycnfvq, Maas 1 5 18 ## --- ## 766: al-Naqvi, Qamraaa 110 49 Tafoya, April 1 766 13 ## 767: el-Hallal, Zahraaa 110 49 Tafoya, April 1 767 14 ## 768: Slee Ackerson, Jeremiah 110 49 Tafoya, April 1 768 19 ## 769: Allen, Jalyn 110 49 Tafoya, April 1 769 11 ## 770: Sanandres, Brandon 110 49 Tafoya, April 1 770 15
We can refine the matched list of \(770\) pairs to include only those that differ by 5 or fewer transformations, and can even create a score based on the distances where a score of 100 represents a perfect match. This refined list can then be reviewed manually to make a final determination in case there are any false matches.
dd <- dd[dist <= 3,] dd[, score := 100 - 5*dist] dd[, .(staff_name = x, pcr_name = name, pcr_id, pcr, pid, score)] ## staff_name pcr_name pcr_id pcr pid score ## 1: Draper, Cody Diaper, nody 57 1 394 90 ## 2: Draper, Cody Draper, Cody 58 1 401 100 ## 3: al-Naqvi, Qamraaa al-Naqvi, Qamraaa 60 1 416 100 ## 4: al-Naqvi, Qamraaa al-Naqvi, Qamraaa 61 1 423 100 ## 5: el-Hallal, Zahraaa el-Hallal, Zahraaa 64 1 445 100 ## 6: Allen, Jalyn Allen, Jalyn 86 1 601 100 ## 7: Allen, Jalyn llen, Jalyn 87 1 608 95 ## 8: Allen, Jalyn Allen, Jalyn 88 1 615 100 ## 9: Sanandres, Brandon Sanandres, Bzandon 102 1 714 95 ## 10: Sanandres, Brandon Sananores, Brandon 103 1 721 95 ## 11: Sanandres, Brandon Sanandues, Brandon 104 1 728 95 ## 12: Sanandres, Brandon Sanandres, Brandon 105 1 735 100
We did pretty well, identifying 12 of the 14 resident records in the PCR data. The two we missed were the result of flipped names.
In practice, a relatively close distance is not necessarily a good match. For example SMITH, MARY and SMITH, JANE are only separated by three letter substitutions, but they are most likely not the same person. We could minimize this problem if we have additional fields to match on, such as date of birth. This would even allow us to increase the string distance we are willing to accept for a possible match without increasing the amount of manual inspection required.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.