When Homoglyphs Attack! Generating Phishing Domain Names with R

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s likely you’ve seen the news regarding yet-another researcher showing off a phishing domain attack. The technique is pretty simple:

  • find a target domain you want to emulate
  • register a homoglpyh version of it
  • use the hacker’s favorite tool, Let’s Encrypt to serve it up with a nice, shiny green lock icon
  • deploy some content
  • phish someone
  • Profit!

The phishing works since International Domain Names have been “a thing” for a while (anything for the registrars to make more money) and Let’s Encrypt provides a domain-laundering service for these attackers. But, why should attackers have all the fun! Let’s make some domain homoglyphs in R.

Have Glyph, Will Hack

Rob Dawson has a spiffy homoglyph generator and even has a huge glyph-alike file, but we don’t need the full list to don the hacker cap for this exercise. I’ve made a stripped-down version of it that has (mostly) glyphs that should display correctly in “western” locales. You can pull the full list and tweak the example to broaden the attack capabilities. Let’s take a look:

library(stringi)
library(urltools)
library(purrr)

URL <- "https://rud.is/dl/homoglyphs.txt" # trimmed down from https://github.com/codebox/homoglyph
fil <- basename(URL)
invisible(try(httr::GET(URL, httr::write_disk(fil)), silent = TRUE))

chars <- stri_read_lines(fil)
idx_char <- stri_sub(chars, 1,1)
stri_sub(chars, 1, 1) <-  ""
chars <- set_names(chars, idx_char)

tail(chars)
##                                         u 
##          "ʋυцս\u1d1cu??????????????????" 
##                                         v 
##        "νѵט\u1d20ⅴ∨⋁v??????????????????" 
##                                         w 
##                                      "w" 
##                                         x 
##                "×хᕁᕽ᙮ⅹ⤫⤬⨯x?????????????" 
##                                         y 
## "ɣʏγуүყ\u1d8c\u1effℽy??????????????????" 
##                                         z 
##                   "\u1d22z?????????????"

What we did there was to read in the homoglpyh lines and create a lookup table for Latin characters. Now we need a transformation function.

to_homoglyph <- function(domain) {

  suf <- suffix_extract(domain)
  domain <- stri_replace_last_fixed(domain, sprintf(".%s", suf$suffix[1]), "")

  domain_split <- stri_split_boundaries(domain, type="character")[[1]]

  map_chr(domain_split, ~{
    found <-  chars[.x]
    pos <- sample(stri_count_boundaries(found, type="character"), 1)
    stri_sub(found, pos, pos)
  }) %>%
    c(".", suf$suffix[1]) %>%
    stri_join(collapse="")

}

The basic idea is to:

  • carve out the domain suffix (we need to ensure valid TLDs/suffixes are used in the final domain)
  • split the input domain into separate characters
  • select a homoglyph of the character at random
  • join the separate glpyhs and the TLD/suffix back together.

We can try it out with a very familiar domain:

(converted <- to_homoglyph("google.com"))
## [1] "ƍ၀໐?|?.com"

Now, that’s using all possible homoglyphs and it might not look like google.com to you, but imagine whittling down the list to ones that are really close to Latin character set matches. Or, imagine you’re in a hurry and see that version of Google’s URL with a shiny, green lock icon from Let’s Encrypt. You might not really give it a second thought if the page looked fine (or were on a mobile browser without a location bar showing).

What’s the solution?

Firefox has a configuration setting to turn these IDNs into punycode in the location bar. What does that mean? We can use the urltools::puny_encode() function to find out:

puny_encode("ƍ၀໐?|?.com")
## [1] "xn--|-npa992hbmb6w79iesa.com"

Most folks will be much less likely to trust that domain name (if they bother looking in the location bar). Note that it will still have the “everything’s ?” green Let’s Encrypt lock icon, but you shouldn’t be trusting SSL/TLS anymore for integrity or authenticity anyway.

Chrome Canary (super early bird alpha versions) expands IDNs to punycode by default today and a shorter-cycle release to stable channel is forthcoming. I’m told Edge does somewhat sane things with IDNs and if Safari doesn’t presently handle them Apple will likely release an interstitial security update to handle it.

FIN

See if you can generate some fun look-alike’s, such as ???????.com and drop some latte change to register an IDN and add a free hacking certificate to it to see just how easy this entire process is. Note that attackers are automating this process, so they may have beat you to your favorite homoglyph IDN.

If you’re on Chrome, give the Punycode Alert extension a go if you’d like some extra notification/protection from these domains.

NOTE: to_homoglyph() is not vectorised (it’s an exercise left to the reader).

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)