Chester (sorry, Liverpool) is the Most Popular City in the World (relative to use as password per inhabitant)

[This article was first published on R on Sastibe's Data Science Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Update 2018-02-17: The title of this article has changed reflecting new information I have received since publishing. For mor information, I refer to the last paragraph.

A treasure trove of leaked passwords

The API of pwnedpasswords.com is quite remarkable. It not only allows you to fetch the results generally obtained by typing in your e-mail into the browser interface and finding out whether or not you’ve been pwned from the comfort of your shell. It further allows you to very simply check whether a certain password has ever been used in any of the dumps they have, and if so, how often. Since haveibeenpwned.com has collected over 550 millions of these in a multitude of data breaches, odds are your password might be amongst these.

Using the API is straightforward, but the way it is secured so that even pwnedpasswords.com itself doesn’t know precisely which password or even password hash you are checking is ingenious, I highly recommend reading their tutorial. The following R snippet allows for obtaining the number of hits for a vector of passwords:

library(httr)
library(digest)
library(plyr)
library(dplyr)
library(stringr)

popularity <- function(password){
  passw_hash = digest(password, "sha1", serialize = FALSE)
  passw_front = toupper(substr(passw_hash, 1, 5))
  passw_back = toupper(substr(passw_hash, 6, 200))
  hashes = 
    read.table(text = content(GET(paste0("https://api.pwnedpasswords.com/range/", 
                                         passw_front)),
                              encoding = "UTF-8"),
               sep = ":") %>%
    rename(hashes = V1, count = V2) %>%
    filter(hashes == passw_back) %>%
    mutate(password = password) %>%
    select(password, count)
  if(nrow(hashes) == 0){
    hashes <- tibble("password" = password, "count" = 0)
  }
  return(hashes)
}

City names as passwords

Using this function allows us to search through various ranges of passwords. For instance, let's see how many people have chosen the names of cites in Baden-Württemberg as their passwords1:

City Name Number of Usages as Password
Freiburg 3077
Stuttgart 9496
Karlsruhe 1426
Heidelberg 4081
Mannheim 5040
Konstanz 924

It seems like the city name "stuttgart" appears most often in the password list, yet that is not incredibly surprising, as it is also the largest city. A plot of the number of password hits in relation to the number of inhabitants2 looks like this:

The red circle describes the number of inhabitants, the black circle the number of usages as password. The plot was created with ggmap.

Quite obviously, the ratio of "number of uses of city name as password" and "number of inhabitants", i.e. "Use of City Name as Password per Inhabitant" differs from city to city. It seems like this ratio is higher for the cities Heidelberg and Freiburg, each of which is known for a high quality of living and a very picturesque old town. So, let's look at some international (and especially British) competition:

City Name Inhabitants Number of Usages as Password City Names as Password per 1000 Inhabitants
Liverpool 473073 280723 593.4
Manchester 520215 98831 190.0
Oxford 161291 23069 143.0
Cambridge 151832 12648 83.3
Heidelberg 160601 4081 25.4
London 8787892 196220 22.3
Mannheim 307997 5040 16.4
Paris 2190327 28699 13.1
Berlin 3613495 40952 11.3

In this longer list, the effect of having a famous football team (Liverpool, Manchester) as well as having a famous university in a small city (Oxford, Cambridge and Heidelberg) becomes obvious. In other words, it's not so much about how many people live in a certain city, but how many people feel a positive connection to that particular city, by loyality of a sports team or by time spent at the university.

Let me conclude this article by pointing out the obvious question: "Can any city beat Liverpool" in this contest? All my manual samples have so far yielded good results, but nothing close to Liverpools numbers, for instance:

City Name Inhabitants Number of Usages as Password City Names as Password per 1000 Inhabitants
Liverpool 473073 280723 593.4
Green Bay 105139 23069 143.0
Barcelona 1620805 152196 129.9

Thus, until futher notice, I proclaim hereby that Liverpool is the most popular city in the world (relative to password use per inhabitant).

Update 2018-02-17: Chester has overtaken Liverpool

I published this post on Feburary 8th and have since received feedback concerning the core result of this post (i.e. which city is the most popular), in particular, the following tweet:

@MurrayData is correct of course, "chester" has 117128 occurrences in the data base. "Chester" is not only a city name, however, but also a Christian name, and in particular that of the late lead singer of Linkin Park, Chester Bennington. It is plausible that many of these passwords are disconnected from the city itself. Yet the same argument hoilds for soccer teams as I had already pointed out in my original post, thus I opted to ignore these distortions, and I hereby present the new, updated table:

City Name Inhabitants Number of Usages as Password City Names as Password per 1000 Inhabitants
Chester 118200 117128 990.9
Liverpool 473073 280723 593.4
Manchester 520215 98831 190.0
Oxford 161291 23069 143.0
Cambridge 151832 12648 83.3
Heidelberg 160601 4081 25.4
London 8787892 196220 22.3
Mannheim 307997 5040 16.4
Paris 2190327 28699 13.1
Berlin 3613495 40952 11.3

  1. Rules are "only lowercase letters, spaces are eliminated". So for "New York" I looked for usage of "newyork", for instance. [return]
  2. Numbers of inhabitants as taken from wikipedia.org. [return]

To leave a comment for the author, please follow the link and comment on their blog: R on Sastibe's Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)