More water, a bit more about saints

Posted on January 25, 2017 by Maëlle Salmon in R bloggers | 0 Comments

[This article was first published on Maëlle, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was lucky enough to get some nice and interesting feedback on my last post. One comment was really useful and pretty embarrassing: I had written “see” instead of “sea” in the whole post… Thanks Steve Dempsey for the correction! I also got some questions which I decided to explore.

More water?

I had chosen to look specifically for a few rivers but one commenter, actually Steve Dempsey again, asked me how it would look like if I systematically retrieved all placenames with “sur-“, “upon”, because where he lived in France there were many “sur-Dronne” or “sur l’Isle”. He actually asked the question in French, “Je me demande ce que ça donne avec tous les places avec un “-sur”. Là, où j’étais en France il y avait plein de “-sur-Dronne” ou “-sur-l’Isle”.” Let’s see! (or sea? just kidding)

library("dplyr")
library("tidyr")
library("readr")
library("purrr")
ville <- read_tsv("data/2017-01-24-kervillebourg_FR.txt", col_names = FALSE)[, c(2, 5, 6)]
names(ville) <- c("placename", "latitude", "longitude")
ville <- unique(ville)
upon <- filter(ville, grepl("sur-", placename))

If truth be told, I thought I’d just need to filter the relevant placenames and then get nice rivers on my map… nope. So I soon realized I should first find unique names of rivers, then group them by name and only draw the ones that have enough places for the map to be at least a bit pretty.

upon <- by_row(upon, function(df){
  sub(".*sur-", "", df$placename)
}, .to = "waterthing", .collate = "cols")
knitr::kable(head(upon))

placename	latitude	longitude	waterthing
Domecy-sur-le-Vault	47.49084	3.80953	le-Vault
Yzeures-sur-Creuse	46.78609	0.87166	Creuse
Yville-sur-Seine	49.39938	0.87750	Seine
Wingen-sur-Moder	48.91900	7.37955	Moder
Willer-sur-Thur	47.84392	7.07223	Thur
Wavrans-sur-Ternoise	50.41322	2.30002	Ternoise

Let’s see what the more frequent watherthings are:

group_by(upon, waterthing) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  head(n = 20) %>%
  knitr::kable()

waterthing	n
Mer	214
Loire	152
Seine	148
Marne	117
Meuse	88
Saône	70
Sarthe	50
Aube	44
Cher	41
Orne	39
Seille	39
Moselle	36
Eure	34
Yonne	33
Oise	32
Vienne	31
Garonne	30
Dordogne	28
Aisne	26
Allier	24

So I find “Sea” again, and the rivers I had chosen in the last post, and others, whose names I all knew so they must be quite important. I want to keep only the rivers with “enough”” places to make the map pretty, because at the scale of the country I prefer seeing a longer river.

upon <- group_by(upon, waterthing)
upon <- filter(upon, n() >= 25)
upon <- ungroup(upon)

library("ggplot2")
library("ggmap")
library("viridis")
library("ggthemes")
map <- ggmap::get_map(location = "France", zoom = 6, maptype = "watercolor")

This time the map will be really artsier than useful since I’ll have a color by river but no legend because of the higher numbers of rivers.

p <- ggmap(map) +
  geom_point(data = upon,
             aes(x = longitude, y = latitude,
                 col = waterthing), size = 1.3) +
  theme_map()+
  scale_colour_grey(end = 0.7)+ 
  ggtitle("French placenames containing the word 'sur'") +
  theme(plot.title = element_text(lineheight=1, face="bold"))+
  theme(text = element_text(size=14),
        legend.position = "none")
p

I like this map because I could find the rivers again without having to enter their names, and it looks like drawing with a pencil on watercolor without making any effort. However, since I’m not very good in geography, I’d like to add labels to each river, let’s say somewhere in the middle of the river. I’ll use ggrepel to avoid having overlapping labels. My definition of “somewhere in the middle” is sorting the places of a river by latitude and longitude and then choosing a place close to the middle.

library("ggrepel")
named_upon <- arrange(upon, latitude, longitude)
named_upon <- group_by(named_upon, waterthing)
named_upon <- mutate(named_upon, index = 1:n())
named_upon <- mutate(named_upon, name = ifelse(index == floor(n()/2),
                                   waterthing, NA))
p + geom_label_repel(aes(x = longitude,
                   y = latitude, 
                   label = name,
                   max.iter = 20000),
               data = named_upon)

Okay, I’ll now look at the map and learn the names! I’ve lived near the Loire, near the Seine and near the Marne. It might be surprising not to see the Rhine on that map, but placenames in Alsace are often more Germanic, so that would be another story.

Which saints?

I also got questions about saints, one on Twitter from Andrew Boa and a few ones from Andrew MacDonald.

Andrew Boa’s question was the easiest one: how many unique names of saints are there? The questions of the other Andrew were “I wonder, would it be possible to also obtain data on how old these towns are? It would be interesting to see if the gender of popular saints changes over time. Or, more simply, which saints have the most towns? I imagine there are tons of “Saint Sernin”s in the South, and probably lots of “Jeanne-d’Arc”s all over”. Let me tell you I am thankful for the more simply part of the questions because I have no idea where one would find information about the age of the places. Nonetheless, this would be really interesting.

Let’s answer the questions about the number of unique names and their frequencies. I am now in particular curious about André since I got questions from two Andrew’s.

Note that for finding the name I remove the “Saint-“ or “Sainte-“ part but also everything that could come after another hyphen and a space, e.g. in “Alban-de-Varèze” which I want as “Alban”. Also note that in this analysis I ignore homonym saints, and that Jeanne d’Arc might be a saint, the places named after her don’t contain the word “sainte”.

saints <- ville %>%
  mutate(saint = grepl("Saint-", placename))  %>%
  mutate(sainte = grepl("Sainte-", placename))  %>%
  gather("gender", "yes", saint:sainte) %>%
  filter(yes == TRUE) %>%
  select(- yes)
saints <- by_row(saints, function(df){
  if(df$gender == "saint"){
    name <- sub(".*Saint-", "", df$placename)
  }else{
    name <- sub(".*Sainte-", "", df$placename)
  }
  name <- trimws(name)
  name <- strsplit(name, "-")[[1]][1]
  name <- strsplit(name, " ")[[1]][1]
  return(name)
}, .to = "saintname", .collate = "cols")
knitr::kable(head(saints))

placename	latitude	longitude	gender	saintname
Ygos-Saint-Saturnin	43.97651	-0.73780	saint	Saturnin
Canal de Vitry-le-François à Saint-Dizier	48.73333	4.60000	saint	Dizier
Vitrac-Saint-Vincent	45.79585	0.49356	saint	Vincent
Vineuil-Saint-Firmin	49.20024	2.49567	saint	Firmin
Villotte-Saint-Seine	47.42893	4.70571	saint	Seine
Villiers-Saint-Denis	49.00000	3.26667	saint	Denis

How many unique names do we have?

group_by(saints, gender) %>%
  summarize(n_names = length(unique(saintname)),
            n_places = n()) %>%
  knitr::kable()

gender	n_names	n_places
saint	1205	10310
sainte	157	971

Let’s look at the distributions of number of places by saint name, separately for saints then saintes.

saints_freq <- group_by(saints, gender, saintname) 
saints_freq <- summarize(saints_freq, n_places = n())

filter(saints_freq, gender == "saint") %>%
ggplot() +
  geom_histogram(aes(n_places))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

filter(saints_freq, gender == "sainte") %>%
ggplot() +
  geom_histogram(aes(n_places))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The information I get from looking at these ugly histograms is that there are some names that are very popular and a lot of them that are rarely used. Let’s look at the 11 more popular ones for saints and saintes. Don’t ask me why I chose 11!

arrange(saints_freq, desc(n_places)) %>%
  filter(gender == "saint") %>%
  head(n = 11) %>%
  knitr::kable()

gender	saintname	n_places
saint	Martin	618
saint	Jean	416
saint	Pierre	391
saint	Germain	304
saint	Julien	219
saint	Laurent	215
saint	Georges	214
saint	Hilaire	181
saint	Aubin	169
saint	Michel	168
saint	André	164

So for saints I see names that are still common in France. And André as 11th one, which might explain why I chose the 11 most popular ones…

arrange(saints_freq, desc(n_places)) %>%
  filter(gender == "sainte") %>%
  head(n = 11) %>%
  knitr::kable()

gender	saintname	n_places
sainte	Marie	122
sainte	Croix	69
sainte	Colombe	68
sainte	Marguerite	52
sainte	Anne	50
sainte	Foy	36
sainte	Eulalie	29
sainte	Gemme	25
sainte	Geneviève	24
sainte	Catherine	21
sainte	Radegonde	20

I’m fascinated by some names like Radegonde that are not common any longer.

And since looking at rare names is so fun (isn’t it?), let’s look at the 11th of the least popular names.

arrange(saints_freq, n_places) %>%
  filter(gender == "saint") %>%
  head(n = 11) %>%
  knitr::kable()

gender	saintname	n_places
saint	Aaron	1
saint	Alexis	1
saint	Alvard	1
saint	Amas	1
saint	Amond	1
saint	Anastase	1
saint	Andre´	1
saint	Andrieu	1
saint	Andrieux	1
saint	Anne	1
saint	Apoll	1

I think some rare names might be errors of accents, e.g. Andre might be André and seeing Anne in the list of saints I’m now wondering about the number of female names classified as saints in the dataset. There are 10 “Saint Mary” in the dataset and 122 “Sainte Marie” so I think the gender inbalance would still exist when accounting for this, but it’s certainly a good point to keep in mind when using this Geonames dataset! If I were to really tackle the issue, I think I’d try using the genderizer package on names, although I’m not so sure it’d perform well for French old names and the rOpensci gender package doesn’t have an historical dataset for France (yet?). I could also simply look for a very French dataset of place names, and hope no name would be translated in it.

Beside discovering this limitation of the dataset, I liked the names one can see by browsing the least popular saint names, like Eusoge or Exupère.

arrange(saints_freq, n_places) %>%
  filter(gender == "sainte") %>%
  head(n = 11) %>%
  knitr::kable()

gender	saintname	n_places
sainte	Ame	1
sainte	Assise	1
sainte	Avoye	1
sainte	Awawa	1
sainte	Blaizine	1
sainte	Clotilde	1
sainte	Élisabeth	1
sainte	Éturien	1
sainte	Eugienne	1
sainte	Eulard	1
sainte	Genevieve	1

This is similarly fascinating and also shows me it might be interesting to re-analyse the all dataset without diacritical accents (Genevieve should be Geneviève).

And Saint Maël, you might ask? Well Maël or Maëlle are Breton names so there’s no place called Saint Maël in the dataset like my holy patron, but there is Maël-Carhaix and Maël-Pestivien. I can’t speak Breton, Wikipedia tells me that Pestivien means “the end of the sources” and for Carhaix it seems to be a long story. After looking at this dataset, I have the impression that the more toponomy questions one tries to answer, the more new questions one has. While this is awesome because of all the learning it implies, I’ll conclude this post (and probably my hobby career as a toponomist :)) by simply plotting the Jeanne d’Arc places for Andrew MacDonald, and Domrémy-la-Pucelle where she was born. There was no Saint-Sermin!

ville <- mutate(ville, jeanne = grepl("Jeanne [dD].[aA]rc", placename))
ville <- mutate(ville, domremy = (placename == "Domrémy-la-Pucelle"))
jeanne <- filter(ville, jeanne | domremy)
jeanne <- tidyr::gather(jeanne, "word", "value", jeanne:domremy)
jeanne <- filter(jeanne, value)
ggmap(map) +
  geom_point(data = jeanne,
             aes(x = longitude, y = latitude, col = word), size = 3) +
  theme_map()+ 
  ggtitle("French placenames containing 'Jeanne d'Arc'") +
  theme(plot.title = element_text(lineheight=1, face="bold")) +
  scale_color_viridis(discrete = TRUE)

So there are a few Jeanne d’Arc places all over as predicted by Andrew, none very close to Domrémy-la-Pucelle but then this was called Domrémy the virgin after her already.

If you want to play with the dataset yourself, you’ll find it on Geonames, see this gist and in the data folder of this Github repo. If you do, don’t hesitate to share your findings!

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

More water, a bit more about saints

More water?

Which saints?

Related

More water?

Which saints?

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)