Continental Language Diversity

[This article was first published on More or Less Numbers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Since language data provides for the demonstration of many visualization techniques, I thought of using another set showing official languages spoken across continents using a new visualization in the “UpSetR” package.  The graph can be used for comparing sets of data numerically.  It provides an easier way to understand data for something like a Venn diagram more quantitatively.  Whereas the appeal of Venn diagrams is in their aesthetic, they do not provide understanding for the numeric value of sets when the sets reach a number that is visually challenging to interpret

Displaying the grouping of languages by continent was one way I thought to illustrate the use of the “UpSetR” package.  The data limits each country to one primary language.  The countries are then grouped into continents based on their location.  Below we can see the different continents and respective bar graphs associated with a continent or a set of continents.  The bar graph on the left side (y-axis) measures the number of languages in each continent.  The bar graph at the top (x-axis) measures the number of occurrences for each set of languages seen in the filled circles.  So each filled dot or set of dots represent a set or grouping of languages. 



For instance the red dot indicates 27 different languages in Europe that only occur in Europe, making it seemingly the most diverse in terms of official languages spoken.  The yellow dot set represents 2 languages that are spoken in countries located on 5 different continents (any guesses?*).  This graph does not capture the many different languages spoke in different countries but only the “lingua franca” associated with a continent.  For instance, Africa though linguistically diverse has as official status languages from the colonial-era, thus showing above a lack of diversity.

Alternatively we can see the transposition of this graph below.  Here we see the size of the intersects is small (because we are now considering continent intersects of which there are 1 for each continent).  Guessing the continents for this graph is perhaps a bit easier than guess the different languages for each continent above.



For those having data whose organization is in sets, visualizing sets in this way allows for various dimensions of data to be understood in a way not captured by other visualizations.  This tool for this particular language data set is an interesting view on official language use across continents.  The data used for these graphs is shown below.

The very few lines of code it took to make these graphs is available here.  Much thanks to the developers of this package!

*English and French


rows Africa Antarctic Asia Europe North.America Oceania South.America Countries
1 Albanian 0 0 0 1 0 0 0 1
2 Arabic 1 0 1 0 0 0 0 17
3 Armenian 0 0 0 1 0 0 0 1
4 Azerbaijani 0 0 1 0 0 0 0 1
5 Belarusian 0 0 0 1 0 0 0 1
6 Bosnian 0 0 0 1 0 0 0 1
7 Bulgarian 0 0 0 1 0 0 0 1
8 Catalan 0 0 0 1 0 0 0 1
9 Croatian 0 0 0 1 0 0 0 1
10 Czech 0 0 0 1 0 0 0 1
11 Danish 0 0 0 1 0 0 0 1
12 Dutch 0 0 0 1 1 0 0 3
13 English 1 0 0 1 1 1 1 39
14 Estonian 0 0 0 1 0 0 0 1
15 Filipino 0 0 1 0 0 0 0 1
16 Finnish 0 0 0 1 0 0 0 1
17 French 1 0 0 1 1 1 1 22
18 Georgian 0 0 0 1 0 0 0 1
19 German 0 0 0 1 0 0 0 4
20 Greek 0 0 0 1 0 0 0 2
21 Heard 0 1 0 0 0 0 0 1
22 Hebrew 0 0 1 0 0 0 0 1
23 Hindi 0 0 1 0 0 0 0 1
24 Hungarian 0 0 0 1 0 0 0 1
25 Icelandic 0 0 0 1 0 0 0 1
26 Indonesian 0 0 1 0 0 0 0 1
27 Italian 0 0 0 1 0 0 0 2
28 Japanese 0 0 1 0 0 0 0 1
29 Khmer 1 0 0 0 0 0 0 1
30 Korean 0 0 1 0 0 0 0 2
31 Lao 0 0 1 0 0 0 0 1
32 Latvian 0 0 0 1 0 0 0 1
33 Lithuanian 0 0 0 1 0 0 0 1
34 Malay 0 0 1 0 0 0 0 1
35 Maltese 0 0 0 1 0 0 0 1
36 Mandarin 0 0 1 0 0 0 0 1
37 Norwegian 0 0 0 1 0 0 0 1
38 Persian 0 0 1 0 0 0 0 1
39 Polish 0 0 0 1 0 0 0 1
40 Portuguese 1 0 0 1 0 0 1 7
41 Romanian 0 0 0 1 0 0 0 1
42 Russian 0 0 1 0 0 0 0 2
43 Slovak 0 0 0 1 0 0 0 1
44 Slovenian 0 0 0 1 0 0 0 1
45 Spanish 1 0 0 1 1 0 1 22
46 Swahili 1 0 0 0 0 0 0 1
47 Swedish 0 0 0 1 0 0 0 1
48 Thai 0 0 1 0 0 0 0 1
49 Turkish 0 0 1 0 0 0 0 1
50 Ukranian 0 0 0 1 0 0 0 1
51 Vietnamese 0 0 1 0 0 0 0 1

To leave a comment for the author, please follow the link and comment on their blog: More or Less Numbers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)