Earthquake Analysis (2/4): Categorical Variables Exploratory Analysis

May 1, 2019
By

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Basic Statistics

    Tags

    1. Data Visualisation
    2. Exploratory Analysis
    3. R Programming

    This is the second part of our post series about the exploratory analysis of a publicly available dataset reporting earthquakes and similar events within a specific time window of 30 days. In the following, we are going to analyze the categorical variables of our dataset. The categorical variables can take on one of a limited, and usually fixed a number of possible values. Factor variables are categorical variables that can be either numeric or string variables. R stores categorical variables into a factor. Their analysis may require statistical tools different from the ones used for quantitative variables.

    Packages

    I am going to take advantage of the following packages.

    suppressPackageStartupMessages(library(ggplot2))
    suppressPackageStartupMessages(library(dplyr))
    suppressPackageStartupMessages(library(Hmisc))
    suppressPackageStartupMessages(library(lubridate))
    suppressPackageStartupMessages(library(vcd))
    suppressPackageStartupMessages(library(vcdExtra))
    suppressPackageStartupMessages(library(gmodels))
    

    Packages versions are herein listed.

    packages <- c("ggplot2", "dplyr", "Hmisc", "lubridate", "vcd", "vcdExtra", "gmodels")
    version <- lapply(packages, packageVersion)
    version_c <- do.call(c, version)
    data.frame(packages=packages, version = as.character(version_c))
    ##    packages version
    ## 1   ggplot2   3.1.0
    ## 2     dplyr 0.8.0.1
    ## 3     Hmisc   4.2.0
    ## 4 lubridate   1.7.4
    ## 5       vcd   1.4.4
    ## 6  vcdExtra   0.7.1
    ## 7   gmodels  2.18.1
    

    Running on Windows-10 the following R language version.

    R.version
    ##                _                           
    ## platform       x86_64-w64-mingw32          
    ## arch           x86_64                      
    ## os             mingw32                     
    ## system         x86_64, mingw32             
    ## status                                     
    ## major          3                           
    ## minor          5.2                         
    ## year           2018                        
    ## month          12                          
    ## day            20                          
    ## svn rev        75870                       
    ## language       R                           
    ## version.string R version 3.5.2 (2018-12-20)
    ## nickname       Eggshell Igloo
    

    Getting Data

    As shown in the first post, we start our analysis by downloading the earthquake dataset from earthquake.usgs.gov site, specifically the last 30 days dataset flavor. Please note that such eartquake dataset is day by day updated to cover the last 30 days of data collection. Furthermore, it is not the most recent dataset available, as I collected it some weeks ago. If such dataset is not already present into our workspace, we download and save it to be loaded into the quakes local variable.

    if ("all_week.csv" %in% dir(".") == FALSE) {
      url <- "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv"
      download.file(url = url, destfile = "all_week.csv")
    }
    quakes <- read.csv("all_month.csv", header=TRUE, sep=',', stringsAsFactors = FALSE)
    
    quakes$time <- ymd_hms(quakes$time)
    quakes$updated <- ymd_hms(quakes$updated)
    quakes$magType <- as.factor(quakes$magType)
    quakes$net <- as.factor(quakes$net)
    quakes$type <- as.factor(quakes$type)
    quakes$status <- as.factor(quakes$status)
    quakes$locationSource <- as.factor(quakes$locationSource)
    quakes$magSource <- as.factor(quakes$magSource)
    

    Exploratory Analysis – Categorical Variables

    The categorical variables can be detected by testing if their class is factor.

    (factor_vars <- names(which(sapply(quakes, class) == "factor")))
    ## [1] "magType"        "net"            "type"           "status"        
    ## [5] "locationSource" "magSource"
    
    length(factor_vars)
    ## [1] 6
    

    The describe() function within HMisc package can be useful for categorical variables as well.

    describe(quakes[,factor_vars])
    ## quakes[, factor_vars] 
    ## 
    ##  6  Variables      8407  Observations
    ## ---------------------------------------------------------------------------
    ## magType 
    ##        n  missing distinct 
    ##     8407        0       10 
    ##                                                                       
    ## Value               mb mb_lg    md    mh    ml   mun    mw   mwr   mww
    ## Frequency      2   604    47  2423    14  5203     2     4    19    89
    ## Proportion 0.000 0.072 0.006 0.288 0.002 0.619 0.000 0.000 0.002 0.011
    ## ---------------------------------------------------------------------------
    ## net 
    ##        n  missing distinct 
    ##     8407        0       14 
    ## 
    ## ak (2469, 0.294), ci (1344, 0.160), hv (253, 0.030), ismpkansas (8,
    ## 0.001), ld (4, 0.000), mb (157, 0.019), nc (1435, 0.171), nm (28, 0.003),
    ## nn (604, 0.072), pr (427, 0.051), se (15, 0.002), us (897, 0.107), uu
    ## (588, 0.070), uw (178, 0.021)
    ## ---------------------------------------------------------------------------
    ## type 
    ##        n  missing distinct 
    ##     8407        0        7 
    ## 
    ## chemical explosion (2, 0.000), earthquake (8232, 0.979), explosion (58,
    ## 0.007), ice quake (16, 0.002), other event (3, 0.000), quarry blast (95,
    ## 0.011), rock burst (1, 0.000)
    ## ---------------------------------------------------------------------------
    ## status 
    ##        n  missing distinct 
    ##     8407        0        2 
    ##                               
    ## Value      automatic  reviewed
    ## Frequency       1691      6716
    ## Proportion     0.201     0.799
    ## ---------------------------------------------------------------------------
    ## locationSource 
    ##        n  missing distinct 
    ##     8407        0       15 
    ##                                                                       
    ## Value         ak    ci    hv  ismp    ld    mb    nc    nm    nn    ok
    ## Frequency   2470  1344   253     8     4   157  1435    28   604     6
    ## Proportion 0.294 0.160 0.030 0.001 0.000 0.019 0.171 0.003 0.072 0.001
    ##                                         
    ## Value         pr    se    us    uu    uw
    ## Frequency    427    15   890   588   178
    ## Proportion 0.051 0.002 0.106 0.070 0.021
    ## ---------------------------------------------------------------------------
    ## magSource 
    ##        n  missing distinct 
    ##     8407        0       15 
    ##                                                                       
    ## Value         ak    ci    hv  ismp    ld    mb    nc    nm    nn    ok
    ## Frequency   2480  1344   253     8     4   157  1435    28   604     5
    ## Proportion 0.295 0.160 0.030 0.001 0.000 0.019 0.171 0.003 0.072 0.001
    ##                                         
    ## Value         pr    se    us    uu    uw
    ## Frequency    427    15   881   588   178
    ## Proportion 0.051 0.002 0.105 0.070 0.021
    ## ---------------------------------------------------------------------------
    

    We notice from the magType description that two records have a null string magType. We then replace them we the NA value.

    quakes$magType[quakes$magType == ""] <- NA
    

    To understand relationship or dependencies among categorical variables, we take advantage of various types of tables and graphical methods. Also stratifying variables can be encompassed in order to highlight if the relationship between two primary variables is the same or different for all levels of the stratifying variable under consideration.

    The contingency table are said to be of one-way flavor when involving just one categorical variable. They are said two-way when involving two categorical variables, and so on (N-way).

    For example, here is the one-way contingency table for the magType variable.

    (tbl <- table(quakes$magType))
    ## 
    ##          mb mb_lg    md    mh    ml   mun    mw   mwr   mww 
    ##     0   604    47  2423    14  5203     2     4    19    89
    

    A graphical representation of the same as bar plot is shown.

    ggplot(data = quakes, aes(x=magType, fill = magType)) + geom_bar(stat='count') + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + guides(fill=FALSE)
    

    One-way contingency table of events based on the network the event has been registered from.

    table(quakes$net)
    ## 
    ##         ak         ci         hv ismpkansas         ld         mb 
    ##       2469       1344        253          8          4        157 
    ##         nc         nm         nn         pr         se         us 
    ##       1435         28        604        427         15        897 
    ##         uu         uw 
    ##        588        178
    

    A graphical representation of the same as bar plot is shown.

    ggplot(data = quakes, aes(x = net, fill = net)) + geom_bar(stat = 'count') + guides(fill = FALSE)
    

    One-way contingency table of the events based on their type.

    table(quakes$type)
    ## 
    ## chemical explosion         earthquake          explosion 
    ##                  2               8232                 58 
    ##          ice quake        other event       quarry blast 
    ##                 16                  3                 95 
    ##         rock burst 
    ##                  1
    

    A graphical representation of the same as bar plot is shown.

    ggplot(data = quakes, aes(x = type, fill = type)) + geom_bar(stat = 'count') + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + guides(fill = FALSE)
    

    One-way contingency table of events based on their status.

    table(quakes$status)
    ## 
    ## automatic  reviewed 
    ##      1691      6716
    

    A graphical representation of the same as bar plot is shown.

    ggplot(data = quakes, aes(x = status, fill = status)) + geom_bar(stat='count') + guides(fill=FALSE)
    

    One-way contingency table of events based on the location source.

    table(quakes$locationSource)
    ## 
    ##   ak   ci   hv ismp   ld   mb   nc   nm   nn   ok   pr   se   us   uu   uw 
    ## 2470 1344  253    8    4  157 1435   28  604    6  427   15  890  588  178
    

    A graphical representation of the same as bar plot is shown.

    ggplot(data = quakes, aes(x = locationSource, fill = locationSource)) + geom_bar(stat='count') + guides(fill=FALSE)
    

    One-way contingency table of the events based on the network that originally authored the reported magnitude for this event.

    table(quakes$magSource)
    ## 
    ##   ak   ci   hv ismp   ld   mb   nc   nm   nn   ok   pr   se   us   uu   uw 
    ## 2480 1344  253    8    4  157 1435   28  604    5  427   15  881  588  178
    

    A graphical representation of the same as bar plot is shown.

    ggplot(data = quakes, aes(x = magSource, fill = magSource)) + geom_bar(stat='count') + guides(fill=FALSE)
    

    Two-way contingency table based upon the network as data contributor and the magType, i.e. the method or algorithm used to calculate the preferred magnitude for the event

    table(quakes$net, quakes$magType)
    ##             
    ##                     mb mb_lg   md   mh   ml  mun   mw  mwr  mww
    ##   ak            0    1     0    0    0 2467    0    0    0    1
    ##   ci            0    0     0    0    3 1339    2    0    0    0
    ##   hv            0    0     0  145    0  108    0    0    0    0
    ##   ismpkansas    0    0     0    0    0    8    0    0    0    0
    ##   ld            0    0     0    0    0    4    0    0    0    0
    ##   mb            0    0     0    6    0  151    0    0    0    0
    ##   nc            0    0     0 1423    0    7    0    3    0    0
    ##   nm            0    0     0   28    0    0    0    0    0    0
    ##   nn            0    0     0    0    0  604    0    0    0    0
    ##   pr            0    0     0  425    0    2    0    0    0    0
    ##   se            0    0     0   15    0    0    0    0    0    0
    ##   us            0  603    47    0    0  140    0    0   19   88
    ##   uu            0    0     0  365    0  222    0    1    0    0
    ##   uw            0    0     0   16   11  151    0    0    0    0
    

    A graphical representation of the same as bar plot is shown.

    ggplot(data = quakes, aes(x=net, fill = magType)) + geom_bar(stat='count')
    

    Starting from this two-way contingency table:

    (tbl <- table(quakes$net, quakes$status))
    ##             
    ##              automatic reviewed
    ##   ak              1105     1364
    ##   ci                29     1315
    ##   hv                90      163
    ##   ismpkansas         0        8
    ##   ld                 0        4
    ##   mb                 0      157
    ##   nc               460      975
    ##   nm                 0       28
    ##   nn                 6      598
    ##   pr                 0      427
    ##   se                 0       15
    ##   us                 0      897
    ##   uu                 0      588
    ##   uw                 1      177
    

    its corresponding row proportions table is shown. Each row shows the fraction of automatic/reviewed earthquakes given a certain network. Each row sum up to one.

    prop.table(tbl, 1)
    ##             
    ##                automatic    reviewed
    ##   ak         0.447549615 0.552450385
    ##   ci         0.021577381 0.978422619
    ##   hv         0.355731225 0.644268775
    ##   ismpkansas 0.000000000 1.000000000
    ##   ld         0.000000000 1.000000000
    ##   mb         0.000000000 1.000000000
    ##   nc         0.320557491 0.679442509
    ##   nm         0.000000000 1.000000000
    ##   nn         0.009933775 0.990066225
    ##   pr         0.000000000 1.000000000
    ##   se         0.000000000 1.000000000
    ##   us         0.000000000 1.000000000
    ##   uu         0.000000000 1.000000000
    ##   uw         0.005617978 0.994382022
    

    Column proportions table. Each row shows the fraction of earthquakes network given a specific status (automatic/reviewed). Each column sums up to one.

    prop.table(tbl, 2)
    ##             
    ##                 automatic     reviewed
    ##   ak         0.6534594914 0.2030970816
    ##   ci         0.0171496156 0.1958010721
    ##   hv         0.0532229450 0.0242703990
    ##   ismpkansas 0.0000000000 0.0011911852
    ##   ld         0.0000000000 0.0005955926
    ##   mb         0.0000000000 0.0233770101
    ##   nc         0.2720283856 0.1451756998
    ##   nm         0.0000000000 0.0041691483
    ##   nn         0.0035481963 0.0890410959
    ##   pr         0.0000000000 0.0635795116
    ##   se         0.0000000000 0.0022334723
    ##   us         0.0000000000 0.1335616438
    ##   uu         0.0000000000 0.0875521144
    ##   uw         0.0005913661 0.0263549732
    

    Let us condider the bar plot that can be the plot to represent the net based events counts and, at the same time, highlighting its status with different fill colors.

    ggplot(data = quakes, aes(x = net, fill = status)) + geom_bar(stat = 'count')
    

    We want to give a better graphical representation, where the different proportion in status can be better perceived. We then build a dataframe collecting our network + status earthquakes information in frequency form. Starting from it, we will show the resulting spineplot.

    tbl_df <- as.data.frame(tbl)
    colnames(tbl_df) <- c("net", "status", "Freq")
    tbl_df$net <- factor(tbl_df$net)
    tbl_df
    ##           net    status Freq
    ## 1          ak automatic 1105
    ## 2          ci automatic   29
    ## 3          hv automatic   90
    ## 4  ismpkansas automatic    0
    ## 5          ld automatic    0
    ## 6          mb automatic    0
    ## 7          nc automatic  460
    ## 8          nm automatic    0
    ## 9          nn automatic    6
    ## 10         pr automatic    0
    ## 11         se automatic    0
    ## 12         us automatic    0
    ## 13         uu automatic    0
    ## 14         uw automatic    1
    ## 15         ak  reviewed 1364
    ## 16         ci  reviewed 1315
    ## 17         hv  reviewed  163
    ## 18 ismpkansas  reviewed    8
    ## 19         ld  reviewed    4
    ## 20         mb  reviewed  157
    ## 21         nc  reviewed  975
    ## 22         nm  reviewed   28
    ## 23         nn  reviewed  598
    ## 24         pr  reviewed  427
    ## 25         se  reviewed   15
    ## 26         us  reviewed  897
    ## 27         uu  reviewed  588
    ## 28         uw  reviewed  177
    

    The spineplot makes more evident count differences of the status among net providing with a common scale in the range [0,1].

    (xtabs_res <- xtabs(Freq ~ net + status, data = tbl_df))
    ##             status
    ## net          automatic reviewed
    ##   ak              1105     1364
    ##   ci                29     1315
    ##   hv                90      163
    ##   ismpkansas         0        8
    ##   ld                 0        4
    ##   mb                 0      157
    ##   nc               460      975
    ##   nm                 0       28
    ##   nn                 6      598
    ##   pr                 0      427
    ##   se                 0       15
    ##   us                 0      897
    ##   uu                 0      588
    ##   uw                 1      177
    
    spineplot(xtabs_res)
    

    The xtabs() function creates cross-tabulations of data using the formula style input. Further, applying the summary() to the xtabs() result, we get a chi-squared test of independence of all factors, while indicating the number of cases and dimensions.

    type_net_count <- quakes %>% group_by(status, net) %>% dplyr::summarise(Freq = n())
    xtabs_res <- xtabs(Freq ~ net + status, data = type_net_count)
    summary(xtabs_res)
    ## Call: xtabs(formula = Freq ~ net + status, data = type_net_count)
    ## Number of cases in table: 8407 
    ## Number of factors: 2 
    ## Test for independence of all factors:
    ##  Chisq = 2082.2, df = 13, p-value = 0
    ##  Chi-squared approximation may be incorrect
    

    Same spineplot as resulting by a slightly different approach.

    spineplot(xtabs_res)
    

    Margin tables are another way to summarise categorical data.

    (mt <- margin.table(tbl, 2:1))
    ##            
    ##               ak   ci   hv ismpkansas   ld   mb   nc   nm   nn   pr   se
    ##   automatic 1105   29   90          0    0    0  460    0    6    0    0
    ##   reviewed  1364 1315  163          8    4  157  975   28  598  427   15
    ##            
    ##               us   uu   uw
    ##   automatic    0    0    1
    ##   reviewed   897  588  177
    

    Per row and per column sums can be added by means of the addmargins() function.

    addmargins(mt)
    ##            
    ##               ak   ci   hv ismpkansas   ld   mb   nc   nm   nn   pr   se
    ##   automatic 1105   29   90          0    0    0  460    0    6    0    0
    ##   reviewed  1364 1315  163          8    4  157  975   28  598  427   15
    ##   Sum       2469 1344  253          8    4  157 1435   28  604  427   15
    ##            
    ##               us   uu   uw  Sum
    ##   automatic    0    0    1 1691
    ##   reviewed   897  588  177 6716
    ##   Sum        897  588  178 8407
    

    The Cross Tabulation result is shown.

    mt <- margin.table(table(quakes$status, quakes$type), 2:1)
    CrossTable(mt, prop.chisq = FALSE, prop.c = TRUE, prop.r = TRUE, format = "SPSS")
    ## 
    ##    Cell Contents
    ## |-------------------------|
    ## |                   Count |
    ## |             Row Percent |
    ## |          Column Percent |
    ## |           Total Percent |
    ## |-------------------------|
    ## 
    ## Total Observations in Table:  8407 
    ## 
    ##                    |  
    ##                    | automatic  |  reviewed  | Row Total | 
    ## -------------------|-----------|-----------|-----------|
    ## chemical explosion |        0  |        2  |        2  | 
    ##                    |    0.000% |  100.000% |    0.024% | 
    ##                    |    0.000% |    0.030% |           | 
    ##                    |    0.000% |    0.024% |           | 
    ## -------------------|-----------|-----------|-----------|
    ##         earthquake |     1688  |     6544  |     8232  | 
    ##                    |   20.505% |   79.495% |   97.918% | 
    ##                    |   99.823% |   97.439% |           | 
    ##                    |   20.079% |   77.840% |           | 
    ## -------------------|-----------|-----------|-----------|
    ##          explosion |        3  |       55  |       58  | 
    ##                    |    5.172% |   94.828% |    0.690% | 
    ##                    |    0.177% |    0.819% |           | 
    ##                    |    0.036% |    0.654% |           | 
    ## -------------------|-----------|-----------|-----------|
    ##          ice quake |        0  |       16  |       16  | 
    ##                    |    0.000% |  100.000% |    0.190% | 
    ##                    |    0.000% |    0.238% |           | 
    ##                    |    0.000% |    0.190% |           | 
    ## -------------------|-----------|-----------|-----------|
    ##        other event |        0  |        3  |        3  | 
    ##                    |    0.000% |  100.000% |    0.036% | 
    ##                    |    0.000% |    0.045% |           | 
    ##                    |    0.000% |    0.036% |           | 
    ## -------------------|-----------|-----------|-----------|
    ##       quarry blast |        0  |       95  |       95  | 
    ##                    |    0.000% |  100.000% |    1.130% | 
    ##                    |    0.000% |    1.415% |           | 
    ##                    |    0.000% |    1.130% |           | 
    ## -------------------|-----------|-----------|-----------|
    ##         rock burst |        0  |        1  |        1  | 
    ##                    |    0.000% |  100.000% |    0.012% | 
    ##                    |    0.000% |    0.015% |           | 
    ##                    |    0.000% |    0.012% |           | 
    ## -------------------|-----------|-----------|-----------|
    ##       Column Total |     1691  |     6716  |     8407  | 
    ##                    |   20.114% |   79.886% |           | 
    ## -------------------|-----------|-----------|-----------|
    ## 
    ## 
    

    With the help of the structable within vcd package, we show a 3-WAY contingency table.

    structable(type ~ status + magType, data = quakes)
    ##                   type chemical explosion earthquake explosion ice quake other event quarry blast rock burst
    ## status    magType                                                                                           
    ## automatic                               0          0         0         0           0            0          0
    ##           mb                            0          1         0         0           0            0          0
    ##           mb_lg                         0          0         0         0           0            0          0
    ##           md                            0        515         0         0           0            0          0
    ##           mh                            0          0         0         0           0            0          0
    ##           ml                            0       1172         3         0           0            0          0
    ##           mun                           0          0         0         0           0            0          0
    ##           mw                            0          0         0         0           0            0          0
    ##           mwr                           0          0         0         0           0            0          0
    ##           mww                           0          0         0         0           0            0          0
    ## reviewed                                0          0         0         0           0            0          0
    ##           mb                            0        603         0         0           0            0          0
    ##           mb_lg                         0         47         0         0           0            0          0
    ##           md                            0       1893         2         0           0           13          0
    ##           mh                            0         14         0         0           0            0          0
    ##           ml                            0       3873        53        16           3           82          1
    ##           mun                           2          0         0         0           0            0          0
    ##           mw                            0          4         0         0           0            0          0
    ##           mwr                           0         19         0         0           0            0          0
    ##           mww                           0         89         0         0           0            0          0
    

    Geographical Regions vs. Depth Category Earthquakes Analysis

    Based on ref. [5] definitions we create new factor variables.

    earthquakes <- quakes %>% filter(type == "earthquake")
    earthquakes <- earthquakes[complete.cases(earthquakes[, c("latitude", "longitude", "depth")]),]
    
    emisphere_ns <- as.factor(ifelse(earthquakes$latitude >= 0, "north", "south"))
    emisphere_ew <- as.factor(ifelse(earthquakes$longitude >= 0, "east", "west"))
    
    earthquakes$region <- ifelse(earthquakes$latitude > 0 & earthquakes$longitude > 0, "NE", 
                                 ifelse(earthquakes$latitude > 0  & earthquakes$longitude < 0, "NW",
                                        ifelse(earthquakes$latitude < 0 & earthquakes$longitude > 0, "SE", "SW")))
    
    earthquakes$depth_type <- ifelse(earthquakes$depth <= 70, "shallow", ifelse(earthquakes$depth <= 300, "intermediate", "deep"))
    earthquakes$region <- factor(earthquakes$region)
    earthquakes$depth_type <- factor(earthquakes$depth_type)
    

    Resulting tables are shown.

    (tbl <- table(earthquakes$region, earthquakes$depth_type))
    ##     
    ##      deep intermediate shallow
    ##   NE    9           65     197
    ##   NW    0          526    7109
    ##   SE   10           40     109
    ##   SW   31           45      91
    
    prop.table(tbl, 1)
    ##     
    ##            deep intermediate    shallow
    ##   NE 0.03321033   0.23985240 0.72693727
    ##   NW 0.00000000   0.06889325 0.93110675
    ##   SE 0.06289308   0.25157233 0.68553459
    ##   SW 0.18562874   0.26946108 0.54491018
    
    prop.table(tbl, 2)
    ##     
    ##            deep intermediate    shallow
    ##   NE 0.18000000   0.09615385 0.02624567
    ##   NW 0.00000000   0.77810651 0.94710898
    ##   SE 0.20000000   0.05917160 0.01452172
    ##   SW 0.62000000   0.06656805 0.01212363
    

    We wonder if the region plays a role for the depth type classification. We run the following chi-square test to figure it out.

    chisq.test(earthquakes$region, earthquakes$depth_type, simulate.p.value = TRUE)
    ## 
    ##  Pearson's Chi-squared test with simulated p-value (based on 2000
    ##  replicates)
    ## 
    ## data:  earthquakes$region and earthquakes$depth_type
    ## X-squared = 1322.4, df = NA, p-value = 0.0004998
    

    And it is statistically significant as the reported p-value highlights.

    The Cross Tabulation is as follows.

    mt <- margin.table(tbl, 2:1)
    CrossTable(mt, prop.chisq = FALSE, prop.c = TRUE, prop.r = TRUE, format = "SPSS")
    ## 
    ##    Cell Contents
    ## |-------------------------|
    ## |                   Count |
    ## |             Row Percent |
    ## |          Column Percent |
    ## |           Total Percent |
    ## |-------------------------|
    ## 
    ## Total Observations in Table:  8232 
    ## 
    ##              |  
    ##              |       NE  |       NW  |       SE  |       SW  | Row Total | 
    ## -------------|-----------|-----------|-----------|-----------|-----------|
    ##         deep |        9  |        0  |       10  |       31  |       50  | 
    ##              |   18.000% |    0.000% |   20.000% |   62.000% |    0.607% | 
    ##              |    3.321% |    0.000% |    6.289% |   18.563% |           | 
    ##              |    0.109% |    0.000% |    0.121% |    0.377% |           | 
    ## -------------|-----------|-----------|-----------|-----------|-----------|
    ## intermediate |       65  |      526  |       40  |       45  |      676  | 
    ##              |    9.615% |   77.811% |    5.917% |    6.657% |    8.212% | 
    ##              |   23.985% |    6.889% |   25.157% |   26.946% |           | 
    ##              |    0.790% |    6.390% |    0.486% |    0.547% |           | 
    ## -------------|-----------|-----------|-----------|-----------|-----------|
    ##      shallow |      197  |     7109  |      109  |       91  |     7506  | 
    ##              |    2.625% |   94.711% |    1.452% |    1.212% |   91.181% | 
    ##              |   72.694% |   93.111% |   68.553% |   54.491% |           | 
    ##              |    2.393% |   86.358% |    1.324% |    1.105% |           | 
    ## -------------|-----------|-----------|-----------|-----------|-----------|
    ## Column Total |      271  |     7635  |      159  |      167  |     8232  | 
    ##              |    3.292% |   92.748% |    1.931% |    2.029% |           | 
    ## -------------|-----------|-----------|-----------|-----------|-----------|
    ## 
    ## 
    

    The spineplot is able to highlight the better difference in proportions than the barplot. We herein below show the resulting spineplot.

    type_region_count <- earthquakes %>% group_by(depth_type, region) %>% dplyr::summarise(Freq = n())
    xtabs_res <- xtabs(Freq ~ depth_type + region, data = type_region_count)
    spineplot(xtabs_res)
    

    Under independence hypothesis, we may compute expected frequencies as follows.

    (exp <- independence_table(mt))
    ##               
    ##                        NE         NW          SE         SW
    ##   deep           1.646016   46.37391   0.9657434   1.014334
    ##   intermediate  22.254130  626.97522  13.0568513  13.713800
    ##   shallow      247.099854 6961.65087 144.9774052 152.271866
    

    We now bring our contingency table in frequency form.

    tbl_df <- as.data.frame(tbl)
    colnames(tbl_df) <- c("region", "depth_type", "Freq")
    tbl_df$region <- factor(tbl_df$region)
    tbl_df$depth_type <- factor(tbl_df$depth_type)
    tbl_df
    ##    region   depth_type Freq
    ## 1      NE         deep    9
    ## 2      NW         deep    0
    ## 3      SE         deep   10
    ## 4      SW         deep   31
    ## 5      NE intermediate   65
    ## 6      NW intermediate  526
    ## 7      SE intermediate   40
    ## 8      SW intermediate   45
    ## 9      NE      shallow  197
    ## 10     NW      shallow 7109
    ## 11     SE      shallow  109
    ## 12     SW      shallow   91
    

    Pattern of association can be revealed by the CMHtest() within vcdExtra package or the sieve plot.

    CMHtest(Freq ~ region + depth_type, data = earthquakes)
    ## Cochran-Mantel-Haenszel Statistics for region by depth_type 
    ## 
    ##                  AltHypothesis   Chisq Df        Prob
    ## cor        Nonzero correlation  271.56  1  5.1951e-61
    ## rmeans  Row mean scores differ  817.17  3 8.1840e-177
    ## cmeans  Col mean scores differ  609.18  2 5.2283e-133
    ## general    General association 1322.22  6 1.6789e-282
    

    The sieve plot highlight frequencies different from the expected as determined by independency. The area of each rectangle is always proportional to expected frequency, however the observed frequency is shown by the number of squares in each rectangle.

    sieve(Freq ~ region + depth_type, data = earthquakes, shade = TRUE, legend = TRUE, labeling = labeling_values)
    

    Further, color and intensity of shadows together with the Pearson residuals legend highlight deviations from the expected.

    The association plot puts deviation from independence in the foreground, in the sense that the area of each box is made proportional to the (observed – expected) frequency. Cells with observed > expected frequency rise above the baseline representing independence, while cells that contain less than the expected frequency fall below it.

    assoc(~ region + depth_type, data = earthquakes, shade = TRUE, legend = TRUE, labeling = labeling_values)
    

    The mosaic plot can be applied to N-way contingency tables. For a 2-way table as herein shown, the mosaic display is like a grouped barchart, where the heights (or widths) of the bars show the relative frequencies of one variable and widths (or heights) of the sections in each bar show the conditional frequencies of the second variable given the first.

    mosaic(~ region + depth_type, data = earthquakes, shade = TRUE, legend = TRUE, labeling = labeling_residuals)
    

    Please note that in above mosaic plot, the frequency equal to zero for NW region deep earthquake does not find an explicit graphical representation. There are several labeling option for the mosaic plot, see Table 5.1 at ref. [4]. In our example, we choose to label the cells with the Pearson residuals value. The Pearson residuals can be computed as herein shown.

    resids <- (mt - exp)/sqrt(exp)
    round(resids, 1)
    ##                  NE   NW   SE   SW
    ##   deep          5.7 -6.8  9.2 29.8
    ##   intermediate  9.1 -4.0  7.5  8.4
    ##   shallow      -3.2  1.8 -3.0 -5.0
    

    Where mt is the margin table, exp is the expected value based on indipendence hypothesis as previously computed above.

    If you want to go more in detail with categorical data exploratory analysis, read ref. [4] as a valuable source of technical content and examples.

    If you have any questions, please feel free to comment below.

    References

    1. Earthquake dataset
    2. Eathquake dataset terms
    3. Introductory Statistics with R, 2nd Edition, P. Dalgaard, Springer
    4. Discrete Data Analysis with R, M. Friendly D. Meyer, CRC press
    5. Determining the Depth of an Earthquake

    Related Post

    1. Earthquake Analysis (1/4): Quantitative Variables Exploratory Analysis
    2. How to calculate the correlation coefficients for more than two variables
    3. Six Sigma DMAIC Series in R – Part 5
    4. Dow Jones Stock Market Index (2/4): Trade Volume Exploratory Analysis
    5. Dow Jones Stock Market Index (1/4): Log Returns Exploratory Analysis

    Interested in guest posting? We would love to share your codes and ideas with our community.

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Search R-bloggers


    Sponsors

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)