Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

An essential but feared topic in spatial analysis is spatial autocorrelation (SAC). If this term is new to you, check out my primer on autocorrelation in ecology. If you work with spatial data in R (or plan to do so) and want to calculate the degree to which SAC could affect your investigation, read on! If you already know about SAC and have decided it is confounding your analysis, the good news is that there are a variety of sophisticated techniques you can apply, which I review in another post.

Here, I simulate spatial data and work through simple ways to:

• estimate the spatial scale of autocorrelation
• calculate the inflation of degrees of freedom
• aggregate data to decrease SAC

The first step is to generate autocorrelated data on a grid (raster). Santiago Begueria wrote a detailed post about doing this with the gstat package. Alternatively, it’s possible to accomplish the same thing using the fields package. In geospatial analysis on continental scales, I might use an equal-area projection (e.g. Eckert VI). For computational efficiency and simplicity here, however, let us assume that the study area is small and near the equator, so we will work with lat/long as Cartesian coordinates and disregard the Earth’s curvature.

gridDim <- 60
xy <- expand.grid(x=1:gridDim, y=1:gridDim)

Now let’s simulate data with known SAC. In this case, there are two independent rasters that each contain SAC, and a third raster that is a combination of the first two with normal error. I am making a simplification in using independent predictors: most environmental variables co-vary (e.g. surface temperature and precipitation on land, or water depth and light availability in the sea). This is why in the previous post I recommended thinking about the process that generated the SAC in the first place. Also note that if the error was not normally distributed, I would recommend analyzing the data through Generalized Linear Mixed Models (GLMM) or Generalized Estimating Equations (GEE).

library(gstat)
# Variogram model, with defined sill and range
varioMod <- vgm(psill=0.05, range=10, model='Exp')

# Set up an additional variable from simple kriging
zDummy <- gstat(formula=z~1, locations = ~x+y, dummy=TRUE,
beta=1, model=varioMod, nmax=20)

# Generate 2 randomly autocorrelated predictor data fields
set.seed(3)
xyz <- predict(zDummy, newdata=xy, nsim=2)

# Generate an autocorrelated response variable:
# Deterministic on predictors, with normal error
e <- rnorm(nrow(xyz), sd=.5)
keepY <- which(xy$y %in% keepPosition) bigGrain <- xyz[intersect(keepX, keepY),] Now we have a smaller raster, in which values of adjacent cells correlate weakly if at all. How does analysis on this raster compare to analysis on the original data? ## Regression In this section we will build a regression on data at both a coarse and fine resolution, and then compare the performance. First we predict the response variable from the coarse-grain data: fmla <- as.formula(resp ~ sim1 + sim2) lmBig <- lm(fmla, data = bigGrain) summary(lmBig) # Call: # lm(formula = fmla, data = bigGrain) # Residuals: # Min 1Q Median 3Q Max # -1.0081 -0.3124 0.1485 0.3659 0.6990 # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -0.5547 1.2599 -0.440 0.6670 # sim1 0.5998 0.7152 0.839 0.4168 # sim2 1.8767 0.7237 2.593 0.0223 * # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Residual standard error: 0.5699 on 13 degrees of freedom # Multiple R-squared: 0.3427, Adjusted R-squared: 0.2416 # F-statistic: 3.389 on 2 and 13 DF, p-value: 0.06538 The estimated beta coefficients are approximately 1 standard error from the true values (0.7 and 1.4). Let’s repeat this regression on the fine-grain data: lmFine <- lm(fmla, data = xyz) summary(lmFine) # Call: # lm(formula = fmla, data = xyz) # Residuals: # Min 1Q Median 3Q Max # -1.73515 -0.33877 -0.00586 0.32890 1.76475 # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -0.10916 0.05731 -1.905 0.0569 . # sim1 0.72254 0.04107 17.591 <2e-16 *** # sim2 1.47054 0.04145 35.477 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Residual standard error: 0.4979 on 3597 degrees of freedom # Multiple R-squared: 0.3183, Adjusted R-squared: 0.3179 # F-statistic: 839.8 on 2 and 3597 DF, p-value: < 2.2e-16 Note the huge inflation in degrees of freedom because of the extra cells, and hence smaller standard errors than in the coarser data. More is not better: additional raster cells add redundant information. In other words, closely spaced observations “pseudo-replicate” each other, which is a violation of the model’s assumption of independence. At this point it may be tempting to think that the finer-grain data result in a more informative model, on the basis of any common performance metric. However, this is an unfair comparison of models. The goal is not to minimize a metric like p-values to satisfy critics, but to accurately and precisely estimate 1) the underlying relationship and 2) our uncertainty about that estimate. If we subset the fine-grain data to the same number of cells as the coarse-grain data, for instance by selecting a corner of the original grid, we find that the original raster does not contain as much information as we may have thought. cornrPosition <- 1:length(keepPosition) keepX <- which(xy$x %in% cornrPosition)
keepY <- which(xy\$y %in% cornrPosition)
cornrFine <- xyz[intersect(keepX, keepY),]

lmCornr <- lm(fmla, data = cornrFine)
summary(lmCornr)

# Call:
# lm(formula = fmla, data = cornrFine)

# Residuals:
#      Min       1Q   Median       3Q      Max
# -0.71805 -0.29791  0.00389  0.28861  0.61303

# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)   0.2528     1.9329   0.131    0.898
# sim1          0.3068     1.1960   0.257    0.802
# sim2          1.2137     1.6056   0.756    0.463

# Residual standard error: 0.4306 on 13 degrees of freedom
# Multiple R-squared:  0.04222,	Adjusted R-squared:  -0.1051
# F-statistic: 0.2865 on 2 and 13 DF,  p-value: 0.7555

The coarse-grain data provide more precise estimates of the relationship than an equivalent sample of fine-grain data. In this case, more is not better: additional cells add redundant information. In other words, closely spaced observations pseudoreplicate each other, which is a violation of the model’s assumption that the observations are independent of each other.

## So what?

Collecting spatial data takes time, resources, and effort. Due diligence in deciding the appropriate resolution and extent of study up front could could simplify later analysis massively. Variograms are useful tools to calculate a resolution that will yield the most information without violating model assumptions of independence. After data collection, variograms can justify succinctly whether or not subsequent analysis should account for spatial autocorrelation. Aggregation is a quick means to lessen the degree of autocorrelation prior to analysis; post hoc methods can be effective as well, but are often more complex and computational.

The post Exploring spatial autocorrelation in R appeared first on Gwen Antell.