(This article was first published on

Last week I showed rain data from six stations in Netherlands years 1906 till now. The obvious next question is; did it change? A surprisingly difficult question. The data is not normal distributed, but it is time-correlated, location correlated.**Wiekvoet**, and kindly contributed to R-bloggers)### The data

As described before; the data are from KNMI; ROYAL NETHERLANDS METEOROLOGICAL INSTITUTE. The actual page for the data is http://www.knmi.nl/klimatologie/monv/reeksen/select_rr.html. The data selected were from 6 locations; De Bilt, Groningen, Heerde, Kerkwerve, Ter Apel and Winterswijk (Sibinkweg). The blog starts with data entered in a data.frame named all which was derived last week.

### The problems with the data

The data is certainly not normal. Most days have no rain. The data is also location correlated. It is improbable to have beautiful weather in one location and loads of rain 200 km, possibly with the exception of summer, when thunderstorms may be highly local. It is also not often to have nice weather one day and loads of rain the next. For those not living here, examine this plot of 1906 data.

library(ggplot2)

Y1906 <- all[all$year==1906,]

c1 <- ggplot(Y1906, aes(x=day, y=RD,col=location,linetype=location))

c1 + geom_line() +

facet_wrap( ~ Month) +

theme(legend.position = "bottom") +

labs(col = " ",linetype=' ')

### Analysis

There are many ways to analyze the data for differences. In this case I wanted to compare the first and last 10 years of data. A non-parametric method is preferred because the non-normality That leaves all the other problems, correlations between observations. To avoid these I tried to eliminate data. Just one location avoids correlation between locations. Data every sixth day avoids time correlation. De Bilt is chosen as location because it is centre of the country and home of the KNMI. Finally, January, because only one month seems most opportune. It seems quite crude when you look at it.

sel1 <- all[(all$year <1916 | all$year>2003)

& all$Month=='Jan'

& all$location=='DE-BILT'

& all$day %in% seq(1,31,by=5),]

sel1$decade <- factor(c('1906-1915','2004-2012')[1+(sel1$year>1950)])

Packige coin has a one_way function which seems suitable. Distribution approximate makes it do an actual Monte Carlo resampling, and the result is no difference.

oneway_test(RD ~ decade,data=sel1,distribution='approximate')

Approximative 2-Sample Permutation Test

data: RD by decade (1906-1915, 2004-2013)

Z = -0.6353, p-value = 0.535

alternative hypothesis: true mu is not equal to 0

Oneway_test tests for location,Kolmogorov_Smirnov tests for general differences. The assumption is that the data is continuous, not so sure that holds. Result; p-value close to 0.05.

ks.test(sel1$RD[sel1$decade=='1906-1915'],

sel1$RD[sel1$decade=='2004-2013'])

Two-sample Kolmogorov-Smirnov test data: sel1$RD[sel1$decade == "1906-1915"] and sel1$RD[sel1$decade == "2004-2013"] D = 0.2286, p-value = 0.05161 alternative hypothesis: two-sided

The question is why there are differences? After plotting, it seems the number of days without rain changes. The latter seems to show a significant difference.

c1 <- ggplot(sel1, aes(x=RD/10,col=decade))

c1 + geom_density() + xlab('mm rain/day')

xtabs(~ decade + factor(RD==0),data=sel1)

factor(RD == 0)

decade FALSE TRUE

1906-1915 37 33

2004-2013 53 17

decade FALSE TRUE

1906-1915 37 33

2004-2013 53 17

prop.test( xtabs(~ decade + factor(RD==0) ,data=sel1))

2-sample test for equality of proportions with continuity correction

data: xtabs(~decade + factor(RD == 0), data = sel1)

X-squared = 7, df = 1, p-value = 0.008151

alternative hypothesis: two.sided

95 percent confidence interval:

-0.3970179 -0.0601250

sample estimates:

prop 1 prop 2

0.5285714 0.7571429

#### Check with December data

Since January was used to select the analysis method, ideally it should not be used for the p-value. There is data enough; December turns out to show a p-value of 0.005.

sel2 <- all[(all$year <1916 | all$year>2002)

& all$Month=='Dec'

& all$location=='DE-BILT'

& all$day %in% seq(1,31,by=5),]

sel2$decade <- factor(c('1906-1915','2002-2012')[1+(sel2$year>1950)])

prop.test( xtabs(~ decade + factor(RD==0) ,data=sel2))

2-sample test for equality of proportions with continuity correction

data: xtabs(~decade + factor(RD == 0), data = sel2)

X-squared = 7.7712, df = 1, p-value = 0.005309

alternative hypothesis: two.sided

95 percent confidence interval:

-0.36463362 -0.06393781

sample estimates:

prop 1 prop 2

0.6571429 0.8714286

To

**leave a comment**for the author, please follow the link and comment on his blog:**Wiekvoet**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...