(This article was first published on

Stone flakes are waste products from the tool making process in the stone age. This is the second post, first post was clustering, second linking to hominid type. The data also contains a more or less continuous age variable, which gives possibility to use regression, which is the topic of this week.**Wiekvoet**, and kindly contributed to R-bloggers)### Data

Data source see first post. Regarding age especially: '*in millenia (not to be taken too seriously) ... mode of dating (geological=more accurate, typological)'*. On inspection, it seems most of the typological dates are 200k, hence it seemed an idea to ignore typological. In addition, age is negative. Since at some point I want to use a Box-Cox transformation, for which positive is required, -age is added.

r2 <- read.table('StoneFlakes.txt',header=TRUE,na.strings='?')

r1 <- read.table('annotation.txt',header=TRUE,na.strings='?')

r12 <- merge(r1,r2)

r12$Group <- factor(r12$group,labels=c('Lower Paleolithic',

'Levallois technique',

'Middle Paleolithic',

'Homo Sapiens'))

r12$mAge <- -r12$age

r12c <- r12[complete.cases(r12),]

table(r12$age ,r1$dating )

geo typo

-400 2 0

-300 8 1

-200 13 12

-130 1 0

-120 12 0

-80 23 2

-40 3 0

### Regression

The tool to start is linear regression. This shows there is a relation, mostly age with FSF and LBI. It is not a very good model, the standard error is at least 55 thousand years.

l1 <- lm(age ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,

data=r12c[r12c$dating=='geo',])

summary(l1)

Call:

lm(formula = age ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 +

PROZD, data = r12c[r12c$dating == "geo", ])

Residuals:

Min 1Q Median 3Q Max

-127.446 -30.646 -7.889 27.790 159.471

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -328.8260 310.6756 -1.058 0.2953

LBI 149.1334 65.8323 2.265 0.0281 *

RTI -0.8196 2.8498 -0.288 0.7749

WDI 16.2067 20.1351 0.805 0.4249

FLA -1.6769 1.8680 -0.898 0.3739

PSF -0.9222 1.0706 -0.861 0.3934

FSF 1.9496 0.8290 2.352 0.0229 *

ZDF1 0.9537 1.1915 0.800 0.4275

PROZD 1.2245 2.0068 0.610 0.5447

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.09 on 47 degrees of freedom

Multiple R-squared: 0.7062, Adjusted R-squared: 0.6562

F-statistic: 14.12 on 8 and 47 DF, p-value: 3.23e-10

Some model validation can be made through the car package. It gives the impression that the error increases with age. It is not a particular strong effect, but then, the age range is not that large either, 40 to 400, which is a factor 10.

library(car)

par(mfrow=c(2,2))

plot(l1,ask=FALSE)

Since I know of no theoretical basis to chose a transformation, Box-Cox is my method of choice to proceed. Lambda zero, or log transformation, is certainly a choice which seems reasonable, hence it has been selected.

r12cx <- r12c[r12c$dating=='geo',]

summary(p1 <- powerTransform(

mAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,

r12cx))

bcPower Transformation to Normality

Est.Power Std.Err. Wald Lower Bound Wald Upper Bound

Y1 0.1211 0.1839 -0.2394 0.4816

Likelihood ratio tests about transformation parameters

LRT df pval

LR test, lambda = (0) 0.4343012 1 5.098859e-01

LR test, lambda = (1) 21.3406407 1 3.844933e-06

#### Linear regression, step 2

Having chosen a transformation, it is time to rerun the model. It is now clear that FSF is most important and LBI a bit less.

r12cx$lAge <- log(-r12cx$age)

l1 <- lm(lAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,

data=r12cx)

summary(l1)

Call:

lm(formula = lAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 +

PROZD, data = r12cx)

Residuals:

Min 1Q Median 3Q Max

-1.04512 -0.18333 0.07013 0.21085 0.58648

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.071224 2.030941 2.497 0.01609 *

LBI -0.953462 0.430357 -2.216 0.03161 *

RTI 0.005561 0.018630 0.298 0.76664

WDI -0.041576 0.131627 -0.316 0.75350

FLA 0.018477 0.012211 1.513 0.13695

PSF 0.002753 0.006999 0.393 0.69580

FSF -0.015956 0.005419 -2.944 0.00502 **

ZDF1 -0.004485 0.007789 -0.576 0.56748

PROZD -0.009941 0.013119 -0.758 0.45236

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3601 on 47 degrees of freedom

Multiple R-squared: 0.6882, Adjusted R-squared: 0.6351

F-statistic: 12.97 on 8 and 47 DF, p-value: 1.217e-09

#### Plot of regression

With two independent variables, it is easy to make a nice plot:

l2 <- lm(lAge ~ LBI + FSF ,

data=r12cx)

par(mfrow=c(1,1))

incont <- list(x=seq(min(r12cx$LBI),max(r12cx$LBI),length.out=12),

y=seq(min(r12cx$FSF),max(r12cx$FSF),length.out=13))

topred <- expand.grid(LBI=incont$x,

FSF=incont$y)

topred$p1 <- predict(l2,topred)

incont$z <- matrix(-exp(topred$p1),nrow=length(incont$x))

contour(incont,xlab='LBI',ylab='FSF')

cols <- colorRampPalette(c('violet','gold','seagreen'))(4)

with(r12cx,text(x=LBI,y=FSF,ID,col=cols[group]))

### Predictions

I started in data analysis as a chemometrician, my method of choice for a predictive model with correlated independent variables is PLS. In this case one component seems enough (lowest cross validation RMSEP), but the model explains 60% of log(age) variability, which is not impressive.

library(pls)

p1 <- mvr(lAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,

data=r12cx,

method='simpls',

validation='LOO',

scale=TRUE,

ncomp=5)

summary(p1)

Data: X dimension: 56 8

Y dimension: 56 1

Fit method: simpls

Number of components considered: 5

VALIDATION: RMSEP

Cross-validated using 56 leave-one-out segments.

(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps

CV 0.6016 0.3714 0.3812 0.3864 0.3915 0.3983

adjCV 0.6016 0.3713 0.3808 0.3859 0.3910 0.3976

TRAINING: % variance explained

1 comps 2 comps 3 comps 4 comps 5 comps

X 51.07 64.11 76.81 83.13 89.44

lAge 64.62 67.94 68.70 68.78 68.81

r12c$plspred <- -exp(predict(p1,r12c,ncomp=1))

plot(plspred ~age,ylab='PLS prediction',type='n',data=r12c)

text(x=r12c$age,y=r12c$plspred,r12c$ID,col=cols[r12c$group])

In an email from Thomas Weber, it was also indicated that there might be reasons to doubt the homonid group of a few inventories (rows); reasons include few flakes, changing insight and misfit from "impressionist technological" point of view. All inventories mentioned in that email are now removed. As can be seen, a two component PLS model is now preferred, and 80% of variance is explained.

p2 <- mvr(lAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,

data=r12cx,

method='simpls',

validation='LOO',

scale=TRUE,

ncomp=5,

subset= !(ID %in% c('ms','c','roe','sz','va','arn')))

summary(p2)

Data: X dimension: 53 8

Y dimension: 53 1

Fit method: simpls

Number of components considered: 5

VALIDATION: RMSEP

Cross-validated using 53 leave-one-out segments.

(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps

CV 0.6088 0.3127 0.3057 0.3062 0.3070 0.3153

adjCV 0.6088 0.3126 0.3054 0.3059 0.3065 0.3140

TRAINING: % variance explained

1 comps 2 comps 3 comps 4 comps 5 comps

X 52.14 63.94 77.05 79.82 84.77

lAge 75.44 79.66 80.31 81.00 81.56

#### Plot

The plot below shows age in data versus the model predictions. It should be noted that after all steps, it is to be expected the data used to fit the model is predicted better than the other data. This is especially true for the suspected outliers which were removed, but also for inventories with dating method typological.

Having said that, the model really does not find inventories v1 and v2 to be as old as the data states. Perhaps there was no or less or different technological change, which is not picked by the model. In addition, the step between middle paleolithic and homo sapiens is not picked by the model. It is my personal suspicion that getting more homo sapiens data would improve all of the models which I have made in these three blog posts. As it is, the regression tree was the best tool to detect these inventories, which is a bit odd.

r12c$plspred2 <- -exp(predict(p2,r12c,ncomp=2))

plot(plspred2 ~age,ylab='PLS prediction',type='n',data=r12c)

text(x=r12c$age,y=r12c$plspred2,r12c$ID,col=cols[r12c$group])

To

**leave a comment**for the author, please follow the link and comment on his blog:**Wiekvoet**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...