It is obvious (after seeing the spectra of the calibration set), that we have at least three clusters, and that this can be related with the concentration of the active ingredient in the tablets. If we see the scores in the PC1-PC2 score map we will see the three clusters.
I have imported the test set into R, and I did project the test set into the PC1-PC2 score map (developed with the calibration samples), and I found another cluster.
If we read the Chemometrics Shootout rules, we see:
“This year’s challenge will consist in developing the best model for the active
ingredient using the calibration data. However, the most important task will be to build a
model that will be robust to production scale differences. In addition, the quality of the
presentation and the reasoning behind the approach taken will be used to determine the
So to predict as accurate as possible this test set is important to approach the challenge.
And what about the Validation Set.We don´t know the reference values, but we can project the samples again into the PC1-PC2 score map (developed with the calibration samples) in order to see more clusters, or if the samples are represented in the Training Set.
As we can see some test and validation samples do not overlap with any samples of the calibration set, so we have to consider this when developing the model.
R is really wonderful making these plots:
Black circles: Calibration Samples
Red triangles: Test Samples
green crosses: Validation samples