Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Happy Planet Index (HPI) is an index of human well-being and environmental impact that was introduced by NEF, a UK-based economic think tank promoting social, economic and environmental justice. The index is weighted to give progressively higher scores to nations with lower ecological footprints. I downloaded the 2016 dataset from HPI website. My goal is to find correlations between several variables, then use clustering technic to seprarate these 140 countries into different clusters, according to happiness, wealth, life expectancy and carbon emissions.

### Data Pre-processing

The structure of the data

The summary of the data

After log transformation, the relationship between GDP per capita and life expectancy is more clear and looks relatively strong. These two variables are concordant. The Pearson correlation between this two variable is reasonably high, at approximate 0.62.

Many countries in Europe and Americas end up with middle-to-low HPI index probably because of their big carbon footprints, despite long life expectancy.

GDP can’t buy happiness. The correlation between GDP and Happy Planet Index score is indeed very low, at about 0.11.

### Always(almost) scale the data.

An important step of meaningful clustering consists of transforming the variables such that they have mean zero and standard deviation one.

A simple correlation heatmap

### Principal Component Analysis (PCA)

PCA is a procedure for identifying a smaller number of uncorrelated variables, called “principal components”, from a large set of data. The goal of principal components analysis is to explain the maximum amount of variance with the minimum number of principal components.

Interpretation:

1. The proportion of variation retained by the principal components was extracted above.

2. eigenvalues is the amount of variation retained by each PC. The first PC corresponds to the maximum amount of variation in the data set. In this case, the first two principal components are worthy of consideration because A commonly used criterion for the number of factors to rotate is the eigenvalues-greater-than-one rule proposed by Kaiser (1960).

The scree plot shows us which components explain most of the variability in the data. In this case, almost 80% of the variances contained in the data are retained by the first two principal components.

1. Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the data set.

2. The contribution of variables was extracted above: The larger the value of the contribution, the more the variable contributes to the component.

This highlights the most important variables in explaining the variations retained by the principal components.

### Using Pam Clustering Analysis to group countries by wealth, development, carbon emissions, and happiness.

When using clustering algorithms, k must be specified by the analyst. I use the following method to help finding the best k.

I will apply K=3 in the following steps.

Number of countries assigned in each cluster.

This prints out one typical country represents each cluster.

It is always a good idea to look at the cluster results, see how these three clusters were assigned.

### A World map of three clusters

Source code that created this post can be found here. I am happy to hear any feedback or questions.

References:

STHDA

r-bloggers

FactoMineR

NbClust

DataScience+

Exploring and Clustering Happy Planet Index was originally published by Susan Li at Susan Li | Data Ninja on May 19, 2017.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.