Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. On this set of exercises, we are going to use the `lm` and `glm` functions to perform several generalized linear models on one dataset.

Since this is a basic set of exercises we will take a closer look at the arguments of these functions and how to take advantage of the output of each function so we can find a model that fits our data.

Before starting this set of exercises, I strongly suggest you look at the R Documentation of `lm` and `glm`.

Note: This set of exercises assume that you have a basic understanding of generalized linear models.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

The dataset we will be using contains information from passengers of the Titanic including if they survived or not.

To obtain the data run these lines of code.

``` if (!'titanic' %in% installed.packages()) install.packages('titanic') library(titanic) DATA <- titanic_train[,-c(1,4,9,11)] ```

Exercise 1

Linear regression

1. Use `DATA` to create a linear model using the function `lm` with the variables Age and Fare as independent variables and Survived as the independent one. Save the regression in an object called `lm_reg`

2. Use the function `glm` to perform the same task and save the regression in an object called `glm_reg`

Exercise 2

If you print any of the previous objects you will realize that there’s not much information about the performance of the models, fortunately `summary` is a great function to find out more about any statistical model you preform to a dataset. Depending on the model `summary` will produce different outputs.

• Apply `summary` to `lm_reg` and to `glm_reg`. You will find a slight difference between both of the outputs, that is because `glm` is more flexible than `lm`.

Exercise 3

So far we have been assuming (incorrectly) that the dependent variable (`Survived`) follows a normal distribution and that’s why we have been performing a linear regression. Obviously `Survived` follows a binomial distribution, there are only two options either the passenger survived (1) or the passenger wasn’t that lucky and he died (0). Since the data has a binomial distribution we should perform a logistic regression, to do this use the function `glm` to perform a logistic regression using `Age` and `Fare` as independent variables and save it in an object called `bin_model`. Hint: Define the value of the argument `family` properly.

Exercise 4

Inside the family attribute you can always specify a particular link, in case you don’t a default link will be associated depending on the family you chose.

1. To find out the default link associated to a certain family, you can write the family name followed by a parenthesis (Ex. `gaussian()`. Find the default link associated to the binomial family.
2. Create a probit model with the same variables used in `bin_model` and save it in an object called `bin_probit_model`.

Exercise 5

Findind the right model requires to compare different models and choose the best, although there are many performance measures, for now we will use the `AIC` as our measure (smaller AIC are better). This means that `bin_model` is better than `bin_probit_model`, so let’s keep working with `bin_model`.

Until now intercept variable has been part of the models. Create a logistic regression with the same variables but with no intercept.

Exercise 6

Impute data. If you run the `summary` function to any of the previous models you will find out that 177 observations have been deleted due to missingness. This happens because the `glm` function has as default argument `na.acton ="na.omit"`. This make easier to run a model with messier data, but that is not always great. You want to have full control an understanding of what does the function is doing.

1. There are some missing values in `age`, replace this values with the median.
2. Update the `glm_model` with the updated data, specify `na.action='na.fail'` This will assure us that the dataset has no missing values, otherwise it will show an error.

Learn more about evaluating different statistical models in the online courses Linear regression in R for Data Scientists and Structural equation modeling (SEM) with lavaan. These courses cover different statistical models that can help you choose the right design for your solution.

Exercise 7

Add polynomial independent variables. Some variables have a quadratic interaction between them and the dependent variable, this can be solved by specifying in the formula of the model a quadratic interaction.

Add a quadratic interaction for the variable `Fare` into the current model, specified in `glm_model`

Exercise 8

Add categorical variables.  Add `Sex` as an independent variable  into the current model specified in `glm_model`. Note that Sex is not a numeric variable.

Exercise 9

Now that we have found a good model that fits our data, so it’s time to use the `predict` function to find how good the model predicts in our own data. Use the function `predict` to find the prediction of the model in `DATA` and save it in `Pred.default`

Exercise 10

`Pred.default` shows the predicted values under the link transformation, in this case logit. This is not easily interpretable, to fix this problem we can specify the `type` of prediction we want.

Obtain the predictions as probability values.
Exta: What’s the percentage accuracy of this model if we assigned as died (0) if the predicted probability is less than 0.5 and survived (1) otherwise?