First, an update: A commentator has asked me to post my code so that it is easier to practice the examples I show here. It will take me a little bit of time to get all of my code for past posts well-documented and readable, but I have uploaded the code and data for the last 4 posts, including this one, here:
Unfortunately, I could not find a way to attach it to blogger, so sorry for the extra step.
Ok, now on to Data types part 4: Logical
I started this series of posts on data types by saying that when you have a dataframe like this called mydata:
you can’t do this in R:
Because Age does not exist as an object in R, and you get the error below:
But then what happens when I do,
This is perfectly legal to do in R, but it’s not going to drop observations. With this kind of statement, you are asking R to evaluate the logical question “Is it true that mydata$Age is less than 25?”. Well, that depends on which element of the Age vector, of course. Which is why this is what you get when you run that code:
On first glance, this looks like a character vector. There is a string of entries using character letters after all. But it’s not character class, it’s the logical class. If you save this string of TRUE and FALSE entries into an object and print its class, this is what you get:
The logical class can only take on two values, TRUE or FALSE. We’ve seen evaluations of logical operations already, first in subsetting, like this:
Check out my post on subsetting if this syntax is confusing. In a nutshell, R evaluates all rows and keeps only those that meet the criteria, which is only rows where Age has a value of under 40 and all columns.
Or here, in ifelse() statements
More on ifelse() statements here. The ifelse() function is really useful, but is actually overkill when you’re just creating a binary variable. This can be done faster by taking advantage of the fact that logical values of TRUE always have a numeric value of 1, while logical values of FALSE always have a numeric value of 0.
That means all I need to do to create a binary variable of under age 25 is to convert my logical mydata$Ageunder25 vector into numeric. This is very easy with R’s as.numeric() function. I do it like this:
or directly without that intermediate step like this:
Let’s check out the relevant columns in our dataframe:
We can see that the Ageunder25_num variable is an indicator of whether the Age variable is under 25.
Now the really, really useful part of this is that you can use this feature to turn on and off a variable depending on its value. For example, say you got your data and realized that some of the height values were in inches and some were in centimeters, like this:
Those heights of 152 and 170 are in centimeters while everything else is inches. There are various ways to fix it, but one way is to check which values are less than, say 90, which is probably a safe cutoff and create a new column that keeps those values under 90 but converts the values over 90. We can do this in this way: