Example 10.5: Convert a character-valued categorical variable to numeric

October 8, 2012

(This article was first published on SAS and R, and kindly contributed to R-bloggers)

In some settings it may be necessary to recode a categorical variable with character values into a variable with numeric values. For example, the matching macro we discussed in example 7.35 will only match on numeric variables. One way to convert character variables to numeric values is to determine which values exist, then write a possibly long series of conditional tests to assign numbers to the values. Surely there’s a better way?

In SAS, Rick Wicklin offers an IML solution and links to a macro with the same function. But if you’re not an IML coder, and you don’t want to investigate a macro solution, it’s simple enough to do with data steps.

We’ll begin by making some fake data.

data test;
  do i = 1 to 100;
  cat = "meow";
  if i gt 30 then cat = "Purr";
  if i gt 70 then cat = "Hiss";

To make the new variable, we’ll just sort (section 1.5.6) the data on the categorical variable we want to convert, then use the set ds; by x; syntax to keep track of when a new value is encountered in the data. It’s hard to believe that we’ve never demonstrated this useful syntax before– perhaps we just can’t find it today. The set ds; by x; syntax makes new temporary variables first.x and last.x that are equal to 1 for the first and last observations of each new level of x, respectively, and 0 otherwise. When we find a new value, we’ll increase a counter by 1; the counter is our new numeric-valued variable.

proc sort data = test; by cat; run;

data catize;
set test;
by cat;
retain catnum 0;
if first.cat then catnum = catnum + 1;

/* check the result */
proc freq data = catize;
tables cat * catnum;

The table also shows the recoding values.

                             Table of cat by catnum

                  cat       catnum

                  Percent  |
                  Row Pct  |
                  Col Pct  |       1|       2|       3|  Total
                  Hiss     |     30 |      0 |      0 |     30
                           |  30.00 |   0.00 |   0.00 |  30.00
                           | 100.00 |   0.00 |   0.00 |
                           | 100.00 |   0.00 |   0.00 |
                  Purr     |      0 |     40 |      0 |     40
                           |   0.00 |  40.00 |   0.00 |  40.00
                           |   0.00 | 100.00 |   0.00 |
                           |   0.00 | 100.00 |   0.00 |
                  meow     |      0 |      0 |     30 |     30
                           |   0.00 |   0.00 |  30.00 |  30.00
                           |   0.00 |   0.00 | 100.00 |
                           |   0.00 |   0.00 | 100.00 |
                  Total          30       40       30      100
                              30.00    40.00    30.00   100.00


We begin by making the data. To convert to numbers, we use the labels option to the factor() function, feeding it the sequences of numbers between 1 and however many different values there are. Note that we find this using the factor() function again. There’s probably a better way of doing this, but it’s a little bit amusing to code it this way. Then we have numbers, but they’re store as a factor. We can get them out with a call to as.numeric().

cat = c(rep("meow",30),rep("Hiss",30), rep("Purr", 40))
catn1 = factor(cat, labels=(1:length(levels(factor(cat)))))
catn = as.numeric(catn1)

catn Hiss meow Purr
   1   30    0    0
   2    0   30    0
   3    0    0   40

There’s a warning in the documentation for factor() that the values are assigned in location-specific fashion, so the table should be used to establish how the codes were assigned.  For the record, the use cases for this kind of recoding in R may be more strained than the SAS example given above.

An unrelated note about aggregators:
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

To leave a comment for the author, please follow the link and comment on their blog: SAS and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training




CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)