Factor variable to numeric or integer variable

603 views
Skip to first unread message

Gouri Shankar Mishra

unread,
Apr 28, 2016, 4:16:19 PM4/28/16
to Davis R Users' Group
I have a variable (age of respondent) which is currently a factor variable and I want to convert to numeric. 

> str(data)
'data.frame': 2001 obs. of  2 variables:
 $ seq_num: int  1 2 3 4 5 6 7 8 9 10 ...
 $ age    : Factor w/ 76 levels "","18","19","20",..: 50 29 3 39 5 35 21 35 56 53 ...

All the factor levels are integers except for "blanks" and a "99 or older". Given these two levels, I cannot convert to a numeric by saying

> data$age_num <- as.numeric(data$age)

If i do the above - the the numeric values are lower by 16. 
> head(data$age_num)
[1] 50 29  3 39  5 35
> head(data$age)
[1] 66 45 19 55 21 51
76 Levels:  18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 ... 99 or older

I wonder what is going on? Also - what is the best way to get a numeric variable. 

Thanks for your time. 

data.csv

Michael Hannon

unread,
Apr 28, 2016, 4:27:57 PM4/28/16
to davi...@googlegroups.com
Below is an except from a note that I wrote recently to one of my
colleagues. It may (?) be useful.

-- Mike

Note also that you *can* extract numerical values from factors that were made
originally with numbers, but you have to be careful. Here's an example:

######
aVec <- c(1, 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5)
print(aVec)
str(aVec)
mAvec <- mean(aVec)

aVecF <- factor(aVec)
print(aVecF)
str(aVecF)
print(levels(aVecF))
mAvecF <- mean(aVecF) ## NA

print(levels(aVecF)[aVecF])
print(as.numeric(levels(aVecF)[aVecF]))
mAvecFnum <- mean(as.numeric(levels(aVecF)[aVecF]))

print(mAvec)
print(mAvecF) ## NA
print(mAvecFnum)

all.equal(mAvec, mAvecFnum)
######

If you run the code above, you'll see that we eventually *did* get back the
correct value for the mean, but it required a lot of work to do it.
> --
> Check out our R resources at http://d-rug.github.io/
> ---
> You received this message because you are subscribed to the Google Groups
> "Davis R Users' Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to davis-rug+...@googlegroups.com.
> Visit this group at https://groups.google.com/group/davis-rug.
> For more options, visit https://groups.google.com/d/optout.

Matt Espe

unread,
Apr 28, 2016, 4:59:57 PM4/28/16
to Davis R Users' Group
Hi,

This catches lots of people. When R stores something as a factor, it takes the values (numbers, characters, whatever) and maps them to integers. If you convert by as.numeric, it converts the integer, not the value that was mapped to that integer. A work around it is convert it from a factor to a character, and then from character to numeric. 

data$age <- as.numeric(as.character(data$age))

However, the question you should be asking is why R imported an integer vector as a factor (typically reserved for categorical data such as character strings). From a quick glance, you have some non-integer values in your data. R will coerce those to NA values, which may or may not be what you want.

I would suggest digging into this a bit further before applying the above. The ideal case is your age column only contains integer values, in which case R will do the correct thing and no conversion will be needed. Next best is you want all the non-integer values to be NA, and R's automatic coercion does the right thing when you try the above. Worse case is you do not want the non-integer values to be NAs, in which case the above will make a big headache for you.

Matt

Gouri Shankar Mishra

unread,
Apr 28, 2016, 7:22:04 PM4/28/16
to Davis R Users' Group
Matt - This works great. Thanks. Would you be able to explain why an intermediate step of as.character is needed? I guess - I am trying to understand what these functions are doing., 

Duncan Temple Lang

unread,
Apr 28, 2016, 9:31:16 PM4/28/16
to davi...@googlegroups.com
Hi

as.integer(as.character(data$age))

is the recommended approach, rather than

as.integer(levels(data$age)[data$age])

i.e. turn the factors into characters which are the values from the levels
and then convert them to numbers.

Ideally, read the data from file by providing a value for the colClasses argument and specifying
the anticipated types for each column, if this is possible. But either approach works.


What is happening is that
typeof( data$age )
is integer. So as.integer(data$age) just gives the integer values.
But these values are, as Mike's code suggests, the indices/positions
of the set of unique values that make up the factor, i.e., the levels.
These integer values in a factor are always 1, 2, 3, up to the total number of unique elements.
So if you have ages 17, 18, ... then these will be mapped to 1, 2, 3, ...

This isn't crazy. We exploit this integer representation in R and so it is very convenient,
but it makes transforming actual values to numbers one step more involved.

D.
--
Director, Data Sciences Initiative, UC Davis
Professor, Dept. of Statistics, UC Davis

http://datascience.ucdavis.edu
http://www.stat.ucdavis.edu/~duncan

Matt Espe

unread,
May 2, 2016, 1:25:07 PM5/2/16
to Davis R Users' Group
Basically, R is storing a factor in two tables, one that contains a vector of integer values and another that contains the levels. So if you have this vector:

> vals<- c("A","B","C","A")

If you turn that into a factor, R stores it this way. 

> ff <- as.factor(vals)
> ff
[1] A B C A
Levels: A B C
> typeof(ff)
[1] "integer"

Despite the fact that the character vector and the factor display for you very similarly when printed to the console, R is seeing the factor as an integer vector (one reason why class(), str(), and typeof() commands are so helpful - you need to know how the computer is seeing your data, not how it gets displayed for you). 

This is why going straight to integer does not work.

Hope that helps,

Matt 
Reply all
Reply to author
Forward
0 new messages