Problems with R can be divided into several groups:
The stringsAsFactors option to read.table() and data.frame() is an example of the first, second, and fourth groups (and, some people would say, the fifth).
Backing up, why do we have strings and factors? There are (at least) three incompatible use cases for things that look superficially the same:
Statistical systems tend to have value labels, plus half-assed support for strings. Programming languages tend to have enumerations and strings (or, in older languages like Fortran, half-assed support for strings)
R had strings (a three-quarter-assed implementation that, like classic C, is just enough to build something useful on top of). R (and S) also had “factors”.
In S, the implementation of factors is as small integers with a ‘levels’ attribute. They could be used (well) as an enumerated type, or (badly) for labels on integer variables. They also had an important secondary use as a data-compression hack for strings with repeated values: each additional copy of “Massachusetts” as a string took a pointer plus 13 bytes plus a terminating NUL for each copy, but as a factor took just the 4 bytes for the integer. Even better, comparison of factor levels was simple integer equality, done in a single clock cycle, but comparison of strings required walking the string byte by byte. Back in the day, this mattered.
Originally in R, factors were a native type and were unambiguously for enumerations. For compatibility, this was changed R 0.62. As the NEWS entry says:
o All internal mechanisms to support factors and
data.frames have been removed. These are now
entirely supported by interprete code!
`is.unordered' has been eliminated. Thanks
to John Chambers for allowing the distribution of
his StatLib code.So, for lo these many years we struggled on with read.table() automagically coercing strings to factors, and users painfully coercing them back again, but saving memory. The Right and Proper use of factors was as enumerated types, but not everyone agreed.
The next change happened in R 2.6.0
o There is now a global CHARSXP cache, R_StringHash. CHARSXPs are no longer duplicated and must not be modified in place. Developers should strive to only use mkChar (and mkString) for creating new CHARSXPs and avoid use of allocString. A new macro, CallocCharBuf, can be used to obtain a temporary char buffer for manipulating character data. This patch was written by Seth Falcon.
The Bioconductor project needed to store and manipulate really big strings. It’s a bit inefficient to store and compare multiple copies of “Massachusetts”, but it’s a really bad idea to store and compare multiple copies of an entire chromosome.
The new format stored each string once, and then used pointers for copies: memory use was about the same as factors, speed of comparison was about the same for duplicated strings but much faster for unique strings. Bioconductor also introduced a bunch of string tools in packages, in particular so that large numbers of short segments of a ridiculously long string could be handled efficiently.
As a result of all this development, R now has strings, and it has an enumerated type. It still doesn’t have value labels (there’s some support in packages, but nothing low-level). There’s now no reason to confuse strings and factors, and no reason to automatically assume that non-numeric variables are factors.
Or rather, that would be true if we wiped out all R users and code and started from scratch. Otherwise, as an early opponent of backwards compatibility noted:
“‘Tis the Last judgment’s fire must cure this place,
``Calcine its clods and set its prisoners free.”