[R] Fast string comparison

0 views
Skip to first unread message

Ralf B

unread,
Jul 11, 2010, 8:08:24 AM7/11/10
to r-h...@r-project.org
What is the fastest way to compare two strings in R?

Ralf

______________________________________________
R-h...@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Hadley Wickham

unread,
Jul 11, 2010, 8:23:20 AM7/11/10
to Ralf B, r-h...@r-project.org
== ?

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Sharpie

unread,
Jul 11, 2010, 2:37:46 PM7/11/10
to r-h...@r-project.org

Ralf B wrote:
>
> What is the fastest way to compare two strings in R?
>
> Ralf
>

Which way is not fast enough?

In other words, are you asking this question because profiling showed one of
R's string comparison operations is causing a massive bottleneck in your
code? If so, which one and how are you using it?

-Charlie

-----
Charlie Sharpsteen
Undergraduate-- Environmental Resources Engineering
Humboldt State University
--
View this message in context: http://r.789695.n4.nabble.com/Fast-string-comparison-tp2285156p2285409.html
Sent from the R help mailing list archive at Nabble.com.

Ralf B

unread,
Jul 13, 2010, 12:52:57 AM7/13/10
to Sharpie, r-h...@r-project.org
I am asking this question because String comparison in R seems to be
awfully slow (based on profiling results) and I wonder if perhaps '=='
alone is not the best one can do. I did not ask for anything
particular and I don't think I need to provide a self-contained source
example for the question. So, to re-phrase my question, are there more
(runtime) effective ways to find out if two strings (about 100-150
characters long) are equal?

Ralf

Hadley Wickham

unread,
Jul 13, 2010, 1:42:33 AM7/13/10
to Ralf B, r-h...@r-project.org
strings <- replicate(1e5, paste(sample(letters, 100, rep = T), collapse = ""))
system.time(strings[-1] == strings[-1e5])
# user system elapsed
# 0.016 0.000 0.017

So it takes ~1/100 of a second to do ~100,000 string comparisons. You
need to provide a reproducible example that illustrates why you think
string comparisons are slow.

Hadley

--

Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

______________________________________________

Ralf B

unread,
Jul 13, 2010, 7:38:57 AM7/13/10
to Hadley Wickham, r-h...@r-project.org
I see. I did not get these performances since did not directly compare
arrays but run seemingly expensive for-loops to do it iteratively...
:(

R.

Matt Shotwell

unread,
Jul 13, 2010, 9:24:53 AM7/13/10
to Hadley Wickham, r-h...@r-project.org
On Tue, 2010-07-13 at 01:42 -0400, Hadley Wickham wrote:
> strings <- replicate(1e5, paste(sample(letters, 100, rep = T), collapse = ""))
> system.time(strings[-1] == strings[-1e5])
> # user system elapsed
> # 0.016 0.000 0.017
>
> So it takes ~1/100 of a second to do ~100,000 string comparisons. You
> need to provide a reproducible example that illustrates why you think
> string comparisons are slow.

Here's a vectorized alternative to '==' for strings, with minimal
argument checking or result conversion. I haven't looked at the
corresponding R source code, it may be similar:

library(inline)
code <- "
SEXP ans;
int i, len, *cans;
if(!isString(s1) || !isString(s2))
error(\"invalid arguments\");
len = length(s1)>length(s2)?length(s2):length(s1);
PROTECT(ans = allocVector(INTSXP, len));
cans = INTEGER(ans);
for(i = 0; i < len; i++)
cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\
CHAR(STRING_ELT(s2,i)));
UNPROTECT(1);
return ans;
"
sig <- signature(s1="character", s2="character")
strcmp <- cfunction(sig, code)


> system.time(strings[-1] == strings[-1e5])

user system elapsed
0.036 0.000 0.035
> system.time(strcmp(strings[-1], strings[-1e5]))
user system elapsed
0.032 0.000 0.034

That's pretty fast, though I seem to be working with a slower system
than Hadley. It's hard to see how this could be improved, except maybe
by caching results of string comparisons.

-Matt

Matthew S. Shotwell
Graduate Student
Division of Biostatistics and Epidemiology
Medical University of South Carolina
http://biostatmatt.com

Romain Francois

unread,
Jul 13, 2010, 9:48:18 AM7/13/10
to Matt Shotwell, r-h...@r-project.org, Hadley Wickham

Hi Matt,

I think there are some confusing factors in your results.


system.time(strcmp(strings[-1], strings[-1e5]))

would also include the time required to perform both subscripting
(strings[-1] and strings[-1e5] ) which actually "takes some time".


Also, you do have a bit of overhead due to the use of STRING_ELT and the
write barrier.


I've include below a version that uses R internals so that you get the
fast (but you have to understand the risks, etc ...) version of
STRING_ELT using the plugin system of inline.

library(inline)
code <- "
SEXP ans;
int i, len, *cans;
if(!isString(s1) || !isString(s2))
error(\"invalid arguments\");
len = length(s1)>length(s2)?length(s2):length(s1);
PROTECT(ans = allocVector(INTSXP, len));
cans = INTEGER(ans);
for(i = 0; i < len; i++)
cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\
CHAR(STRING_ELT(s2,i)));
UNPROTECT(1);
return ans;
"
sig <- signature(s1="character", s2="character")
strcmp <- cfunction(sig, code)

strings <- replicate(1e5, paste(sample(letters, 100, rep = T), collapse
= ""))


lhs <- strings[-1]
rhs <- strings[-1e5]
system.time( lhs == rhs )
system.time(strcmp( lhs, rhs) )

library(inline)
settings <- getPlugin( "default" )
settings$includes <- paste( "#define USE_RINTERNALS", settings$includes,
collapse = "\n" )
code2 <- "


SEXP ans;
int i, len, *cans;
if(!isString(s1) || !isString(s2))
error(\"invalid arguments\");
len = length(s1)>length(s2)?length(s2):length(s1);
PROTECT(ans = allocVector(INTSXP, len));
cans = INTEGER(ans);
for(i = 0; i < len; i++)
cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\
CHAR(STRING_ELT(s2,i)));
UNPROTECT(1);
return ans;
"
sig <- signature(s1="character", s2="character" )

strcmp2 <- cxxfunction(sig, code2, settings = settings)
system.time(strcmp2( lhs, rhs) )

I get:

$ Rscript strings.R
Le chargement a nécessité le package : methods
utilisateur système écoulé
0.002 0.000 0.002
utilisateur système écoulé
0.004 0.000 0.005
utilisateur système écoulé
0.003 0.000 0.003

Romain


Le 13/07/10 15:24, Matt Shotwell a écrit :


--
Romain Francois
Professional R Enthusiast
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
|- http://bit.ly/bc8jNi : Rcpp 0.8.4
|- http://bit.ly/dz0RlX : bibtex 0.2-1
`- http://bit.ly/a5CK2h : Les estivales 2010

Matt Shotwell

unread,
Jul 13, 2010, 11:37:22 AM7/13/10
to Romain Francois, r-h...@r-project.org, Hadley Wickham
Good idea Romain, there is quite a bit of type testing in the function
versions of STRING_ELT and CHAR, not to mention the function call
overhead. Since the types are checked explicitly, I believe this
function is safe. All together now...

> system.time(strings[-1] == strings[-1e5])
user system elapsed

0.032 0.000 0.035

> system.time(strcmp(strings[-1], strings[-1e5]))
user system elapsed
0.032 0.000 0.034

> system.time(strcmp2(strings[-1], strings[-1e5]))
user system elapsed
0.024 0.000 0.026
> system.time(lhs==rhs)
user system elapsed
0.012 0.000 0.013
> system.time(strcmp(lhs, rhs))
user system elapsed
0.012 0.000 0.011
> system.time(strcmp2(lhs, rhs))
user system elapsed
0.004 0.000 0.004

I looks like you can squeeze out more speed using the macro versions of
STRING_ELT and CHAR.

Matthew S. Shotwell
Graduate Student
Division of Biostatistics and Epidemiology
Medical University of South Carolina
http://biostatmatt.com

______________________________________________

Reply all
Reply to author
Forward
0 new messages