Performance of string comparison vs. char list

Louis Simoneau

unread,

Jul 6, 2014, 2:55:45 AM7/6/14

to elixir-l...@googlegroups.com

Hi all,

Was just working through a problem in Elixir that required comparing characters at various points in a string (using ==). My initial implementation used String.split to break the string into a list, but this seemed slow. When I profiled the code, it seemed a lot of time was being spent in :re.run (and also some in :re.loopexec, :re.do_slit, :re.forward, etc.). Changing the initial list construction to use String.to_char_list eliminates the :re.run calls from the profile, and results in a roughly 3x performance boost for my code.

I'm not familiar with Erlang, or with language design in general, but this seems unusual to me. I'm curious as to why this would be? Why does Elixir need to run a regex to determine if two strings are equal?

Thanks, and sorry if it's a dumb question!

José Valim

unread,

Jul 6, 2014, 4:34:04 AM7/6/14

to elixir-l...@googlegroups.com

Louis, why are you using String.split?

== is going to be a fast operation, the slowing down is likely coming from String.split, that splits on *all* whitespace as specified by Unicode. If you post the specific part of the problem, we can try to identify how to make it better.

José Valim

www.plataformatec.com.br

Skype: jv.ptec

Founder and Lead Developer

--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Louis Simoneau

unread,

Jul 6, 2014, 8:26:49 AM7/6/14

to elixir-l...@googlegroups.com, jose....@plataformatec.com.br

Hi José,

Thanks for your reply. Sorry if my original question was unclear, I realize I didn't fully explain what I was trying to accomplish. For the purposes of the problem, I'm trying to compare individual characters of a very long string with other characters to find repeating sequences. The String.split is only performed once, to make a list of one-character strings to scan, but then == is run a lot.

Changing the split to a to_char_list makes it much faster, because now it's comparing chars instead of one-character strings. That's not so surprising, I guess, but I was surprised to see regex functions in the profile since I wasn't doing any regex matching. And there are other operations performed recursively as well, like appending to a tuple of matches, but those don't eat up nearly as much time as the == on the one-character strings.

FWIW, I'm writing a solution to this exercise reimplementing the preprocessing step of the Knuth-Morris-Pratt algorithm. My original solution used lists and took about 4 minutes to process a 100,000 character string, mostly because of Enum.at calls. After changing to use Tuples, it was still quite slow (around 30s), and finally changing the initial step to make a char list it takes about 10 seconds. Here's the current version of my code (I've just started learning Elixir so I'm sure there are heaps of things I could be doing better):

defmodule Kmp do
def failure_array(input) do
    seq = Fasta.parse(input) |> Enum.at(0) |> elem(1) |> String.to_char_list |> List.to_tuple
    compute(seq, 2, 0, {-1, 0}) |> Tuple.to_list |> tl
end

defp compute(seq, pos, _, array) when pos > size(seq), do: array
defp compute(seq, pos, cnd, array) do
    cond do
      elem(seq, pos - 1) == elem(seq, cnd) -> compute(seq, pos + 1, cnd + 1, Tuple.insert_at(array, size(array), cnd + 1))
      cnd > 0 -> compute(seq, pos, elem(array, cnd), array)
      true -> compute(seq, pos + 1, cnd, Tuple.insert_at(array, size(array), 0))
    end
end
end

Alexei Sholik

unread,

Jul 6, 2014, 8:35:45 AM7/6/14

to elixir-l...@googlegroups.com

José, I suppose this could be an issue https://github.com/elixir-lang/elixir/blob/master/lib/elixir/lib/string.ex#L234. Regexes have an overhead regardless of the complexity of the regex. Delegating to codepoints or graphemes would make more sense here.

--
Best regards
Alexei Sholik

Alexei Sholik

unread,

Jul 6, 2014, 9:03:10 AM7/6/14

to elixir-l...@googlegroups.com

This thing looks familiar, Louis. Could you also share your parser and a link to the data you're using? I might try playing with it and see if it can be made faster.

Louis Simoneau

unread,

Jul 6, 2014, 9:39:06 AM7/6/14

to elixir-l...@googlegroups.com

Sure Alexei,

Here's a sample input file and the correct output array: https://gist.github.com/lsimoneau/c0380e0bb0e99344a412

And here's the parser I've been using: https://bitbucket.org/lsimoneau/rosalind-elixir/src/ffdbe3fb1f72ee589a99aea6158bed77423368e9/lib/fasta.ex?at=master

You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/wpACHaYGHMs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.

José Valim

unread,

Jul 6, 2014, 9:49:28 AM7/6/14

to elixir-l...@googlegroups.com

Thanks for the info Louis!

I think String.split/1 is using a regex to split on graphemes, a native implementation would likely be faster. Btw, what if the algorithm works on bytes? Assuming both inputs are valid UTF8, the result will be valid. Maybe that will be faster than using lists but I am not sure.

José Valim

unread,

Jul 6, 2014, 3:23:46 PM7/6/14

to elixir-l...@googlegroups.com

I have pushed a commit that makes String.split/1 use a Elixir-pure implementation instead of relying on a regex. The end result is great, it is 20x faster for small inputs (tested with a 10 characters but 20 bytes long string) and 100x faster for larger inputs (tested with a 1000 characters but 2000 bytes long string).

Thanks Louis for starting this fruitful discussion!

José Valim

www.plataformatec.com.br

Skype: jv.ptec

Founder and Lead Developer

Reply all

Reply to author

Forward