Performance of string comparison vs. char list

492 views
Skip to first unread message

Louis Simoneau

unread,
Jul 6, 2014, 2:55:45 AM7/6/14
to elixir-l...@googlegroups.com
Hi all,

Was just working through a problem in Elixir that required comparing characters at various points in a string (using ==). My initial implementation used String.split to break the string into a list, but this seemed slow. When I profiled the code, it seemed a lot of time was being spent in :re.run (and also some in :re.loopexec, :re.do_slit, :re.forward, etc.). Changing the initial list construction to use String.to_char_list eliminates the :re.run calls from the profile, and results in a roughly 3x performance boost for my code.

I'm not familiar with Erlang, or with language design in general, but this seems unusual to me. I'm curious as to why this would be? Why does Elixir need to run a regex to determine if two strings are equal?

Thanks, and sorry if it's a dumb question!

José Valim

unread,
Jul 6, 2014, 4:34:04 AM7/6/14
to elixir-l...@googlegroups.com
Louis, why are you using String.split?

== is going to be a fast operation, the slowing down is likely coming from String.split, that splits on *all* whitespace as specified by Unicode. If you post the specific part of the problem, we can try to identify how to make it better.



José Valim
Skype: jv.ptec
Founder and Lead Developer


--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Louis Simoneau

unread,
Jul 6, 2014, 8:26:49 AM7/6/14
to elixir-l...@googlegroups.com, jose....@plataformatec.com.br
Hi José,

Thanks for your reply. Sorry if my original question was unclear, I realize I didn't fully explain what I was trying to accomplish. For the purposes of the problem, I'm trying to compare individual characters of a very long string with other characters to find repeating sequences. The String.split is only performed once, to make a list of one-character strings to scan, but then == is run a lot.

Changing the split to a to_char_list makes it much faster, because now it's comparing chars instead of one-character strings. That's not so surprising, I guess, but I was surprised to see regex functions in the profile since I wasn't doing any regex matching. And there are other operations performed recursively as well, like appending to a tuple of matches, but those don't eat up nearly as much time as the == on the one-character strings.

FWIW, I'm writing a solution to this exercise reimplementing the preprocessing step of the Knuth-Morris-Pratt algorithm. My original solution used lists and took about 4 minutes to process a 100,000 character string, mostly because of Enum.at calls. After changing to use Tuples, it was still quite slow (around 30s), and finally changing the initial step to make a char list it takes about 10 seconds. Here's the current version of my code (I've just started learning Elixir so I'm sure there are heaps of things I could be doing better):

defmodule Kmp do
  def failure_array(input) do
    seq = Fasta.parse(input) |> Enum.at(0) |> elem(1) |> String.to_char_list |> List.to_tuple
    compute(seq, 2, 0, {-1, 0}) |> Tuple.to_list |> tl
  end

  defp compute(seq, pos, _, array) when pos > size(seq), do: array
  defp compute(seq, pos, cnd, array) do
    cond do
      elem(seq, pos - 1) == elem(seq, cnd) -> compute(seq, pos + 1, cnd + 1, Tuple.insert_at(array, size(array), cnd + 1))
      cnd > 0 -> compute(seq, pos, elem(array, cnd), array)
      true -> compute(seq, pos + 1, cnd, Tuple.insert_at(array, size(array), 0))
    end
  end
end

Alexei Sholik

unread,
Jul 6, 2014, 8:35:45 AM7/6/14
to elixir-l...@googlegroups.com
José, I suppose this could be an issue https://github.com/elixir-lang/elixir/blob/master/lib/elixir/lib/string.ex#L234. Regexes have an overhead regardless of the complexity of the regex. Delegating to codepoints or graphemes would make more sense here.
--
Best regards
Alexei Sholik

Alexei Sholik

unread,
Jul 6, 2014, 9:03:10 AM7/6/14
to elixir-l...@googlegroups.com
This thing looks familiar, Louis. Could you also share your parser and a link to the data you're using? I might try playing with it and see if it can be made faster.

Louis Simoneau

unread,
Jul 6, 2014, 9:39:06 AM7/6/14
to elixir-l...@googlegroups.com


You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/wpACHaYGHMs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.

José Valim

unread,
Jul 6, 2014, 9:49:28 AM7/6/14
to elixir-l...@googlegroups.com
Thanks for the info Louis!

I think String.split/1 is using a regex to split on graphemes, a native implementation would likely be faster. Btw, what if the algorithm works on bytes? Assuming both inputs are valid UTF8, the result will be valid. Maybe that will be faster than using lists but I am not sure.

José Valim

unread,
Jul 6, 2014, 3:23:46 PM7/6/14
to elixir-l...@googlegroups.com
I have pushed a commit that makes String.split/1 use a Elixir-pure implementation instead of relying on a regex. The end result is great, it is 20x faster for small inputs (tested with a 10 characters but 20 bytes long string) and 100x faster for larger inputs (tested with a 1000 characters but 2000 bytes long string).

Thanks Louis for starting this fruitful discussion!



José Valim
Skype: jv.ptec
Founder and Lead Developer


Reply all
Reply to author
Forward
0 new messages