Hi José,
Thanks for your reply. Sorry if my original question was unclear, I realize I didn't fully explain what I was trying to accomplish. For the purposes of the problem, I'm trying to compare individual characters of a very long string with other characters to find repeating sequences. The String.split is only performed once, to make a list of one-character strings to scan, but then == is run a lot.
Changing the split to a to_char_list makes it much faster, because now it's comparing chars instead of one-character strings. That's not so surprising, I guess, but I was surprised to see regex functions in the profile since I wasn't doing any regex matching. And there are other operations performed recursively as well, like appending to a tuple of matches, but those don't eat up nearly as much time as the == on the one-character strings.
FWIW, I'm writing a solution to
this exercise reimplementing the preprocessing step of the Knuth-Morris-Pratt algorithm. My original solution used lists and took about 4 minutes to process a 100,000 character string, mostly because of Enum.at calls. After changing to use Tuples, it was still quite slow (around 30s), and finally changing the initial step to make a char list it takes about 10 seconds. Here's the current version of my code (I've just started learning Elixir so I'm sure there are heaps of things I could be doing better):
defmodule Kmp do
def failure_array(input) do
seq = Fasta.parse(input) |> Enum.at(0) |> elem(1) |> String.to_char_list |> List.to_tuple
compute(seq, 2, 0, {-1, 0}) |> Tuple.to_list |> tl
end
defp compute(seq, pos, _, array) when pos > size(seq), do: array
defp compute(seq, pos, cnd, array) do
cond do
elem(seq, pos - 1) == elem(seq, cnd) -> compute(seq, pos + 1, cnd + 1, Tuple.insert_at(array, size(array), cnd + 1))
cnd > 0 -> compute(seq, pos, elem(array, cnd), array)
true -> compute(seq, pos + 1, cnd, Tuple.insert_at(array, size(array), 0))
end
end
end