Troubleshooting performance of a simple anagram program

331 views
Skip to first unread message

Marcus Gartner

unread,
May 25, 2015, 3:07:56 PM5/25/15
to elixir-l...@googlegroups.com
I'm just starting to mess around with Elixir and threw together a simple program to find groups of words in /usr/share/dict/words that are anagrams of each other, with the requirement that they have at least 4 characters, and 4 anagrams.

The compiled version of this takes about ~8 seconds to complete. I wrote two other programs that do the same thing in Ruby and Go, and they both take around 2 seconds. I'm assuming I'm doing something stupid which is leading to the difference in runtime performance (along with some other non-idiomatic Elixir things). Can anyone spot my problem areas? Stylistic feedback also welcome. How would a more veteran Elixir-ist write a program to do this?

defmodule Anagrams do


 
def main(args) do
    path
= hd(args)
    file
= File.open!(path, [:read])
    words
= gather_words(file, 4)
   
print(words, 4)
 
end


 
def gather_words(file, min) do
   
Enum.reduce IO.stream(file, :line), HashDict.new, fn line, word_map ->
      word
= String.strip line
      key
= Enum.sort to_char_list(word)
     
if String.length(word) >= min do
       
case word_map[key] do
         
nil -> Dict.put(word_map, key, [word])
          list
-> Dict.put(word_map, key, [word | list])
       
end
     
else
        word_map
     
end
   
end
 
end


 
def print(words, min) do
   
Enum.each Dict.keys(words), fn key ->
     
if length(words[key]) >= min, do: IO.puts Enum.join words[key], " "
   
end
 
end


end



José Valim

unread,
May 25, 2015, 3:27:31 PM5/25/15
to elixir-l...@googlegroups.com
Here are some changes that made it twice faster on my machine:


1. Use File.stream! instead of File.open! + IO.read(..., :line). Typically, Elixir/Erlang favors distribution which means File.open/2 is going to open up a process (lightweight thread of execution) so you can send it across nodes. The downside is that every operation has the overhead of going through this process. This is not an issue in many cases but in tight loops like the example above they are. Using File.stream! solves this issue because the "file descriptor" is never exposed, so we can optimize and avoid creating processes besides using other options like :read_ahead which are useful on read only modes (plus the code is cleaner).

2. Use Dict.update/4 instead of traversing the Dict twice

There are other things to consider:

1. In contrast to Ruby, all operations in the String module are Unicode based. So when you call String.strip/1, it looks up all whitespace characters in the unicode codebase. This is usually the safest way to do such string operations but it may add a bit of overhead when comparing to languages that do not perform this operation

2. Do perform a protocol consolidation. In order to allow code reloading in development, protocol dispatch may introduce some overhead. You can compile protocols by compiling or running your escript with MIX_ENV=prod:

    $ MIX_ENV=prod mix escript.build

Or:

    $ MIX_ENV=prod mix run -e "Anagrams.main(System.argv)" -- path/to/file

After performing protocol consolidation and my fixes above, it took 3s on my old Macbook Pro from the original 11s when processing  /usr/share/dict/words.



José Valim
Skype: jv.ptec
Founder and Lead Developer

--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/3fd9624e-3c48-4465-91f1-c37dfdf85ede%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rich Morin

unread,
May 25, 2015, 3:44:41 PM5/25/15
to elixir-l...@googlegroups.com
On May 25, 2015, at 12:27, José Valim wrote:
> 1. In contrast to Ruby, all operations in the String module are
> Unicode based.

Perhaps this should be an option, rather than a feature. That is,
the module could provide fast paths for 7-bit ASCII. Discuss...

-r

--
http://www.cfcl.com/rdm Rich Morin r...@cfcl.com
http://www.cfcl.com/rdm/resume San Bruno, CA, USA +1 650-873-7841

Software system design, development, and documentation


Marcus Gartner

unread,
May 25, 2015, 3:57:06 PM5/25/15
to elixir-l...@googlegroups.com, jose....@plataformatec.com.br
Thanks for the helpful insight!

Saša Jurić

unread,
May 25, 2015, 3:59:17 PM5/25/15
to elixir-l...@googlegroups.com, jose....@plataformatec.com.br
In addition, this part can be improved. Transforming it to:

Enum.each words, fn {key, value} ->
  if length(value) >= min, do: IO.puts Enum.join value, " "
end

shaved off about a second (though I didn't bother to consolidate protocols, so maybe it's not so problematic after that). 

This can be further optimized if you count the number of words on the fly (eliminating the need to calculate length). Enum.join might be eliminated in the same fashion. Those interventions didn't yield significant improvements though.

Also (though not relevant with my dictionary), you could move character sorting under the if condition and test the length of the original word instead.

José Valim

unread,
May 25, 2015, 4:34:30 PM5/25/15
to elixir-l...@googlegroups.com
On May 25, 2015, at 12:27, José Valim wrote:
> 1. In contrast to Ruby, all operations in the String module are
>    Unicode based.

Perhaps this should be an option, rather than a feature.  That is,
the module could provide fast paths for 7-bit ASCII.  Discuss...

The binary module in Erlang already exists for this purpose. I would prefer the default behaviour (String) to be correct rather than fast. You can always fallback to :binary.

Reply all
Reply to author
Forward
0 new messages