There are a couple things to consider:
1. File.iterator! reads the file line by line using a function and accumulator data. This is very useful to read long files without loading them all into memory. If the file is small, maybe you would have better performance by loading it all in memory at once;
2. You are looping the file twice because Enum.sort does not support an ordering function. I am glad to say this was fixed on master and that could speed things up a bit;
3. File.iterator! is also faster on master. The curent code looped each binary twice to wrongly remove new line feeds;
So there are a few things that could be improved but I don't think it will get much faster by default. Why is that?
Every time you open up a file in Erlang/Elixir, you create a new Erlang/Elixir process that is responsible to manage that file. This allows you to pass files in between nodes, monitor them, talk to them asynchronously and other things. That said, every time you read a line from that file, you are sending a process message. Sending messages are very fast but the point is that you have some overhead to get other features back.
If you want to bypass such features, you can probably squeeze better performance. For example, you can
open a file passing [:raw] option which does not wrap the file in a process and treat it as a raw binary. Assuming you are running on
Elixir master, we could try this code out:
File.biniterator!(filepath, [:raw]) />
Enum.sort(fn(w1, w2) -> size(w1) > size(w2) end) />
Enum.take(10)
Note: the pipe operator left /> right(...args) translates to right(left, ...args)
With this, the result went from 10_275_166 to 4_186_974 micro seconds on my machine. :)
Not really. Erlang would force you to loop the file extracting words manually which could give you some tiny performance gains. You could also hand roll the loop manually in Elixir too but I'd prefer the convenience of using an iterator 99 out of 100 times. Therefore, the results would be the same.