Idiomatic Elixir for transforming streams?

661 views
Skip to first unread message

Alan Kucheck

unread,
Jan 5, 2015, 12:32:50 AM1/5/15
to elixir-l...@googlegroups.com
Hello:

[this is going to be a little long-winded, but I'm hoping the format will assist other Elixir newbies]

I'm just getting started with Elixir and trying to understand the "Elixir way" of doing things.  The immediate task at hand is a text file transformer.  Ideally, I want the final code to look conceptually like this, the focus being on readability:

File.stream!(inFile)
 
|> Enum.map( transform1 )
 
|> Enum.map( transform2 )
 
|> Enum.map( transform3 )
 
|> IO.write

I started with this, the code failing as noted in the comment:

inFile  = "./sample.tsv"
File
.stream!(inFile)                      # iterates by :line, by default
   
|> Enum.map( String.replace("\t", ",") ) # fails with - wrong arity
   
|> IO.write

After some experimentation, I found that using the capture syntax fixed the arity problem, though I'm not at all clear on why [in IEx the same String.replace/3, getting the 1st arg from the pipeline works fine...]:

File.stream!(inFile)                                  
#   |> Enum.map( String.replace("\t", ",") )           # fails with - wrong arity
   
|> Enum.map( &(String.replace(&1, "\t", ",")) )    # works!
   
|> IO.write

Although, one has to be careful about where that 1st paren goes:

File.stream!(inFile)                                  
#   |> Enum.map( &(String.replace(&1, "\t", ",")) )    # works!
   
|> Enum.map (&(String.replace(&1, "\t", ",")) )    # fails with - protocol String.Chars not impl for #Function<0.33035400...
   
|> IO.write

Then, hoping to avoid the capture syntax, I created some helper functions [attached] that have an arity of /1 and tried calling those instead of the String.replace/3, but that failed, the same way my first attempt had:

import StringUtils

File.stream!(inFile)
#   |> Enum.map( &(String.replace(&1, "\t", ",")) )    # works!
   
|> Enum.map( replaceTabWithComma ) # fails with - wrong arity
   
|> IO.write

Once again, capture syntax to the rescue:

File.stream!(inFile)
#   |> Enum.map( replaceTabWithComma ) # fails with - wrong arity
   
|> Enum.map &(replaceTabWithComma(&1)) # works!
   
|> IO.write

But as soon as you add a 2nd transform, you get a different failure:

File.stream!(inFile)
   
|> Enum.map &(replaceTabWithComma(&1)) # works!
    |> Enum.map &(trimInstr(&1))           # fails with - nested captures via & are not allowed: &(trimInstr(&1) |> IO.write())
    |> IO.write

It doesn't look nested *to me*, and this failure does *not* occur if using multiple calls to the original String.replace/3 function.  Hmmm.  So, I tried capturing the helper functions, invoking them as above:

tabsToCommas = &replaceTabWithComma/1
shortenInstr
= &trimInstr/1
updateYear  
= &chgYear/1

File.stream!(inFile)
 
|> Enum.map( &(tabsToCommas.(&1)) )    
 
|> Enum.map( &(shortenInstr.(&1)) )    
 
|> Enum.map( &(updateYear.(&1)) )    
 
|> IO.write

Ok...that all works...still unclear why the capture syntax is there...what if I remove it?

File.stream!(inFile)
 
|> Enum.map( tabsToCommas )    
 
|> Enum.map( shortenInstr )    
 
|> Enum.map( updateYear )    
 
|> IO.write

Woo-hoo!  It works, and looks just how I wanted it to. Now, I clearly need to read the Elixir sources and see if I can figure out why!  The final code is attached: transform.ex

So, finally, the question:  Although presumably without the circuitous route, would the final version be considered "idiomatic Elixir"? Is there a better way to accomplish this?

thanks,

ak











sample.tsv
StringUtils.ex
transform.ex

José Valim

unread,
Jan 5, 2015, 3:52:08 AM1/5/15
to elixir-l...@googlegroups.com
Hey Alan!

I would write it as:

File.stream!(in_file)
|> Enum.map(tabs_to_commas)
|> Enum.map(shorten_instr)
|> Enum.map(update_year)
|> IO.write

Variables in snake_case and no space around the parens.

Btw, the reason your examples failed are related to precedence:

Enum.map &tabs_to_comma(&1) |> Enum.map

is the same as:

Enum.map(&tabs_to_comma(&1) |> Enum.map)

Which is the same as:

Enum.map(Enum.map(&tabs_to_comma(&1)))

The same if you write this:

Enum.map (&tabs_to_comma(&1)) |> Enum.map

As the parens are wrapping "&tabs_to_comma(&1)" rather than applying to Enum.map/2.




José Valim
Skype: jv.ptec
Founder and Lead Developer

--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Saša Jurić

unread,
Jan 5, 2015, 4:38:31 AM1/5/15
to elixir-l...@googlegroups.com
In addition to José's remarks, I'd skip using temp variables, and pass lambdas directly to Enum functions:

File.stream!(inFile) 
|> Enum.map(&replaceTabWithComma/1)     
|> Enum.map(&trimInstr/1)     
|> Enum.map(&chgYear/1)    
|> IO.write

As José mentioned, snake (underscore) case, would be more idiomatic.

Finally, I'd probably use streams to reduce the number of passes and allocated memory. I'd also print one line at a time as it is transformed. This should significantly reduce the amount of allocated memory for large files, as you don't need to keep the entire file contents in memory. The code would look something like:

File.stream!(inFile) 
|> Stream.map(&replaceTabWithComma/1)     
|> Stream.map(&trimInstr/1)     
|> Stream.map(&chgYear/1)    
|> Enum.each(&IO.puts/1)

Disclaimer: didn't try it out, so there might be some errors.

Alan Kucheck

unread,
Jan 5, 2015, 12:51:12 PM1/5/15
to elixir-l...@googlegroups.com, jose....@plataformatec.com.br
José:

Thanks for the explanation about precedence - very helpful!

ak

Alan Kucheck

unread,
Jan 5, 2015, 1:01:54 PM1/5/15
to elixir-l...@googlegroups.com
Saša:

Regarding the choice of passing lambdas directly: IMO, anything that reduces noise is a good thing, and the capture syntax does just that.   Perhaps just a difference in taste...

Regarding the structural change you suggest, to a fully streamed solution: excellent, and thanks!  It does take just over 2x the runtime as the my in-memory version, but I will likely have gigantic files so this is a significant, beneficial change.

If anyone knows of simple way to monitor the amount of memory that an Elixir script is using that would be very helpful...

ak

Sasa Juric

unread,
Jan 5, 2015, 1:36:57 PM1/5/15
to elixir-l...@googlegroups.com
The simplest way to monitor memory is to watch the OS process memory while it’s handling a large input.

If you want to see in more details, you could start the observer OTP application with :observer.start, and there you should be able to see all kinds of memory information on a VM level as well as per each process.

You can also periodically call :erlang.memory/0,1 to observe memory usage, and also :erlang.process_info/1 to get info about individual processes.


--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/pN1-DHA40Hs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages