Destination file name based in field content

9 views
Skip to first unread message

Henry Molina

unread,
Dec 18, 2012, 12:54:20 PM12/18/12
to activewareh...@googlegroups.com
Hi All,

Is there a way to set the destination "variable"?

Ex:
source :in, {
  :file => "/tmp/in.dat",
  :parser => {
    :name => :csv,      
    :options => {
      :col_sep => "\t"
    }
  }
}, [:date, :id, :val]

destination :out, {
  :file => "/tmp/"+:date.to_s, # here :date is comming from input row
  :append => true
},
{
  :order => [:id, :val]
}

In other words, I want to split the input files in many files as "dates" are in the input file.

Is it possible?

Regards,

Henry

Thibaut Barrère

unread,
Dec 21, 2012, 5:16:38 AM12/21/12
to activewareh...@googlegroups.com
Hello Henry,

Is there a way to set the destination "variable"?

destination :out, {
  :file => "/tmp/"+:date.to_s, # here :date is comming from input row
  :append => true
},
{
  :order => [:id, :val]
}
In other words, I want to split the input files in many files as "dates" are in the input file.
Is it possible?

Since the control file is evaluated once at load, what you proposed won't work - but I definitely understand what you mean though!

There is no built-in way to do that, but I can suggest several ways to implement this.

A first possibility is to remove the destination, and instead use a before_write block that will do what you want in pure ruby as a quick work-around (I would probably use this first):

before_write do |row|
  filename = File.join("/my-folder", "etl-output-file-#{row[:your_date_column]}")
  File.open(filename, "a") { |f| f << [row[:id], row[:val]].map(&:to_s).join(',') }
  nil # remove row from pipeline
end

Please note that if you need proper CSV escaping etc, around :val and :id, you'll have to use CSV output instead etc, and as well that this re-opens the file for each row, so it may not be fast enough for a large volume (but still, could be enough in your case!).

Another possibility is to derive from the current file destination to code your own destination and support that scenario, possibly keeping many files open at once, or requiring sort before so it can split on day change.

A last possibility is to use a postprocess block to use a separate tool to cut the output file in many pieces, using sed/awk or similar. The pros is that it will be much faster, the cons is that you'll have to make sure it relies on the good assumptions on the file format (ie: update the sed/awk command if your first column becomes the second etc! Or use headers...).

Hope this helps, and let me know what you ended up doing in all cases, out of curiosity :-)

Thibaut
--

Reply all
Reply to author
Forward
0 new messages