confused about sinkmode

57 views
Skip to first unread message

Andrew Xue

unread,
Dec 19, 2012, 9:35:49 PM12/19/12
to cascadi...@googlegroups.com
Hi -

So I have a cascalog job which looks at a day's worth of data, basically filters it, and then outputs it to a location in Amazon S3. The idea is that I want a "folder" in S3 to keep on accumulating the daily outputs from my job. The way I did was to modify the OutputFormat so that instead of writing "part-00000" it wrote the timestamp of when the job was run as the prefix -- so something like "1355970777-00000". 

The idea then is that, say, for day 1, the output would look like

s3://mybucket/my_path/timestamp1-00000
s3://mybucket/my_path/timestamp1-00001
s3://mybucket/my_path/timestamp1-00002

etc ..

The sinkmode is set to REPLACE.But I figure that was fine since the next it ran, the filenames would be all different, ie, next time the job ran, the output files would be

s3://mybucket/my_path/timestamp2-00000
s3://mybucket/my_path/timestamp2-00001
s3://mybucket/my_path/timestamp2-00002

etc ..

the idea is that it would not overwrite the previous files. 

But instead it seems like the entire dir is deleted the next time the job runs. 

The documentation says:

SinkMode.KEEP

This is the default behavior. If the resource exists, attempting to write to it will fail.

SinkMode.REPLACE

This allows Cascading to delete the file immediately after the Flow is started


I am probably being dense but not sure what the mean by resource in this case? Would resource be the whole file path, or the folder that the files are going into? What "file" in question that can be deleted immediately after the flow is started for REPLACE? 

thanks


 

Koert Kuipers

unread,
Dec 20, 2012, 11:12:14 AM12/20/12
to cascadi...@googlegroups.com
cascading writes to a directory and will wipe it before writing output if you use SinkMode.REPLACE.

using SinkMode.KEEP will not work for you either since cascading will detect that the dir is already there and stop, despite the fact that you will write to different files thanks to your modified output format.

my guess is you will have to change your Tap that uses the new OutputFormat to change this behavior.
you could probably extend Hfs tap and disable some checks. from a quick scan overriding the resourceExists method might work...


 

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/Fb5YD4cowbcJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Ken Krugler

unread,
Dec 20, 2012, 6:03:16 PM12/20/12
to cascadi...@googlegroups.com
Hi Andrew,

Normally you'd specify separate <timestamp> directories in S3, and use these as the destination path for your job.

That avoids the issues you've run into, and means you don't have to use a special OutputFormat.

-- Ken

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/Fb5YD4cowbcJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--------------------------------------------




Andrew Xue

unread,
Dec 21, 2012, 4:01:26 PM12/21/12
to cascadi...@googlegroups.com
Hi Ken --

so the reason I am not putting the files into a <timestamp> directories structure is because I would like to consolidate the files later in place. (I use backtypes Consolidateor: https://github.com/nathanmarz/dfs-datastores/blob/master/src/jvm/backtype/hadoop/Consolidator.java)

this job tends to create a lot of tiny files 

so, the way i originally envisioned how this would work: 

the daily run of the job spits out a bunch of tiny files; then i consolidate it in place (the output of the consolidate job are fewer, bigger files with a random UUID in the name). 

next day the job spits out another bunch of tiny files which shouldn't overwrite any previous files (even if consolidate failed previous day) --  then consolidate, rinse repeat. 

i feel like this problem of accumulating data in one place and maintaining that the files are reasonably sized must be pretty common -- how are people solving it?

also, its too much to hope the simply using the UPDATE sinkmode would fix everything huh?
Reply all
Reply to author
Forward
0 new messages