How to setup a S3 source so I ignore the historical files

30 views
Skip to first unread message

Alfredo Gomez

unread,
Dec 4, 2017, 1:06:43 PM12/4/17
to sdc-user
Hello there,

I am trying to put in production some processes but I have found something that is making it complicated. How can I setup the S3 sources so that they ignore historical files on their locations but start feeding only for new files? or since certain modified file date?

Thanks!

Adam Kunicki

unread,
Dec 4, 2017, 1:12:51 PM12/4/17
to Alfredo Gomez, sdc-user
By default they're read in a lexicographic ascending sort order but can also be done via modified time. Lexicographic sorting is useful when your files conform to a pattern such as 2017-12-01-xyz, 2017-12-02-xyz, and so on.

Please choose one of the two options in the following setting:
Screenshot 2017-12-04 10.11.09.png

--
You received this message because you are subscribed to the Google Groups "sdc-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sdc-user+u...@streamsets.com.
Visit this group at https://groups.google.com/a/streamsets.com/group/sdc-user/.

Alfredo Gomez

unread,
Dec 4, 2017, 2:28:58 PM12/4/17
to sdc-user
Hello Adam, thanks for the answer.

But this option only change the order of processing, when I launch it with this order, it stills process all the historical files, how can I do to avoid all those?

This is the order I want indeed, but I want to avoid processing thousands of files already available in the bucket.

Thanks again!

Adam Kunicki

unread,
Dec 4, 2017, 2:44:29 PM12/4/17
to Alfredo Gomez, sdc-user
Ah, understood. I don't think there's currently a config to avoid this. You could start the pipeline initially, stop it, and edit the pipeline's corresponding offset.json in the $SDC_DATA directory manually to reflect the position you want (part of the offset will include the last filename processed I believe.)

--

Alfredo Gomez

unread,
Dec 4, 2017, 3:09:54 PM12/4/17
to sdc-user
Oh, good idea, I am going to check that. Will let you know if that makes the trick, thanks!


On Monday, December 4, 2017 at 1:06:43 PM UTC-5, Alfredo Gomez wrote:

Alfredo Gomez

unread,
Dec 4, 2017, 3:45:10 PM12/4/17
to sdc-user
Hello Adam, your solution work perfectly, in fact, you do not really need to put a "real" filename in the offset.json doc, just adding the epoch you want to start from would make it.
Thans a lot!
Reply all
Reply to author
Forward
0 new messages