How to read unix directory for same filename without resetting Origin

22 views
Skip to first unread message

KarthikJayRaj

unread,
Apr 20, 2021, 2:39:26 AM4/20/21
to sdc-user
Hi,

Is it possible to use the Directory Origin in such a way that the pipeline is able to create trigger on the event of a file name arrival that has already processed in the past.

We do not  want to use the reset origin and restart the pipelines again as it would be an automated batch process.

Thanks,
J

Sathish Reddy

unread,
Apr 20, 2021, 3:23:27 AM4/20/21
to KarthikJayRaj, sdc-user

--
You received this message because you are subscribed to the Google Groups "sdc-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sdc-user+u...@streamsets.com.
To view this discussion on the web visit https://groups.google.com/a/streamsets.com/d/msgid/sdc-user/1093d3a3-0c82-4e7e-90b9-fa586470a005n%40streamsets.com.

KarthikJayRaj

unread,
Apr 20, 2021, 3:27:58 AM4/20/21
to Sathish Reddy, sdc-user
Thanks, I'll check it out. 
Thanks,
J 

""



KarthikJayRaj

unread,
Apr 20, 2021, 3:38:21 AM4/20/21
to Sathish Reddy, sdc-user
We have a source directory /incoming/ where we expect files to arrive ,
as per the requirement, it is possible that the same file arrives again in the future
When we used Directory component it does not sense the arrival of the same file again.

Thanks,
J

""



Alejandro Abdelnur

unread,
Apr 20, 2021, 4:03:19 AM4/20/21
to KarthikJayRaj, Sathish Reddy, sdc-user
Karthik,

Have you tried using Last Modified Timestamp ordering, https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Origins/Directory.html, make sure to read the Tip of the documentation.

thx



--
Alejandro

KarthikJayRaj

unread,
Apr 20, 2021, 4:18:11 AM4/20/21
to Alejandro Abdelnur, Sathish Reddy, sdc-user
Thanks for the tip. 
but change in the timestamp does not help because the pipeline reads based on the file name 
and if a file abc.txt has been processed then irrespective of its modifications in the furture this file does not get picked up.

Thanks,
Jay(Karthik)

""



Alejandro Abdelnur

unread,
Apr 20, 2021, 5:17:21 AM4/20/21
to KarthikJayRaj, Sathish Reddy, sdc-user
Karthik,

The directory origin assumes files show up atomically and ready for ingestion (fully written and closed).  This typically is achieved by a move or writing to a temp file and then doing a rename when done (which is also a move).

If you are receiving the same file more than once, you could be in the scenario that you are processing the file when it is 'replaced'  by the new one and that would lead to processing corruption.

Avoiding this scenario is not easy because your pipeline could be stopped for some reason.

If you append to your files names the upload timestamp then you would have unique file names and things would work as expected.

thx
--
Alejandro

KarthikJayRaj

unread,
Apr 20, 2021, 5:21:05 AM4/20/21
to Alejandro Abdelnur, Sathish Reddy, sdc-user
Understood, thanks
--
Thanks,
Jay(Karthik)

"​"


Reply all
Reply to author
Forward
0 new messages