How can I check for updates on parameterized arbitrary command output?

25 views
Skip to first unread message

Ant Super

unread,
Sep 1, 2017, 8:39:38 AM9/1/17
to python-doit
I'd like to check for changes in HDFS file timestamps (which doit does not do yet)?

So I thought I could create a dependency on the output content of a command line command which yields the timestamp of a file.
I'd like to use that on different files (hence they file name should not be hard-coded).

What is the recommended way to do that?

I've seen the `result_dep` example with `version` but how can I check for different HDFS files at multiple places in the pipeline?
Something like `check_update_hdfs_file('...')`

Jan Vlčinský (gmail)

unread,
Sep 3, 2017, 4:27:19 AM9/3/17
to 'Ant Super' via python-doit

If you talk about simple file timestamp as seen by OS, doit already provides tooling for it.

Dependencies & Targets - basic introduction. Talks about tracking file MD5 signature as sign of file being modified

More on dependencies - explains custom functions usable within "uptodate" section to determine, if the file/task is considered uptodate

check_timestamp_unchanged() - an example for a file "foo" and the check, if it's timestamp has changed or not.

But if you talk about timestamps stored within HDFS file, the options are:

  • hope, that it will modify the file MD5 signature and render the file as modified for doit
  • implement your own function checking the value as one of "uptodate" functions.

I shall be even possible to consider the HDFS file as couple of independent virtual files, each being updated by doit (incl. evaluation, if it shall be updated or not) one by one. This would require use of custom function to determine the HDFS file state and to remove task dependency on the HDFS file itself (as it would change with update of any of it's sections and trigger updates which are sometime not necessary).

With best regards

Jan

--
You received this message because you are subscribed to the Google Groups "python-doit" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-doit...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ant Super

unread,
Sep 4, 2017, 4:00:14 AM9/4/17
to python-doit
I indeed have HDFS files which I'd like to judge by timestamps (they are to large for proper MD5).

How can I use "uptodate" with the internal database of doit (so that I do not have to store previous timestamps myself)?

As I understand, "result_dep" would do that for me, but that does allow me to parametrize it for arbitrary filenames.

Jan Vlčinský (gmail)

unread,
Sep 4, 2017, 4:43:50 AM9/4/17
to 'Ant Super' via python-doit

I am not sure, if you can disable doit using MD5 checks on any of files mentioned as file dependency (this is question to Eduardo Shettino).

My guess is, that currently all files (mentioned in any of "file_dep") are checked using MD5 so any additional checks would only add one check extra.

Workaround could be using small file describing size and modification file of MD5 and keep this one as representative of HDFS file for doit. If your action on fixed set of actions on HDFS file works as one unite and always modifies the representative file, it could work well.

doit also allows for some calculated task results, this is an alternative. See calculated-dependencies, it might be exactly what you need.

Regarding MD5 and huge HDFS file size: is this really a problem? How much time it takes to calculate MD5? Is your code really time constrained? I can imagine, that if you accept your task running 10 seconds longer you can keep your code simple and clean (and with higher chance to provide proper results). But you know your constrains the best.

Jan Vlčinský

Ant Super

unread,
Sep 4, 2017, 5:10:15 AM9/4/17
to python-doit
Hi Jan,

thanks for all the explanation. I see the options, but ideally I would really be writing a custom solution to check if something is up-to-date (I have huge HDFS files, I want the flexibility to have control over the uptodate check, ...). I'm currently digging through the source code, but it's hard to understand which parts I should use.

What part should I hook in (class, method, ...) to have my personal `uptodate` function (with a filename parameter that I can set by `partial`), but also have access to the doit internal database, so that I can store my own file / updateinfo pair?

Regards
Anton

Jan Vlčinský (gmail)

unread,
Sep 4, 2017, 5:15:27 AM9/4/17
to 'Ant Super' via python-doit

Hi Anton

Best would be to ignore doit internals (it does not do trivial things) and use what it already offers. Try the calculated-dependencies example. Even if it does not gets clear at the first sight, it would be better to get through all existing examples and tutorials than to modify a code of doit. The design is nice and applicable to many situations. The other option is to mess with the code whose design you do not understand well.

Jan

Simon Conseil

unread,
Sep 4, 2017, 10:16:46 AM9/4/17
to pytho...@googlegroups.com
Hi,

The default is to use MD5 but it is also possible to use timestamps :
http://pydoit.org/cmd_run.html#check-file-uptodate
or you can also write a custom check_file_uptodate, see docs.

Simon


To unsubscribe from this group and stop receiving emails from it, send an email to python-doit+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "python-doit" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-doit+unsubscribe@googlegroups.com.

Eduardo Schettino

unread,
Sep 4, 2017, 4:56:56 PM9/4/17
to python-doit

On Mon, Sep 4, 2017 at 12:43 PM, Jan Vlčinský (gmail) <jan.vl...@gmail.com> wrote:

I am not sure, if you can disable doit using MD5 checks on any of files mentioned as file dependency (this is question to Eduardo Shettino).

My guess is, that currently all files (mentioned in any of "file_dep") are checked using MD5 so any additional checks would only add one check extra.


Yes, if you have a custom up-to-date check for a resource (file) you should NOT include it as a file_dep (because doit will check for MD5).

It would be nice to support something like `my_custom_checker("my_file_or_resource")` that would replace the MD5 check if anyone wants to implement that :)

I am not familiar with HDFS but I guess the implementation should be something like: http://pydoit.org/uptodate.html#check-timestamp-unchanged

regards

Reply all
Reply to author
Forward
0 new messages