Code not able to execute subtasks, in parallel.

55 views
Skip to first unread message

Sandhya Tiwari

unread,
Oct 10, 2017, 3:44:51 AM10/10/17
to python-doit
Hi,

I am fairly new to doit so do forgive my ignorance. I was previously able to run multiple tasks of the same type iteratively before but I find that when I want to run it now on a new set of data in a completely different directory, doit does not run the tasks without exiting with an error. I tested with the "hello word" example, which ran perfectly fine.
My code is very similar to the following:

import os

def task_1():
        def run_task1(a,b,targets):
                os.system("programmename -i %s -o %s  > %s"%(a,b,targets[0]))
                return

        LIST=open("list_of_input_filenames.txt",'r')
        LIST=LIST.readlines()
        for a in LIST:
                a=a.rstrip()
                for b in LIST:
                        b=b.rstrip()
                        if os.path.isfile("output/output_%s_%s.out"%(a.split("/")[-1].split("_")[1],b.split("/")[-1].split("_")[1])) == False:
                                yield {'name':"task1_%s_%s"%(a.split("/")[-1].split("_")[1],b.split("/")[-1].split("_")[1]),
                                       'actions':[(run_task1,[a,b])],
                                       'file_dep':[a,b],
                                       'targets':["output/output_%s_%s.out"%(a.split("/")[-1].split("_")[1],b.split("/")[-1].split("_")[1])]

I have a few questions,

a) does running it in a "screen" matter?
b) does anyone have experience of trying to run it again and again in a folder after killing a job and removing "__pycache__"
c) I am unable to execute my tasks with or without "-n". Due to the number of tasks I have to run I would like to be able to us it. Any tips on that?

Cheers,
Sandhya

Jan Vlčinský (gmail)

unread,
Oct 11, 2017, 6:17:12 AM10/11/17
to pytho...@googlegroups.com, Sandhya Tiwari

Hi Sandya

I took the liberty to refactor your dodo.py file, see: https://gist.github.com/vlcinsky/fb33a5d9b0fa67c39a54de7d20d4789f

Things changed:

  • formatted according to PEP008: to make reading more pleasant
  • used some python idioms:
    • opening file: "r" is the default mode so skipped. Using context manager to close it
    • reading file names from files: using list comprehension, filtering any line, which is empty. Also stripping space inside the comprehension to save you from a.strip() and b.strip()
    • repeatedly used values calculated extracted to variables a_name_tail and b_name_tail
    • to calculate name of file in the directory: used os.path.basename function. Other option would be to use pathlib, which is naturally supported by doit, but this i skipped.
  • used doit idioms and constructs:
    • "name" does not use "task1" prefix as it already comes from "basename". Try $ doit list --all to see it
    • not testing if target file exists. doit will not touch the target if it already exists and input files (in file_dep) did not change too.
    • as you finally run command, changed python action to cmd-action. run_task1 renamed to create_cmd_string and used when building the dictionary to yield.
    • added "clean": True. If you want to delete output files, just run $ doit clean. You may even clean targets created by particular subtask, e.g. by $ doit clean one:onetail_twotail (asuming such task is generated what depends on your "list_of_input_filenames.txt")
  • changed "programmename" to "echo" to have something I can really test locally.

Some prerequisites which may cause the process to fail:

  • target directory "output" must exist.
  • all files, which are mentioned in "list_of_input_filenames.txt" must exist as they are declared as dependencies in "file_dep".
  • if your "programmename" does not allow concurrent processing (either globally, or using the same input file), it may cause parallel processing to fail as the command would crash or complain

Regarding your questions:

  • screen shall not matter
  • killing the job: My test with killing the job shows, that what is already finished is not regenerated as the result is already up to date (this is great feature of doit) and next run continues without requiring to clean anything line __pycache__ etc. Did not test with parallel processing but I expect the same result.
  • using with "-n" works for me. Hint: to play with it, make the tasks to run longer by adding "sleep 10" as second command to run in the task.

Btw, I think your use case is really good one for using doit. It shall bring you advantages as:

  • easy to generate many tasks automatically (as you do by combining filenames from list_of_input_filenames.txt
  • easy to start
  • can speed up by not regenerating outputs, which are already uptodate
  • allow killing the process and restarting - it will continue skipping the tasks already uptodate
  • can speed up by parallel processing (e.g. -n 10)
  • simple to clean the outputs

Enjoy python and doit

With best regards

Jan Vlčinský

--
You received this message because you are subscribed to the Google Groups "python-doit" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-doit...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sandhya Tiwari

unread,
Oct 24, 2017, 5:16:23 AM10/24/17
to python-doit
Thanks Jan! That worked :)

-Sandhya

Sandhya Tiwari

unread,
Dec 4, 2017, 5:32:02 AM12/4/17
to python-doit
Hi again,

I have another issue related to this topic of running subtasks in parallel. I used the example of gen_many_tasks() to create a series of dependent tasks within the loop as described before. I am able to execute the script using "doit -f filename -n 100" but the scripts runs normally without printing the taskname. Instead, it runs as a regular script with standard output of the other programs being used, while overwriting existing files. I have uploaded the code here:
https://gist.github.com/sandhyatiwari/80c95933c6caba4d890d6a71068c9b38

Please help!

Thanks,
Sandhya

Jan Vlčinský

unread,
Dec 5, 2017, 2:53:25 AM12/5/17
to Sandhya Tiwari, 'Ant Super' via python-doit

Hi Sandhya (sorry for misspelling your name last time).

If you do not set basename explicitly in your dictionary describing the task, it is derived from name of function generating it, e.g. if it is "task_all" function the basename will be "all".

The dictionary with task can explicitly use attribute "basename".

Regarding PEP8: the code will run regardless of the formatting, but you as a programmer will have more joy from coding, if the code is well formatted. I used to code in Lisp and adopting proper coding style increased my productivity at least 4 times :-).

There are weapons to kill old code formatting habits such as editor built in code formatting. I use vim and currently moving to emacs and both support great formatting tool called yapf. To me, after I integrated it to my editor, formatting my code is no-brain activity (exactly as presented in End the Holly Wars of Formatting talk by Paul Bailey).

With best regards

Jan


On 5.12.2017 04:39, Sandhya Tiwari wrote:
Hi Jan,

Thanks for the quick reply! I will attempt to pep8 format the code, old bad habits die hard.

I tried doing '$ doit list --all' and saw that no tasks were printed, instead I only get the stdout of the programs I am using within my functions. Somehow the basename doesn't register?

Re. '-n 100', I usually put a much lower number, I have quite a few cores on my workstation so I am always curious to see how many tasks run in parallel.

Thanks,
Sandhya

On 4 December 2017 at 23:08, Jan Vlčinský <jan.vl...@gmail.com> wrote:

Hi Sandya

Nice that you have provided code snippet. However, to run it I miss a file "B_files.out" and I got a bit confused by 8-char long indentation (would be great, if you pep8 format the code.).


From what I can guess from reading your code I would recommend doing `$ doit list --all` to see, what task names are planned (feel free to provide this list of task names here).

Take into account, that doit does parallel processing only for files having the same basename, so if your list of taks loks like:

baseA:alfa
baseA:beta
baseA:delta

these tasks can run in parallel.


However, tasks such as:

baseA_alfa
baseA_beta
baseA_delta

do not share common basename thus cannot run in parallel. If you call it `$ doit -n 10`, it will simply ignore those tasks.


Another note: using `-n 100` seems rather high number. Usually the number relates to number of cores on your CPU or is 2 or 3 times higher, but not much more. I have 4 cores and us `-n 10` for fetching many files from internet. If you expect having high number of subtasks, do not worry about the `-n` matching it, if you have 100 subtasks and run only 4 in parallel, it will process all the subtasks anyway.

With best

Jan

To unsubscribe from this group and stop receiving emails from it, send an email to python-doit+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

-- 
Jan Vlčinský
mob: +420-60897040
Slunečnicová 338/3
CZ-73401 Karviná-Ráj



--
"There is always some madness in love. But there is also always some reason in madness." - Nietszche

-- 
Jan Vlčinský
mob: +420-60897040
Slunečnicová 338/3
CZ-73401 Karviná-Ráj
Reply all
Reply to author
Forward
0 new messages