Selecting required files for Input Dataunit in Pilot Data

2 views
Skip to first unread message

Anjani Ragothaman

unread,
Feb 5, 2014, 12:49:18 PM2/5/14
to bigjob...@googlegroups.com
Hello,

Is there a way to select the only required files in the Input Data Unit?

When there is a pipeline of tasks, if an output data unit of CU-A becomes the input data unit for the CU-B, and there are multiple jobs submitted in loop by the CUs, is there is a way to place only required files in the input data unit (like we do for the output data unit, specifying only required files to be transferred)?

I tried doing in a similar way we declare for output dataunit, specifying the required files. Though the script doesn't give any error, the application reports that the required file for execution is not found. I've pasted the code snippet which I tried for input_data similar to output_data.

Any suggestions would be very helpful.

*******************************************************************
            ####"input_data": [self.input_data_unit.get_url()],
            "input_data": [
                        {
                                self.input_data_unit.get_url():[dataset+".fasta"]
                        }
                 ],
            
            "output_data": [
                        {
                                self.output_data_unit.get_url():
                                ["stdout_"+dataset+"_chain_blast.txt","stderr_"+dataset+"_chain_blast.txt",dataset+"-chain.csblast"]
                        }
                 ],


Thanks
Anjani

Mark Santcroos

unread,
Feb 5, 2014, 4:51:47 PM2/5/14
to Anjani Ragothaman, bigjob...@googlegroups.com
Hi Anjani,

On 05 Feb 2014, at 18:49 , Anjani Ragothaman <ar...@scarletmail.rutgers.edu> wrote:
> Is there a way to select the only required files in the Input Data Unit?

What do you exactly mean with “select”?


> When there is a pipeline of tasks, if an output data unit of CU-A becomes the input data unit for the CU-B, and there are multiple jobs submitted in loop by the CUs, is there is a way to place only required files in the input data unit (like we do for the output data unit, specifying only required files to be transferred)?

I’m a bit guessing what you want to do, so please correct my if I’m wrong.

You say that the output of CU-A is a “mix” of certain types of data, of which some are required by CU-B, but others are not?
If so (and in true general), you want to group files together in a DU that have a similar dataflow.
From this it follows, that if files have a different dataflow (e.g. not all of them are required by a consecutive CU) that they should be split over multiple DUs.

In your case, this would mean that you would put all files that are needed by CU-B in one DU, and “the rest” in another DU. You can then deal with these individually.

Does that capture what you want to achieve?

Gr,

Mark

Anjani Ragothaman

unread,
Feb 5, 2014, 5:34:59 PM2/5/14
to Mark Santcroos, saga-devel@googlegroups.com <saga-devel@googlegroups.com>, saga-users@googlegroups.com <saga-users@googlegroups.com>, bigjob-devel@googlegroups.com <bigjob-devel@googlegroups.com>, bigjob-users@googlegroups.com
Hi Mark,




On Wed, Feb 5, 2014 at 4:51 PM, Mark Santcroos <mark.sa...@rutgers.edu> wrote:
Hi Anjani,

On 05 Feb 2014, at 18:49 , Anjani Ragothaman <ar...@scarletmail.rutgers.edu> wrote:
> Is there a way to select the only required files in the Input Data Unit?

What do you exactly mean with “select”?


> When there is a pipeline of tasks, if an output data unit of CU-A becomes the input data unit for the CU-B, and there are multiple jobs submitted in loop by the CUs, is there is a way to place only required files in the input data unit (like we do for the output data unit, specifying only required files to be transferred)?

I’m a bit guessing what you want to do, so please correct my if I’m wrong.

You say that the output of CU-A is a “mix” of certain types of data, of which some are required by CU-B, but others are not?
If so (and in true general), you want to group files together in a DU that have a similar dataflow.
From this it follows, that if files have a different dataflow (e.g. not all of them are required by a consecutive CU) that they should be split over multiple DUs.

The data flow is similar, the CUs run in a loop for example, in a range of 1 - 100. For each iteration, different data(input file) gets processed in the CUs and as the iteration increases, the number of output files gets added to the data unit. Now, to execute i-th CU-B, I need only the output of i-th CU-A, I do not need the rest of the output. So, is there a way to get only the i-th CU-A's output to i-th CU-B's input?

In your case, this would mean that you would put all files that are needed by CU-B in one DU, and “the rest” in another DU. You can then deal with these individually.

Does that capture what you want to achieve?

Gr,

Mark


> I tried doing in a similar way we declare for output dataunit, specifying the required files. Though the script doesn't give any error, the application reports that the required file for execution is not found. I've pasted the code snippet which I tried for input_data similar to output_data.
>
> Any suggestions would be very helpful.
>
> *******************************************************************
>             ####"input_data": [self.input_data_unit.get_url()],
>             "input_data": [
>                         {
>                                 self.input_data_unit.get_url():[dataset+".fasta"]
>                         }
>                  ],
>
>             "output_data": [
>                         {
>                                 self.output_data_unit.get_url():
>                                 ["stdout_"+dataset+"_chain_blast.txt","stderr_"+dataset+"_chain_blast.txt",dataset+"-chain.csblast"]
>                         }
>                  ],



--
Best
Anjani

Mark Santcroos

unread,
Feb 5, 2014, 5:38:51 PM2/5/14
to bigjob...@googlegroups.com
Hi,

On 05 Feb 2014, at 23:34 , Anjani Ragothaman <ar...@scarletmail.rutgers.edu> wrote:
> The data flow is similar, the CUs run in a loop for example, in a range of 1 - 100. For each iteration, different data(input file) gets processed in the CUs and as the iteration increases, the number of output files gets added to the data unit.

Ah, I dont think you should do that (or that it is even possible!).
Each CU will create a new output DU.

> Now, to execute i-th CU-B, I need only the output of i-th CU-A, I do not need the rest of the output. So, is there a way to get only the i-th CU-A's output to i-th CU-B's input?

Yes, see above.

Gr,

Mark
Reply all
Reply to author
Forward
0 new messages