Some options that does not work to run in parallel

39 views
Skip to first unread message

Arang Rhie

unread,
Jan 6, 2014, 11:54:42 PM1/6/14
to gotc...@googlegroups.com
Hi,


I'm trying GotCloud 1.11 on my low-coverage, whole genome population data.
Here are some options that does not seem to work well on align step:

--out_dir . : Gives a silly error message; that is not initialized in bin/align.pl line 453, <IN> in line 30 and something else.
(this is not really important: I've found a work around by just giving another directory name)

--numjobs 7 : I see only 2 processes running bwa.
BWA_THREADS in config file: This does not work on the example test file.

As I have several fastq files to handle, running multiple jobs at once is really in need.
Please give me some advise or any workarounds especially for the last 2 options.


Thanks,
Arang Rhie

Hyun Min Kang

unread,
Jan 7, 2014, 12:16:12 AM1/7/14
to Arang Rhie, gotc...@googlegroups.com
Hi Arang,

The reason why you observe 2 processes is probably because you have only two samples. The maximum number of parallel process you can run cannot exceed the number of samples in the current gotCloud setting. 

Thanks,
Hyun.


--
You received this message because you are subscribed to the Google Groups "GotCloud" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gotcloud+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Arang Rhie

unread,
Jan 7, 2014, 2:01:19 AM1/7/14
to gotc...@googlegroups.com, Arang Rhie
Hmm, that's interesting.

My sequence index file consists of 28 different individuals (samples).

MERGE_NAME and SAMPLE are the same, since one sample was only once sequenced in one lane.

MERGE_NAME      FASTQ1  FASTQ2  RGID    SAMPLE  LIBRARY CENTER  PLATFORM
14-0274 /path/to/seq/s_1_1_sequence.txt.gz /path/to/seq/s_1_2_sequence.txt.gz 14-0274 14-0274 14-0274 GMI     ILLUMINA
14-0347 /path/to/seq/s_2_1_sequence.txt.gz /path/to/seq/s_2_2_sequence.txt.gz 14-0347 14-0347 14-0347 GMI     ILLUMINA
14-0348 /path/to/seq/s_3_1_sequence.txt.gz /path/to/seq/s_3_2_sequence.txt.gz 14-0348 14-0348 14-0348 GMI     ILLUMINA

These are the 4 lines of my sequence index file.


2014년 1월 7일 화요일 오후 2시 16분 12초 UTC+9, Hyun Min Kang 님의 말:

Hyun Min Kang

unread,
Jan 7, 2014, 4:28:00 AM1/7/14
to Arang Rhie, gotc...@googlegroups.com

Ah, I see..

Mary Kate - doesn't gotCloud allow multisample parallelization? Should one run gotCloud sample by sample basis?

Thanks,
Hyun

(from a mobile phone, perhaps with typos and brevity)

Mary Kate Wing

unread,
Jan 7, 2014, 9:54:05 AM1/7/14
to Hyun Min Kang, Arang Rhie, gotcloud
Based on the 4 lines of your fastq index file, it appears that each sample has just 2 fastqs.

The --numjobs option is how many programs to run in parallel for a given sample.  The current version can only run at most 1 BWA job per fastq.  Since each sample has only 2 fastqs, then at most it will run two jobs per sample at a time.

To run multiple samples at a time you can use the --numcs followed by the number of samples to run in parallel.

In the gotcloud aligner, the number of threads per sample and number of samples are done as 2 separate settings.  This was done because when we run on a cluster we send each sample out as a separate cluster make command with the --numjobs option passed to that make call to indicate how much to run in parallel for that sample.  

If you are running local and only have 7 threads, then you could do: --numjobs 1 --numcs 7

I'm not sure I understand what you mean by:
  "BWA_THREADS in config file: This does not work on the example test file."

When I put the following in my configuration file: 
BWA_THREADS = -t 2

I see the "-t 2" option show up in the bwa aln call in my Makefile.log files.
Are you seeing something different?

I know you found a work around for the "--out_dir ." error, but I can't seem to recreate it.
I would like to try to fix this so it would work.  
Can you prvide more information on this?  My line 453 of bin/align.pl has "push @{$mergeToFq1{$mergeName}}, $fastq1;"
Is that what you have too? 
  "--out_dir . : Gives a silly error message; that is not initialized in bin/align.pl line 453, <IN> in line 30 and something else."

Let me know if you still have questions or if I can better clarify parts of this.  
Also, let me know about the errors you are still seeing.  I would like to get them fixed.

Thanks,
Mary Kate

Hyun Min Kang

unread,
Jan 8, 2014, 6:31:00 AM1/8/14
to Mary Kate Wing, Arang Rhie, gotcloud
Thanks Mary Kate!

Arang Rhie

unread,
Jan 10, 2014, 2:11:16 AM1/10/14
to gotc...@googlegroups.com, Mary Kate Wing, Arang Rhie
Thanks Mary Kate!
--numcs option works perfectly fine. This is what I needed.

For the BWA_THREADS option, I saw the message yield from gotcloud align --test log but haven't dig deeply.

It said:
Total space for BAM files will be about 0.00 GB
Be sure you have enough space to hold all this data
Created /path/to/testalign_t8/aligntest/Makefiles/align_Sample2.Makefile
Created /path/to/testalign_t8/aligntest/Makefiles/align_Sample3.Makefile
Created /path/to/testalign_t8/aligntest/Makefiles/align_Sample1.Makefile
---------------------------------------------------------------------
Waiting while samples are processed...
Processing finished in 78 secs with no errors reported
Results from DIFF will be in /path/to/testalign_t8/diff_logfiles_results.txt
/path/to/testalign_t8/aligntest/bams/Sample1.recal.bam.qemp does not match /path/to/tools/GotCloud/gotcloud-gotcloud.1.11/test/align/expected/aligntest/bams/Sample1.recal.bam.qemp. See mismatches in /path/to/testalign_t8/diff_logfiles_results.txt
Comparison failed, test case FAILED.
CMD=/path/to/tools/GotCloud/gotcloud-gotcloud.1.11/scripts/diff_results_align.sh /path/to/testalign_t8 /path/to/tools/GotCloud/gotcloud-gotcloud.1.11/test/align/expected


I think It's the diff_logfiles_result.txt that matters; which looks as follows:
2a3
> RGID1,10,13,AC,1,1,0
5a7
> RGID1,10,13,CC,1,0,3
8a11
> RGID1,10,13,GC,1,0,3
12a16
> RGID1,10,13,TC,1,0,3
58a63
> RGID1,10,29,AC,3,1,3
60a66
> RGID1,10,29,AT,3,0,6
63a70
> RGID1,10,29,CC,4,0,7
65a73
> RGID1,10,29,GA,1,0,3
67a76
> RGID1,10,29,GT,1,1,0
68a78
> RGID1,10,29,TA,2,1,2
71,97d80
< RGID1,10,-29,TG,3,0,6
< RGID1,10,-29,TT,7,2,4
< RGID1,10,-36,AA,2,0,5
< RGID1,10,-36,CA,2,0,5
< RGID1,10,-36,CC,1,0,
...

But I see .OK files in output directory. Seems like the order in bam file (or something else?) is a bit different due to collecting threaded results.


For the <IN> error that I saw on my samples were maybe due to the empty line at the end of my configuration file.
I saw an "align_.Makefile" with no sample name on it. But I don't see any other problems with the empty line removed from the config file.


BTW,
Why is RedHat or CentOS not supported?
Our server cluster is based on RedHat. Are there any problems that I might expect?

One more question is, are the samples independently handled during align step?
Or are re-calibration applied on all samples after alignment and sorting, dedup?
I'm asking this because I'm not sure if I could run gotcloud align on partitioned subgroups of my samples and do the snpcall after all the alignment is done on each subgroup.
This time, instead of running all 28 samples at once, I tried with 10 samples with --numcs 5 and BWA_THREADS = -t 2 option.
So that when this process finishes with no error message, I would proceed with the next 18 samples.

Thanks for your helpful advise again.



2014년 1월 8일 수요일 오후 8시 31분 0초 UTC+9, Hyun Min Kang 님의 말:

Mary Kate Wing

unread,
Jan 10, 2014, 10:48:29 AM1/10/14
to Arang Rhie, gotcloud
Great, I'm glad that --numcs worked for you as you needed.

BWA_THREADS 
Did you run the test case without the BWA_THREADS option in the test configuration file?  I'm wondering if that also fails.  

It does sound like the test ran, but the results did not successfully validate.  It could be due to a different "sort" between your computer and our ubuntu computers.  This isn't a big deal.  But if you would like to get it to validate, I suggest running the following steps:

cd /path/to/testalign_t8 /path/to/tools/GotCloud/gotcloud-gotcloud.1.11/test/align/
mkdir origqemp
mv expected/aligntest/bams/Sample*.recal.bam.qemp origqemp/.
sort origqemp/Sample1.recal.bam.qemp > expected/aligntest/bams/.
sort origqemp/Sample2.recal.bam.qemp > expected/aligntest/bams/.
sort origqemp/Sample3.recal.bam.qemp > expected/aligntest/bams/.

Then rerun the test validation:
/path/to/tools/GotCloud/gotcloud-gotcloud.1.11/scripts/diff_results_align.sh /path/to/testalign_t8 /path/to/tools/GotCloud/gotcloud-gotcloud.1.11/test/align/expected

Note: The origqemp directory I have created here is not in the expected results directory since the validation script checks for unexpected/missing files between the expected results directory and the output directory.


<IN> error - ok.  Hmm, I should ignore leading whitespace in the index file and ignore blank/empty lines.  I will put that on my list of things to update.  Thanks.  Glad you found the workaround.


RedHat/CentOS may work, but we just haven't tested our software on them.  
We run Ubuntu here, so that is what it has been developed and tested on.
While Ubuntu vs RedHat/CentOS shouldn't make any difference, in my experience, they do have minor differences.  The only issues I would anticipate would be compile/scripting problems.  Occasionally the different versions don't all have the same options.  It would probably be pretty obvious as you would see a failure.
I would like to get GotCloud up and running on RedHat/CentOS, I just don't necessarily have the physical machines to test it on.

So, if you are willing to give it a try, I am happy to help resolve any issues you encounter to get it running on those platforms.
I will incorporate any of these updates back into GotCloud so the next release would have these issues resolved.


SAMPLE HANDLING
The entire alignment pipeline handles samples independently.  The recalibration is done separately for each sample.
You will get the same results if you run all samples at once through the aligment pipeline or if you run them in subgroups or if you run 1 instance of gotcloud for each sample.  

For most steps of snpcall, samples are processed together, so the grouping does matter and you will get different results if you run with different groups/subgroups.

Does that answer your question on sample handling?

Let me know if you have more questions.

Mary Kate
Reply all
Reply to author
Forward
0 new messages