Cortex_var temporary files

125 views
Skip to first unread message

lumull...@gmail.com

unread,
Apr 30, 2015, 5:01:17 PM4/30/15
to corte...@googlegroups.com
Hi,

I am currently running Cortex_var using WGS data on 60 individuals (paired-end sequences) and the run_calls wrapper. It has been running for one hour now and this is the content of the log-file:

Start time: Thu Apr 30 14:22:56 CEST 2015
*********** Command-line used was : *********************
perl /home/ludom/CORTEX_release_v1.0.5.21/scripts/calling/run_calls.pl --first_kmer 95 --last_kmer 223 --kmer_step 32 --auto_cleaning yes --bc yes --pd no --outdir /scratch2/ludom/cortexVariants --outvcf cortexVariants1 --ref Absent --ploidy 1 --stampy_bin /cm/shared/apps/stampy/1.0.23/stampy.py --fastaq_index cortexIndexNoSE.txt --qthresh 5 --dups --mem_height 26 --mem_width 100 --gt_assemblies no --genome_size 180000000 --vcftools_dir /home/usr/ludom/vcftools_0.1.12b/ --do_union yes --logfile cortexVariants1.log --workflow joint --apply_pop_classifier
*********************************************************

********************************************
Build uncleaned binaries:
********************************************

The "binaries" output folder is still empty and I was wondering whether there is a way to tell that the program is running properly. Are there any temporary files that are being generated and should be visible?

Regards,

Ludo.

Zamin Iqbal

unread,
Apr 30, 2015, 5:07:11 PM4/30/15
to corte...@googlegroups.com
Hi Ludo

You are looking in the right place, and I am surprised it is so slow. However your choice of kmer is almost surely bad - I don't know what read length you have, but your lowest kmer is  k=95 and that is very high. Although Cortex will indeed run with k=223, it will be slow - performance drops with kmer size as the amount fo bitshifting increases.
If you have

 - 454 data, then the error rate is too high to support a kmer that high, even though the reads will be significantly longer than 95bp

 - Illumina ~100-ish bp data - then by choosing k=95, you only notice overlaps between reads when nearly the entire reads overlap,
   so you drop your effective coverage hugely

 - Illumina - 250bp - I guess this might work if you have carefully error-corrected your reads

Can I suggest you try with k=31 only, and check what happens?

Z

lumull...@gmail.com

unread,
May 2, 2015, 3:03:24 PM5/2/15
to corte...@googlegroups.com
Dear Zamin,

Thank you for your reply. I have indeed Illumina-250-bp data and thought that a kmer length of 95 may have been appropriate. I have restarted the analysis with a kmer-length of 31 using the following command:

perl /home/ludom/CORTEX_release_v1.0.5.21/scripts/calling/run_calls.pl --first_kmer 31 --last_kmer 95 --kmer_step 32 --auto_cleaning yes --bc yes --pd no --outdir /scratch2/ludom/cortexVariants --outvcf cortexVariants1 --ref Absent --ploidy 1 --stampy_bin /cm/shared/apps/stampy/1.0.23/stampy.py --fastaq_index cortexIndexNoSE.txt --qthresh 5 --dups --mem_height 26 --mem_width 100 --gt_assemblies no --genome_size 180000000 --vcftools_dir /home/usr/ludom/vcftools_0.1.12b/ --do_union yes --logfile cortexVariants1.log --workflow joint --apply_pop_classifier

It did start with creating uncleaned binaries, but then quickly exited with the following error (in the log file of the first sample, "B35.1.unclean.kmer31.q5.ctx.build_log", with B35.1 being the first sample):
-----
Fri May  1 08:55:42 2015
Starting Cortex, version 1.0.5.21
Command: /home/ludom/CORTEX_release_v1.0.5.21//bin/cortex_var_31_c1 --sample_id B35.1 --kmer_size 31 --mem_height 26 --mem_width 100 --dump_binary /scratch2/ludom/cortexVariants/binaries/uncleaned/31/B35.1.unclean.kmer31.q5.ctx --dump_covg_distribution /scratch2/ludom/cortexVariants/binaries/uncleaned/31/B35.1.unclean.kmer31.q5.ctx.covg --pe_list B35.1_R1_trimPaired.fastq,B35.1_R2_trimPaired.fastq --quality_score_threshold 5 --remove_pcr_duplicates 
Maximum k-mer size (compile-time setting): 31
Actual K-mer size: 31
Hash table created, number of buckets: 67108864
No SE data
Input file of paired end data: B35.1_R1_trimPaired.fastq, and B35.1_R2_trimPaired.fastq, 
quality cut-off: 5
Removing duplicates from the paired end files when both mates start with the same kmer
-----
Fri May  1 08:55:42 2015
Error: Cannot find sequence file: /scratch2/ludom/trimmedData/@M01271:62:000000000-A798C:1:1101:15420:1031 1:N:0:1

I don't really understand this error, since the sequence files given in the index files are the following (first five lines of the index file):
A14.1 . A14.1_R1_trimPaired.fastq A14.1_R2_trimPaired.fastq
A60.1 . A60.1_R1_trimPaired.fastq A60.1_R2_trimPaired.fastq
A7.1 . A7.1_R1_trimPaired.fastq A7.1_R2_trimPaired.fastq
A70.1 . A70.1_R1_trimPaired.fastq A70.1_R2_trimPaired.fastq
A71.1 . A71.1_R1_trimPaired.fastq A71.1_R2_trimPaired.fastq

The "sequence file" it cannot find is the name of the first sequence in the FASTQ file of sample B35.1:

@M01271:62:000000000-A798C:1:1101:15420:1031 1:N:0:1

NCATCTATAGAAATAATAAGTCTAAGGTATAGGACTTCCGTTTTATAAAACTTGTATTTATCTATATCTAATTATAGGCTAGCTCTATATAGGGCTTATAGGACTAGTTTAACGTGTTTTTAGTATTTCGCTAGATTATTGCTATATATTAAGATATCGTTAATATANGTAGTATAGAAGATGTCTAGGTANGGGCGTAGGGTATTATTTATAAAGTATTAGNATAAACTTAGGGCGTTTATATCACGGAGTAAGTNTAAGCG

+

#8ACCGGGGGGGGGGGGGGGFDFGFFFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGDFGGGGGGFGGGGGGGGGGGGGGGGDG9FCCGGGGGGFGDGFGGGGCGGGGGGGGGEFGGG?DGGGGGGGGGDGGGGGFEFG9CEFGGFFFGGGGGFGGGGBFGGFFFE#9AFFFGGGGGGGGGGGAFGFGGD#6@CD?CEGGGD8DFDGG9FGF,@A7:8EG,#6@E6EG8EGF?+?7=9;F9F+<<@D>DCDCFGA#3;94@4


It's paired end sequence is:

@M01271:62:000000000-A798C:1:1101:15420:1031 2:N:0:1

NGCTTATACTTACTCCGTGATATAAACGCCCTAAGTTTATTCTAATACTTTANAAATAATACCCTACGCCCGTACCTAGACATCTTCTANACTACTTATATTANNNNNNNNNTAATATATAGCAATAATCTAGCGAAATACTAAAAACACGTTAANNNNNNCCTATAAGCCCTATATAGAGCNAGCCTATAATTAGATATAGATNANNACNNNNTTTNTANAACGGAAGTCCTATACCTTAGACTTATTATTTCTATAGATGG

+

#8ACCGGGGGGGGGGGGGGGGFGGGFGGGGGGGGGGDGGGGGGGGGGGGGGG#:CFFGGDFGGGGGGGGGGGGGGEGGC9FGGGGGGGF#:BDGGGGGGGGGG#########::DFGGGGGGGGGGGEFGGGGGGGGEFGGGGGGGGGGGGFFGG######46@EEEGGGGGGGGFGGAFGD#6@@DG8EFACFA7FGGGGGG+#5##5*####303#33#3:@FFB>9FCFFAC=E<B<E=4:*7*/79@C<5@EFFF3EC8


The perl script also returns the following error:

Unable to build /scratch2/ludom/cortexVariants/binaries/uncleaned/31/B35.1.unclean.kmer31.q5.ctx at /home/ludom/CORTEX_release_v1.0.5.21/scripts/calling/run_calls.pl line 2097.


Can you tell me what is going wrong?


In addition to this particular problem, is there a reason the analysis starts with the sample B35.1 and not with the first sample mentioned in the index file (A14.1)?


Best regards,


Ludo.

Zamin Iqbal

unread,
May 2, 2015, 3:07:27 PM5/2/15
to corte...@googlegroups.com
Ludo

The index file should contain NAMES OF FILELISTS, not names of fastq.

regards

Zam

Ludo Muller

unread,
Aug 15, 2015, 6:55:18 PM8/15/15
to cortex_var
Dear Zamin,

I was able to generate the individual uncleaned binaries with a k-mer length of 31 and have started a script to generate the uncleaned pooled library. It has been running now for over three weeks and I was wondering whether this duration is normal. The log file has not been updated since the script was started:

-----
Wed Jul 22 17:03:22 2015
Starting Cortex, version 1.0.5.21
Command: /home/ludom/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1 --kmer_size 31 --mem_height 27 --mem_width 110 --colour_list colour31.list --dump_binary uncleaned_pool.q10.k31.ctx --dump_covg_distribution uncleaned_pool.q10.k31.covg.txt 
colour list is set to colour31.list
Maximum k-mer size (compile-time setting): 31
Actual K-mer size: 31
Hash table created, number of buckets: 134217728

-----
Wed Jul 22 17:03:22 2015

-----
Wed Jul 22 17:03:22 2015

Is there a way to assess whether the program is running correctly and/or how long it is expected to run?

Best regards,

Ludo.

Zamin Iqbal

unread,
Aug 15, 2015, 7:02:14 PM8/15/15
to cortex_var
Something is wrong. Even when I merged human populations in the 1000 genomes project, a merge never took more than 3 days.
Can you try merging sequentially, or something like the following:
if you have 2N samples, run N jobs, each merging 2. Then run N/2 jobs merging 2 of the outputs of the previous step.etc

My guess is the 250bp reads still have adapters in, and some are inside the reads, so maybe you have a larger number of kmers than expected biologically.
But this is pure speculation
Z

Ludo Muller

unread,
Aug 19, 2015, 11:13:27 AM8/19/15
to cortex_var
Dear Zamin,

Thanks for your answer, I was already suspecting something was wrong. Regarding remaining adapter sequences: adapters were removed using Trimmomatic, but I will check if reducing the stringency of detection improves the runtime. Would the presence of adapters also influence generation of binaries for individual samples? Runtimes for individual samples (10 - 15 million reads of 250 bp) were about 30 minutes. Does this seem normal?

I will try to merge sequentially as you suggested and report back.

Regards,

Ludo.

Ludo Muller

unread,
Sep 21, 2015, 7:35:01 AM9/21/15
to cortex_var
Dear Zamin,

I have ran Trimmomatic again to remove the adapters and merged the binaries sequentially as you suggested, starting with 30 jobs merging 2 binaries followed by 15 jobs merging the output from the previous jobs (so merging 4 binaries) etc. The run times were as follows: about 20 minutes for 2 binaries, 30 minutes for 4 binaries, 45 minutes for 8 binaries and 55 minutes for 16 binaries. However, when I try to merge 32 binaries cortex_var seems to stall and the job is running now for more than 24 hours.

Do you have any further suggestions as to where to look for the problem?

Kind regards,

Ludo.

On Sunday, August 16, 2015 at 1:02:14 AM UTC+2, Zamin Iqbal wrote:

Zamin Iqbal

unread,
Sep 21, 2015, 7:41:17 AM9/21/15
to cortex_var

How many samples in total Ludo?

I have ran Trimmomatic again to remove the adapters and merged the binaries sequentially as you suggested, starting with 30 jobs merging 2 binaries followed by 15 jobs merging the output from the previous jobs (so merging 4 binaries) etc. The run times were as follows: about 20 minutes for 2 binaries, 30 minutes for 4 binaries, 45 minutes for 8 binaries and 55 minutes for 16 binaries.

Wait, what does this mean? You first have a bunch of processes, each merging 2 binaries, and the average runtime for those was 20 mins?
And then you wanted to merge 2 of these output files, which took 30 mins on avg?
Or you merged 4 output files?


 
However, when I try to merge 32 binaries cortex_var seems to stall and the job is running now for more than 24 hours.

Can you be more explicit about total number of samples and how you are collapsing the tree? Is this the final job or do you have a long way to go.
My big merges for 1000 genomes took 3 days, but I only had to do this once

Ludo Muller

unread,
Oct 2, 2015, 8:22:56 PM10/2/15
to cortex_var
How many samples in total Ludo?
 
In total, I have 60 samples.
 
I have ran Trimmomatic again to remove the adapters and merged the binaries sequentially as you suggested, starting with 30 jobs merging 2 binaries followed by 15 jobs merging the output from the previous jobs (so merging 4 binaries) etc. The run times were as follows: about 20 minutes for 2 binaries, 30 minutes for 4 binaries, 45 minutes for 8 binaries and 55 minutes for 16 binaries.

Wait, what does this mean? You first have a bunch of processes, each merging 2 binaries, and the average runtime for those was 20 mins?
And then you wanted to merge 2 of these output files, which took 30 mins on avg?
Or you merged 4 output files?

Yes, first I had 30 processes, each merging 2 binaries, and those finished in 20 minutes. Then I had 15 processes, each merging 2 outputs (not 4) of the previous step, and those took about 30 minutes to finish. Continuing this way, I was able to get to merging 16 of the original output files (step 4) and these processes took about 55 minutes.

However, when I try to merge 32 binaries cortex_var seems to stall and the job is running now for more than 24 hours.

Can you be more explicit about total number of samples and how you are collapsing the tree? Is this the final job or do you have a long way to go.
My big merges for 1000 genomes took 3 days, but I only had to do this once

As mentioned above, I have a total of 60 samples. So collapsing the tree would require 6 steps of merging in total.

In the mean time, I also ran a process that directly (not sequentially) merges 30 binaries. It finished in 7 days. Maybe I can just use this pooled library to clean up the individual libraries? I will likely loose some variants in the cleaning process, but the 30 samples that the binaries correspond to come from four different populations so should cover quite a bit of variation anyway.
Reply all
Reply to author
Forward
0 new messages