IOError (file too large)

48 views
Skip to first unread message

Sandip Kale

unread,
May 19, 2017, 8:35:02 AM5/19/17
to Platypus Users
I would like to run the platypus to analyse data from 10K GBS samples. For initial run, I tried to run with 541 samples. We have 250 Gb RAM and 32 cores. 

I used ulimit -n 2046 to increase the CPUs and when I am running with 3 CPU its working fine, However, with more than 3 CPUs, it throws bam indexing error. 

From other post, I learned that, the bam index error can be ignore by using --fileCaching=2.So, when I tried running program with 10 CPUs and --fileCaching=2 flag, I am getting " IOError: [Errno 27] File too large" error.

Can you please help me in sorting this issue.

Thanking you in advance.

With best regards

Sandip

Andy Rimmer

unread,
May 19, 2017, 8:37:22 AM5/19/17
to Sandip Kale, Platypus Users
Hi Sandip,

Could you send me the log file and the exact command you used to run Platypus?

Kind regards,
Andy

--
You received this message because you are subscribed to the Google Groups "Platypus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platypus-users+unsubscribe@googlegroups.com.
To post to this group, send email to platypus-users@googlegroups.com.
Visit this group at https://groups.google.com/group/platypus-users.
To view this discussion on the web, visit https://groups.google.com/d/msgid/platypus-users/835e9c3f-3da5-401f-961b-82c8e19e65ff%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Dr Andrew (Andy) Rimmer

Sandip Kale

unread,
May 19, 2017, 8:55:44 AM5/19/17
to Platypus Users, vska...@gmail.com
Hello Dr. Rimmer,

Thanks for quick reply. Here is the command and error:

Command: 

Platypus.py callVariants --bamFiles=sorted_bam.txt --fileCaching=2 --nCPU=10 --genIndels=0 --minMapQual=30 --minBaseQual=30 --minGoodQualBases=5 --badReadsThreshold=10 --rmsmqThreshold=20 --abThreshold=0.01 --maxReadLength=250  --hapScoreThreshold=20 --trimAdapter=0 --maxGOF=20 --minReads=2 --minFlank=5 --sbThreshold=0.01 --scThreshold=0.95 --hapScoreThreshold=15 --filterDuplicates=0 --filterVarsByCoverage=0 --filteredReadsFrac=0.7 --minVarFreq=0.0001 --mergeClusteredVariants=0 --filterReadsWithUnmappedMates=0 --refFile=wheat_psuedomolecules/161010_CS_v1.0_pseudomolecules_parts.fasta --logFileName=test_log_1 --output=test.vcf


Error: 

`Exception IOError: (27, 'File too large') in 'vcfutils.outputCallToVCF' ignored

Also, I would like to know the best strategy to run 10K samples.


Thanking you with best regards

Sandip



On Friday, May 19, 2017 at 2:37:22 PM UTC+2, Andy Rimmer wrote:
Hi Sandip,

Could you send me the log file and the exact command you used to run Platypus?

Kind regards,
Andy
On Fri, May 19, 2017 at 1:35 PM, Sandip Kale <vska...@gmail.com> wrote:
I would like to run the platypus to analyse data from 10K GBS samples. For initial run, I tried to run with 541 samples. We have 250 Gb RAM and 32 cores. 

I used ulimit -n 2046 to increase the CPUs and when I am running with 3 CPU its working fine, However, with more than 3 CPUs, it throws bam indexing error. 

From other post, I learned that, the bam index error can be ignore by using --fileCaching=2.So, when I tried running program with 10 CPUs and --fileCaching=2 flag, I am getting " IOError: [Errno 27] File too large" error.

Can you please help me in sorting this issue.

Thanking you in advance.

With best regards

Sandip

--
You received this message because you are subscribed to the Google Groups "Platypus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platypus-user...@googlegroups.com.
To post to this group, send email to platypu...@googlegroups.com.

Andy Rimmer

unread,
May 19, 2017, 9:11:44 AM5/19/17
to Sandip Kale, platypu...@googlegroups.com
Can you check the size of the output file that is produced? Are you writing the output to a partition with a limit on file size? I haven't seen this error before, so my guess is there's something unusual with your file-system setup.

Platypus works best when run on small numbers of samples. It depends a bit on the per-sample coverage. If you have very high coverage for each sample then the variant calls will be fine if you process each sample individually. If you have low coverage, then processing in small batches (<= 10) will improve the calls. The downside is that you end up with a lot of VCFs which need merging. One option is to process the files individually or in batches, and then concatenate the list of unique variants into a single VCF file, and use this VCF file as an input to a second run of Platypus where you genotype all samples at the same list of sites simultaneously:

python Platypus.py callVariants --bamFiles=list_of_all_bams.txt --refFile=ref.fa --output=out.vcf --source=list_off_all_variants.vcf.gz --minPosterior=0 --getVariantsFromBAMs=0

Kind regards,
Andy


To unsubscribe from this group and stop receiving emails from it, send an email to platypus-users+unsubscribe@googlegroups.com.
To post to this group, send email to platypus-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Sandip Kale

unread,
May 19, 2017, 9:51:03 AM5/19/17
to Platypus Users, vska...@gmail.com
Thank you very much Dr. Rimmer for information.

I think, we have ample free space on server and don't think that would be really a issue. Still, I will check it again. 

Also, thank you very much for the suggestion on analysis of large dataset.

With best regards

Sandip
Reply all
Reply to author
Forward
0 new messages