Sarah;
Glad this helps and the memory changes figures out the practical issues to get
it analyzed.
> For the sorting my test runs finished and indeed giving it more memory
> solves the problem. I'm also considering changing the setting for the tmp
> directory, because it turned out that our GPFS file system can handle it.
>
> Regarding the resource/memory settings, I still don't understand why only
> 1G per command. With 4G per core and 4 cores, so 16G total, 4 (sort)
> processes (that get memory limits) in that piped command and only one of
> these piped commands running, that would still leave 4G for each sort
> process.
> Unfortunately most of our cluster nodes have 5G per core ;-).
The memory specifications in the samtools commandline are per-core, so:
-@ 4 -m 1G
is 4Gb per program. So that should give you 16Gb for the 4 running samtools
programs in the pipe. Sorry, a lot of memory calculations.
If you'd like more fine grained memory usage for maximizing your 5g/core, you
could specify as megs -- 5000m -- bcbio is just not smart enough right now to
convert over there.
> I just ran into another small unrelated issue. The tool and database
> versions for snpEff don't match on my recent bcbio-nextgen installation. I
> pasted the error at the end. I ran the upgrade with "-u stable --tools
> --data --genome GRCh37", but the error remains. I manually downloaded the
> appropriate database from snpEff, but it contains much less files.
Sorry, that's strange -- it should identify the snpEff database is out of date
and re-download it. If you remove the directory and re-run bcbio should
download it to save you the manual steps.
Hope this helps,
Brad
>
> ------------- snpEff error ---------------
>
> CalledProcessError: Command 'set -o pipefail;
> /work/projects/lcsbsoft/workflows/pipelines/NGS/bcbio-nextgen-v0.9.4/anaconda/bin/snpEff
> -Xms750m -Xmx4g
> -Djava.io.tmpdir=/tmp/bcbiotx/7db65472-2585-49f4-b817-ed2997390178/tmpWOLr2D/tmp
> eff -dataDir
> /mnt/gaiagpfs/projects/lcsbsoft/workflows/pipelines/NGS/bcbio-nextgen/genomes/Hsapiens/GRCh37/snpeff
> -cancer -noHgvs -noLog -i vcf -o vcf -s
> /mnt/lustre/projects/melanomics/sdiehl/work/structural/patient_2_PM/manta/tumorSV-patient_2_PM-effects-stats.html
> GRCh37.75
> /mnt/lustre/projects/melanomics/sdiehl/work/structural/patient_2_PM/manta/tumorSV-patient_2_PM.vcf.gz
> |
> /work/projects/lcsbsoft/workflows/pipelines/NGS/bcbio-nextgen-v0.9.4/anaconda/bin/pbgzip
> -n 3 -c >
> /tmp/bcbiotx/7db65472-2585-49f4-b817-ed2997390178/tmpWOLr2D/tumorSV-patient_2_PM-effects.vcf.gz
> java.lang.RuntimeException: Database file
> '/mnt/gaiagpfs/projects/lcsbsoft/workflows/pipelines/NGS/bcbio-nextgen/genomes/Hsapiens/GRCh37/snpeff/GRCh37.75/snpEffectPredictor.bin'
> is not compatible with this program version:
> Database version : '4.1'
> Program version : '4.2'
> Try installing the appropriate database.
>
>
> ---------------------------------------
>> Thanks for all the details, this helps a lot for trying to sort out issues.
>>
>> > In the first case the problem is that around 10 of those merges are
>> running
>> > at the same time. If only 2 are running (by specifying -n 2) it's fine.
>> > Since there are no settings for cpu or memory in the command, I guess
>> > changes in bcbio_system.yaml will not have any influence on it?
>>
>> Sorry, I don't have a good way to fix this without upgrading versions of
>> bcbio. We've moved away from sambamba merge because of scaling issues like
>> you're see, so no longer take this approach. I think your diagnosis is
>> correct -- it's probably due to running too many at once but we don't have
>> a
>> good workaround for it by tweaking bcbio_system.yaml.
>>
>> > In the second case there is only one sample and the memory limits are
>> below
>> > what's available on the machine. I played around a little bit with the
>> > settings and also isolated the failing command, but each test run takes
>> > about half a day. Since all the sorting creates A LOT of temporary files
>> > maybe this "kills" the file system? So I thought about actually giving it
>> > more memory. However, even though I specified "cores: 4" and "memory: 4G"
>> > the pipeline only puts 1G in the command.
>>
>> Your diagnosis seems spot on again. I'm a little surprised that Lustre is
>> falling over like this and not letting you write to disk, but depending on
>> the
>> setup and other ongoing work on the cluster your thoughts sound
>> reasonable. If
>> you only see this under load and you're not nearing the memory
>> requirements of
>> the machine, this could be the cause.
>>
>> The other option is that you're having memory issues. What kind of memory
>> and
>> cores do the machines you're running on have? The reason it uses 1G
>> memory/core instead of the 4G per core you specified is that there are
>> multiple process running simultaneously in that piped command so it's
>> dividing
>> up the available memory. If you set to 6G per core it should use 2G for
>> each
>> step and you can see if that improves the runs or make it worse.
>>
>> Sorry for not having definite answers -- tuning at scale depends a lot on
>> the
>> system -- but hope this discussion helps,
>> Brad