Bbtools

5 views

Skip to first unread message

Numeria Mealer

unread,

Jul 21, 2024, 9:29:52 AM7/21/24

to liftiospeakal

I am facing an interesting case, where bbtools callvariant fail to call SNPs in high density regions. I have some nanopore reads, and Clair3 telling me those regions do exist, as exemplified in the pic hereafter. I would like to tweak the callvariant to be able to detect those SNPs, as I have a lot of Illumina data and not so much in the realm of long reads. I know Brian Bushnell sometimes answers here. But anyone with any suggestion is welcome. As you see the above band using Clair3 identifies the SNPs, while they are rejected by the short read caller. Any idea?

Hmmm... usually reads with lots of variants in a region are mismapped as a result of a structural variations, so CallVariants penalizes them... in this case they also have low AF (unless this is a diploid). You can add the flag "countnearbyvars=f" to disable this. Posting the exact VCF lines generated for these variations, which you can generate with the "clearfilters" flag, can indicate exactly which filters they are failing. If this is a haploid you can also play with the rarity flag.

bbtools

Download File >>> https://ssurll.com/2zw5WJ

Thanks for your answer, this is a haploid assembly of a diploid organism, not sure if you meant the organism ploidy or the assembly ploidy. So I am gonna try your suggestion and hunt the faulty flags. Thanks a lot. By the way, the speed at which your tool works to call the variants is a-m-a-z-i-n-g.

Brian, I have another question. I have a phase diploid assembly and want to map it against a diploid reference. Then, I want to call SNPs. Do you think callvariants.sh would work here? The thing is, there will be a coverage of 1X (assuming no repeat issues, etc, I am assuming the perfect mapping with a cardinality of 1). Thanks again.

If you are going to align an assembly to a reference, that will work fine, but you need to use the clearfilters flag since you want everything to show up. Personally I never have worked with a diploid assembly, just haploidified assemblies... I'm not sure how well aligning a diploid assembly to a diploid reference would work since you'll probably get 2x coverage in some places and 0x in others, but it's worth trying.

Based on various posts at Biostars, I gather than error correction (EC) is important for accurate de Bruijn map generation, and de novo sequence. I am trying to stick with BBTools for all my pre-processing steps.

I've completed the following steps: Force Trim Modulo, Adapter Trimming, Quality Trimming, PHI-X174 check & removal, Human contamination check & removal. So for this EC step, I seem to be running out of memory for bbnorm.sh, or even khist.sh.

My bbnorm.sh syntax and STDOUT are shown below. I thought using the recommended flags, explained at -and-tools/bbtools/bb-tools-user-guide/bbnorm-guide/, that I would never run into memory issues. Perhaps I am misunderstanding those instructions / syntax? Because I never ran into memory issues with Jellyfish (for example), I suspect my syntax and/or understanding of EC is incomplete/incorrect.

As for memory kills - that's different than BBNorm running out of memory. BBNorm does not run out of memory, as the amount of memory it uses is defined upfront and unrelated to the dataset (as long as you are using Illumina data; theoretically, if you had a bunch of 100 Mbp reads, you could make it run out of memory, but I only recommend it for short reads). However, memory kills are different. That's determined by the scheduler, and sometimes schedulers will just randomly kill jobs that are doing nothing wrong because the scheduler was misconfigured by the administrators.

In this case, you need to find out what the virtual memory limit is for the job and set the -Xmx flag to lower than the limit. The exact value depends, but generally, 1.5 GB plus 5% lower should be sufficient, so if the virtual memory limit is 100 GB, try adding the flag "-Xmx93G". Obviously in this case you need a lower value since you were already getting killed at 48GB, so you need to find out from the admins why the job is getting killed and what the limits are.

Per your advice, using tadpole.sh, I bumped up RAM requested incrementally, and it took > 50GB for the command to execute. If you could you please look at info I've copy/pasted, and linked out to, and comment on anything out of the ordinary / abnormal, that would be helpful indeed. Thanks, Brian!

"prefilter=t" works, and becomes "prefilter=2". You don't need to use prefilter unless you run out of memory (basically, for large genomes and large datasets). A Count-min sketch is a type of Bloom filter. It's unrelated to the "mincount" or "minprob" flag. "minprob" ignores individual kmers, as encountered in a read, with a probability of correctness below a certain value. "prefilter" first counts everything, then ignores kmers with a probable count below a certain value. "mincount" is different; it's mainly for assembly (don't assemble kmers with count less than X).

Just ignore the advanced flags for now; the defaults are fine, and you only need to adjust K as needed based on read length (typically 1/3rd of read length is good for error correction). You can adjust minprob and prefilter if you run out of memory.

I am trying to use bbtools:bbduk to remove host (mouse) contaminants from my metagenomics data but am running into an issue regarding memory needed for Galaxy to run this, and the job will not complete. Is there a way to adjust the memory? Or any other solutions? Thanks!

While the updated version of the tool is now available, it also fails with such a large query (3+ GB for each end of the pair) and target (mm39) combination. Indexing k-mers from the entire mouse genome is where the tool seems to have the immediate trouble.

I have ticketed the request for an increase in the memory allocation at UseGalaxy.org. Enhancement: allocate more memory for bbtools_bbduk 39.06+galaxy2 at ORG Issue #750 galaxyproject/usegalaxy-tools GitHub

That will not be immediate. Meanwhile, I would suggest that you try at UseGalaxy.eu or UseGalaxy.org.au. You can easily move data between servers without a download step. See FAQ: Transfer entire histories from one Galaxy server to another

On biowulf all bbtools are used through a wrapper script that automatically passes correctmemory information to each of the tools. This is important as the automatic memory detectionby the individual tools is not slurm aware and reports incorrect amounts of availablememory.

Batch jobMost jobs should be run as batch jobs.As an example for using a few of the tools we will download some bacterialgenomes, simulate reads from these genomes in different ratios, apply kmerbased normalization, align the raw and normalized reads to the genomes, andfinally plot the coverage of the genomes before and after normalization.

Below are the statistics for RNA-seq mapped and unmapped paired-end reads to rice genome using reformat.sh from bbtools on bam files. It gives 77% mapped and 5% unmapped, what about the remaining 18% reads? How I can get information for all reads?

This page was created before conda became the wonderful powerhouse it is today. Conda is a package and environment manager that is by far the easiest way to handle installing most of the tools we would want to use.

And other just convenient things to have handy like removing those annoying soft line wraps that some fasta files have (bit-remove-wraps) and printing out the column names of a TSV with numbers (bit-colnames) to quickly see which columns need to be provided to things like cut or awk. Some require biopython and pybedtools, but all is taken care of if you use the the conda installation ?

Each command has a help menu accessible by either entering the command alone or by providing -h as the only argument. Once installed, you can see all available by entering bit- and pressing tab twice.

FastQC is a handy and user-friendly tool that will scan your fastq files to generate a broad overview summarizing some useful information, and possibly identify any commonly occurring problems with your data. But as the developers note, its modules are expecting random sequence data, and any warning or failure notices should be interpreted within the context of your experiment.

You can list multiple files like this delimited by a space, and the program is capable of dealing with gzipped files. This will produce a .html file for each fastq file you run it on that you can then open and analyze.

Since bbtools comes with many programs, stored within the directory we just untarred, it may be preferable to add that directory to our PATH, rather than copying all of the programs to our location here. So here we are adding another directory to our PATH:

QUAST is a really nice tool for comparing multiple assemblies, and for metagenome assemblies there is a comparable MetaQUAST. Some discussion and example usage can be found here. To install on my personal computer, I followed the instructions laid out here, and, because of the way QUAST compiles things as needed if used, I added its location to my PATH:

SPAdes is an assembly program. You can read some of my thoughts on assemblies here. SPAdes is packaged as a binary, and the developers provide excellent installation instructions here. This is how I installed the latest release at the time on my local computer:

Since SPAdes comes with many programs, stored in a subdirectory of the SPAdes directory we just untarred, it may be preferable to add that directory to our PATH, rather than copying all of the programs to our location here. So here we are adding another directory to our PATH: