Best protocol for mix of SE and PE libraries

Pearce Cooper

unread,

May 22, 2021, 6:54:41 PM5/22/21

to dDocent User Help Forum

Hi John,

I want to double check on something.

I have a mix of SE and PE,150 bp Hiseq libraries (ddRAD prepped at MGL). These are mapped to a miseq ref. All have the same cutsites/ bp window.

I have successfully demultiplexed the SE libraries with some modifications by Pavel and later myself on to "demultiplex.pl" that allows it to create a SE BASH wrapper for process radtags. As for the PE libraries I am using the original script but only using the forward enzyme in all demultiplexing, then deleting all the reverse reads and sample* files so that the libraries are compatible for trimmomatic, mapping, and downstream.

I am thinking this should work , there is code on a previous issue that I believe indicates trimmomatic ID's whether the samples are SE or PE by just by the presence of a reverse read.

Is this correct?

Sorry for the long winded email,

Thanks,

-Pearce

...

}

Pearce Cooper

unread,

May 22, 2021, 10:01:23 PM5/22/21

to dDocent User Help Forum

It appears that I was incorrect about trimmomatic as fastp is now used (I am currently reworking data on EARTH).

I have noticed a lot of adapters, but not all, are not found by fastp across all libraries, some in low quality individuals while others in

ones that seem fine.

At least in one PE library, Trueseq adapter Idx 2 is dominant in the adapter not found logs.

I need to investigate further as to what indices and libraries are most affected.

A typical not found log reads as such:

```WARNING: cut_by_quality5 is deprecated, please use cut_front instead.

Detecting adapter sequence for read1...

No adapter detected for read1

Read1 before filtering:

total reads: 559526

total bases: 81131270

Q20 bases: 78550986(96.8196%)

Q30 bases: 74541824(91.878%)

Read1 after filtering:

total reads: 558997

total bases: 81054512

Q20 bases: 78511340(96.8624%)

Q30 bases: 74516448(91.9337%)

Filtering result:

reads passed filter: 558997

reads failed due to low quality: 428

reads failed due to too many N: 101

reads failed due to too short: 0

reads with adapter trimmed: 0

bases trimmed due to adapters: 0

Duplication rate (may be overestimated since this is SE data): 67.1151%

JSON report: LM_LM092-L5-P2-A2.json

HTML report: LM_LM092-L5-P2-A2.html

fastp -i LM_LM092-L5-P2-A2.F.fq.gz -o LM_LM092-L5-P2-A2.R1.fq.gz --cut_by_quality5 20 --cut_by_quality3 20 --cut_window_size 5 --cut_mean_quality 15 -q 15 -u 50 -j LM_LM092-L5-P2-A2.json -h LM_LM092-L5-P2-A2.html

fastp v0.19.7, time used: 20 seconds```

I was not familiar with this program until today to be honest, as my preliminary analyses were with trimmomatic.

The reason I am investigating this is both Dave Portnoy (my committee member) and myself had noticed a quite high number of snps per locus that did not make sense for SE data. The original perception was this was due to an overabundance of singletons but I am not so sure.

If anything jumps out at you or if you have an idea as to what might be a good thing to look at, could you let me know?

I realize there could be multiple things going on here, but I am not a bioinformatics guru by any means.

For example I am not sure if " cut_by_quality5 is deprecated, please use cut_front" could a major culprit or not.

I will continue trying to find patterns via libraries /indices.

Thanks again,

-Pearce

Pearce Cooper

unread,

May 23, 2021, 12:25:30 AM5/23/21

to dDocent User Help Forum

Sorry, vital info: Earth is running dDocent version 2.8.7 at the moment yet fastp version 0.19.7. Which no longer uses the commands, cut_by quality_5 or cut_by quality_3.

the version of dDocent on github uses:

```#single

fastp -i $1.F.fq.gz -o $1.R1.fq.gz --cut_by_quality5 20 --cut_by_quality3 20 --cut_window_size 5 --cut_mean_quality 15 -q 15 -u 50 $TW -j $1.json -h $1.html &> $1.trim.log```

There are also warnings about SE data on fastp 0.19.6 and above about SE data

1.) https://github.com/OpenGene/fastp#per-read-cutting-by-quality-score

"WARNING: all these three operations will interfere deduplication for SE data, and --cut_front or --cut_right may also interfere deduplication for PE data. The deduplication algorithms rely on the exact matchment of coordination regions of the grouped reads/pairs."

2.) And about adapters with SE data:

https://github.com/OpenGene/fastp#adapters

"For SE data, the adapters are evaluated by analyzing the tails of first ~1M reads. This evaluation may be inaccurate, and you can specify the adapter sequence by -a or --adapter_sequence option. If adapter sequence is specified, the auto detection for SE data will be disabled"

So it looks like for fastp adapters should be specified for SE data, the current version is not compatible with dDocent's SE code or potentially SE data which is outdated but this is an old project with a budget.

I did not expect this though and am in a time crunch. So is there a way to just use trimmomatic?

What would be your recommendation?

Thank you, and again sorry for the essay.

-Pearce

Jon Puritz

unread,

May 28, 2021, 11:22:23 AM5/28/21

to ddo...@googlegroups.com

Hi Pearce,

There's a lot to unpack here. First, that's an old version of dDocent, please have it updated. Second, yes, fastp adaptive adapter trimming leaves a little to be desired (https://github.com/jpuritz/dDocent/issues/52), and I am working on fixing this by basically including a set of TruSeq adapters to search for.

The easiest fix for you is to get your own copy of dDocent (clone from repo), and edit the trimming code directly to include the adapter you are seeing.

Alternatively, you can trim your reads with any program you want, and as long as they are named properly it won't matter for dDocent.

Jon Puritz, PhD

Assistant Professor
Department of Biological Sciences
University of Rhode Island
120 Flagg Road, Kingston, RI 02881

Webpage: MarineEvoEco.com

Cell: 401-338-8739

Work: 401-874-9020

"The most valuable of all talents is that of never using two words when one will do.” -Thomas Jefferson

--
You received this message because you are subscribed to the Google Groups "dDocent User Help Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ddocent+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ddocent/28c6d31d-b6e5-43a9-a717-6077725c2849n%40googlegroups.com.

Reply all

Reply to author

Forward