Excessively soft-clipped primary alignments & low coverage

Cari Lewis

unread,

Apr 18, 2022, 1:32:11 PM4/18/22

to Stacks

I'm looking for advice/expertise on how to move forward with my data.

Background:

Last year, I did a ddRAD test library of 48 individuals using Pst1 and Msp1 with a target fragment size of 150 bp, single-end fragments on a HiSeq. They were run on a single lane with another test library (N=48) for another species. I used BWA and a reference genome for my species. The test library was great, with extremely high coverage (ranged from ~30x to 200x) so my next sequencing run was on a full NovaSeq 6000 lane with 3 libraries of 48 individuals (144 total), same enzymes and prep.

Unfortunately, I've run into problems at every step with the new libraries. My fastqc report (attached) indicates low quality bases at--I believe--the cutsites since the low quality scores correspond to the 6th-10th bp locations. I think that was causing me to lose the majority of my sequences in process_radtags due to "RAD cutsite not found drops", but was able to troubleshoot this with suggestions from this thread and reran process_radtags using the flag --disable_rad_check which recovered the majority of the sequences. My before --disable_rad_check and after runs are below for a single library since the percentages were very similar for all 3 libraries. I'm satisfied with this solution since I have a very small portion of low quality reads.

BEFORE:

process_radtags code:

process_radtags -p ../raw/$INPUT -o ./ -i gzfastq -r -c -q --e pstI -b ../info/$barcodes_file

output:

55876540 total sequences

103844 barcode not found drops (0.2%)

2039 low quality read drops (0.0%)

48318851 RAD cutsite not found drops (86.5%)

7451806 retained reads (13.3%)

AFTER:

process_radtags code:

process_radtags -p ../raw/$INPUT -o ./ -i gzfastq -r -c -q --e pstI -b ../info/$barcodes_file --disable_rad_check

output:

55876540 total sequences

103844 barcode not found drops (0.2%)

15808 low quality read drops (0.0%)

0 RAD cutsite not found drops (0.0%)

55756888 retained reads (99.8%)

My Current Problem:

I then ran ref_map.pl for the new individuals AND the initial test set of 48 for a total of 192 samples assuming panmixia and all assigned to the same population. The test set coverage was very high as previously found (highlighted individual), however the majority of the individuals from the new libraries have extremely low coverage (gstacks.log and gstacks.log.distribs[for the first few individuals] below).

gstacks.log:

gstacks v2.41, executed 2022-04-16 13:34:46 (zlib-1.2.11)

/opt/oscer/software/Stacks/2.41-foss-2018b/bin/gstacks -I ./alignments.bwa/ -M ../info/popmap.tsv -O ./stacks.bwa

Locus/sample distributions will be written to './stacks.bwa/gstacks.log.distribs'.

Configuration for this run:

Input mode: reference-based

Population map: '../info/popmap.tsv'

Input files: 192, e.g. './alignments.bwa/AIH07-03-N.bam'

Output to: './stacks.bwa/'

Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05)

Reading BAM headers...

Processing all loci...

1K...

2K...

5K...

10K...

20K...

50K...

100K...

done.

Read 385059896 BAM records:

kept 176796250 primary alignments (46.4%), of which 0 reverse reads

skipped 21064011 primary alignments with insufficient mapping qualities (5.5%)

skipped 148251283 excessively soft-clipped primary alignments (38.9%)

skipped 35242263 unmapped reads (9.2%)

skipped some suboptimal (secondary/supplementary) alignment records

Per-sample stats (details in 'gstacks.log.distribs'):

read 2005520.3 records/sample (26854-7119373)

kept 1.8%-88.1% of these

Built 199467 loci comprising 176796250 forward reads and 0 matching paired-end reads; mean insert length was 0.0 (sd: nan).

Genotyped 199467 loci:

effective per-sample coverage: mean=56.6x, stdev=81.3x, min=1.2x, max=351.2x

mean number of sites per locus: 140.2

a consistent phasing was found for 270933 of out 641039 (42.3%) diploid loci needing phasing

gstacks is done.

gstacks.log.distribs:

BEGIN bam_stats_per_sample

sample records primary_kept kept_frac primary_kept_read2 primary_disc_mapq primary_disc_sclip unmapped secondary supplementary

AIH07-03-N 1404345 194005 0.138 0 97793 1037151 65382 10014 0

AIH10-02-F2 870196 83807 0.096 0 45354 599317 133679 8039 0

ANC01-01-F 1524233 176818 0.116 0 97617 1022852 215613 11333 0

ATL01-02-F2 1535230 172530 0.112 0 69608 950252 334044 8796 0

BOI01-01-M 1415181 217650 0.154 0 80240 1056764 46509 14018 0

BUR06-01-F 1148804 96466 0.084 0 66009 702906 277024 6399 0

CAN73078-001-F 3957206 3217259 0.813 0 190522 113731 401480 34214 0

CFV02-01-F2 2596338 234172 0.090 0 131037 1864447 349623 17059 0

CLD02-01-F 1411284 106428 0.075 0 79451 1032822 179295 13288 0

CLE73051-01-F 1297821 123715 0.095 0 82087 983875 99007 9137 0

COM73501-001-M 1289949 138149 0.107 0 90532 997592 54074 9602 0

COM73505-001-F 1133092 153854 0.136 0 77321 818193 75110 8614 0

COM73507-001-F 867293 93777 0.108 0 60014 634912 71997 6593 0

COM73507-002-M 1691799 214017 0.127 0 119854 1288029 54443 15456 0

COM73507-003-M 1078429 164674 0.153 0 71816 748525 84399 9015 0

COM73507-004-F 842436 105986 0.126 0 53923 548593 127014 6920 0

COM73507-005-M 1121324 164609 0.147 0 74247 811628 61582 9258 0

DAY01-03-F 1267749 167561 0.132 0 81468 920158 87628 10934 0

...

BEGIN effective_coverages_per_sample

# For mean_cov_ns, the coverage at each locus is weighted by the number of

# samples present at that locus (i.e. coverage at shared loci counts more).

sample n_loci n_used_fw_reads mean_cov mean_cov_ns

AIH07-03-N 16584 194005 11.698 16.378

AIH10-02-F2 15469 83807 5.418 7.440

ANC01-01-F 16287 176818 10.856 14.747

ATL01-02-F2 15393 172530 11.208 14.861

BOI01-01-M 17540 217650 12.409 17.801

BUR06-01-F 14173 96466 6.806 8.741

CAN73078-001-F 29295 3217259 109.823 193.264

CFV02-01-F2 19499 234172 12.009 17.822

CLD02-01-F 13800 106428 7.712 9.989

CLE73051-01-F 15020 123715 8.237 10.996

COM73501-001-M 15233 138149 9.069 12.297

COM73505-001-F 16345 153854 9.413 12.791

COM73507-001-F 14104 93777 6.649 8.521

COM73507-002-M 17108 214017 12.510 17.578

COM73507-003-M 15941 164674 10.330 13.936

COM73507-004-F 13380 105986 7.921 10.053

COM73507-005-M 16558 164609 9.941 13.698

DAY01-03-F 17025 167561 9.842 13.712

...

My Question:

I saw the recommendation of running stacks de novo and then aligning the consensus sequences to the reference genome, which I'm currently working on. To approach this from a different perspective, can I trim the low quality bases that occur at the beginning of the sequences (and trim the test library to match, if needed) before running ref_map.pl to fix the current problem? My thought is that hard clipping is acceptable in this instance since they're poor quality bases. Are there issues in the data and/or analysis that I'm overlooking?

This is a new problem for me and I'm struggling to understand the cause of the low-quality bases (so I can avoid it in the future!) and if this data is salvageable. I appreciate any advice you can give!

Thank you for your time,

Cari Lewis

unread,

Apr 18, 2022, 1:35:06 PM4/18/22

to Stacks

Actually attaching the fastqc report would help :)

Apologies,

Cari

L2203-06_S2_L001_R1_001_fastqc.html

L2203-05_S1_L001_R1_001_fastqc.html

L2203-07_S3_L001_R1_001_fastqc.html

Cari Lewis

unread,

Apr 18, 2022, 1:40:49 PM4/18/22

to Stacks

Third time is the charm

L2203-07_S3_L001_R1_001.fastq.gz FastQC Report.pdf

L2203-06_S2_L001_R1_001.fastq.gz FastQC Report.pdf

L2203-05_S1_L001_R1_001.fastq.gz FastQC Report.pdf

Catchen, Julian

unread,

Apr 19, 2022, 4:46:48 AM4/19/22

to stacks...@googlegroups.com

Hi Cari,

You can clip bases from the 5’ end of reads, however, it needs to be done uniformly so all the appropriate reads still stack up. (process_radtags cannot do this, you could do it with some simple UNIX and awk, or look for another timming program.)

However, I would spend some time understanding why you are seeing so much soft clipping of your alignments. Specifically, is the majority of your soft clipping coming on the 5’ or the 3’ end of the reads? Obviously, if it is due to a poor RE cutsite sequence, it will be on the 5’ end of the read. But it is also possible you have a lot of adapter in your sequenced libraries, which would result in soft clipping on the 3’ end of the reads. If that is the case, you want to go back to process_radtags and have it filter adaptor. Adaptor/dimer pairs, that is, reads that have no DNA, just the two adpaters stuck together might also ‘appear’ to just be missing the RAD cutsite, but are indeed missing the cutsite and everything else.

Best,

julian

Cari Lewis

unread,

Apr 20, 2022, 11:32:20 AM4/20/22

to Stacks

Julian,

Thank you for the advice! I didn't specify this earlier, but our Blue Pippin window is 326-426 bp including the adapters so that we end up with ~150 bp fragments after demultiplexing so I don't think it's adapter/dimers. But I reran process_radtags filtering the single end adapter sequence, and there was no change in the file sizes before/after which suggests to me there was not a large portion of adapters present.

Just to see if it would help, I used Trimmomatic to trim the first 5bp from the raw data files (and the test library for consistency) since those were the bases with the extremely low quality scores. I then ran BWA and ref_map.pl which retained 82.5% of the reads and increased coverage to above 10x for all except 5 individuals, but their raw files are very small to begin with so that was expected. It also identified over 300000 loci compared to the ~16000 loci using the test library, which I'm uncertain of at the moment and am going to look through the f-statistics to see how many are out of HWE. I've included the gstacks.log and first few individuals from the gstacks.log.distribs below (as in my previous email).

While this helped recover the majority of the data, it still doesn't help me understand why there's so much soft-clipping and what caused it. I've emailed the staff at the sequencing lab to get their perspective and more information on their process and data handling. I will follow up here with what I find out.

Thank you kindly,

Cari Lewis

gstacks.log:

gstacks v2.41, executed 2022-04-19 21:20:50 (zlib-1.2.11)

/opt/oscer/software/Stacks/2.41-foss-2018b/bin/gstacks -I ./trimmed.bwa/ -M ../info/popmap.tsv -O ./stacks.trim
Locus/sample distributions will be written to './stacks.trim/gstacks.log.distribs'.

Configuration for this run:
Input mode: reference-based
Population map: '../info/popmap.tsv'

Input files: 192, e.g. './trimmed.bwa/AIH07-03-N.bam'
Output to: './stacks.trim/'

Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05)

Reading BAM headers...
Processing all loci...
1K...
2K...
5K...
10K...
20K...
50K...
100K...

200K...
done.

Read 384938361 BAM records:
kept 314798066 primary alignments (82.5%), of which 0 reverse reads
skipped 21306250 primary alignments with insufficient mapping qualities (5.6%)
skipped 9970835 excessively soft-clipped primary alignments (2.6%)
skipped 35278656 unmapped reads (9.3%)

skipped some suboptimal (secondary/supplementary) alignment records

Per-sample stats (details in 'gstacks.log.distribs'):

read 2004887.3 records/sample (26850-7114820)
kept 25.7%-88.0% of these

Built 302259 loci comprising 314798066 forward reads and 0 matching paired-end reads; mean insert length was 0.0 (sd: nan).

Genotyped 302259 loci:
effective per-sample coverage: mean=74.2x, stdev=48.8x, min=2.1x, max=260.2x
mean number of sites per locus: 133.2
a consistent phasing was found for 194822 of out 219258 (88.9%) diploid loci needing phasing

gstacks is done.

gstacks.log.distribs:

BEGIN bam_stats_per_sample
sample records primary_kept kept_frac primary_kept_read2 primary_disc_mapq primary_disc_sclip unmapped secondary supplementary

AIH07-03-N 1404119 1196110 0.852 0 98095 34713 65413 9788 0
AIH10-02-F2 869998 659413 0.758 0 45472 23549 133723 7841 0
ANC01-01-F 1523992 1164807 0.764 0 97720 34715 215658 11092 0
ATL01-02-F2 1534805 1090234 0.710 0 69863 32254 334083 8371 0
BOI01-01-M 1414674 1240178 0.877 0 80543 33893 46549 13511 0
BUR06-01-F 1148744 778334 0.678 0 66252 20752 277067 6339 0
CAN73078-001-F 3956020 3230985 0.817 0 194115 95946 401946 33028 0
CFV02-01-F2 2595240 2040640 0.786 0 131901 57055 349683 15961 0
CLD02-01-F 1411067 1107278 0.785 0 79997 31387 179334 13071 0
CLE73051-01-F 1297561 1082785 0.834 0 82223 24643 99033 8877 0
COM73501-001-M 1289818 1104678 0.856 0 90984 30571 54114 9471 0
COM73505-001-F 1132866 944327 0.834 0 77695 27309 75147 8388 0
COM73507-001-F 867157 708510 0.817 0 60383 19784 72023 6457 0
COM73507-002-M 1691315 1460997 0.864 0 120034 40837 54475 14972 0
COM73507-003-M 1078238 887975 0.824 0 72116 24795 84528 8824 0
COM73507-004-F 842373 636659 0.756 0 53950 17863 127044 6857 0
COM73507-005-M 1120619 951376 0.849 0 74411 24658 61621 8553 0
DAY01-03-F 1267647 1058103 0.835 0 81772 29266 87674 10832 0

...

BEGIN effective_coverages_per_sample
# For mean_cov_ns, the coverage at each locus is weighted by the number of
# samples present at that locus (i.e. coverage at shared loci counts more).
sample n_loci n_used_fw_reads mean_cov mean_cov_ns

AIH07-03-N 27446 1196110 43.580 57.488
AIH10-02-F2 33980 659413 19.406 29.870
ANC01-01-F 26581 1164807 43.821 54.371
ATL01-02-F2 24866 1090234 43.844 53.283
BOI01-01-M 28267 1240178 43.874 58.579
BUR06-01-F 25546 778334 30.468 36.996
CAN73078-001-F 30437 3230985 106.153 147.786
CFV02-01-F2 31866 2040640 64.038 87.675
CLD02-01-F 26085 1107278 42.449 52.121
CLE73051-01-F 25163 1082785 43.031 51.657
COM73501-001-M 25572 1104678 43.199 52.680
COM73505-001-F 25834 944327 36.554 44.561
COM73507-001-F 24684 708510 28.703 34.376
COM73507-002-M 26581 1460997 54.964 67.730
COM73507-003-M 24989 887975 35.535 43.098
COM73507-004-F 22046 636659 28.879 33.920
COM73507-005-M 26124 951376 36.418 45.239
DAY01-03-F 28238 1058103 37.471 48.762

...

Reply all

Reply to author

Forward