Excessively soft-clipped primary alignments & low coverage

463 views
Skip to first unread message

Cari Lewis

unread,
Apr 18, 2022, 1:32:11 PM4/18/22
to Stacks
I'm looking for advice/expertise on how to move forward with my data.

Background:
Last year, I did a ddRAD test library of 48 individuals using Pst1 and Msp1 with a target fragment size of 150 bp, single-end fragments on a HiSeq. They were run on a single lane with another test library (N=48) for another species. I used BWA and a reference genome for my species. The test library was great, with extremely high coverage (ranged from ~30x to 200x) so my next sequencing run was on a full NovaSeq 6000 lane with 3 libraries of 48 individuals (144 total), same enzymes and prep.

Unfortunately, I've run into problems at every step with the new libraries. My fastqc report (attached) indicates low quality bases at--I believe--the cutsites since the low quality scores correspond to the 6th-10th bp locations. I think that was causing me to lose the majority of my sequences in process_radtags due to "RAD cutsite not found drops", but was able to troubleshoot this with suggestions from this thread and reran process_radtags using the flag --disable_rad_check which recovered the majority of the sequences. My before --disable_rad_check and after runs are below for a single library since the percentages were very similar for all 3 libraries. I'm satisfied with this solution since I have a very small portion of low quality reads.

BEFORE:
process_radtags code:

process_radtags -p ../raw/$INPUT -o ./ -i gzfastq -r -c -q --e pstI -b ../info/$barcodes_file

output:

55876540 total sequences

  103844 barcode not found drops (0.2%)

    2039 low quality read drops (0.0%)

48318851 RAD cutsite not found drops (86.5%)

 7451806 retained reads (13.3%)

AFTER:

process_radtags code:

process_radtags -p ../raw/$INPUT -o ./ -i gzfastq -r -c -q --e pstI -b ../info/$barcodes_file --disable_rad_check

output:

55876540 total sequences

  103844 barcode not found drops (0.2%)

   15808 low quality read drops (0.0%)

       0 RAD cutsite not found drops (0.0%)

55756888 retained reads (99.8%)


My Current Problem:

I then ran ref_map.pl for the new individuals AND the initial test set of 48 for a total of 192 samples assuming panmixia and all assigned to the same population. The test set coverage was very high as previously found (highlighted individual), however the majority of the individuals from the new libraries have extremely low coverage (gstacks.log and gstacks.log.distribs[for the first few individuals] below).

gstacks.log:

gstacks v2.41, executed 2022-04-16 13:34:46 (zlib-1.2.11)

/opt/oscer/software/Stacks/2.41-foss-2018b/bin/gstacks -I ./alignments.bwa/ -M ../info/popmap.tsv -O ./stacks.bwa

Locus/sample distributions will be written to './stacks.bwa/gstacks.log.distribs'.


Configuration for this run:

  Input mode: reference-based

  Population map: '../info/popmap.tsv'

  Input files: 192, e.g. './alignments.bwa/AIH07-03-N.bam'

  Output to: './stacks.bwa/'

  Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05)


Reading BAM headers...

Processing all loci...

1K...

2K...

5K...

10K...

20K...

50K...

100K...

done.


Read 385059896 BAM records:

  kept 176796250 primary alignments (46.4%), of which 0 reverse reads

  skipped 21064011 primary alignments with insufficient mapping qualities (5.5%)

  skipped 148251283 excessively soft-clipped primary alignments (38.9%)

  skipped 35242263 unmapped reads (9.2%)

  skipped some suboptimal (secondary/supplementary) alignment records


  Per-sample stats (details in 'gstacks.log.distribs'):

    read 2005520.3 records/sample (26854-7119373)

    kept 1.8%-88.1% of these


Built 199467 loci comprising 176796250 forward reads and 0 matching paired-end reads; mean insert length was 0.0 (sd: nan).


Genotyped 199467 loci:

  effective per-sample coverage: mean=56.6x, stdev=81.3x, min=1.2x, max=351.2x

  mean number of sites per locus: 140.2

  a consistent phasing was found for 270933 of out 641039 (42.3%) diploid loci needing phasing


gstacks is done.

gstacks.log.distribs:

BEGIN bam_stats_per_sample

sample    records    primary_kept    kept_frac    primary_kept_read2    primary_disc_mapq    primary_disc_sclip    unmapped    secondary    supplementary

AIH07-03-N    1404345    194005    0.138    0    97793    1037151    65382    10014    0

AIH10-02-F2    870196    83807    0.096    0    45354    599317    133679    8039    0

ANC01-01-F    1524233    176818    0.116    0    97617    1022852    215613    11333    0

ATL01-02-F2    1535230    172530    0.112    0    69608    950252    334044    8796    0

BOI01-01-M    1415181    217650    0.154    0    80240    1056764    46509    14018    0

BUR06-01-F    1148804    96466    0.084    0    66009    702906    277024    6399    0

CAN73078-001-F    3957206    3217259    0.813    0    190522    113731    401480    34214    0

CFV02-01-F2    2596338    234172    0.090    0    131037    1864447    349623    17059    0

CLD02-01-F    1411284    106428    0.075    0    79451    1032822    179295    13288    0

CLE73051-01-F    1297821    123715    0.095    0    82087    983875    99007    9137    0

COM73501-001-M    1289949    138149    0.107    0    90532    997592    54074    9602    0

COM73505-001-F    1133092    153854    0.136    0    77321    818193    75110    8614    0

COM73507-001-F    867293    93777    0.108    0    60014    634912    71997    6593    0

COM73507-002-M    1691799    214017    0.127    0    119854    1288029    54443    15456    0

COM73507-003-M    1078429    164674    0.153    0    71816    748525    84399    9015    0

COM73507-004-F    842436    105986    0.126    0    53923    548593    127014    6920    0

COM73507-005-M    1121324    164609    0.147    0    74247    811628    61582    9258    0

DAY01-03-F    1267749    167561    0.132    0    81468    920158    87628    10934    0

...


BEGIN effective_coverages_per_sample

# For mean_cov_ns, the coverage at each locus is weighted by the number of

# samples present at that locus (i.e. coverage at shared loci counts more).

sample    n_loci    n_used_fw_reads    mean_cov    mean_cov_ns

AIH07-03-N    16584    194005    11.698    16.378

AIH10-02-F2    15469    83807    5.418    7.440

ANC01-01-F    16287    176818    10.856    14.747

ATL01-02-F2    15393    172530    11.208    14.861

BOI01-01-M    17540    217650    12.409    17.801

BUR06-01-F    14173    96466    6.806    8.741

CAN73078-001-F    29295    3217259    109.823    193.264

CFV02-01-F2    19499    234172    12.009    17.822

CLD02-01-F    13800    106428    7.712    9.989

CLE73051-01-F    15020    123715    8.237    10.996

COM73501-001-M    15233    138149    9.069    12.297

COM73505-001-F    16345    153854    9.413    12.791

COM73507-001-F    14104    93777    6.649    8.521

COM73507-002-M    17108    214017    12.510    17.578

COM73507-003-M    15941    164674    10.330    13.936

COM73507-004-F    13380    105986    7.921    10.053

COM73507-005-M    16558    164609    9.941    13.698

DAY01-03-F    17025    167561    9.842    13.712

...


My Question:

I saw the recommendation of running stacks de novo and then aligning the consensus sequences to the reference genome, which I'm currently working on. To approach this from a different perspective, can I trim the low quality bases that occur at the beginning of the sequences (and trim the test library to match, if needed) before running ref_map.pl to fix the current problem? My thought is that hard clipping is acceptable in this instance since they're poor quality bases. Are there issues in the data and/or analysis that I'm overlooking?

This is a new problem for me and I'm struggling to understand the cause of the low-quality bases (so I can avoid it in the future!) and if this data is salvageable. I appreciate any advice you can give! 

Thank you for your time,

Cari Lewis

Cari Lewis

unread,
Apr 18, 2022, 1:35:06 PM4/18/22
to Stacks
Actually attaching the fastqc report would help :)

Apologies,
Cari
L2203-06_S2_L001_R1_001_fastqc.html
L2203-05_S1_L001_R1_001_fastqc.html
L2203-07_S3_L001_R1_001_fastqc.html

Cari Lewis

unread,
Apr 18, 2022, 1:40:49 PM4/18/22
to Stacks
Third time is the charm 
L2203-07_S3_L001_R1_001.fastq.gz FastQC Report.pdf
L2203-06_S2_L001_R1_001.fastq.gz FastQC Report.pdf
L2203-05_S1_L001_R1_001.fastq.gz FastQC Report.pdf

Catchen, Julian

unread,
Apr 19, 2022, 4:46:48 AM4/19/22
to stacks...@googlegroups.com

Hi Cari,

 

You can clip bases from the 5’ end of reads, however, it needs to be done uniformly so all the appropriate reads still stack up. (process_radtags cannot do this, you could do it with some simple UNIX and awk, or look for another timming program.)

 

However, I would spend some time understanding why you are seeing so much soft clipping of your alignments. Specifically, is the majority of your soft clipping coming on the 5’ or the 3’ end of the reads? Obviously, if it is due to a poor RE cutsite sequence, it will be on the 5’ end of the read. But it is also possible you have a lot of adapter in your sequenced libraries, which would result in soft clipping on the 3’ end of the reads. If that is the case, you want to go back to process_radtags and have it filter adaptor. Adaptor/dimer pairs, that is, reads that have no DNA, just the two adpaters stuck together might also ‘appear’ to just be missing the RAD cutsite, but are indeed missing the cutsite and everything else.

 

Best,

 

julian

Cari Lewis

unread,
Apr 20, 2022, 11:32:20 AM4/20/22
to Stacks
Julian,
Thank you for the advice! I didn't specify this earlier, but our Blue Pippin window is 326-426 bp including the adapters so that we end up with ~150 bp fragments after demultiplexing so I don't think it's adapter/dimers. But I reran process_radtags filtering the single end adapter sequence, and there was no change in the file sizes before/after which suggests to me there was not a large portion of adapters present.

Just to see if it would help, I used Trimmomatic to trim the first 5bp from the raw data files (and the test library for consistency) since those were the bases with the extremely low quality scores. I then ran BWA and ref_map.pl which retained 82.5% of the reads and increased coverage to above 10x for all except 5 individuals, but their raw files are very small to begin with so that was expected. It also identified over 300000 loci compared to the ~16000 loci using the test library, which I'm uncertain of at the moment and am going to look through the f-statistics to see how many are out of HWE. I've included the gstacks.log and first few individuals from the gstacks.log.distribs below (as in my previous email). 

While this helped recover the majority of the data, it still doesn't help me understand why there's so much soft-clipping and what caused it. I've emailed the staff at the sequencing lab to get their perspective and more information on their process and data handling. I will follow up here with what I find out.

Thank you kindly,
Cari Lewis

gstacks.log:
gstacks v2.41, executed 2022-04-19 21:20:50 (zlib-1.2.11)

/opt/oscer/software/Stacks/2.41-foss-2018b/bin/gstacks -I ./trimmed.bwa/ -M ../info/popmap.tsv -O ./stacks.trim
Locus/sample distributions will be written to './stacks.trim/gstacks.log.distribs'.



Configuration for this run:
  Input mode: reference-based
  Population map: '../info/popmap.tsv'

  Input files: 192, e.g. './trimmed.bwa/AIH07-03-N.bam'
  Output to: './stacks.trim/'


  Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05)

Reading BAM headers...
Processing all loci...
1K...
2K...
5K...
10K...
20K...
50K...
100K...

200K...
done.

Read 384938361 BAM records:
  kept 314798066 primary alignments (82.5%), of which 0 reverse reads
  skipped 21306250 primary alignments with insufficient mapping qualities (5.6%)
  skipped 9970835 excessively soft-clipped primary alignments (2.6%)
  skipped 35278656 unmapped reads (9.3%)


  skipped some suboptimal (secondary/supplementary) alignment records

  Per-sample stats (details in 'gstacks.log.distribs'):

    read 2004887.3 records/sample (26850-7114820)
    kept 25.7%-88.0% of these

Built 302259 loci comprising 314798066 forward reads and 0 matching paired-end reads; mean insert length was 0.0 (sd: nan).

Genotyped 302259 loci:
  effective per-sample coverage: mean=74.2x, stdev=48.8x, min=2.1x, max=260.2x
  mean number of sites per locus: 133.2
  a consistent phasing was found for 194822 of out 219258 (88.9%) diploid loci needing phasing

gstacks is done.

gstacks.log.distribs:
BEGIN bam_stats_per_sample
sample    records    primary_kept    kept_frac    primary_kept_read2    primary_disc_mapq    primary_disc_sclip    unmapped    secondary    supplementary
AIH07-03-N    1404119    1196110    0.852    0    98095    34713    65413    9788    0
AIH10-02-F2    869998    659413    0.758    0    45472    23549    133723    7841    0
ANC01-01-F    1523992    1164807    0.764    0    97720    34715    215658    11092    0
ATL01-02-F2    1534805    1090234    0.710    0    69863    32254    334083    8371    0
BOI01-01-M    1414674    1240178    0.877    0    80543    33893    46549    13511    0
BUR06-01-F    1148744    778334    0.678    0    66252    20752    277067    6339    0
CAN73078-001-F    3956020    3230985    0.817    0    194115    95946    401946    33028    0
CFV02-01-F2    2595240    2040640    0.786    0    131901    57055    349683    15961    0
CLD02-01-F    1411067    1107278    0.785    0    79997    31387    179334    13071    0
CLE73051-01-F    1297561    1082785    0.834    0    82223    24643    99033    8877    0
COM73501-001-M    1289818    1104678    0.856    0    90984    30571    54114    9471    0
COM73505-001-F    1132866    944327    0.834    0    77695    27309    75147    8388    0
COM73507-001-F    867157    708510    0.817    0    60383    19784    72023    6457    0
COM73507-002-M    1691315    1460997    0.864    0    120034    40837    54475    14972    0
COM73507-003-M    1078238    887975    0.824    0    72116    24795    84528    8824    0
COM73507-004-F    842373    636659    0.756    0    53950    17863    127044    6857    0
COM73507-005-M    1120619    951376    0.849    0    74411    24658    61621    8553    0
DAY01-03-F    1267647    1058103    0.835    0    81772    29266    87674    10832    0
...

BEGIN effective_coverages_per_sample
# For mean_cov_ns, the coverage at each locus is weighted by the number of
# samples present at that locus (i.e. coverage at shared loci counts more).
sample    n_loci    n_used_fw_reads    mean_cov    mean_cov_ns

AIH07-03-N    27446    1196110    43.580    57.488
AIH10-02-F2    33980    659413    19.406    29.870
ANC01-01-F    26581    1164807    43.821    54.371
ATL01-02-F2    24866    1090234    43.844    53.283
BOI01-01-M    28267    1240178    43.874    58.579
BUR06-01-F    25546    778334    30.468    36.996
CAN73078-001-F    30437    3230985    106.153    147.786
CFV02-01-F2    31866    2040640    64.038    87.675
CLD02-01-F    26085    1107278    42.449    52.121
CLE73051-01-F    25163    1082785    43.031    51.657
COM73501-001-M    25572    1104678    43.199    52.680
COM73505-001-F    25834    944327    36.554    44.561
COM73507-001-F    24684    708510    28.703    34.376
COM73507-002-M    26581    1460997    54.964    67.730
COM73507-003-M    24989    887975    35.535    43.098
COM73507-004-F    22046    636659    28.879    33.920
COM73507-005-M    26124    951376    36.418    45.239
DAY01-03-F    28238    1058103    37.471    48.762

...


Reply all
Reply to author
Forward
0 new messages