process_radtags -p ../raw/$INPUT -o ./ -i gzfastq -r -c -q --e pstI -b ../info/$barcodes_file
55876540 total sequences
103844 barcode not found drops (0.2%)
2039 low quality read drops (0.0%)
48318851 RAD cutsite not found drops (86.5%)
7451806 retained reads (13.3%)
AFTER:
process_radtags code:
process_radtags -p ../raw/$INPUT -o ./ -i gzfastq -r -c -q --e pstI -b ../info/$barcodes_file --disable_rad_check
output:
55876540 total sequences
103844 barcode not found drops (0.2%)
15808 low quality read drops (0.0%)
0 RAD cutsite not found drops (0.0%)
55756888 retained reads (99.8%)
My Current Problem:
I then ran ref_map.pl for the new individuals AND the initial test set of 48 for a total of 192 samples assuming panmixia and all assigned to the same population. The test set coverage was very high as previously found (highlighted individual), however the majority of the individuals from the new libraries have extremely low coverage (gstacks.log and gstacks.log.distribs[for the first few individuals] below).
gstacks.log:
gstacks v2.41, executed 2022-04-16 13:34:46 (zlib-1.2.11)
/opt/oscer/software/Stacks/2.41-foss-2018b/bin/gstacks -I ./alignments.bwa/ -M ../info/popmap.tsv -O ./stacks.bwa
Locus/sample distributions will be written to './stacks.bwa/gstacks.log.distribs'.
Configuration for this run:
Input mode: reference-based
Population map: '../info/popmap.tsv'
Input files: 192, e.g. './alignments.bwa/AIH07-03-N.bam'
Output to: './stacks.bwa/'
Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05)
Reading BAM headers...
Processing all loci...
1K...
2K...
5K...
10K...
20K...
50K...
100K...
done.
Read 385059896 BAM records:
kept 176796250 primary alignments (46.4%), of which 0 reverse reads
skipped 21064011 primary alignments with insufficient mapping qualities (5.5%)
skipped 148251283 excessively soft-clipped primary alignments (38.9%)
skipped 35242263 unmapped reads (9.2%)
skipped some suboptimal (secondary/supplementary) alignment records
Per-sample stats (details in 'gstacks.log.distribs'):
read 2005520.3 records/sample (26854-7119373)
kept 1.8%-88.1% of these
Built 199467 loci comprising 176796250 forward reads and 0 matching paired-end reads; mean insert length was 0.0 (sd: nan).
Genotyped 199467 loci:
effective per-sample coverage: mean=56.6x, stdev=81.3x, min=1.2x, max=351.2x
mean number of sites per locus: 140.2
a consistent phasing was found for 270933 of out 641039 (42.3%) diploid loci needing phasing
gstacks is done.
gstacks.log.distribs:
BEGIN bam_stats_per_sample
sample records primary_kept kept_frac primary_kept_read2 primary_disc_mapq primary_disc_sclip unmapped secondary supplementary
AIH07-03-N 1404345 194005 0.138 0 97793 1037151 65382 10014 0
AIH10-02-F2 870196 83807 0.096 0 45354 599317 133679 8039 0
ANC01-01-F 1524233 176818 0.116 0 97617 1022852 215613 11333 0
ATL01-02-F2 1535230 172530 0.112 0 69608 950252 334044 8796 0
BOI01-01-M 1415181 217650 0.154 0 80240 1056764 46509 14018 0
BUR06-01-F 1148804 96466 0.084 0 66009 702906 277024 6399 0
CAN73078-001-F 3957206 3217259 0.813 0 190522 113731 401480 34214 0
CFV02-01-F2 2596338 234172 0.090 0 131037 1864447 349623 17059 0
CLD02-01-F 1411284 106428 0.075 0 79451 1032822 179295 13288 0
CLE73051-01-F 1297821 123715 0.095 0 82087 983875 99007 9137 0
COM73501-001-M 1289949 138149 0.107 0 90532 997592 54074 9602 0
COM73505-001-F 1133092 153854 0.136 0 77321 818193 75110 8614 0
COM73507-001-F 867293 93777 0.108 0 60014 634912 71997 6593 0
COM73507-002-M 1691799 214017 0.127 0 119854 1288029 54443 15456 0
COM73507-003-M 1078429 164674 0.153 0 71816 748525 84399 9015 0
COM73507-004-F 842436 105986 0.126 0 53923 548593 127014 6920 0
COM73507-005-M 1121324 164609 0.147 0 74247 811628 61582 9258 0
DAY01-03-F 1267749 167561 0.132 0 81468 920158 87628 10934 0
...
BEGIN effective_coverages_per_sample
# For mean_cov_ns, the coverage at each locus is weighted by the number of
# samples present at that locus (i.e. coverage at shared loci counts more).
sample n_loci n_used_fw_reads mean_cov mean_cov_ns
AIH07-03-N 16584 194005 11.698 16.378
AIH10-02-F2 15469 83807 5.418 7.440
ANC01-01-F 16287 176818 10.856 14.747
ATL01-02-F2 15393 172530 11.208 14.861
BOI01-01-M 17540 217650 12.409 17.801
BUR06-01-F 14173 96466 6.806 8.741
CAN73078-001-F 29295 3217259 109.823 193.264
CFV02-01-F2 19499 234172 12.009 17.822
CLD02-01-F 13800 106428 7.712 9.989
CLE73051-01-F 15020 123715 8.237 10.996
COM73501-001-M 15233 138149 9.069 12.297
COM73505-001-F 16345 153854 9.413 12.791
COM73507-001-F 14104 93777 6.649 8.521
COM73507-002-M 17108 214017 12.510 17.578
COM73507-003-M 15941 164674 10.330 13.936
COM73507-004-F 13380 105986 7.921 10.053
COM73507-005-M 16558 164609 9.941 13.698
DAY01-03-F 17025 167561 9.842 13.712
...
My Question:
I saw the recommendation of running stacks de novo and then aligning the consensus sequences to the reference genome, which I'm currently working on. To approach this from a different perspective, can I trim the low quality bases that occur at the beginning of the sequences (and trim the test library to match, if needed) before running ref_map.pl to fix the current problem? My thought is that hard clipping is acceptable in this instance since they're poor quality bases. Are there issues in the data and/or analysis that I'm overlooking?
This is a new problem for me and I'm struggling to understand the cause of the low-quality bases (so I can avoid it in the future!) and if this data is salvageable. I appreciate any advice you can give!
Thank you for your time,
Cari Lewis
Hi Cari,
You can clip bases from the 5’ end of reads, however, it needs to be done uniformly so all the appropriate reads still stack up. (process_radtags cannot do this, you could do it with some simple UNIX and awk, or look for another timming program.)
However, I would spend some time understanding why you are seeing so much soft clipping of your alignments. Specifically, is the majority of your soft clipping coming on the 5’ or the 3’ end of the reads? Obviously, if it is due to a poor RE cutsite sequence, it will be on the 5’ end of the read. But it is also possible you have a lot of adapter in your sequenced libraries, which would result in soft clipping on the 3’ end of the reads. If that is the case, you want to go back to process_radtags and have it filter adaptor. Adaptor/dimer pairs, that is, reads that have no DNA, just the two adpaters stuck together might also ‘appear’ to just be missing the RAD cutsite, but are indeed missing the cutsite and everything else.
Best,
julian
/opt/oscer/software/Stacks/2.41-foss-2018b/bin/gstacks -I ./trimmed.bwa/ -M ../info/popmap.tsv -O ./stacks.trim
Locus/sample distributions will be written to './stacks.trim/gstacks.log.distribs'.
Configuration for this run:
Input mode: reference-based
Population map: '../info/popmap.tsv'
Input files: 192, e.g. './trimmed.bwa/AIH07-03-N.bam'
Output to: './stacks.trim/'
Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05)
Reading BAM headers...
Processing all loci...
1K...
2K...
5K...
10K...
20K...
50K...
100K...
200K...
done.
Read 384938361 BAM records:
kept 314798066 primary alignments (82.5%), of which 0 reverse reads
skipped 21306250 primary alignments with insufficient mapping qualities (5.6%)
skipped 9970835 excessively soft-clipped primary alignments (2.6%)
skipped 35278656 unmapped reads (9.3%)
skipped some suboptimal (secondary/supplementary) alignment records
Per-sample stats (details in 'gstacks.log.distribs'):
read 2004887.3 records/sample (26850-7114820)
kept 25.7%-88.0% of these
Built 302259 loci comprising 314798066 forward reads and 0 matching paired-end reads; mean insert length was 0.0 (sd: nan).
Genotyped 302259 loci:
effective per-sample coverage: mean=74.2x, stdev=48.8x, min=2.1x, max=260.2x
mean number of sites per locus: 133.2
a consistent phasing was found for 194822 of out 219258 (88.9%) diploid loci needing phasing
gstacks is done.
BEGIN effective_coverages_per_sample
# For mean_cov_ns, the coverage at each locus is weighted by the number of
# samples present at that locus (i.e. coverage at shared loci counts more).
sample n_loci n_used_fw_reads mean_cov mean_cov_ns
AIH07-03-N 27446 1196110 43.580 57.488
AIH10-02-F2 33980 659413 19.406 29.870
ANC01-01-F 26581 1164807 43.821 54.371
ATL01-02-F2 24866 1090234 43.844 53.283
BOI01-01-M 28267 1240178 43.874 58.579
BUR06-01-F 25546 778334 30.468 36.996
CAN73078-001-F 30437 3230985 106.153 147.786
CFV02-01-F2 31866 2040640 64.038 87.675
CLD02-01-F 26085 1107278 42.449 52.121
CLE73051-01-F 25163 1082785 43.031 51.657
COM73501-001-M 25572 1104678 43.199 52.680
COM73505-001-F 25834 944327 36.554 44.561
COM73507-001-F 24684 708510 28.703 34.376
COM73507-002-M 26581 1460997 54.964 67.730
COM73507-003-M 24989 887975 35.535 43.098
COM73507-004-F 22046 636659 28.879 33.920
COM73507-005-M 26124 951376 36.418 45.239
DAY01-03-F 28238 1058103 37.471 48.762
...