different positions

234 views
Skip to first unread message

Miranda Gaupp

unread,
Oct 18, 2021, 4:21:53 PM10/18/21
to Stacks
Hello,

I'm using a RADcap/rapture technique to selectively capture genomic regions of interest using a large number of individuals. I created a reference of about 6,000 snps using initial 3RAD data and I'm aligning my RADcap data to this reference. I'm using bwa to align my data (after running process_radtags and clone_filter) followed by the stacks ref_pipe. The problem I'm running into is this: the positions (POS) in the resulting VCF file are different than they are in my original 3RAD data. Is there a way for me to fix this? Is this causing the low coverage and high missingness I am seeing in my data?

Thanks in advance,
Miranda Gaupp

Catchen, Julian

unread,
Oct 19, 2021, 3:22:53 PM10/19/21
to stacks...@googlegroups.com

Hi Miranda,

 

We will need more information to provide any advice. Can you give an example of the positions being different between the stacks output and the original 3RAD data? What are you referring to when you say the “original 3RAD data”?

 

julian

Miranda Gaupp

unread,
Oct 19, 2021, 4:07:39 PM10/19/21
to Stacks
Hi Julian, thanks for your response. To create my reference fasta file, I performed 3RAD on 65 individuals,  applied the STACKS denovo pipeline (after demultiplexing and decloning), and stringently filtered the resulting vcf (to the best of my abilities) to eliminate non-neutral SNPs. From this filtered dataset, I chose 6,000 loci to create capture baits for radcap/rapture procedure thus these 6,000 loci have now become my reference. By "original 3RAD data" I'm referring to these 6,000 loci containing neutral SNPs --more specifically I'm referring to the vcf file containing this information. 
The vcf information for the first three loci in the original 3RAD data are:
CHROM         POS
85                    21
167                 220
192                 249

However, in my RADcap data (1045 individuals, aligned using bwa and SNPs called using stacks ref pipe), the first three loci in the vcf read:
CHROM         POS
 85                   57
167                 141
192                 141

The CHROM numbers are the names of each reference sequence in the fasta reference file. 
Will the difference in POS (snp position correct?) influence analyses? Is there a method of aligning and/or SNP calling to remedy this?

I hope this clears up my problem a little more

Thanks,
Miranda

Catchen, Julian

unread,
Oct 20, 2021, 5:01:28 PM10/20/21
to stacks...@googlegroups.com

Hi Miranda,

 

I see, that makes sense.

 

In general, I think the chrom/pos should match between the two analyses, depending on how you made your ‘reference genome’ with BWA (which FASTA file did you feed into it?). Also, the VCF columns changed in Stacks v2.52, so I am assuming you are running on the latest version. You might also consider looking at the populations.sumstats.tsv files from the two runs to see how/if they match.

 

To say more than that, I would still need more information. What commands did you run to create your BWA reference and do your alignments, and, can you send the first 20 or 30 lines from your denovo/reference VCF files (as well as the populations command you used to create them)? If it is easier, you could just drop the files in a Dropbox folder.

 

I am curious why your de novo loci have base pair positions in the 200s – I assume you used paired-end reads? Did you again use the paired-end reads in your reference run?

 

If everything looks correct, then the path would be to look at how well the alignments went against your ‘reference genome’, what was the coverage (reported by gstacks), etc.

Miranda Gaupp

unread,
Oct 21, 2021, 10:31:46 AM10/21/21
to Stacks
I have been using Stacks v2.3e but I have ran populations separately using 2.55 so this might be causing some of my current troubles with vcf files. I'm attaching the commands I used in 1) populations to make the final vcf from my 3RAD denovo data, 2) bwa commands to index the reference fasta file and align new data, 3) samtools commands to convert and sort files, and 4) commands for the stacks ref_pipe. I am very new to bwa and samtools. If there are any suggestions/tips your can share that might help my situation, I would appreciate it! I'm also attaching the fasta file I used as my reference in bwa as well as the first some odd lines from the 3RAD denovo vcf used to create the fasta file (excel file).

I am not sure why some of the base pairs are in the 200s, especially since I used the --write-single-snp command in populations to write the first snp in each locus...I did use paired-end reads in 3RAD and in my reference run.

Is there a specific command to run (in bwa, samtools, stacks, etc.) in order to output a score that reflects how well the RADcap data aligned to my reference? 

command_line.txt
denovo-reference-vcf.xlsx
target_loci2.fasta

Catchen, Julian

unread,
Oct 21, 2021, 4:38:55 PM10/21/21
to stacks...@googlegroups.com

The loci you are using are, for the most part, 291bp in length (which explains how you can have SNP coordinates >200bp). The all have 10 Ns in the middle, which means that de novo stacks could not merge the two halves of your loci together (which is fine), so you used a size selection window in your molecular protocol greater than the single-end+paired-end read length. Some of your loci are significantly longer, which in principle should not be possible, but can happen when you have two paired-end cutsites nearby each other in the genome and some times you got one cutsite and other times you got another. Or, they can appear because of significant indels and repeats in that particular sequence which caused a lot of gaps to appear during reassembly of your raw, de novo reads. This is all fine, but you should be aware and be more skeptical of these longer loci.

 

For your refmap data, you should look at the gstacks output file, gstacks.log.distribs, which will tell you how many reads aligned for each sample and some characteristics of the alignments. It will also tell you coverage. You should have good numbers here for both.

 

On your refmap data, I would recommend re-running populations (latest version) without the --max-obs-het and --write-single-snp filters. Examine the sumstats output to see if your expected SNPs are present (even if you find additional SNPs that were not in your original de novo data).

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/398bca98-59ec-4279-8d70-62408648eac3n%40googlegroups.com.

Catchen, Julian

unread,
Oct 21, 2021, 4:42:30 PM10/21/21
to stacks...@googlegroups.com

Just as a tip, and to save a lot of disk space, you can combine all your BWA and samtools loops into a single command:

 

for file in $samples

do

    bwa mem -t 2 $bwa_db $src/${file}.1.fa.gz $src/${file}.2.fa.gz | \

        samtools view -h -b | \

        samtools sort --threads 2 > $src/aligned/${file}.bam

done

.

Miranda Gaupp

unread,
Oct 26, 2021, 1:19:10 PM10/26/21
to Stacks
Thanks for the tips!

I found that by running populations on my refmap data using -R (I tried values .5-.9), the positions mostly matched those of my original 3RAD data. Applying this did cause me to loose about 1,000 loci or more depending on the value of -R. There is still a fair number of individuals with high missingness (~200 individuals out of 1045 with missingness >30%) but this number has decreased by ~130 individuals when looking at the refmap data without the -R filter. I don't entirely understand why applying an R filter helps this issue though.

I am finding some odd estimates of observed heterozygosity for the individuals I replicated between libraries. I also don't quite understand the number of private alleles in each population that populations is finding.

I have four libraries in this dataset.  In the second library, I sequenced 10 of the same individuals from the first library and in the fourth library I sequenced 10 of the same individuals from the third library. I'm comparing estimates of observed heterozygosity for these replicated individuals and I am seeing some estimates that don't quite make sense. For example, these are the observed heterozygosity estimates for one individual in library 3 (Dit-040419-5) that was replicated in library 4 ( Dit-040419-5-b)
DIT-040319-5                       DIT-040319-5-b
0.20                                       0.52
With the --max-obs-het filter in populations set to .05, I don't understand how the Ho estimate for the replication is so high.
I'm also not sure what could be causing such a large difference in Ho for the same individual in different libraries. Any suggestions on what could be causing this or how I can discover the underlying cause? I have removed loci with >/= 5% missingness, the coverage and number of reads across libraries is roughly the same, and there is about the same number of total sites (all/variant/polymorphic) across libraries.

Also, I ran populations using r70 p4 using the library (1,2,3 or 4) and the population thinking this might help with the variation I am seeing in the Ho estimates for replicate individuals. In the populations output for this,  there is about the same number of total sites (all/variant/polymorphic) across libraries; however, library 1 is reported to have 45 private alleles while the other three libraries have 0 private alleles reported. I'm not sure how this would be true since all four libraries are entirely composed of samples from the same populations. If you have any insight, please let me know. Your help is so greatly appreciated.

forward

unread,
Oct 26, 2021, 6:31:55 PM10/26/21
to Stacks
We Are Your Number One Supplier Of High-Quality All 420 strins and all Body Pain killer pills
Text or call +19516389432 whatsapp
@Wickr Me:::bowe420
Hi Group,We Got research chemicals,MMJ,ALL 420 STRAINS,ALL BODY PAIN KILLER PILLS. CONTACT US bellow
Text or call +19516389432 whatsapp
@Wickr Me:::bowe420
//listings.////
#XANAX,#DIAZEPAM,vitamin k, exotic marijuana, heroine,
oxycodone, blues ,bars ,peruvian, mdma ,
crystal meth, lod, taps, shrooms,mophine,
roxicodone,codin, marijuana card, aderral,
oxycotine,cocaine ,ephedrine, hydrocodone....
and more ETC
ready for delivery. ,DM,your orders,CONTACT US bellow
Text or call +19516389432 whatsapp
@Wickr Me:::bowe420
* NO PRIOR PRESCRIPTION NEEDED!
* 100% Anonymity & Discreet shipping
* FDA approved
* Friendly customer support
Reply all
Reply to author
Forward
0 new messages