Paired-end GBS data with ref genome - how to keep unpaired sequence data after process_radtags?

596 views
Skip to first unread message

Harriet Hunt

unread,
Jun 25, 2015, 11:24:04 AM6/25/15
to stacks...@googlegroups.com
Hi all,

I have run process_radtags on my data set - paired-end Illumina HiSeq data, standard GBS protocol with single-digest & size selection, so restriction enzyme cut site present at both ends. Specified enz1 and enz2 as the same enzyme in process_radtags.

Now trying to align reads to the reference genome, and as far as I can tell neither gsnap nor bowtie2 allow you to include both files 1.fq and 2.fq *AND* the .rem.1.fq and .rem.2.fq files. So I am wondering how to keep the unpaired sequences so they can be used later in pstacks. Can I treat the data like ddRAD, and concatenate .1.fq, .2.fq, .rem.1.fq and .rem.2.fq to a single file before aligning? All the discussions I've seen on this topic seem to assume you'll be running denovo_map, so I don't know if it's OK to concatenate then align.

thanks...


Carlos Arbizu

unread,
Aug 23, 2015, 9:23:19 PM8/23/15
to Stacks

Hi Harriet,

I have the same data (GBS, paired-end, single-digest). I read many posts and the manual again and decided to concatenate 2.fq and .rem.2.fq. 

Now I am planning to use GSNAP. 

Did you solve your issue?

Best,

Carlos

Harriet Hunt

unread,
Aug 24, 2015, 5:37:42 AM8/24/15
to stacks...@googlegroups.com
Hi Carlos,

I think it is best to keep the .2.fq and .rem.2.fq files separate, because bowtie2 and gsnap both have more sophisticated options for handling paired reads when doing alignments and these will not work if you include unpaired reads (.rem files) in the samples to align. I first aligned the paired reads to the genome, then did the unpaired reads but appended the output to the same sam file (set option not to include sam headers, so you don't get these repeated in the middle of your sam file). So then you get a single file for each sample you can use as input for refmap.pl in Stacks.

By the way, I don't know if you saw my other posts to the list, but there is a problem with the latest version of gsnap - the terminal-alignment suppression option (by setting --terminal-threshold to a high value) does not work correctly. I don't know if this will be a problem with your data or not, but I got long strings of softmasked nucleotides at the ends of a small proportion of reads and this caused problems in Stacks. I got much better results with one of the old (2014) versions of gsnap.

Hope that helps!

best wishes, Harriet


--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
Visit this group at http://groups.google.com/group/stacks-users.
For more options, visit https://groups.google.com/d/optout.

Goutham atla

unread,
Aug 24, 2015, 5:53:55 AM8/24/15
to stacks...@googlegroups.com
Bowtie2 handles combination of Paired-End and unpaired reads at the same time. 

From Manual:

-U <r>    Comma-separated list of files containing unpaired reads to be aligned, e.g. lane1.fq,lane2.fq,lane3.fq,lane4.fq. Reads may be a mix of different lengths. If - is specified, bowtie2 gets the reads from the "standard in" or "stdin" file handle.

So the command would be:

bowtie2 -x genome_index -1 sample1_R1.fq -2 sample1_R2.fq -U sample1_rem1.fq,sample1_rem2.fq -S sample1.sam
- - -
Goutham Atla
PhD Student. Marie Curie ESR.
Zencode-ITN.
IDIBAPS, Spain and Imperial College, London.

Meghana Natesh

unread,
Aug 24, 2015, 6:09:39 AM8/24/15
to stacks...@googlegroups.com
I know, I remember that code, But remember another person (some lady had this doubt, and the response had been that you can't combine paired reads with single end reads) had said the same thing. Sorry, I get worried.

Harriet Hunt

unread,
Aug 24, 2015, 6:12:13 AM8/24/15
to stacks...@googlegroups.com
I think we've had this discussion before - no, you lose the mate-pair information if you combine paired with unpaired reads in the same command. As I said above, best to do the alignments of paired and unpaired reads correctly and then concatenate into a single sam file (there are various ways to do this).

Goutham atla

unread,
Aug 24, 2015, 7:17:11 AM8/24/15
to stacks...@googlegroups.com
We do not lose the "Paired-End" information. Bowtie2 assigns an "=" sign if both the mates are mapped. You can also verify this with flag information in sam file. The flag 3 indicates that the read is paired and read is mapped in proper pair. 

I have loaded the BAM files in to IGV and I could always see the pairing information in IGV. All the Paired-End reads will be shown as pairs.

First few columns of my bam file generated by command

samtools view  -f3 input.bam | less

2_1101_16120_2170_1     83      genome       1398915 42      80M     =       1398734 -261 . . .  
2_1101_16120_2170_2     163     genome       1398734 42      80M     =       1398915 261   . . .
2_1101_15135_2869_1     99      genome        5319397 42      80M     =       5319661 344   . . .
2_1101_15135_2869_2     147     genome        5319661 42      80M     =       5319397 -344 . . .

This clearly indicates that the reads with _1 and _2 are next to each other. They are mapped with an insert size of 261bp and 344bp. They are paired ( indicated by = ).  You can even check what does each flag means ( 83,163,99, 147)

83 = Read paired, read mapped in proper pair, First in pair and read reverse strand.
163 = Read paired, read mapped in proper pair, Second in pair and mate reverse strand.
99 = Read paired, read mapped in proper pair, First in pair and Mate reverse strand.
147 = Read paired, read mapped in proper pair, Second  in pair and read reverse strand.

This clearly explains that the reads still have the paired information.


- - -
Goutham Atla
Prof. Jorge Ferrer Lab. Zencode-ITN.

Goutham atla

unread,
Aug 24, 2015, 7:18:46 AM8/24/15
to stacks...@googlegroups.com
We modified /1 and /2 with _1 an d_2 to keep track of R1 and R2.

Harriet Hunt

unread,
Aug 24, 2015, 9:04:25 AM8/24/15
to stacks...@googlegroups.com
So did you just use the -U flag in bowtie2 and list both the paired and unpaired read files under that? I did not know that this would work. Can you still set -I and -X with this option?: (i.e. constrain the fragment size)?

Goutham atla

unread,
Aug 24, 2015, 9:19:53 AM8/24/15
to stacks...@googlegroups.com
Yes. -1 and -2 tells that the reads are paired and the insert size constraints should be considered while mapping them and also helps in assigning the mapping quality. All the reads given with -U will be treated as orphan reads ( reads with no corresponding mate) hence they are aligned as if they are single end reads. All other options will work how they are supposed to work.

The idea of aligning Paired-end and unpaired reads also should yield the same results, but its just few extra steps and consumes time.  But I just wanted to convey that it is possible with bowtie2.

In BWA, I don't think so there is a way to map both paired-end and unpaired reads at one goal, so everybody maps them independently and later mares the files. 

Zoo Keeper

unread,
Aug 27, 2015, 9:16:22 AM8/27/15
to Stacks
Hi Harriet,

What version of GSnap from 2014 did you find to be OK?  I used the 2014-12-16 version.

What problems did you encounter with STACKS?  I did not notice anything, but I may not have looked in the right places to find 'problems'.

Many thanks.

Crispin

Margarita Mauro

unread,
Oct 20, 2016, 3:56:07 AM10/20/16
to Stacks
Hello,
   I realize it has been over a year since these posts.  I wonder if I can get some help as to how to do a loop in Bowtie2 using both unpaired and paired files to make the alignments but for multiple samples.  So geek_y suggested this for a single sample:


bowtie2 -x genome_index -1 sample1_R1.fq -2 sample1_R2.fq -U sample1_rem1.fq,sample1_rem2.fq -S sample1.sam


 I have around 34 samples per folder, and somewhere I read that you don't really want to list every file.  So I can imagine it would be a loop like 
for f in *.rem.1.fq; do base=`basename $f .rem.1.fq`; bowtie2 

 After bowtie2 I am not sure as some of the files are *.rem.1.fq, others are *.rem.2.fq, and I also have the *.1.fq and *.2.fq (per sample for paired end reads).

Hoping to hear from you,
Margarita

geek_y

unread,
Oct 20, 2016, 11:36:40 AM10/20/16
to Stacks
If you have sample1, sample2, sample3 etc .. You could do something like:

for i in {1..34}
do
bowtie2 -x genome_index -1 sample${i}_R1.fq -2 sample${i}_R2.fq -U sample${i}_rem1.fq,sample${i}_rem2.fq -S sample${i}.sam
done

Hope it helps.
Thanks.
Reply all
Reply to author
Forward
0 new messages