Paired end data

1,598 views
Skip to first unread message

gupta5...@gmail.com

unread,
Apr 1, 2013, 3:29:34 PM4/1/13
to rna-...@googlegroups.com
Hello Everyone
So I have started using STAR and have few questions.

1. If the raw fastq files has /1 and /2 extensions then in the Aligned.Out.sam file we do not have /1 and /2 extensions. I have seen a previous blog here and it was said that when star runs paired end data it does not produces /1 and /2 extensions in the alignment file. If this is the case, I actually downloaded the ENCODE CSHL aligned bam file for Hepg2 cell line(or any other cell line like Gm12878 or Nhek or HelaS3 PS: LONG POLY A RNA SEQ DATA). Since the program used is STAR, I can see in the aligned bam files /1,/2 extensions as well as reads without these extensions. Can anyone explain me why the bam file has /1 , /2 extensions when star does not report the extensions??

2.Is there a way to get reads which are unmapped??

3. I also need to understand what this parameter actually means outSJfilterDistToOtherSJmin.

4. Soft clipping issue: I don't understand this feature. Suppose I have a 76 bp read and the alignment shows me a CIGAR string something like this 21S55M. Now I don't get this 21S clipping going on. Well aligners like tophat would not take that into account and hence that read is unaligned , but STAR takes it as an aligned read. Is there a way I can say that only 4 bases can be soft clipped from ends at max. On what bases Soft clipping is usually done??

5. Does STAR report secondary alignments??

6. In paired end no singletons are produced. Any way/parameter of getting it??

7. Is there a way we can directly get a bam file from STAR output which is sorted as is in Tophat output

Hope to hear from you

Thanks

Regards
VARUN

Alexander Dobin

unread,
Apr 1, 2013, 6:54:59 PM4/1/13
to rna-...@googlegroups.com
Hello Varun,

please find my answers below.

Cheers
Alex


On Monday, April 1, 2013 3:29:34 PM UTC-4, gupta5...@gmail.com wrote:
Hello Everyone
So I have started using STAR and have few questions.

1. If the raw fastq files has /1 and /2 extensions then in the Aligned.Out.sam file we do not have /1 and /2 extensions. I have seen a previous blog here and it was said that when star runs paired end data it does not produces /1 and /2 extensions in the alignment file. If this is the case, I actually downloaded the ENCODE CSHL aligned bam file for Hepg2 cell line(or any other cell line like Gm12878 or Nhek or HelaS3 PS: LONG POLY A RNA SEQ DATA). Since the program used is STAR, I can see in the aligned bam files /1,/2 extensions as well as reads without these extensions. Can anyone explain me why the bam file has /1 , /2 extensions when star does not report the extensions??


The BAM files in ENCODE UCSC repository were mapped with 2-3 year old version of STAR, which had some unconventional BAM formatting issues. I highly recommend re-mapping all the data you need with the latest version of STAR including annotations. We are planning to re-map and re-place all the files when the new genome assembly version comes out this summer.

 
2.Is there a way to get reads which are unmapped??
--outReadsUnmapped Fastx        will output unmapped reads into separate fasta/fastq files: Unmapped.out.mate1/2
--outSAMunmapped Within           will output unmapped reads into Aligned.out.sam
 

3. I also need to understand what this parameter actually means outSJfilterDistToOtherSJmin.

--outSJfilterDistToOtherSJmin values affect filtering of the junctions output to SJ.out.tab. The 4 values correspond to 4 intron motifs: (1) non-canonical motifs, (2) GT/AG motif, (3) GC/AG motif, (4) AT/AC motif
Each value is the minimum distance of a junction donor/acceptor to any other junction donor acceptor. This prevents output of junctions that shifted just a few bases from other - typically highly expressed junctions. These shift are likely to be sequencing errors, but could be true bio-events as well, such as spliceosome errors.

 
4. Soft clipping issue: I don't understand this feature. Suppose I have a 76 bp read and the alignment shows me a CIGAR string something like this 21S55M. Now I don't get this 21S clipping going on. Well aligners like tophat would not take that into account and hence that read is unaligned , but STAR takes it as an aligned read. Is there a way I can say that only 4 bases can be soft clipped from ends at max. On what bases Soft clipping is usually done??

STAR follows the “local alignment” logic, i.e. it tries to maximize the alignment score rather than align reads end-to-end (like TopHat - at least its older versions).
The alignment score is calculated as a sum of +1 for matches bases, -1 for mismatched bases, junction/indel penalties, genomic length penalty.
Soft clipping happens when end-to-end alignments have lower scores than the clipped alignment. The clipping could be caused by  poor-quality/adapter/poly-A tails,
or by a short (or not-so-short) splice junction overhang that STAR could not connect.

The soft clipping always happens at the ends of the reads. I think it would be relatively easy to filter alignments by their CIGAR strings if you want to get rid of the soft-clippings. 

 
5. Does STAR report secondary alignments??

STAR's reporting of multi-mappers is controlled by --outFilterMultimapNmax <Nmult>, which is =10 by default.
All reads mapping to >=2 and <= Nmult loci are considered "mapped to multiple loci", and all their alignments are output to Aligned.out.sam.
The SAM attribute NH:i:Nmap contains the number of loci, and  HI:i:<multInd> is the index of the alignment. For all alignments except one the 0x100 bit is set.
The reads that "mapped to too many loci" >Nmult are considered "unmapped" and none of their alignments are output to Aligned.out.sam.


6. In paired end no singletons are produced. Any way/parameter of getting it??
With the default parameters STAR will only output correctly paired alignments. STAR considers two mates to be parts of the same sequence (possibly separatde by an un-sequenced portion of the insert).
The minimum alignment score and number of matches are controlled by --outFilterScoreMinOverLread and --outFilterMatchNminOverLread.
By default these parameters are equal to 0.66, i.e. if either the number of matched bases OR the alignment score (which is number of mapped bases - penalties) is < 66% of the read length (which is the sum of the lengths for both mates), the alignment will not be output and will be reported as "too short".

I generally do not recommend using un-paired alignments - in my experience, they contain a larger % of false positives. 
If you want to seem them, there are several options:
1. Reduce both --outFilterScoreMinOverLread and --outFilterMatchNminOverLread to  <0.5. Note, that STAR will treat the the unpaired alignments of the mates as multi-mappers.
2. Switch on "chimeric" output, which will output the unpaired alignments of two mates into Chimeric.out.sam file.
3. First map allowing for paired alignments only, output the unmapped reads, and map those as single-end reads separately for each mate.
 

7. Is there a way we can directly get a bam file from STAR output which is sorted as is in Tophat output

I know this is currently the major bottleneck - I am working on it. Hope to have it implemented in 1-2 weeks.


Reply all
Reply to author
Forward
0 new messages