Read counts per contig .. ? Any chance for a new feature in STAR ..?

94 views
Skip to first unread message

Kirill Tsyganov

unread,
Sep 1, 2015, 9:58:05 AM9/1/15
to rna-star
Hi Alex, 

Thanks for the STAR its truly a STAR and you are STAR for making the STAR I guess..

I'm working with data set where there isn't a reference genome, however we have quite a good transcriptome data (in fasta file) so we are using it as a "reference genome" to map against our other RNAseq data from the same organism and it works fine.. I actually tried both STAR and BWA MEM both gave me pretty much exact the same result.. I obviously don't need to worry about splicing. We don't have any annotation, not that I know of and even if we do it isn't in any particular format such as GTF/GFF. We are interested in read counts per transcript, which is really per individual contig name as it will be reflected in the BAM file..

I will write some python script to count reads per "contig" so that we can estimate some differential expression and stuff. However I thought it would be great if feature like that was introduced into STAR. Currently there isn't any tools that I know of count reads per contig. 

The way I am going to approach this is to make big hash with keys set to contig/transcript names and values are counters for every time
there is a read that associate with particular contig name and then just write keys and values into table. There might be a better ways..

p.s exploring STAR's splicing features and they are seems pretty cool

Regards, 

Kirill



 

Alexander Dobin

unread,
Sep 3, 2015, 3:05:42 PM9/3/15
to rna-star
Hi Kirill,

this would be a nice summary statistics feature to have.
If I understood correctly, you are treating contigs/transcripts as separate chromosomes. Then you can use 
$ samtools idxstats Aligned.sortedByCoord.out.bam
It requires sorted and indexed BAM file, it works very fast (directly on BAM index), and will give you the count of alignments on each "chromosome".
Note, that it will count all the SAM lines, i.e. double count for PE reads, and multi-count for multimappers, so you may want to filter the alignments before counting.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages