Subread subjunc and featureCounts for small RNA sequencing data

snt...@gmail.com

unread,

Mar 3, 2014, 1:49:23 AM3/3/14

to sub...@googlegroups.com

I am attempting to use Subread subjunc and featureCounts with small RNA sequencing data and will appreciate comments on appropriate usage of the programs.

--

My data is single-end 50 b-long reads obtained on the Illumina HiSeq 2500 platform with small RNA sequencing libraries (Illumina TruSeq Small RNA Sample Prep kit). It is trimmed to remove adapter and poor-quality sequences (the size of reads thus ranges from 15-50 b).

My main goal is to quantify microRNA (miR) expression. Mature miRs are ~18-22 b in length. They are generated from ~70-90 b-long precursor miRs (pre-miRs) which in turn are produced from primary miR transcripts (pri-miRs) which are like coding RNA transcripts in terms of length. Mature miRs can differ from each other at just one nucleotide position. Different mature miRs (i.e., arising from different pri-miRs) can have the exact same sequence and a pri-miR can produce multiple different mature miRs. Finally, a pri-miR can be the product of multiple copies of a gene.

The reads in my data represent not only mature miRs but also other small RNAs like tRNAs and snoRNAs as well as degradation fragments of coding RNAs (mRNAs), ribosomal RNAs, pre-miRs, pri-miRs, etc., because the biomaterial that were used were formalin-fixed tissues (formalin causes RNA degradation).

So, besides quantifying mature miR expression, I also want to check how many of the reads map to tRNAs, snoRNAs, rRNAs, mRNAs, etc.

--

Considering the above aspects, I am using Subread subjunc to map the reads against the hg19 human genome with:

subjunc -T 16 -i hg19 -r sample.fastq -d 15 -D 50 -B 16 -o sample.bam --BAMoutput

I find that ~70% of the 7-17 million reads of my samples get mapped. I then use featureCounts to obtain counts for miRs using a GFF file from the miR sequence repository (miRBase):

featureCounts -T 16 -a hsa_mir.gff3 -t miRNA -f -O -M -d 15 -D 50 -o sample_mir_count.txt sample.bam

--

I will appreciate if someone can comment on the use of subjunc and featureCounts for small RNA mapping and counting in the manner noted above fashion.

Thanks.

Wei Shi

unread,

Mar 5, 2014, 5:13:35 PM3/5/14

to sub...@googlegroups.com

Hi,

You should not use '-d' and '-D' parameters in your subjunc command, since these parameters are for paired end reads. Also, subjunc does not support '-B' option, which is an option supported by subread-align for reporting multiple best mapping locations. But anyway, removing these three parameters from your command wouldn't change your mapping results.

You may also try to run subread-align on your data. Subread-align is more sensitive than subjunc but it still has good specificity. You should get more mapped reads from using subread-align.

Similarly, please do not include '-d' and '-D' options in your featureCounts command.

Hope this helps.

Wei

snt...@gmail.com

unread,

Mar 6, 2014, 3:45:12 AM3/6/14

to sub...@googlegroups.com

Thank you for your suggestions.

I removed the d and D arguments from alignment and feature-counting functions, and used subread-align instead of subjunc for alignment. As you indicated, more of the reads mapped to the hg19 reference human genome. As per the display that Subread shows after alignment, as against ~70% of reads mapping with subjunc, with subread-align, the figure is ~190%.

Why is it more than 100%?

ps. I retained the B argument (-B 16) because many microRNAs and other small RNAs with identical sequences are generated from more than one genomic locus.

Wei Shi

unread,

Mar 6, 2014, 4:58:52 PM3/6/14

to sub...@googlegroups.com

This is because you used -B option, which instructed subread-align to report multiple locations for the multi-mapping reads. Therefore, the total number of reported mapping locations are more than the number of reads in your dataset.

Reply all

Reply to author

Forward