problems with unique alignment percentage

Saumya Kumar

unread,

May 24, 2017, 11:43:15 AM5/24/17

to rna-star

I have 75 bp, single-end, good quality reads from Mouse genome generated by Smart-seq2 protocol. I tried STAR for aligning them using the default parameters which results in following statistics: Unique alignment 63.35%, Reads mapped to multiple loci: 8.89% and reads unmapped: too short: 25.62%.

I then followed on from previous posts and changed values of the following parameters to this: --seedSearchStartLmax 30 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 30

but the unique alignment increased only to about 68.16%, Reads mapped to multiple loci: 12.85% and reads unmapped: too short:16.52%

In a previous study, where I had problems with a similar % of reads unmapped "too short", I used the above mentioned parameters, the results showed increased unique alignment from 62% to 85% and only about 2-5% increase in multi loci. The only difference was that data was paired-end reads compared to this dataset which is single-end reads.

I have checked the quality of reads and they are all good reads. In both datasets, there is 3' adapter contamination in about 10-30% of the dataset, which I didn't remove in either of the study. I suppose STAR's soft clipping takes care of those reads. (I had checked and about 50-70% of the contaminated reads aligned uniquely in the paired-end dataset after using the above parameters).

Can you please suggest what should I consider to improve my alignment for the single-end reads?

Thank you,

Saumya

Alexander Dobin

unread,

May 24, 2017, 4:10:40 PM5/24/17

to rna-star

Hi Saumya,

first I would check the severity of adapter contamination. How many reads are shorter than 30b after the adapter is trimmed? Those reads will not be mapped because of the --outFilterMatchNmin 30 filter.

I would generally recommend trimming the adapters for datasets with high adapter contamination.

If adapter trimming does not help please send me the full Log.final.out output. The next step would probably be to BLAST a few of the unmapped reads to check for other types of contamination.

Cheers

Alex

Saumya Kumar

unread,

May 25, 2017, 8:59:01 AM5/25/17

to rna-star

Hi Alex,

Thanks for your reply.

I am attaching Fastqc report for adapter content of 1 of our sample. I think it looks like less than 10% of the sample had this contamination,similar levels in few other samples all showing 61-68% unique alignment. So I had not removed the adapter from these. Also I am attaching the log file of the same sample with and without changing the parameter values.

Would you still recommend adapter trimming as essential?

Best,

Saumya

When I look at the size of the reads on fastqc, they all are 75bps, but in the log file it reports 21% as too short.... are they short

adapter_content.png

C1A-defaultLog.final.out

C1A-DefinedParams-Log.final.out

Alexander Dobin

unread,

May 31, 2017, 5:27:16 PM5/31/17

to rna-star

Hi Saumya,

it looks like adapter contamination is not the problem here. I think that trimming never hurts, but I do not think it will help a lot either.

The next thing I would recommend is to map the read1 / read2 separately - to see if the problem is in the pairing. Please send me the Log.out files for these runs as well.

The 'too short' category is for reads for which STAR can only find alignments that are shorter than the required filters - the reads themselves might be long.

Cheers

Alex

Saumya Kumar

unread,

Jun 1, 2017, 5:38:44 AM6/1/17

to rna-star

Hi Alex,

These log files and plot that I attached are for single end reads. My previous study was paired-end reads, but I was able to achieve 85% unique alignment with the parameters I had mentioned. In that case, after parameter change, most of the reads from "too short" moved to unique alignment and very little to multiple alignment.

I did the alignment again after trimming of single end reads, but nothing changed. What else can I try?

Best,

Saumya

Alexander Dobin

unread,

Jun 2, 2017, 3:38:05 PM6/2/17

to rna-star

Hi Saumya,

I think the next step should be to BLAST the unmapped reads and check where they are going.

Probably it's best to take them from the "too short" category for the very unrestrictive run with --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 30.

You can use the --outSAMunmapped option which will write the unmapped reads into the SAM/BAM files.

The allignments with uT:A:1 tag are the "too short" ones.

Cheers

Alex

Saumya Kumar

unread,

Jun 6, 2017, 7:10:35 AM6/6/17

to rna-star

Thanks Alex,

So I did as suggested. I did BLAST of first 50 such sequences and attached are the results of that. The alignment score is always less than 80-200 category hits that it comes across. I also did an MSA of these 50 sequences, attached is an image of it. It doesn't appear as a consensus. Would you consider this as contamination?

Best,

Saumya

KATNFKG2014-Alignment.txt

short-BLASTingFile.png

Alexander Dobin

unread,

Jun 7, 2017, 1:31:11 PM6/7/17

to rna-star

Hi Saumya,

yes, I think this is clear case of contamination, with Salmonella sequences popping up for most of the unmapped reads.

In principle, you could add the assembled Salmonella genomes to the Mouse genome to effectively filter them out at the mapping stage.

After that, you can again BLAST the unmapped reads to see if there is another source of contamination.