Using STAR for genome reads alignment

Capricy Gao

unread,

Feb 23, 2017, 1:13:30 PM2/23/17

to rna-star

Hello,

This might sound weird. But I have been told that STAR aligner is really fast. So I wonder if I have whole genome shotgun short reads, may I use STAR to align them? If so, what should I modify in terms of command line!

Thank you very much!

C.

Alexander Dobin

unread,

Feb 23, 2017, 1:20:57 PM2/23/17

to rna-star

Hi Capricy,

you are talking about DNA reads, right?

In this case, for starters, you would want to prohibit splicing and limit the max insert size.

--alignIntronMax 1 --alignMatesGapMax <MaxInsertSize-2*ReadLength>

Later on, you may want to tweak other parameters such as the number of mismatches, min mapped length, etc.

Cheers

Alex

Capricy Gao

unread,

Feb 23, 2017, 11:13:34 PM2/23/17

to rna-star

Hi, Alex,

Yes, DNA reads is what I meant. Thank you very much for the information! I plan to do those post-alignment processing.

I actually have a question about indexing the genomes:

when I index the mouse genome, which is 2.6 G in total, and I have an "SAindex" file with a size about 1.5 G; however, when I index a bacterial genome, which is only 4.6M, the SAindex file size is also 1.5 G. Is it normal? Since the two genome fasta file sizes are so different, i expect the resultant files from the indexing be different too. Could you please correct me if my initial understanding is wrong?

Thanks a lot!

C.

Alexander Dobin

unread,

Feb 24, 2017, 4:08:31 PM2/24/17

to rna-star

Hi Capricy,

indeed, for small genomes you need to use at the genome generation step --genomeSAindexNbases = min(14, log2(GenomeLength)/2 - 1) = 10 for 4.6M genome.

This will reduce the size of the SAindex file, and prevent seg-faults at the mapping stage.

Cheers

Alex

Capricy Gao

unread,

Feb 26, 2017, 7:21:21 AM2/26/17

to rna-star

Hi, Alex,

I actually was wondering what genome size would be considered "small genome"?

Also is there a way that I would be able to estimate the index file size for a given genome? For thousands of genomes, I do need to have such rough calculation so that I would plan for the disc space.

I am wondering if it is more efficient to combine all the genomes and index them all together?

Thanks a lot for your help!

C.

Alexander Dobin

unread,

Feb 27, 2017, 1:13:03 PM2/27/17

to rna-star

Hi Capricy,

according to the formula, for genomes ~500M or smaller, this parameter will have to be reduced. However, typically you only need to do it for genomes < 100M.

If you want to map to multiple small genomes, it is definitely better to combine them all into a super-genome, i.e. map to all the genomes at the same time, provided you have enough RAM to fit the index of the super-genome (~9*GenomeLength bytes).

Cheers

Alex

Capricy Gao

unread,

Mar 8, 2017, 9:32:50 AM3/8/17

to rna-star

Hi, Alex,

I adjusted parameters according to your suggestions for draft genome mapping. Mapping has been completely. Then I happened to find a mapped read in sam file slightly different from the input read:

Here is what I see:

Input (from R2 reads file):

@M04650:15:000000000-ARB7E:1:1103:6869:18593 2:N:0:14
TGGCCCGGGGCTGGGCGCTTAACTAGTGAGTATAGTGTCCTGGTCAGCATGTTTTCGTGAGGGAGTGGAGGGGCACTGAGCTGACTGGCTTTGAGGCTGGACGGATGGATAGACATGGCGG
+M04650:15:000000000-ARB7E:1:1103:6869:18593 2:N:0:14
CCCCCFGDGFGGEGGGG77CE<CFAF<E6CF,EFFGGG9CFEFGF9@FF,EFGGGAF<FEGGGGE,@D@EFEEDEGGGGGGGGGGADFC,CEA9B@<FACEGGGEGGGFFFGAFFGGECC@

Out.sam file:

M04650:15:000000000-ARB7E:1:1103:6869:18593 16 GL193112.1 75683 255 123M * 0 0 TGGCCCGGGGCTGGGCGCTTAACTAGTGAGTATAGTGTCCTGGTCAGCATGTTTTCGTGGGGGAGTGGAGGGGCACTGAGCTGACTGGCTTTGAGGCTGGACGGATGGATAGACATGGCGGAG :+CCFC:FF@C?4:@FACGFGFGFEDGDADGGGFDGDGBE<DGFAEGFGGGGGFBEFF8,FDFGGEEGFFC,EGGGGGEECECCDCGDFEFAFF@FFC:FFBEFFECFFE<FAFC:FF@<<CC NH:i:1 HI:i:1 AS:i:119 nM:i:1

The output sequence has extra "AG" at the end. I've globally checked how many there are those kind of reads. Wonder if you've every had such issue and how to explain that?

Thanks.

C.

On Monday, February 27, 2017 at 1:13:03 PM UTC-5, Alexander Dobin wrote:

Hi Capricy,

Alexander Dobin

unread,

Mar 8, 2017, 11:29:46 AM3/8/17

to rna-star

Hi Capricy,

the SAM line has FLAG=16, which means the read was single-end, not paired-end.

However, you are saying it's R2, and it has the Illumina tag 2:N:... Are you mapping PE or SE reds?

Cheers

Alex

Capricy Gao

unread,

Mar 9, 2017, 10:06:27 AM3/9/17

to rna-star

The mate of this read was removed due to the low quality. So it is a singleton from PE sequencing, and I mapped it as single-end reads using --readFilesIn Singleton-R1,Singleton-R2

I just figured that the sam file sequence should be the same as the input sequence.

Alexander Dobin

unread,

Mar 9, 2017, 12:34:07 PM3/9/17

to rna-star

Hi Capricy,

I have mapped just this read, and it comes out OK:

M04650:15:000000000-ARB7E:1:1103:6869:18593 0 GL193112.1 75683 255 121M * 0 0 TGGCCCGGGGCTGGGCGCTTAACTAGTGAGTATAGGCATGTTTTCGTGAGGGAGTGGAGGGGCACTGAGCTGACTGGCTTTGAGGCTGGACGGATGGATAGACATGGCGG CCCCCFGDGFGGEGGGG77CE<CFAF<E6CF,EFFGGG9CFEFGF9@FF,EFGGGAF<FEGGGGE,@D@EFEEDEDFC,CEA9B@<FACEGGGEGGGFFFGAFFGGECC@ NH:i:1 HI:i:1 AS:i:119 nM:i:0

You can try to do the same - map just this one read to see if you can reproduce it.

Note that there are several inconsistencies between this SAM line and the one that you got: sequence, quality scores, strand. It seems to me that there was some kind of mix-up when the reads were converted to Singleton-R2 .

Please send me the Singleton-R2 file.

Cheers

Alex

Capricy Gao

unread,

Mar 14, 2017, 7:37:17 AM3/14/17

to rna-star

Hi, Alex,

Thank you very much for the reply. I went back to the input files and found that they actually are mixed with paired reads. The sequencing library had rather short insert size and therefore the read pair had >90% overlap. The read that I saw had extra bases were exactly their corresponding mate.

C.

Reply all

Reply to author

Forward