Mapping reads to sequence with high homology

Benjy Jek Yang Tan

unread,

Nov 13, 2017, 5:27:17 AM11/13/17

to rna-star

Hi Alex,

I have a question here.

I am using STAR to align DNA-Seq reads to my reference genome of human genome + retrovirus.

If you can see the attachment, it is the reference span of the reads mapped to the retrovirus.

However, the 5' end of the virus have no reads while the coverage of 3' end is higher than the other regions.

As you might know, the 5' LTR and 3' LTR of retrovirus are sequences with very high homology to each other.

So I would like to ask how does STAR designates which part a read goes to for sequences with high homology?

I believe that some of the reads mapped to the 3' end should have gone to the 5' end instead.

Do you have any idea how I can solve this?

Thank you very much.

Screen Shot 2017-11-13 at 19.20.17.png

Alexander Dobin

unread,

Nov 14, 2017, 11:57:16 AM11/14/17

to rna-star

Hi Benjy,

is this signal track for unique only, or for unique+multiple alignments?

STAR - by default - reports all of the loci a multimapper maps to, so it should not create the bias between the homologous sequences, if you are looking at unique+multiple alignments. If one of the sequence has more sequence similarity in the other places in the genome, the unique alignments might be biased against it.

Note that some depletion of the reads next to 5'/3' ends is to be expected, since some of the reads that cross from retrovirus into the genome will not be mapped to the virus sequence alone.

Cheers

Alex

Benjy Jek Yang Tan

unread,

Nov 16, 2017, 4:49:11 AM11/16/17

to rna-...@googlegroups.com

Hi Alex,

For this mapping, I removed reads which are not primary alignments, so they should be unique only reads?

Sorry but I don't really catch your last comment here. Could you mind explaining?

Even if some of the retrovirus reads cross into the human genome, part of it should still map to the virus sequence isn't it? As in the part which maps to the human genome being soft-clipped?

Thank you.

Alexander Dobin

unread,

Nov 17, 2017, 5:27:39 PM11/17/17

to rna-star

Hi Benjy,

how did you remove the non-primary alignments?

Note that in the SAM output, one alignment is always marked as primary, even if there are other non-primary alignments for this read.

I understood your case as follows. You have a retrovirus sequence that is inserted in the genome of your organism.

However, you are mapping to the genome without insertion + standalone sequence.

The reads that cross the insertion points in the real genome will be mapped partially to the genome without insertion, and partially to the standalone sequence.

These reads are considered chimeric and not output by default, thus effectively depleting the 5'/3' ends of the standalone sequence. Actually I remember some people were trying to use the chimeric alignments to detected the insertion points of retroviruses.

Cheers

Alex

Benjy Jek Yang Tan

unread,

Nov 28, 2017, 7:32:25 PM11/28/17

to rna-star

HI Alex,

Thank you very much for your reply.

For that, I used samtools to remove reads with the flag 0x100 (not primary alignment).

I am only keeping the first alignment of each read.

Yes, that's correct!

So would it be better if I map it instead to a genome with the inserted retrovirus sequence?

Or I should output the chimeric reads into the output aligned BAM file using --chimOutType WithinBAM?

Thank you very much.

On Saturday, November 18, 2017 at 7:27:39 AM UTC+9, Alexander Dobin wrote:

Hi Benjy,

Alexander Dobin

unread,

Nov 29, 2017, 3:33:17 PM11/29/17

to rna-star

Hi Benjy,

if you know for sure where the retrovirus sequence is inserted, then mapping to the genome with the inserted sequence is definitely the best approach.

Alternatively, you could add the flanking regions (say ~1,000) bases to the start and end of standalone retrovirus "chromosome".

The chimeric approach is good if you do not know the insertion points. It would allow you to find those insertion points, and then I would still recommend inserting the sequence into the genome and re-mapping all reads.

Cheers

Alex

Benjy Jek Yang Tan

unread,

Nov 29, 2017, 9:35:45 PM11/29/17

to rna-...@googlegroups.com

Hi Alex,

Thank you for your reply.

I see. Thank you for your suggestion.

I think that would be possible with cell lines but for human genome, it would be a bit difficult as the retrovirus integrate randomly and each copy of the virus is integrated at different sites in different chromosomes.