ASE (allele specific expression) using STAR

Bishwa Kiran

unread,

May 9, 2017, 12:10:28 PM5/9/17

to rna-star

HI Alex,

I like the STAR very much.

I am more interested in doing count analyses of RNAseq data for ASE (allele specific expression purpose). The general method for ASE is to align the RNAreads to one reference genome and then count the number of allele/haplotype A vs B at a locus. But in our case one of the population of our F1 hybrid is more close to the reference genome, so there will biased estimates of allele/haplotype counts of the population closer to reference.

I want to use STAR in the following manner and have partially done some of these:

- Prepare the haploid genome for each parent by patching the SNPs/InDels and then merge them to create a diploid genome. For each parental chromosome there will be a suffix string to represent the parental haplotype.

- Similarly update the gtf/gff file for each parental genome (with suffix strings indicating parental genomes) and move the coordinates to reflect changes in gtf/gff file.

- then run a competitive alignments on a diploid genome using STAR (with alignment to transcriptome mode). Reason being the gene annotation for our organism may not be perfect so genome guided alignment to transcriptome is a better approach is my guess).

- then I want to count the number of reads mapping uniquely to each haplotype (which is obtainable using ReadsPerGene.out.tab file). - I have finished upto this stage.

- But, using the '2ms04hR2Aligned.toTranscriptome.out.bam' file, I also want to
> count the number of reads that map to both alleles/haplotypes for one gene
> as well as the reads that align to multiple genes (in each sample vs. pooled sample).
Can you give some idea if there is way to mine the reads that map to both haplotype of the same gene vs. the ones that map to multiple genes (same genome or another)?

Thanks,
- Bishwa K.

Alexander Dobin

unread,

May 10, 2017, 12:11:44 PM5/10/17

to rna-star

Hi Bishwa,

your strategy looks great. Actually, we are working on a similar pipeline for ENCODE data, where we have phased diploid genome for a few individuals.

Briefly, it works as follows.

1. We generate the diploid genome from phased VCF files using AlleleSeq, which creates the diploid sequence and the chain files.

2. A new option in STAR transforming the GTF files from reference coordinates to the two haplotypes.

3. Genome index is generated for the diploid sequence with diploid annotations.

4. Reads are mapped and transformed to diploid transcriptomic coordinates.

5. RSEM computes haplotype specific expression for transcripts and genes,

To differentiated the reads that map equally well to both haplotypes of the same gene from the reads that map to different genes (loci), we need to transform the diploid alignments into reference coordinates.

I am currently working on this and hope to release it within 1-2 months.

Cheers

Alex

Bishwa Kiran

unread,

May 10, 2017, 3:08:07 PM5/10/17

to rna-...@googlegroups.com

HI Alex,

Its good to know that such a pipeline in on progress.
- One problem with preparation of diploid genome using Alleleseq is that it needs parent-child trio. I think it works without that information but it may not prepare the appropriate diploid file (haven't tested though).

We are making diploid for F1 hybrids and don't have parental/maternal genotype. Instead I have to segregate the haplotype using my own methods (will discuss later if need be).

I think I can deal with other problems, but I am mostly interested in point #4 and #5 and the last point of counting the number of reads - This is exactly what I want to do and working on this right now. I am not a programmer but a beginner pythonist and trying to work this out within a month (for my presentation at a conference).

I wish your tool had come earlier, lol. I am tyring to do a more clean ASE for my data since last year but being an empirical biologist I had to take a pause and learn python first.

I will have to follow up with you on the progress and will ask some questions just to make sure I am doing things rite. Would that be ok?

Thanks,

Erik Omar Díaz Valenzuela

unread,

Jun 21, 2017, 11:14:17 AM6/21/17

to rna-star

Hi Bishwa,

If you have a good reference genome you could try my pipeline:

1) Build Pseudo-references
- Call SNPs from both parental (STAR, picard, sam tools, GATK)

- Take those SNPs and replace them in the reference genome (homozygous and biallelic SNPs > 4 DP >30 QUAL) (GATK)

2) Call high quality F1 SNPs

- Call SNPs of your F1 against each of the two Pseudo-references and filter out the heterozygous > 10 DP SNPs
- Merge your 2 VCFs and extract common positions to perform binomial/chi-square test

If you want more detail we can discuss this pipeline and share scripts

Bishwa Kiran

unread,

Jul 31, 2017, 10:58:15 AM7/31/17

to rna-star

HI Erik,
I just saw your post. Sorry about the delay.

Thanks for reaching out. I have actually create my own pipeline with overlapped feature with what Alex and I had discussed. And this pipeline has some shared feature with what your proposed.
ASE in my F1 hybrid was little problematic since one parent was more close to the reference genome.

So, I took the approach of:
1) creating a haplotype based diploid reference for each parent/population using SNP/index patching.
Btw, is you pseudo genome, haploid or diploid? Is it a masked genome or SNP/Indel patched genome?
2) Now, I align the F1 reads competitively against the diploid genome/transcriptome.
3) calculate the number of uniq, bi-allelic, multi-alignment reads. I am not sure about your step #2 - can you add more details.

The problem for me now is the covert this bam (aligned to diploid) back to haploid construct, but it may/not be necessary.

If you are open to ideas we can discuss further.

Niran Hadad

unread,

Apr 5, 2019, 2:29:26 PM4/5/19

to rna-star

Hi Alex,

Sorry to bring this post back from the dead.

Has this pipeline been developed? More specifically, has the second step regarding transforming a GTF been developed?

Is there a best practice for using STAR for allele specific alignment?

Niran

On Wednesday, May 10, 2017 at 12:11:44 PM UTC-4, Alexander Dobin wrote:

Alexander Dobin

unread,

Apr 9, 2019, 10:36:43 AM4/9/19

to rna-star

Hi Niran,

yes, the liftOver for GTF had been implemented in STAR, you would need to run it separately for each haplotype, e.g.

$ STAR --runMode liftOver --genomeChainFiles Hap1.chain --sjdbGTFfile Annot.gtf

Cheers

Alex

Reply all

Reply to author

Forward