Adding custom genes

5,049 views
Skip to first unread message

Yu Chen

unread,
Nov 12, 2013, 10:35:45 AM11/12/13
to rna-...@googlegroups.com
Sorry for what my be a simple question. I'm doing RNA-seq on a mouse cell line infected with both a human transgene and EGFP. Is it easy to add these sequences for STAR to map? I don't want to do de novo discovery.

Thanks.

Alexander Dobin

unread,
Nov 13, 2013, 11:58:14 PM11/13/13
to rna-...@googlegroups.com
Hi Yu Chen,

this is a good question. You can add these sequences at the "genome generation step" as extra "chromosome".
You would need to make FASTA files corresponding to each gene, and then use them together with the genome sequence, for instance:

$ STAR   --genomeFastaFiles mm9.fa gene1.fa gene2.fa     --runMode genomeGenerate    --genomeDir ./  --sjdbGTFfile mm9.gtf   --sjdbOverhang 100  --runThreadN 10  

Cheers
Alex

Rob Wirka

unread,
May 16, 2016, 4:33:28 AM5/16/16
to rna-star
Hi Alex,

Wouldn't you also need to modify the GTF file in order to correctly annotate these extra chromosomes? I've been trying to modify the GTF file in many ways to get it to work, to no avail. It acknowledges the extra chromosome in the chrName.txt file generated after genome generation, but upon alignment the gene is nowhere to be found in the ReadsPerGene.out.tab (not even listed with 0 reads).

Thanks,

Rob

Alexander Dobin

unread,
May 16, 2016, 5:45:01 PM5/16/16
to rna-star
Hi Rob,

indeed, if you need to count reads per gene for these added "gene references", you would need to add to the GTF file one line per gene like this:
gene1  \tab\  AddedGenes  \tab\ exon   \tab\  1   <gene1_length>  \tab\  .  \tab\  +  \tab\ 0 \tab\ gene_id "gene1"; transcript_id "gene1";

Here gene1 should be the name used in the fasta file for the extra genic reference, and gene1_length is its length.
If this does not work, please send me the Log.out file.

Cheers
Alex

Rob Wirka

unread,
May 16, 2016, 8:26:32 PM5/16/16
to rna-star
Alex,

Thanks so much, that worked!

Best,

Rob

Matthew Jones

unread,
Aug 7, 2016, 9:28:15 PM8/7/16
to rna-star
Hi guys,

I tried to duplicate this genome generate step to include tdTomato in the mouse genome but I'm having issues with the downstream analysis. The genome is indexed successfully but when I use the newly indexed directory for alignment, Star complains about the genomeParameters.txt having an extra line stating "sjdbInsertSave  Basic." I would greatly appreciate any insights you may have on how to address this issue.

Thanks,
Matt
Log.out

Alexander Dobin

unread,
Aug 10, 2016, 6:20:54 PM8/10/16
to rna-star
Hi Matt,

are you using the same STAR version for genome generation and mapping?
I would recommend that you switch to the latest version. If this does not help, please send me the Log.out file from the failed mapping run.

Cheers
Alex

Matt K

unread,
Sep 14, 2016, 11:02:37 AM9/14/16
to rna-star
Hi,

I am trying to do the exact same as the person above, add eGFP to a genome.  I've followed the directions above and then created a fastq with 4 reads which should all align to the eGFP gene.  When I run the alignment and look at the eGFPAlignedtoTranscriptome.out.bam, only 1 of the 4 reads is present.

When I look at the eGFPAligned.out.bam, all four aligned reads are present and are aligned to the eGFP chromosome.

DGR4KXP1:301:H7LRPADXX:2:1207:10120:44392 0 eGFP 508 255 100M * 0 0 CACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGA CCCFFFFFHHHHHIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJIIJJJJJJJJJJHHHHFFFDDDDDDDDDDDDDDDDDDDDDDD@DD NH:i:1 HI:i:1 AS:i:98 nM:i:0
DGR4KXP1:301:H7LRPADXX:2:1207:10151:44403 0 eGFP 339 255 100M * 0 0 GAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTAC CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJJIJJJJJJJGIIJJJJJJJJJJIJJJJJHHHHHHHFFFFFFEEEFFEEDDDEDDDD NH:i:1 HI:i:1 AS:i:98 nM:i:0
DGR4KXP1:301:H7LRPADXX:2:1207:10488:44274 0 eGFP 170 255 100M * 0 0 CCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCC CCCFFFFFHHHHHJJJJIJJJJJJJJJJJJIJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJIIJJJJJJJHHHHHFFFFFEEEEEDDBDDDDDDDDDDEE NH:i:1 HI:i:1 AS:i:98 nM:i:0
DGR4KXP1:301:H7LRPADXX:2:1207:10340:44430 0 eGFP 1 255 100M * 0 0 ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGG CCCFFFFFHHHHHIJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJJJJJJJJJJJJJJJJJJJJJHHHHHFFFFFFFEEEEEEEEEEEDDDDDDDDDDDD NH:i:1 HI:i:1 AS:i:98 nM:i:0

I don't know why it won't recognize all reads as aligning to eGFP.

The exact steps I did to generate the genome indexes as are follows.

1.  Created eGFP fasta file called eGFP.fa
First five lines of file
>eGFP
ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGAC
GGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTAC
GGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACC
CTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAG

2.  Added eGFP to my gtf file
Last 2 lines of gtf, with the last line being the line added for eGFP
chrsM ENSEMBL exon 15356 15422 . - . gene_id "ENSMUSG00000064372.1"; transcript_id "ENSMUST00000082423.1"; gene_type "Mt_tRNA"; gene_status "KNOWN"; gene_name "mt-Tp"; transcript_type "Mt_tRNA"; transcript_status "KNOWN"; transcript_name "mt-Tp-201"; exon_number 1; exon_id "ENSMUSE00000521550.1"; level 3; transcript_support_level "NA"; tag "basic";
eGFP AddedGenes exon 1 720 . + 0 gene_id "eGFP"; transcript_id "eGFP";

3.  Load star/2.4.2a

4.  Run generate genome command

bsub -n 8 -R "span[hosts=1]" STAR --runMode genomeGenerate \
                                                         --runThreadN 8 \
                                                         --genomeFastaFiles GRCm38.p4.genome.fa eGFP.fa \
                                                         --genomeDir gencode.vM10eGFP2 \
                                                         --sjdbGTFfile gencode.vM10.annotation.gtf \
                                                         --sjdbOverhang 100

Any help would be greatly appreciated, and let me know if I can provide anymore information.

Thanks,
Matt

Alexander Dobin

unread,
Sep 14, 2016, 4:04:14 PM9/14/16
to rna-star
Hi Matt,

I cannot reproduce this problem. I made a fake gene seqeunce, out of these reads, and used your gtf line. All 4 reads were reported in the AlignedtoTranscriptome.out.bam file.
Please try the latest version 2.5.2b: https://github.com/alexdobin/STAR/releases/tag/2.5.2b
If it does not help, please send me the Log.out files from both the genome generation and mapping steps.

Cheers
Alex

Matt K

unread,
Sep 14, 2016, 5:41:51 PM9/14/16
to rna-star
Hi Alex,

Thanks for the response, I re-did the genome generation and alignment with a more current STAR version and everything worked.

Thanks,
Matt

Алима Галиева

unread,
May 2, 2023, 5:48:59 PM5/2/23
to rna-star
Hello, Alex, 
I'm very new to STAR, so I'm sorry if it's a stupid question 
I added my artificial gene as a separate optAIPL1.fa file at the "genome generation step" and also added the last string with information of it in the gtf annotation file (it was 'optAIPL1  \tab\  AddedGenes  \tab\ exon   \tab\  1   1158 \tab\  .  \tab\  +  \tab\ 0 \tab\ gene_id "optAIPL1"; transcript_id "optAIPL1 ";' string) followed by Your messages above. But my added gene haven't appeared in ouput counts table file after mapping. I tried to alter string in the annotation file to 'optAIPL1    ENSEMBL gene    exon    1   1158   .      +       0       gene_id "optAIPL1"; transcript_id "optAIPL1 ";' but it haven't changed a thing, I tried to look through output files and it seems like something happened at the step of splice junctions detection, cause SJ.out.tab file contains not all chromosomes. Now I'm wondering, did I do something wrong 
Could You, please, kindly help me out? 
Attached files are: mapping output files for one of my samples and index generation log file 

Thanks 
Alima 

вторник, 17 мая 2016 г. в 00:45:01 UTC+3, Alexander Dobin:
AIPL1_6_2.fastqReadsPerGene.out.tab
AIPL1_6_2.fastqLog.progress.out
AIPL1_6_2.fastqLog.out
AIPL1_6_2.fastqSJ.out.tab
indexGeneratingLog.out

Alexander Dobin

unread,
May 2, 2023, 6:11:45 PM5/2/23
to rna-star
Hi Alima,

the line you need to add to the GTF must have columns separated by tabs.
The best way is to copy an existing line and edit the fields according to your needs.
I am attaching the line that will hopefully work.
A.gtf

Алима Галиева

unread,
May 3, 2023, 8:00:19 AM5/3/23
to rna-star
Hi, Alex
I did change the annotation string as in attached file and it worked for Gencode annotation. I also shouldn't have looked for my custom gene in SJ.out.tab file, cause it might not contain all chromosomes, if some of them don't have genes with splice junctions, which was not obvious for me before :) 
Thank You a lot for Your answer! 

Best regards, 
Alima

среда, 3 мая 2023 г. в 01:11:45 UTC+3, Alexander Dobin:
ann.gtf
Reply all
Reply to author
Forward
0 new messages