Adding custom genes

Yu Chen

unread,

Nov 12, 2013, 10:35:45 AM11/12/13

to rna-...@googlegroups.com

Sorry for what my be a simple question. I'm doing RNA-seq on a mouse cell line infected with both a human transgene and EGFP. Is it easy to add these sequences for STAR to map? I don't want to do de novo discovery.

Thanks.

Alexander Dobin

unread,

Nov 13, 2013, 11:58:14 PM11/13/13

to rna-...@googlegroups.com

Hi Yu Chen,

this is a good question. You can add these sequences at the "genome generation step" as extra "chromosome".

You would need to make FASTA files corresponding to each gene, and then use them together with the genome sequence, for instance:

$ STAR --genomeFastaFiles mm9.fa gene1.fa gene2.fa --runMode genomeGenerate --genomeDir ./ --sjdbGTFfile mm9.gtf --sjdbOverhang 100 --runThreadN 10

Cheers

Alex

Rob Wirka

unread,

May 16, 2016, 4:33:28 AM5/16/16

to rna-star

Hi Alex,

Wouldn't you also need to modify the GTF file in order to correctly annotate these extra chromosomes? I've been trying to modify the GTF file in many ways to get it to work, to no avail. It acknowledges the extra chromosome in the chrName.txt file generated after genome generation, but upon alignment the gene is nowhere to be found in the ReadsPerGene.out.tab (not even listed with 0 reads).

Thanks,

Rob

Alexander Dobin

unread,

May 16, 2016, 5:45:01 PM5/16/16

to rna-star

Hi Rob,

indeed, if you need to count reads per gene for these added "gene references", you would need to add to the GTF file one line per gene like this:

gene1 \tab\ AddedGenes \tab\ exon \tab\ 1 <gene1_length> \tab\ . \tab\ + \tab\ 0 \tab\ gene_id "gene1"; transcript_id "gene1";

Here gene1 should be the name used in the fasta file for the extra genic reference, and gene1_length is its length.

If this does not work, please send me the Log.out file.

Cheers

Alex

Rob Wirka

unread,

May 16, 2016, 8:26:32 PM5/16/16

to rna-star

Alex,

Thanks so much, that worked!

Best,

Rob

Matthew Jones

unread,

Aug 7, 2016, 9:28:15 PM8/7/16

to rna-star

Hi guys,

I tried to duplicate this genome generate step to include tdTomato in the mouse genome but I'm having issues with the downstream analysis. The genome is indexed successfully but when I use the newly indexed directory for alignment, Star complains about the genomeParameters.txt having an extra line stating "sjdbInsertSave Basic." I would greatly appreciate any insights you may have on how to address this issue.

Thanks,

Matt

Log.out

Alexander Dobin

unread,

Aug 10, 2016, 6:20:54 PM8/10/16

to rna-star

Hi Matt,

are you using the same STAR version for genome generation and mapping?

I would recommend that you switch to the latest version. If this does not help, please send me the Log.out file from the failed mapping run.

Cheers

Alex

Matt K

unread,

Sep 14, 2016, 11:02:37 AM9/14/16

to rna-star

Hi,

I am trying to do the exact same as the person above, add eGFP to a genome. I've followed the directions above and then created a fastq with 4 reads which should all align to the eGFP gene. When I run the alignment and look at the eGFPAlignedtoTranscriptome.out.bam, only 1 of the 4 reads is present.

When I look at the eGFPAligned.out.bam, all four aligned reads are present and are aligned to the eGFP chromosome.

DGR4KXP1:301:H7LRPADXX:2:1207:10120:44392 0 eGFP 508 255 100M * 0 0 CACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGA CCCFFFFFHHHHHIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJIIJJJJJJJJJJHHHHFFFDDDDDDDDDDDDDDDDDDDDDDD@DD NH:i:1 HI:i:1 AS:i:98 nM:i:0

DGR4KXP1:301:H7LRPADXX:2:1207:10151:44403 0 eGFP 339 255 100M * 0 0 GAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTAC CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJJIJJJJJJJGIIJJJJJJJJJJIJJJJJHHHHHHHFFFFFFEEEFFEEDDDEDDDD NH:i:1 HI:i:1 AS:i:98 nM:i:0

DGR4KXP1:301:H7LRPADXX:2:1207:10488:44274 0 eGFP 170 255 100M * 0 0 CCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCC CCCFFFFFHHHHHJJJJIJJJJJJJJJJJJIJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJIIJJJJJJJHHHHHFFFFFEEEEEDDBDDDDDDDDDDEE NH:i:1 HI:i:1 AS:i:98 nM:i:0

DGR4KXP1:301:H7LRPADXX:2:1207:10340:44430 0 eGFP 1 255 100M * 0 0 ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGG CCCFFFFFHHHHHIJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJJJJJJJJJJJJJJJJJJJJJHHHHHFFFFFFFEEEEEEEEEEEDDDDDDDDDDDD NH:i:1 HI:i:1 AS:i:98 nM:i:0

I don't know why it won't recognize all reads as aligning to eGFP.

The exact steps I did to generate the genome indexes as are follows.

1. Created eGFP fasta file called eGFP.fa

First five lines of file

>eGFP

ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGAC

GGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTAC

GGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACC

CTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAG

2. Added eGFP to my gtf file

Last 2 lines of gtf, with the last line being the line added for eGFP

chrsM ENSEMBL exon 15356 15422 . - . gene_id "ENSMUSG00000064372.1"; transcript_id "ENSMUST00000082423.1"; gene_type "Mt_tRNA"; gene_status "KNOWN"; gene_name "mt-Tp"; transcript_type "Mt_tRNA"; transcript_status "KNOWN"; transcript_name "mt-Tp-201"; exon_number 1; exon_id "ENSMUSE00000521550.1"; level 3; transcript_support_level "NA"; tag "basic";

eGFP AddedGenes exon 1 720 . + 0 gene_id "eGFP"; transcript_id "eGFP";

3. Load star/2.4.2a

4. Run generate genome command

bsub -n 8 -R "span[hosts=1]" STAR --runMode genomeGenerate \

--runThreadN 8 \

--genomeFastaFiles GRCm38.p4.genome.fa eGFP.fa \

--genomeDir gencode.vM10eGFP2 \

--sjdbGTFfile gencode.vM10.annotation.gtf \

--sjdbOverhang 100

Any help would be greatly appreciated, and let me know if I can provide anymore information.

Thanks,

Matt

Alexander Dobin

unread,

Sep 14, 2016, 4:04:14 PM9/14/16

to rna-star

Hi Matt,

I cannot reproduce this problem. I made a fake gene seqeunce, out of these reads, and used your gtf line. All 4 reads were reported in the AlignedtoTranscriptome.out.bam file.

Please try the latest version 2.5.2b: https://github.com/alexdobin/STAR/releases/tag/2.5.2b

If it does not help, please send me the Log.out files from both the genome generation and mapping steps.

Cheers

Alex

Matt K

unread,

Sep 14, 2016, 5:41:51 PM9/14/16

to rna-star

Hi Alex,

Thanks for the response, I re-did the genome generation and alignment with a more current STAR version and everything worked.

Thanks,

Matt

Алима Галиева

unread,

May 2, 2023, 5:48:59 PM5/2/23

to rna-star

Hello, Alex,

I'm very new to STAR, so I'm sorry if it's a stupid question

I added my artificial gene as a separate optAIPL1.fa file at the "genome generation step" and also added the last string with information of it in the gtf annotation file (it was 'optAIPL1 \tab\ AddedGenes \tab\ exon \tab\ 1 1158 \tab\ . \tab\ + \tab\ 0 \tab\ gene_id "optAIPL1"; transcript_id "optAIPL1 ";' string) followed by Your messages above. But my added gene haven't appeared in ouput counts table file after mapping. I tried to alter string in the annotation file to 'optAIPL1 ENSEMBL gene exon 1 1158 . + 0 gene_id "optAIPL1"; transcript_id "optAIPL1 ";' but it haven't changed a thing, I tried to look through output files and it seems like something happened at the step of splice junctions detection, cause SJ.out.tab file contains not all chromosomes. Now I'm wondering, did I do something wrong

Could You, please, kindly help me out?

Attached files are: mapping output files for one of my samples and index generation log file

Thanks

Alima

вторник, 17 мая 2016 г. в 00:45:01 UTC+3, Alexander Dobin:

AIPL1_6_2.fastqReadsPerGene.out.tab

AIPL1_6_2.fastqLog.progress.out

AIPL1_6_2.fastqLog.out

AIPL1_6_2.fastqSJ.out.tab

indexGeneratingLog.out

Alexander Dobin

unread,

May 2, 2023, 6:11:45 PM5/2/23

to rna-star

Hi Alima,

the line you need to add to the GTF must have columns separated by tabs.

The best way is to copy an existing line and edit the fields according to your needs.

I am attaching the line that will hopefully work.

A.gtf

Алима Галиева

unread,

May 3, 2023, 8:00:19 AM5/3/23

to rna-star

Hi, Alex

I did change the annotation string as in attached file and it worked for Gencode annotation. I also shouldn't have looked for my custom gene in SJ.out.tab file, cause it might not contain all chromosomes, if some of them don't have genes with splice junctions, which was not obvious for me before :)

Thank You a lot for Your answer!

Best regards,

Alima

среда, 3 мая 2023 г. в 01:11:45 UTC+3, Alexander Dobin:

ann.gtf

Reply all

Reply to author

Forward