Cannot generate working indices

411 views
Skip to first unread message

Drwhit

unread,
Jul 20, 2016, 12:23:40 PM7/20/16
to rna-star
Hi Alex,

I am trying to generate a STAR indices based on GRCh38.83 and include the ERCC spike-in sequences.  I included the ERCC spike-in fasta files, and also concatenated their gtf file to the GRCh38.83 gtf.  If you need to download these, this is the link: ERCC92.fa & ERCC92.gtf sequence and annotation files (.zip)

When I then run this command:
STAR --runThreadN 15 --runMode genomeGenerate --genomeDir ./ --sjdbGTFfile ~/Precyte1/genomes/gtf/Homo_sapiens.GRCh38.83wERCC92.gtf --genomeFastaFiles ~/Precyte1/genomes/hg38wERCC92/*.fa

I get the following files:
-rw-r--r--  1 awhitneyPre  staff   3115581440 Jul 18 18:34 Genome
-rw-r--r--  1 awhitneyPre  staff    861972074 Jul 18 18:35 Log.out
-rw-r--r--  1 awhitneyPre  staff  24236342100 Jul 18 18:35 SA
-rw-r--r--  1 awhitneyPre  staff   1565873619 Jul 18 18:35 SAindex
-rw-r--r--  1 awhitneyPre  staff          650 Jul 18 17:52 chrLength.txt
-rw-r--r--  1 awhitneyPre  staff         1150 Jul 18 17:52 chrName.txt
-rw-r--r--  1 awhitneyPre  staff         1800 Jul 18 17:52 chrNameLength.txt
-rw-r--r--  1 awhitneyPre  staff         1057 Jul 18 17:52 chrStart.txt
-rw-r--r--  1 awhitneyPre  staff         2282 Jul 18 18:34 exonGeTrInfo.tab
-rw-r--r--  1 awhitneyPre  staff          783 Jul 18 18:34 exonInfo.tab
-rw-r--r--  1 awhitneyPre  staff         1015 Jul 18 18:34 geneInfo.tab
-rw-r--r--  1 awhitneyPre  staff         3545 Jul 18 17:50 genomeParameters.txt
-rw-r--r--  1 awhitneyPre  staff            6 Jul 18 18:34 sjdbInfo.txt
-rw-r--r--  1 awhitneyPre  staff            0 Jul 18 18:34 sjdbList.fromGTF.out.tab
-rw-r--r--  1 awhitneyPre  staff            0 Jul 18 18:34 sjdbList.out.tab
-rw-r--r--  1 awhitneyPre  staff         3807 Jul 18 18:34 transcriptInfo.tab

As you can see, a couple of them are 0 bytes.

When I then run the command to map, STAR loads the genome and starts to map, but the Aligned.out.sam file stops growing after about 30 seconds and reaches a max size of 3.6 KB.

I have been able to align my reads to an indices that does not contain the ERCC sequences, so I am pretty confident that it is my indice that is faulty.  

Any help you can give would be much appreciated.

Thanks for this great software!
Adam

Alexander Dobin

unread,
Jul 21, 2016, 6:56:44 PM7/21/16
to rna-star
Hi Adam,

please send me the Log.out file for the genome generation step.

Cheers
Alex
Message has been deleted

Drwhit

unread,
Jul 25, 2016, 10:08:18 PM7/25/16
to rna-...@googlegroups.com
Hi Alex,

Thank you for your response.  The Log.out file is 862 MB, and I get an upload error when I try to attach here.  I am attaching a truncated version of the Log.out file in case it helps.

After looking at a portion of the Log.out file, I found that it consists mostly of these types of warnings:

WARNING: while processing sjdbGTFfile=<this is the path to my directory>/genomes/gtf/Homo_sapiens.GRCh38.83wERCC92.gtf: chromosome '1' not found in Genome fasta files for line:

1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";

WARNING: while processing sjdbGTFfile=<this is the path to my directory>/genomes/gtf/Homo_sapiens.GRCh38.83wERCC92.gtf: chromosome '1' not found in Genome fasta files for line:

1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";


I don't know if this is specific to chromosome 1 or not, but I doubt it. 


My GTF file contains entries as such:

head <this is the path to my directory>/genomes/gtf/Homo_sapiens.GRCh38.83wERCC92.gtf

#!genome-build GRCh38.p5

#!genome-version GRCh38

#!genome-date 2013-12

#!genome-build-accession NCBI:GCA_000001405.20

#!genebuild-last-updated 2015-10

1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";

1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";


So I am not sure why chromosome 1 is not being found.

Thanks again for your help,
Adam
Log.out.partial

Alexander Dobin

unread,
Jul 27, 2016, 6:14:47 PM7/27/16
to rna-star
Hi Adam,

the main problem is that the chromosomes in the FASTA files have names chr1,chr2,... while in the GTF file their names are 1,2... - this is why you get so many warnings.
You have to use the consistent names in both file, e.g. get both files from ENSEMBL or from GENCODE.
If you fix this, the warnings should go away, but if the mapping problem persists, please send me the Log.out files.

Cheers
Alex

Drwhit

unread,
Jul 28, 2016, 2:36:12 PM7/28/16
to rna-star
Alex,

Thank you for your reply.  I see what you mean.  It is strange because I thought all our files did come from the same source, but I will look into it and try again.

Thanks again,
Adam

Drwhit

unread,
Aug 2, 2016, 9:54:37 PM8/2/16
to rna-star
Alex,

I am now using the gencode fasta and gtf and indice creation and mapping work great now.  Thanks for your help!
Reply all
Reply to author
Forward
0 new messages