GTF recommendations: include or exclude exons on scaffolds, assembly patches & alt loci?

Peter Waltman

unread,

Feb 26, 2016, 1:04:54 PM2/26/16

to rna-star

I'm working with human data, and am curious if users should include any of the exons that are at alternate loci such as the scaffolds, assembly patches & alternate loci? Or would you recommend that users just stick with the primary assembly?

Alexander Dobin

unread,

Feb 26, 2016, 4:45:27 PM2/26/16

to rna-star

Hi Peter,

my recommendation is to include the unplaced/unlocalized scaffolds, and not include the patches/alt loci.

The unplaced/unlocalized scaffolds may include important sequences, for instance in current human assembly one of the scaffolds contains a highly expressed rRNA.

On the other hand, patches/alt loci are modifications of the sequences already present in the primary assembly.

If you include them, you would need to resolve multi-mappers that map equally well to both the primary and patch/alt.

STAR cannot do it at the moment, but I am planning to implement this feature in the near future.

Cheers

Alex

Peter Waltman

unread,

Feb 29, 2016, 11:29:15 AM2/29/16

to rna-...@googlegroups.com

OK, that was my suspicion (that the alt loci & patches should be excluded).

For those who might be interested, my solution was to do the following:

use the b37 reference available from the Broad - as this is guaranteed to work with the VCFs that they release (e.g. 1000G_phase1.indels.b37.vcf, etc.). Note, if you choose to use a different reference, you will likely need to sort the VCF's that the Broad makes available (I found this out the hard way when everything started failing at the BaseRecalibrator tool/walker step)
use the GTF of your choice.

my recommendation is to use the release 19 GTF from Gencode that only includes the annotation on the autosomal & sex chromosomes: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
In order to use this GTF with the b37 (GRCh37) version of the reference from the Broad, you will need to strip out all the 'chr' annotations and replace the 'chrM' with 'MT' annotations. The easiest way to do that is with sed, using the following command:

zcat gencode.v19.annotation.gtf.gz | sed 's/chrM/MT/1' | sed 's/chr//1' > gencode.v19.annotation.nochar.gtf
(NOTE that the '1' at the end of each sed call specifies that it only apply to the 1st field of the file/input)

Build the 1st pass index as follows:
STAR --runThreadN <Nthreads> \
--runMode genomeGenerate \
--genomeDir $(pwd)/ \
--genomeFastaFiles /path/to/Homo_sapiens_assembly19.fasta \
--sjdbGTFfile /path/to/gencode.v19.annotation.nochr.gtf \
--sjdbOverhang <reads_length - 1>
return the your regularly scheduled programming (as recommended by the GATK, article: https://www.broadinstitute.org/gatk/guide/article?id=3891, i.e. align your reads, etc.)
collect and filter the results from the SJ.out.tab, i.e.

cat */SJ.out.tab > combined.SJ.out.tab
awk 'BEGIN {OFS="\t"; strChar[0]="."; strChar[1]="+"; strChar[2]="-";} {if(($5>0)||($7>=50)){print $1,$2,$3,strChar[$4]}}' combined.SJ.out.tab | awk '!a[$0]++' | grep -v MT > filtered.combined.SJ.out.tab
the command above will create a new file called 'filtered.combined.SJ.out.tab' that filters out all duplicate and Mitochondrial junctions, as well as all non-canonical junctions, unless they (the non-canonical junctions) have 50+ reads supporting them. The threshold of 50 reads is purely arbitrary, and if you want to exclude ALL non-canonical junctions, you can use the following command instead:
awk 'BEGIN {OFS="\t"; strChar[0]="."; strChar[1]="+"; strChar[2]="-";} {if($5>0){print $1,$2,$3,strChar[$4]}}' combined.SJ.out.tab | awk '!a[$0]++' | grep -v MT > filtered.combined.SJ.out.tab

Start of 2nd pass: re-build the index using the filtered.combined.SJ.out.tab that you just created - something like this:
STAR --runThreadN <Nthreads> \
--runMode genomeGenerate \
--genomeDir $(pwd)/ \
--genomeFastaFiles /path/to/Homo_sapiens_assembly19.fasta \
--sjdbFileChrStartEnd /path/to/filtered.combined.SJ.out.tab \
--sjdbGTFfile /path/to/gencode.v19.annotation.nochr.gtf \
--sjdbOverhang <reads_length - 1>
Re-align your fastq's
split-n-trim...
and away you go!

Alexander Dobin

unread,

Feb 29, 2016, 4:29:37 PM2/29/16

to rna-star

Hi Peter,

thanks a lot for working out and sharing this strategy! It will be very useful for people running STAR/GATK.

I would like to point to a few options that would make the pipeline more efficient.

At the 1st mapping pass mapping (step 4), --outSAMtype None will prevent output of large SAM files, which you do not really need at the 1st pass.

At the 2nd mapping pass, you can used the following parameters to eliminate several steps from the GATK best practices:

--outSAMattrRGline

string: SAM/BAM read group line. The first word contains the read group identifier and must start with "ID:", e.g. --outSAMattrRGline ID:xxx CN:yy "DS:z z z".

xxx will be added as RG tag to each output alignment. Any spaces in the tag values have to be double quoted.

--outSAMmapqUnique

int: 0 to 255: the MAPQ value for unique mappers - it's still not really meaningful, but can be set to any value for compatibility

--outSAMattributes

string: a string of desired SAM attributes, in the order desired for the output SAM

NH HI AS nM NM MD jM jI XS

NM and MD have the same meaning as in the samtools (nM is the number of mismatches per read pair).

Also, you cam make STAR generate the sorted BAM file directly while mapping:

--outSAMtype BAM SortedByCoordinate

Cheers

Alex

On Monday, February 29, 2016 at 11:29:15 AM UTC-5, Peter Waltman wrote:

OK, that was my suspicion (that the alt loci & patches should be excluded).

For those who might be interested, my solution was to do the following:

use the b37 reference available from the Broad - as this is guaranteed to work with the VCFs that they release (e.g. 1000G_phase1.indels.b37.vcf, etc.). Note, if you choose to use a different reference, you will likely need to sort the VCF's that the Broad makes available (I found this out the hard way when everything started failing at the BaseRecalibrator tool/walker step)
use the GTF of your choice.
my recommendation is to use the release 19 GTF from Gencode that only includes the annotation on the autosomal & sex chromosomes: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
In order to use this GTF with the b37 (GRCh37) version of the reference from the Broad, you will need to strip out all the 'chr' annotations and replace the 'chrM' with 'MT' annotations. The easiest way to do that is with sed, using the following command:
zcat gencode.v19.annotation.gtf.gz | sed 's/chrM/MT/1' | sed 's/chr//1' > gencode.v19.annotation.nochar.gtf
(NOTE that the '1' at the end of each sed call specifies that it only apply to the 1st field of the file/input)
Build the 1st pass index as follows:
STAR --runThreadN <Nthreads> \
--runMode genomeGenerate \
--genomeDir $(pwd)/ \
--genomeFastaFiles /path/to/Homo_sapiens_assembly19.fasta \
--sjdbGTFfile /path/to/gencode.v19.annotation.nochr.gtf \
--sjdbOverhang <reads_length - 1>

return the your regularly scheduled programming (as recommended by the GATK, article: https://www.broadinstitute.org/gatk/guide/article?id=3891 ,i.e. align you reads, etc.)

Peter Waltman

unread,

Feb 29, 2016, 5:51:13 PM2/29/16

to rna-star

Ahhh, that's awesome! I knew about the --outSAMattrRGline param, but didn't realize that I didn't realize that a user could get STAR to generate the bam directly. While STAR can generate a sorted bam, can/will it also mark PCR duplicates? Or will users need to do this step as well?

In an ideal world, it'd be really sweet, if STAR could be pipelined, i.e. STAR <params> --outSAMattrRGline SortedByCoordinate /dev/stdout | sambamba markdup <params>

Alexander Dobin

unread,

Mar 1, 2016, 12:18:27 PM3/1/16

to rna-star

Hi Peter,

there is an option in STAR to remove duplicates, but it's in its beta version. It requires a sorted BAM file as input, i.e. duplicate removal is not not simultaneously with mapping at the moment.

Also, it has a slightly different rule than the samtools rmdup, it requires matching CIGAR strings (i.e. matching junctions and indels) in duplicate reads. It's also not parallelized so it's probably not much faster than samtools rmdup.

You can pipe the sorted BAM output into the stdout:

STAR <params> --outSAMtype BAM SortedByCoordinate --outStd BAM_SortedByCoordinate | <yourProcess>

Note that the sorted file is generated only after all the mapping is done, so your downstream process will be idle for the most of the run.

Reply all

Reply to author

Forward