OK, that was my suspicion (that the alt loci & patches should be excluded).For those who might be interested, my solution was to do the following:
- use the b37 reference available from the Broad - as this is guaranteed to work with the VCFs that they release (e.g. 1000G_phase1.indels.b37.vcf, etc.). Note, if you choose to use a different reference, you will likely need to sort the VCF's that the Broad makes available (I found this out the hard way when everything started failing at the BaseRecalibrator tool/walker step)
- use the GTF of your choice.
- my recommendation is to use the release 19 GTF from Gencode that only includes the annotation on the autosomal & sex chromosomes: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
- In order to use this GTF with the b37 (GRCh37) version of the reference from the Broad, you will need to strip out all the 'chr' annotations and replace the 'chrM' with 'MT' annotations. The easiest way to do that is with sed, using the following command:
- zcat gencode.v19.annotation.gtf.gz | sed 's/chrM/MT/1' | sed 's/chr//1' > gencode.v19.annotation.nochar.gtf
(NOTE that the '1' at the end of each sed call specifies that it only apply to the 1st field of the file/input)- Build the 1st pass index as follows:
STAR --runThreadN <Nthreads> \
--runMode genomeGenerate \
--genomeDir $(pwd)/ \
--genomeFastaFiles /path/to/Homo_sapiens_assembly19.fasta \
--sjdbGTFfile /path/to/gencode.v19.annotation.nochr.gtf \
--sjdbOverhang <reads_length - 1>
- return the your regularly scheduled programming (as recommended by the GATK, article: https://www.broadinstitute.org/gatk/guide/article?id=3891 ,i.e. align you reads, etc.)