mapping to hg38/GRCh38

1,048 views
Skip to first unread message

Alexander Predeus

unread,
Oct 10, 2015, 4:53:43 PM10/10/15
to rna-star
Hello All,

I'm slowly transitioning to the GRCh38 genome assembly and the corresponding Gencode annotation (currently using v23)

I've read that GRCh38 includes more fragments describing alternative loci. However I was not sure about the extent of the variation.

As far as I understand, there's a tradeoff in such cases; it's "good" to include new sequences, but it's not ideal to include same gene or element twice (even with minor variation).

What patches and fragments would you recommend using with Gencode v23 and GRCh38.p3?

On a separate but related note, the new annotation generates some amount of "short" splicing junctions, here are some stats - all junctions from sjdbList.out.tab that are smaller than 20 bp (size,  frequency):

0 683
1 448
2 160
3 155
4 103
5 1
6 2
7 2
8 1
10 4
11 4
13 6
15 3
16 5
17 2
18 3

is it still recommended to get rid of them? There are no "negative length" junctions which sometimes were a problem in the past, just these.

cheers

-- Alex Predeus

Kirill Tsyganov

unread,
Oct 10, 2015, 5:48:36 PM10/10/15
to Alexander Predeus, rna-star
Hi Alex, 

I think its great that you brought that question up here.  I also want to get some tips/advices on this matter. 

Thus far I have been using Ensembl files and I prefer them. I'm not entirely sure how do releases work but here is link to the latest release of human DNA data ftp://ftp.ensembl.org/pub/release-82/fasta/homo_sapiens/dna/. From what I understand thats still GRCh38.. so what I meant is from Ensembl release 76-82 they all cover GRCh38 assembly.. I don't know what the difference and which to pick - I just go for the latest release.. And then there are sort of two big options to choose - TOPLEVEL and PRIMARY ASSEMBLY, but each would have masked, unmasked or removed options as well.. It was very confusing at the start. I go with PRIMARY ASSEMBLY softly masked sequence.. meaning those long repeats are all in lower case letters. 

I don't feel that I can trust GENCODE references 100%, although they are very good.. I think I had some what bad experience with one of theirs GTF files. 

Yesterday actually a senior person suggested I use NCBI reference instead.. apparently its very good for "mapping" to the reference genome.. It has all of those alternative contigs removed ! and some sequences around centromeres, which makes very good as a reference. Here is link ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

I haven't used it those. 

With ALT (alternative) contigs from what I understand if you not doing some fancy analysis that is interested in maybe alternative loci usage or something like that, you don't need ALT contigs. 

From mapping to the GRCh38 (human) from Ensembl I am getting very high mapping rate over 95% (great), but very low "uniquely mapped" rate 34%.. We are still investigating this, but its mainly to do with read mapping to BOTH chromosome and contig loci..

From what I understand bwa mem is the only aligner that can do some stuff behind the seam to account for ALT contigs..

I see this is as a great opportunity to as Alex Dobin about "How does STAR handles those new ALT contigs..?"

I would love to hear more from other about ALT contigs and they approaches.. I think it is rather important issue. 

Cheers, 

Kirill

--
You received this message because you are subscribed to the Google Groups "rna-star" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/rna-star.

Alexander Dobin

unread,
Oct 14, 2015, 6:44:46 PM10/14/15
to rna-star, pre...@gmail.com
Hi Alex, Kirill,


At present I recommend to include unplaced/unlocalized scaffolds (which add several Mbases of new sequences, importantly, one of the scaffolds contains highly expressed rRNA), 
but not include alternative loci and patches (which add some variations of existing sequencess).

For recent releases of ENSEMBL, "primary_assembly" files satisfy this recommendation, while "toplevel" include the alternative loci and patches, and may significantly increase the number of multimapping reads. The NCBI "no_alt_analysis_set" also satisfy this requirement, and I believe they should contain the same scaffolds as ENSEMBL "primary_assembly" (and also EBV which you may or may not want).
The naming of additional scaffolds on the NCBI and ENSEMBL files is - unfortunately - different which creates inconveniences. 
GENCODE release contains yet another FASTA file, but that one includes the alternatives/patches, so it should not be used as is , it has to be filtered.

At the moment STAR does not handle alternative contigs in any specific way.
There is some discussion about what can be done here:
I am working on mapping to personal genomes, I think it's a better approach than trying to include alternative loci.

Cheers
Alex
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+unsubscribe@googlegroups.com.

Alexander Predeus

unread,
Oct 15, 2015, 5:25:39 AM10/15/15
to rna-star, pre...@gmail.com
Thank you, this clarifies it a lot! Gencode actually has no-alt assembly for mm10, hg19 and hg38.

So that makes it easier - no need to rename the patches in no-alt files provided by Ensembl. 

Yet another interesting discussion on transitioning to hg38: https://www.reddit.com/r/genome/comments/3b3s3t/switch_from_hg19build37_to_hg20build38/

-- Alex

 
Reply all
Reply to author
Forward
0 new messages