Interpretation of insilico mutagenesis

atla goutham

unread,

Aug 9, 2019, 9:16:25 AM8/9/19

to Selene (sequence-based deep learning package)

Hi All,

Its not a question direct related to Selene but to Seqweaver implemented on selene framework https://hb.flatironinstitute.org/asdbrowser/help (Updated 2019.08.06 section).

I ran the pipeline on eQTLs (+ high LD variants) and I got deepsea and seqweaver results, deepsea_results.tsv and seqweaver_results.tsv . A part of the output file is:

disease_impact_score    ref_match       chrom   pos     id      ref     alt     strand 8988T|DNase|None        AoSMC|DNase|None        Chorion|DNase|None      CLL|DNase|None Fibrobl|DNase|
4.24310 True    chr19   38183232        19:38183232     A       C       .       1.1306e-01      1.6175e-01      1.0755e-01      1.1984e-01      1.3083e-01      1.4206e-01      1.5335e-01
4.24310 True    chr5    56205357        5:56205357      A       G       .       4.3788e-01      4.7426e-01      4.8993e-01      4.0483e-01      4.1506e-01      4.1625e-01      3.6664e-01
4.24310 True    chr17   73201477        17:73201477     T       C       .       3.0071e-01      3.4921e-01      2.2446e-01      2.8547e-01      2.9204e-01      2.8831e-01      3.1007e-01
4.24310 True    chr8    145008443       8:145008443     A       T       .       3.4862e-01      6.0394e-01      4.1469e-01      2.5603e-01      5.1025e-01      5.7487e-01      5.1747e-01

I understood that the first column is based on HGMD training model. The rest of the columns are Absolute difference score/PME score for each feature ? Could you please let me know how can I interpret them and thresholding those scores ? What is the best approach to know which features are most affected by the variants ?

Thanks,

Goutham A

Jian Zhou

unread,

Aug 9, 2019, 9:50:43 AM8/9/19

to Selene (sequence-based deep learning package)

Disease impact scores (DIS) are in z-score scale (as described in this manuscript https://www.nature.com/articles/s41588-019-0420-0). Generally, the higher the DIS is, the more simlar the variant is to disease pathogenic regulatory variants.

For absolute difference scores, for DNA model the score is |ref_probability - alt_probability| for that specific feature, and for RNA model the score is |ref_probability - alt_probability| / scaling_factor. The scaling factors for RNA model scores are meant to make the scores comparable across SeqWeaver features which differ greatly in its variance, thus RNA model absolute difference scores are normalized scores (difference of 1 corresponds to one standard deviation of scores computed from SSC cohort mutations).

We don't currently recommend thresholds for these scores and encourage using these scores as is. If you are looking for pathogenic variants, high DIS score is likely the best indicator for selecting these variants. Absolute difference scores are useful if you are interested in a specific feature, or if you are interested in generally functional variants (not necessarily pathogenic).

atla goutham

unread,

Aug 9, 2019, 9:54:29 AM8/9/19

to Selene (sequence-based deep learning package)

Thanks Jian.

Zofie Lin

unread,

Jun 29, 2020, 4:22:14 AM6/29/20

to Selene (sequence-based deep learning package)

Hi Jian,

I am running the variant effect prediction using selene based on the "deepsea.beluga.pth" (DeeperDeepSEA class), but the results only contain abs_diffs and logits for each chromatin feature. I wonder how could I get the result of Disease impact scores (DIS).

Thanks,

Zofie

Jian Zhou

unread,

Jun 29, 2020, 9:11:47 PM6/29/20

to Selene (sequence-based deep learning package)

Hi Zofie,

To compute Disease impact scores, you will need to use the code here: http://deepsea.princeton.edu/media/code/code_asd_dnarna_v3.tar.gz (instruction are included in a google doc linked from the readme file).

Also the model "deepsea.beluga.pth" used in ExPecto is different from DeeperDeepSEA (even though two models are similar), and it was not used for Disease impact score. But in case you still want to use it, the model specification is available from https://github.com/FunctionLab/ExPecto/blob/master/chromatin.py .

Best,

Jian

Zofie Lin

unread,

Jul 6, 2020, 6:40:47 AM7/6/20

to Selene (sequence-based deep learning package)

Hi Jian,

Many thanks for your reply! I have computed the DIS according to your instruction and it runs well.

Yet I have a question about the two models "beluga" and "DeepSEAasd", it seems only a little bit different from the architecture between these two models. And "beluga" is used in Expecto and HumanBase website, right? Moreover, both are used for the prediction of 2002 chromatin features.

So, I am wondering what's the main difference between these two models and how different will their predicted results be? It's "DeepSEAasd" more specific to ASD?

Thanks,

Zofie

Jian Zhou

unread,

Jul 7, 2020, 8:11:14 PM7/7/20

to Selene (sequence-based deep learning package)

Hi Zofie,

Thanks for the question! Indeed the two models are similar in architecture and predict the same 2002 features. The main difference is that "beluga" models take 2000bp input while the "ASD" model takes 1000bp input. Both models are general and not specific to ASD. The "beluga" model is slightly newer and has some minor improvement in performance, but both models should produce similar results. HumanBase uses "beluga" model and provides DIS scores we retrained for "beluga" model, but if you want to match the published version of DIS you should use the code and model we shared for the ASD publication.

Best,

Jian

Zofie Lin

unread,

Jul 8, 2020, 6:10:06 AM7/8/20

to Selene (sequence-based deep learning package)

Hi Jian,

Thank you so much and sorry to bother you again.

Actually, for me, HumanBase is not as much convenience as predicted by Selene, so I am trying to use the variant_effect_prediction by Selene. Then I am using class "DeeperDeepSEA" (from Selene), trained model data "deepsea.beluga.pth" (from Expecto), and setting input size as 2000bp, also using the "DIS_computation.py" from DeepSEAasd (had make some modification). However, the result is totally different from HumanBase, nor the chromatin effect of each feature or the DIS. Are there any codes for the one using in HumanBase that can be processed by Selene?

Thanks,

Zofie

Kathy Chen

unread,

Jul 12, 2020, 10:48:30 AM7/12/20

to Selene (sequence-based deep learning package)

Hi Zofie,

DIS models are not comparable between HumanBase and ASD DeepSEA paper because HumanBase computes DIS for the Beluga model, not the model from the ASD paper. I'll see if I can get some code for that together though.

Thanks,

Kathy

Jian Zhou

unread,

Jul 12, 2020, 12:47:15 PM7/12/20

to Selene (sequence-based deep learning package)

Hi Zofie,

If you want to use the DIS score as is published, it needs to be using the DeepSEA ASD model and the DIS model trained on DeepSEA ASD model. The "Beluga DIS" is provided in HumanBase as a summary score for the convenience of the user, but it is experimental and unpublished.

If you want to use the Beluga model with Selene, the ASD code we shared includes an example to use DeepSEA ASD model with Selene, and you can modify that to use Beluga if you use the model specification from https://github.com/FunctionLab/ExPecto/blob/master/chromatin.py. However, as Kathy mentioned, the Beluga model is not compatible with the DIS model trained for DeepSEA ASD and you should not use the DIS score from the ASD code if you switch to Beluga.