atla goutham

unread,

Aug 6, 2019, 4:57:45 AM8/6/19

to Selene (sequence-based deep learning package)

Hi All,

I am running selene_cli.py on enhancer sequences from tissue of my interest. There are ~36000 enhancers, and I am considering 600bp from the peak center. I am running selene with three different intervals using deeperdeepsea architecture

1. Custom intervals (All open chromatin regions from the same tissue, and enhancers are a subset of these open chromatin regions) n=133,931.

2. Random sampler

3. Deepsea TF intervals.

Here are the three config files: https://www.dropbox.com/sh/texuhtuvyqs22zy/AACKKZxJFdG9QJgnq2FWaNnfa?dl=0

When I ran selene_cli.py, the output (selene_sdk.train_model.validation.txt) is as follows. :

Custom_intervals.yml :

loss   average_precision    roc_auc
2.1934749383945018e-05    1.0                                  NA
1.025205165205989e-05      1.0                                  NA
6.6757424974639434e-06    1.0                                  NA
5.006802894058637e-06     1.0                                   NA

RandomSampler.yml:

loss                                    average_precision   roc_auc
0.0259966566034127    0.047024421248050424 0.8781358803654997
0.02558650181721896   0.05514831338163669   0.8890238953359162
0.024295823980821297 0.07944684755956975   0.9072231938391968
0.024937484529567882 0.06972641161781534   0.9126959874902972
0.023475201040739194 0.08703433436639303   0.9105172974661733

DeepSeaIntervals.yml:

loss   average_precision    roc_auc
2.1934749383945018e-05    1.0   NA
1.025205165205989e-05      1.0       NA
6.6757424974639434e-06    1.0    NA

Its still running but I am wondering why the deepsea TF intervals and custom regions give NAs in AUC.

And the precision is low but the AUC is improved with deeperdeepsea. My ultimate aim is to do in silico mutagenesis on eQTLs from the same tissue. So I am wondering what would be a good precision to consider the results are reliable.

Thanks,

Goutham A

Kathy Chen

unread,

Aug 6, 2019, 11:40:10 AM8/6/19

to selen...@googlegroups.com

You should check the extent to which your input bed data intersects with the intervals that you passed in (and whether those peaks intersect with size greater than `center_bin_to_predict` * `feature_thresholds`) - NAs means that in the validation set there were less than `report_gt_feature_n_positives` (default: 10) positive samples for all of the features.

Average precision depends on the number of positive vs negative samples per class 'seen' by the model during training.

If the average precision is very low, it is probably worth considering the degree of class imbalance as well as visualizing the AUPRC curves to get more information about the precision distribution.

You can get the test targets + predictions to visualize AUPRC from EvaluateModel http://selene.flatironinstitute.org/overview/cli.html#expected-outputs-for-evaluation

atla goutham

unread,

Aug 6, 2019, 11:49:36 AM8/6/19

to Selene (sequence-based deep learning package)

Thanks for the answer. I will check those plots once it finishes running.

Reply all

Reply to author

Forward

Selene with different samplers

atla goutham

Custom_intervals.yml :

RandomSampler.yml:

DeepSeaIntervals.yml:

Kathy Chen

atla goutham