Selene with different samplers

92 views
Skip to first unread message

atla goutham

unread,
Aug 6, 2019, 4:57:45 AM8/6/19
to Selene (sequence-based deep learning package)
Hi All,

I am running selene_cli.py on enhancer sequences from tissue of my interest. There are ~36000 enhancers, and I am considering 600bp from the peak center.  I am running selene with three different intervals using deeperdeepsea architecture

1. Custom intervals (All open chromatin regions from the same tissue, and enhancers are a subset of these open chromatin regions) n=133,931.
2. Random sampler
3. Deepsea TF intervals.


When I ran selene_cli.py, the output (selene_sdk.train_model.validation.txt) is as follows. :

Custom_intervals.yml :

loss                                           average_precision    roc_auc
2.1934749383945018e-05    1.0                                  NA
1.025205165205989e-05      1.0                                  NA
6.6757424974639434e-06    1.0                                  NA
5.006802894058637e-06     1.0                                   NA

RandomSampler.yml:

loss                                    average_precision               roc_auc
0.0259966566034127    0.047024421248050424  0.8781358803654997
0.02558650181721896   0.05514831338163669   0.8890238953359162
0.024295823980821297  0.07944684755956975   0.9072231938391968
0.024937484529567882  0.06972641161781534   0.9126959874902972
0.023475201040739194  0.08703433436639303   0.9105172974661733


DeepSeaIntervals.yml:

loss                                     average_precision    roc_auc
2.1934749383945018e-05    1.0                         NA
1.025205165205989e-05      1.0                         NA
6.6757424974639434e-06    1.0                         NA


Its still running but I am wondering why the deepsea TF intervals and custom regions give NAs in AUC.

And the precision is low but the AUC is improved with deeperdeepsea. My ultimate aim is to do in silico  mutagenesis on eQTLs from the same tissue. So I am wondering what would be a good precision to consider the results are reliable.

Thanks,
Goutham A

Kathy Chen

unread,
Aug 6, 2019, 11:40:10 AM8/6/19
to selen...@googlegroups.com
You should check the extent to which your input bed data intersects with the intervals that you passed in (and whether those peaks intersect with size greater than `center_bin_to_predict` * `feature_thresholds`) - NAs means that in the validation set there were less than `report_gt_feature_n_positives` (default: 10) positive samples for all of the features.

Average precision depends on the number of positive vs negative samples per class 'seen' by the model during training.
If the average precision is very low, it is probably worth considering the degree of class imbalance as well as visualizing the AUPRC curves to get more information about the precision distribution.
You can get the test targets + predictions to visualize AUPRC from EvaluateModel http://selene.flatironinstitute.org/overview/cli.html#expected-outputs-for-evaluation

atla goutham

unread,
Aug 6, 2019, 11:49:36 AM8/6/19
to Selene (sequence-based deep learning package)
Thanks for the answer. I will check those plots once it finishes running.
Reply all
Reply to author
Forward
0 new messages