deesea_TF_intervals

Dewan Shrestha

unread,

Oct 28, 2019, 10:25:12 AM10/28/19

to Selene (sequence-based deep learning package)

Hi,

I am kind of confused about the interval file used in the tutorial. If sorted_GM12878_CTCF.bed.gz is used for training, validating and testing, why are we using deepsea_TF_intervals.txt for sampling?

Kathy Chen

unread,

Oct 28, 2019, 3:47:49 PM10/28/19

to Selene (sequence-based deep learning package)

The intervals file restricts the regions from which you can sample the data. This means that even if `sorted_GM12878_CTCF.bed.gz` covers the entire genome, the `deepsea_TF_intervals.txt` file restricts the sampler so that it may only draw samples from intervals with (in this case) at least 1 TF in the 200bp bin. You can use the `RandomPositionsSampler` if you just want to sample across the whole genome.

(Realized the email reply doesn't forward to the google groups... please post follow up questions here! Thanks)

Dewan Shrestha

unread,

Oct 28, 2019, 3:54:32 PM10/28/19

to Selene (sequence-based deep learning package)

if I have a bed file with positive and negative cases, and want to sample from only those regions (creating train, validation and test data from given bed file) without using separate interval file, is it possible?

Kathy Chen

unread,

Oct 29, 2019, 9:37:09 AM10/29/19

to Selene (sequence-based deep learning package)

What do you mean positive and negative cases? Are you only training on the absence/presence of a single class? If so you should probably just create a bed file of only the positives (i.e. regions containing that class) and then either use the RandomPositionsSampler for whole-genome negatives or use your original BED file for the intervals sampler

Dewan Shrestha

unread,

Oct 29, 2019, 4:20:47 PM10/29/19

to Selene (sequence-based deep learning package)

thank you

Zofie Lin

unread,

Jun 3, 2020, 6:41:38 AM6/3/20

to Selene (sequence-based deep learning package)

Hi,

I also confused about the interval file using for modeling.

Is `deepsea_TF_intervals.txt` can only be used for the training of transcription factor features or it can also be used for other features like DNase or histones?

If I want to do the training on a series of chromatin features, how can I generate the interval file for it? Can I just use the `deepsea_TF_intervals.txt`? Will it affect the result?

Many thanks.

Kathy Chen

unread,

Jun 3, 2020, 9:48:15 AM6/3/20

to Selene (sequence-based deep learning package)

Hi! Thanks for your question.

This kind of "interval restriction" is described in the DeepSEA 2015 manuscript as follows:
"We focused on the set of 200-bp bins with at least one TF binding event, resulting in 521,636,200 bp of sequences (17% of whole genome), which was used for training and evaluating chromatin feature prediction performance."

In short, the whole point of the intervals file is that you are changing your training from whole-genome sampling to only "N region" sampling. In the `deepsea_TF_intervals.txt` case, we are restricting to regions that have at least one TF measured in that region. This means that the model can still be trained for ANY chromatin profile, DNase, histone, TF, etc. but it will not "see" any of the data that lies outside of those intervals.

Depending on your application, you might find that a model that learns chromatin features from only these regions will detect more biologically relevant signal than training on the entire genome. So it's totally up to you what interval file you choose. If you want to use `deepsea_TF_intervals.txt` I'd just check to see how much of your training set intersects with that BED file to make sure there's sufficient training data in those regions. :)