Where in code and how should I change Selene pipeline to use Kmer instead of raw sequences?

30 views
Skip to first unread message

Dat Duong

unread,
Feb 3, 2020, 11:04:42 PM2/3/20
to Selene (sequence-based deep learning package)
Hi, I would like to try some variations on how to encode the nucleotide. Where in the Selene code, and what would be the best strategy to modify the input raw sequences into k-mers (for example 3-mers). Thanks. 

Kathy Chen

unread,
Feb 7, 2020, 12:17:42 PM2/7/20
to Selene (sequence-based deep learning package)
Hi Dat,

I'd probably suggest using the MatFileSampler in Selene and generating your own custom dataset. If you still want to use Selene's samplers to generate random sequences, I could imagine doing this a few ways:

- You can modify RandomPositionsSampler/IntervalsSampler or the Genome class in (selene_sdk/sequences/genome.py), adding code at the point where the sequence is randomly sampled, and post-process into k-mers from there. 

- You can use Selene's API directly, rather than the configuration file, to generate samples and then modify with your own code afterwards (i.e. `sequences, targets = sampler.sample(...)` and then `custom_process_function(sequences)` before writing those to a file, and then using the MatFileSampler to train your architecture with the dataset. I may release an example script to do pre-sampling with Selene soon, will update here if we get it pushed to master. 

- You could create a new class in `selene_sdk/sequences` that is similar to Genome or uses Genome within it to fetch sequences and then encodes them in a different way.


Reply all
Reply to author
Forward
0 new messages