Hi Dat,
I'd probably suggest using the MatFileSampler in Selene and generating your own custom dataset. If you still want to use Selene's samplers to generate random sequences, I could imagine doing this a few ways:
- You can modify RandomPositionsSampler/IntervalsSampler or the Genome class in (selene_sdk/sequences/genome.py), adding code at the point where the sequence is randomly sampled, and post-process into k-mers from there.
- You can use Selene's API directly, rather than the configuration file, to generate samples and then modify with your own code afterwards (i.e. `sequences, targets = sampler.sample(...)` and then `custom_process_function(sequences)` before writing those to a file, and then using the MatFileSampler to train your architecture with the dataset. I may release an example script to do pre-sampling with Selene soon, will update here if we get it pushed to master.
- You could create a new class in `selene_sdk/sequences` that is similar to Genome or uses Genome within it to fetch sequences and then encodes them in a different way.