Running Case 2 on Personal Data

94 views
Skip to first unread message

Katha Korgaonkar

unread,
Sep 4, 2019, 7:14:40 PM9/4/19
to Selene (sequence-based deep learning package)
I'm currently running case 2 on my personal data. I've already trained a new architecture using the online sampler, but I am having trouble with step 2. I used the evaluate_test_bed.yml file provided in config examples since I used a .bed file to train the architecture. Should I be using the train_data.bed or validate_data.bed file from the step 1 output for the .bed file needed in the config file? Assuming I can use the train_data.bed (I'm currently trying to run it with that), when I try running the eval.sh script I keep getting the following error: 

Outputs and logs saved to /home/ubuntu/selene/manuscript/case2/evaluation_outputs/2019-09-04-22-27-47

Traceback (most recent call last):

  File "../../../selene_cli.py", line 32, in <module>

    parse_configs_and_run(configs, lr=arguments["--lr"])

  File "/home/ubuntu/selene/selene_sdk/utils/config_utils.py", line 341, in parse_configs_and_run

    execute(operations, configs, current_run_output_dir)

  File "/home/ubuntu/selene/selene_sdk/utils/config_utils.py", line 208, in execute

    evaluate_model = instantiate(evaluate_model_info)

  File "/home/ubuntu/selene/selene_sdk/utils/config.py", line 239, in instantiate

    return _instantiate_proxy_tuple(proxy, bindings)

  File "/home/ubuntu/selene/selene_sdk/utils/config.py", line 144, in _instantiate_proxy_tuple

    obj = proxy.callable(**kwargs)

  File "/home/ubuntu/selene/selene_sdk/evaluate_model.py", line 132, in __init__

    self.sampler.get_data_and_targets(self.batch_size, n_test_samples)

  File "/home/ubuntu/selene/selene_sdk/samplers/file_samplers/bed_file_sampler.py", line 244, in get_data_and_targets

    seqs, tgts = self.sample(batch_size=batch_size)

  File "/home/ubuntu/selene/selene_sdk/samplers/file_samplers/bed_file_sampler.py", line 160, in sample

    tgts[features] = 1

IndexError: index 9 is out of bounds for axis 1 with size 2



I don't know what is wrong. I've attached a picture of what my config file for step 2 looks like. Thank you!
Screen Shot 2019-09-04 at 4.01.48 PM.png

Kathy Chen

unread,
Sep 5, 2019, 9:26:56 AM9/5/19
to Selene (sequence-based deep learning package)
Thanks for your question!

I will look into possible bugs for the bed file sampler, but I first wanted to clarify a few things:

1. Evaluation should not be done on the training or validation data. (You can technically do that, but it's quite circular to do so since you trained your model on the training data and validated the performance against the validation data, so the performance from that kind of evaluation will not be a reliable way to gauge how well your model will do on completely unseen data.) You can generate a test_data.bed file with one of the samplers and evaluate your model with that input instead.

2. This is a question about your configuration file: I noticed that your model predicts on >20 features total but your config for the BedFileSampler specifies 2 features? I think these need to be consistent with each other, so I'm wondering why they are different. 

Katha Korgaonkar

unread,
Sep 5, 2019, 2:19:02 PM9/5/19
to Selene (sequence-based deep learning package)
Thanks Kathy!
How would I go about generating the test_data.bed file with the sampler? (I'm assuming I have to use bed_file_sampler.py provided, but I am unsure of what to do with it). I am also unfamiliar with config files. Should n_features be the same as n_targets (the number of distinct features). Thank you so much!

Kathy Chen

unread,
Sep 6, 2019, 8:50:11 AM9/6/19
to Selene (sequence-based deep learning package)
One of the online samplers can do that - for example, the IntervalsSampler has the `save_data` parameter that must have been set to `[train, validate]` at some point. If you added `test` to it it would work. (Same with `RandomPositionsSampler`). 

`n_features` is the same as `n_targets` in this case. Note that `n_targets` is a parameter to the DeeperDeepSEA class and could have been named something else.

Katha Korgaonkar

unread,
Sep 6, 2019, 5:55:25 PM9/6/19
to Selene (sequence-based deep learning package)
Is there any way to create the test.bed file without running the training step again? I looked back to check my train_online_sampler.yml file and I already had "test" in my save_datasets parameter but for some reason it never outputted the test.bed file. I've attached a photo of my train_online_sampler.yml. Thanks!
Screen Shot 2019-09-06 at 2.51.43 PM.png

Katha Korgaonkar

unread,
Sep 6, 2019, 6:32:37 PM9/6/19
to Selene (sequence-based deep learning package)
Sorry! I also had another unrelated question but in the selene_sdk.train_model.validation.txt file that was outputted. I thought there would be 28 lines, one for every distinct feature. Instead my file only has 14 lines. Do you know the reason for this? Thanks!

Kathy Chen

unread,
Sep 9, 2019, 8:44:26 AM9/9/19
to Selene (sequence-based deep learning package)
Ah, so this is because you do not have the parameter `load_test_set` specified to True. You can search that parameter on http://selene.flatironinstitute.org/overview/cli.html to see what it does, but basically by default the test dataset is not created until evaluation takes place. You can also just create an evaluate config now with the IntervalsSampler and run that on your trained model to get that test dataset. Here's an example config, note that you will use the same IntervalsSampler config you have in the training config though http://selene.flatironinstitute.org/overview/cli.html#evaluate-using-matrix-file-sampler

Kathy Chen

unread,
Sep 9, 2019, 8:45:47 AM9/9/19
to Selene (sequence-based deep learning package)
Validation is average loss/auc/etc for all the features and is logged every N steps (`report_stats_every_n_steps`)

Katha Korgaonkar

unread,
Sep 9, 2019, 7:04:07 PM9/9/19
to Selene (sequence-based deep learning package)
Hey Kathy! Thanks so much. I'm still having trouble with the evaluate config file. I made a new evaluate config file going by the example you linked me. I also used my interval sampler from my training config. I've attached what my new config looks like. It keeps giving me this error: 

Outputs and logs saved to /home/ubuntu/selene/manuscript/case2/evaluation_outputs/2019-09-09-22-20-15

Traceback (most recent call last):

  File "../../../selene_cli.py", line 32, in <module>

    parse_configs_and_run(configs, lr=arguments["--lr"])

  File "/home/ubuntu/selene/selene_sdk/utils/config_utils.py", line 341, in parse_configs_and_run

    execute(operations, configs, current_run_output_dir)

  File "/home/ubuntu/selene/selene_sdk/utils/config_utils.py", line 208, in execute

    evaluate_model = instantiate(evaluate_model_info)

  File "/home/ubuntu/selene/selene_sdk/utils/config.py", line 239, in instantiate

    return _instantiate_proxy_tuple(proxy, bindings)

  File "/home/ubuntu/selene/selene_sdk/utils/config.py", line 144, in _instantiate_proxy_tuple

    obj = proxy.callable(**kwargs)

  File "/home/ubuntu/selene/selene_sdk/evaluate_model.py", line 134, in __init__

    if type(self.reference_sequence) == Genome and \

AttributeError: 'EvaluateModel' object has no attribute 'reference_sequence'


I don't understand why it's trying to get reference_sequence when that isn't a parameter for EvaluateModel objects. Would you mind helping me figure this out and also checking whether the rest of my config file looks correct. Thank you!
Screen Shot 2019-09-09 at 3.44.41 PM.png

Kathy Chen

unread,
Sep 10, 2019, 5:38:45 PM9/10/19
to Selene (sequence-based deep learning package)
Thanks for your patience! So this is actually a known bug that we caught earlier: https://github.com/FunctionLab/selene/issues/108

Are you using Selene locally or through conda/pip installation? If you are using it locally, just run `git pull` and it should be fixed.

If you are using it through conda/pip, you can either set up Selene to run locally to get this fix immediately, or wait about a week for me to push the next release. I'm working on getting that release finalized right now. 

Katha Korgaonkar

unread,
Sep 11, 2019, 7:00:54 PM9/11/19
to Selene (sequence-based deep learning package)
I think I was finally able to get it. Thank you so much for all your help Kathy!!
Reply all
Reply to author
Forward
0 new messages