Can seqweaver take sequence (.fa) instead of variant (.vcf) as input?

26 views
Skip to first unread message

Weichen Song

unread,
Mar 19, 2022, 3:29:09 AM3/19/22
to Selene (sequence-based deep learning package)
Hi Selene developer,
    Thanks for the great tool! I hope to predict how multiple variants could impact RBP profile, and I suppose I should run seqweaver on the sequence that carries all variants, and compare it against the reference sequence. Is it possible to input fasta file to vep_cli.py and output 232 RBP profile, on which I could manually calculate the difference between ref and alt fasta? 

A related question is, by reading the NG paper it seems that the input sequence is 1000 BP long. If I want to take all variants on a gene into account, shall I arbitrarily use 1kb windows to cover the full length, or shall I put the midpoint at a specific position? If the gene length is less than 1kb, is it okay to include some non-transcribed base pairs at the flanking regions?

Thanks in advance for your help!

chen.ka...@gmail.com

unread,
Mar 22, 2022, 10:29:49 AM3/22/22
to Selene (sequence-based deep learning package)
Hi Weichen, 

Thank you for your message! 

You cannot input a FASTA file to `vep_cli.py` but you can modify the script and the corresponding configuration YAML file(s) to output FASTA file predictions. Here's some changes I would suggest:
```
def run_config(config_yml):
        configs = load_path(config_yml, instantiate=False)
        configs["prediction"]["input_path"] = arguments["<fasta>"] // take as input now a FASTA file 
        configs["prediction"]["output_dir"] = arguments["<output-dir>"]
        parse_configs_and_run(configs)
run_config("configs/mouse_fasta.yml")
run_config("configs/human_fasta.yml")

```
In the `configs`, make a copy of each of the seqweaver .yml files and make one change: remove the `variant_effect_prediction: { ... }` section entirely and instead use
```
prediction: {
    output_format: hdf5
}
```

Then you can take the difference between ref and alt FASTAs to get the SeqWeaver variant effect prediction difference scores (note DIS would be an additional step). 

For your second question:
Seqweaver makes predictions for the center bin given an input sequence of 1kb, so I would recommend centering each variant. If the sequence extends outside of gene regions, you can replace the non-transcribed base pairs with 'NNNN's in the FASTA files you generate. 
If you use Selene with an input VCF file, Selene doesn't have the capability to replace those bases with 'N's so there will probably be some noise added; in practice, far away sequences don't make too much impact on high impact variants so it shouldn't make too much of a difference. 

Please let me know if there's anything I can clarify further.

Thanks!
Kathy 
Reply all
Reply to author
Forward
0 new messages