Hey Everyone!
I actually developed a Django application (
freegenes) at the end of last year that serves genomic data, and while it would need some tuning up, I think it could be a solid starting base for this web interface that you have in mind. I strongly advocate for development in Python (Django is a python framework) because it's hugely important that others in the scientific community can contribute. There isn't a good method to put comments on GitHub wikis, so I'd like to discuss some points here:
Galaxy:
I don't use it, but doesn't Galaxy provide an interface that might be a
good start? Is there a reason to roll a new tool (what doesn't galaxy
do?)
Sequencer: "One recurring idea is to create an uploader where raw data from a
sequencer (long reads and short reads) is loaded onto a backend and
mapped using traditional tools as well as the variation
graph/pangenome tools."
Is this part of the job of the
interface? I would want to ask if the sequencers in question have API
endpoints that would allow for this?
Visualizer: Next a visualization is generated of the viral
strain in comparison with data we already have in the database.
Is this a database deployed by this web interface, or some other one?
The way I see this work is a bit different, here is how I'd think about the design:
1. The sequencer needs an API to push / notify of new data to parse
2.
Some kind of trigger from the sequencer needs to ping an endpoint to
start running a job (e.g., could these workflows be handled with
snakemake?) For example, for Singularity Hub, a commit to GitHub (the trigger) pings the server to launch a container build (a separate instance) that then sends the container image to storage, and notifies Singularity Hub. We'd want something similar to this, but with a trigger coming from an authenticated sequencer, and instead of a container builder, a launch of some cloud pipeline (snakemake, nextflow, take your pick!)
3. The job, on success, puts some result files in object storage, and pings this interface to add an entry for new data
The interface serving the data should do little except receive the final metadata, and serve an API that redirects to storage URLs for download (preferably with signed URLs unless you want some malicious user to be able to charge you up the wazoo). The workflows should be modular (this really comes down to being packaged in containers) and deployed with an orchestration tool that matches the sequencer API (e.g., if we use Python, we want snakemake). The metadata served by the interface should be structured (ontology this is where you come in).
Let
me know your thoughts! I'm really hopeful that I can help with this - I can whip out these Django interfaces very quickly, and I've done quite a bit of development work for snakemake, and have used Google Cloud a ton too.
Best,
Vanessa