Question: in-house sequence database

Rutger Vos

unread,

May 23, 2023, 1:23:15 PM5/23/23

to BioHackathon

Hi biohackers!

in the interest of not reinventing the wheel: do we know of any open source systems for managing marker sequences in an institution?

At my place of work, we generate or reuse DNA barcode sequences on a number of different platforms (e.g. ONT, PacBio, Sanger, etc.) and we want to aggregate them in a system so that we can:

deposit sequences from different internal workflows
add reference sequences from BOLD / GenBank / ENA
curate and extract reference subsets

In al cases, the sequences are a few hundred bp long and from a variety of species so there needs to be a good design for using multiple taxonomic backbones. Any suggestions?

Thanks!

Rutger

Peter Cock

unread,

May 23, 2023, 3:25:37 PM5/23/23

to biohac...@googlegroups.com

I'd be curious too, I've been doing this on a project by project basis

(i.e. marker specific databases), but they have all been very tied into

the classification pipeline - and you need something more generic.

Peter

Repository: https://github.com/peterjc/thapbi-pict

Preprint: https://doi.org/10.1101/2023.03.24.534090

--
You received this message because you are subscribed to the Google Groups "BioHackathon" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biohackathon...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/biohackathon/339724db-2a06-4c5f-8a6d-1a3a247f7ec8n%40googlegroups.com.

Takeshi Kawashima

unread,

May 23, 2023, 5:34:17 PM5/23/23

to biohac...@googlegroups.com

Dear Rutger and cc all

I'm not sure if this answer is your question.

You are discussing how to automatically curate the eDNA / barcode results, right?

And I'm afraid I'm not talking about the institution level, but about my previous personal subwork. It means I have invented the wheel once before, too. So, I don't think the following information is worth reading for an expert such as yourself. But just for your information.

The Materials and Methods section of the following my paper is somewhat more detailed.

https://bioone.org/journals/zoological-science/volume-39/issue-1

"Observing Phylum-level metazoan diversity by environmental DNA analysis at the Ushimado area in the seto inland sea". Kawashima T. et al

First, in terms of reference sequences, I used two resources. One is Organelle Genome Resources from GenBank and another is barcode resources provided by JBIF. (JBIF is a GBIF-related organization in Japan.)

For the correspondence between sequence and scientific name, although it is very classical, I first made a hit with BlastN and corresponded it to the Taxonomy ID of NCBI. I wrote my own ruby script for the mapping.

What I need as arguments for my script are two files in the taxdump.tar.gz provided by NCBI-Taxonomy: nodes.dmp and names.dump. I then passed the taxonomy-ID or scientific name as an additional argument and it would go back through the tree of nodes and provide the taxonomy information.

I think most people would have a hard time downloading data and upgrading data in above way, but since I worked at the National Institute of Genetics in Japan until recently, the reference data in HD was constantly being upgraded by NIG staffs, so if I let my scripts work, I could get the results semi-automatically, I was not too stressed.

That's all.

Rutger Vos

unread,

May 23, 2023, 7:42:04 PM5/23/23

to biohac...@googlegroups.com

Hi both,

thanks very much for your thoughts! Keep them coming!

I guess for Peter you might see that I'm coming from the perspective that it's a shame there doesn't appear to be a good successor for BioSQL. Something in the direction that it's a bit more current, nosql-ish, perhaps even denormalized and easily serialized into some machine-readable (say, JSON) would be ideal.

For Takeshi, it sounds like we're basically in the same space. Indeed, we need good support for species names and taxonomic backbones and one of the applications is metabarcoding - but what we need at my institution should be useable by many people (including lab analysts) and should hold many sequences. I'm hoping for that production-ready open source solution that I somehow hadn't heard of yet ;-)

Rutger

To view this discussion on the web visit https://groups.google.com/d/msgid/biohackathon/CAL20msH6Vu0tyPvO7qN0inBSVfYfhshF1_ZgmV-su5CH%2BQR6Cw%40mail.gmail.com.

Peter Cock

unread,

May 23, 2023, 7:57:10 PM5/23/23

to biohac...@googlegroups.com

Actually no, I'd not though of BioSQL in this context - although there
may be something sufficiently general here.

My example was also metabarcoding, primarily the ITS1 marker in
Phytophthora. Currently the database is in SQLite3 and only at
genus/species level (using the NCBI taxonomy), and built up using a
script from a mixture of NCBI search results (with automated primer
trimming), curated sequences, and a few single species samples we
sequenced as well. This is all under version control. I find conflicts
requiring manual review crop up often enough that I do the updates
periodically and semi-manually (rather than automating it to happen
automatically once a month or similar). More notes here:

https://github.com/peterjc/thapbi-pict/blob/master/database/README.rst

None of the other metabarcoding amplicon reference datasets I've put
together have been anywhere near as long running, most were one offs.

Peter

> To view this discussion on the web visit https://groups.google.com/d/msgid/biohackathon/CAC9cYqHa1FVmAbZ6uyfZdxkd5_N0mV0nQwU-dNJgSUrRbvtcTw%40mail.gmail.com.

Rutger Vos

unread,

May 27, 2023, 8:52:59 AM5/27/23

to biohac...@googlegroups.com

Hi Peter,

Yeah, we’ve been doing that too. The plan is for something a bit more robust than SQLite because it’s meant for the institution as a whole, so the lab pushes data into it, the bioinformaticians do ETL on it to make blast dbs, and the researchers wanna see what we have and make target lists. Maybe we can modify BOLD5…

Rutger

Op wo 24 mei 2023 om 01:57 schreef 'Peter Cock' via BioHackathon <biohac...@googlegroups.com>

To view this discussion on the web visit https://groups.google.com/d/msgid/biohackathon/CAKVJ-_4d36u%2BbJzWMu5N1hndRoa4oF9jo2vApBKs6r95-99qSg%40mail.gmail.com.

Reply all

Reply to author

Forward