Niches and aims

Skip to first unread message

Matúš Kalaš

Jul 13, 2015, 1:10:04 PM7/13/15
The niches and aims of BioXSD are not absolutely fixed, but have been and will continue being outlined by the following:

Technical niche is tree-structured formats that can be parsed, serialised, and validated using standard-based software libraries. Tree-structured formats are advantageous in certain scenarios e.g. in object-oriented programming (examples of data format paradigms are in Fig. 1).

"Contentual" niche is sequence, alignment, and sequence|.*genome feature annotation data, covering the whole spectrum between:
  • simple sequence and alignment data (example in Fig. 2),
  • and complex integrated records of diverse features, phenomena, measured, or inferred data (e.g. structures, interactions, functional and evolutionary relations), with sequence|.*genome as the basis, and with metadata including provenance, reliability scores, and references to databases, taxonomies, tools, and ontologies (example outlined in Fig. 3).
Conceptual aim of BioXSD is NOT replacing all those tree-structured data formats used by a single tool or site, but instead unifying them.

Practically, this aim can be achieved by working together on integrating the requirements and great ideas reflected in the various single-site formats, and updating the involved tools and data resources with added support for the common format shared between all contributing sites (illustrated in Fig. 4).
Fig. 1. Data formatting paradigms.png
Fig. 2. Example sequence record.png
Fig. 3. Integrated feature record.png
Fig. 4. Common data format.png

Jens Preußner

Jul 21, 2015, 8:06:00 AM7/21/15
Hi Matús,

thanks a lot for your work here and the poster you presented at the ISMB in Dublin this year. I really like the route where bioXSD is going! 

Lately, I was thinking in the same direction and with great input from Mike Amudsen and the ALPS community[1] we created bioALPS. Obviously, my literature research wasn't done very well, otherwise I would have stumbled on bioXSD much earlier. I'm not sure if you're aware of ALPS, but in a nutshell it exploits the profile link relation from RFC6906 by linking a document to a profile, which in turn can constrain content or advertise conventions, just like an XML namespace document. Besides the semantics of the data, ALPS can also define "transitions in an intuitive structure with human-readable descriptions"[2]. With this, its possible to additionally describe actions or transitions that make sense in scope of the data (e.g. how to issue a blast search with a sequence). The section "A Short Hike into the ALPS" of [2] should give a sufficient overview to grasp the essentials.

For me, ALPS was the way to go to semantically describe the interplay between sequences, high-throughput experiments, samples, genome assemblies, sequence features and experimental contrasts. What I like very much about ALPS is that it extends the contraints/conventions not only to the data format but also on the interaction with the underlying web service that serves the data. This can, in my view, lead to benefits in database integration and "smoother" workflows, as you put it in Figure 4 ;)

Please feel free to comment and let me know what you think!

Reply all
Reply to author
0 new messages