BioALPS for biological/genome science

67 views
Skip to first unread message

Jens von Preußen

unread,
Jul 31, 2014, 4:00:11 AM7/31/14
to alp...@googlegroups.com
Dear all, hi Mike,

I'm currently working on profiles for biological entities, like sequence-based objects, experiments, samples, features and the like. This is because in biological sciences there are only a few major APIs (e.g. for the protein data base [1], the uniprot database [2] and a big REST service for KEGG, NCBI, EMBL, PDB [3]) and as with many APIs, descriptors meaning the same have different names, even for the same entity. So, my question(s) here aims for your experience in designing profiles. Is it advisable to have a more general approach, that fits different entities (we have for example genes, proteins, transcripts, exons... in bioscience) into one profile like a bioobject or bioentity? The drawback is that the common descriptors narrow down to a few like accession, sequence, length, maybe reference and metadata like last_modified, created_on and so on. Additionally, my approach would include general properties and features, using ontology terms (we have many ontologies in science) and descriptions for extensive annotation. Or is it better to have own profiles for each entity with the drawback that some descriptors are repeated?
Maybe anyone has experience in complex, interconnected fields and is aware of related work or stumbled on previous efforts? Maybe there is even a volunteer to review the current approach?

Best,
Jens

[1] http://www.rcsb.org/pdb/software/rest.do
[2] http://www.uniprot.org/faq/28
[3] http://togows.dbcls.jp/site/en/rest.html

mca

unread,
Jul 31, 2014, 9:17:14 AM7/31/14
to Jens von Preußen, alp...@googlegroups.com
Jens:

first, welcome! great to see you here and to hear about the project you're working on.

now, on to some "advice" (in quotes because we are all still rather new at using ALPS since it is so young)....

when creating ALPS, i had the notion of something that would be the "shared understanding" for a problem space (like bio sciences, accounting, space satellite tracking, geological services, etc.). and ALPS is *not* meant to be only a data definition support. that's why transitions are an important part of the design. real life problem spaces involve not just *data* but also *actions*. so the ALPS documents outline the data and possible actions you can take in the problem space.

i also hope that ALPS documents will be useful by more than the original author. that others will find an existing ALPS document and be able to build services or client apps using those documents w/o having to fork them. for that reason, I tend to design small-ish ALPS documents each document ends up describing a real life set of interactions (data + actions) that "make sense together"  most of my ALPS documents have been small (<25 data elements and ~10-15 transitions) but i think that might be due to my personal preference to defining very small service interfaces when i build systems. 

but, as i said at the top, this is a very limited experience base.  i know Mark Foster has been working w/ ALPS for a while in a production setting and he may have other advice to offer.

that may not address all of your Qs, but i hope it is a start. 

i'd love to learn more about what you are doing, how you are considering to use ALPS and any (and *all*) challenges you run into as you explore ALPS. every experience (positive or negative) is welcome because that experience will help inform us all and offers a chance to improve the design of ALPS.

thanks and hope to hear more from you soon.

Jens Preußner

unread,
Aug 5, 2014, 8:04:10 AM8/5/14
to alp...@googlegroups.com
Dear Mike,

thank you for the warm welcome! Of course, the "problem space" also has transitions, interconnections and hierarchical flavors - so the profiles I have in mind will outline possible actions (most of them are covered by IANA rel's though).
Your advice to go for small-ish documents gave me confidence, I will put a first draft online (maybe with the state diagram, as suggested in your book, because you can grasp the problem space from it).

I already faced some challenges, namely putting hypermedia control into profiles. I looked through some of your examples on Git and noticed that you started to divide a document into "base elements" and "containers". The container descriptors (semantic) return "collection", that makes sense to me. In the example in your recent ALPS draft on Git, you had a descriptor (state transition) with the id "collection" but returning a "contact", which makes the whole profile describing some kind of collection? A quite specific question would be: Whats the best practice to include collections (as a standalone resource, with hypermedia control, like search) and items (also a standalone resource) into the very same profile document? In terms of the contact-example, the approach in the specs is like the collection+json approach: A single contact is a collection of contacts with one entry. Would it make sense to have two profiles, one for the actual contact and one for the collection?

Best and thank you for your patience!
Jens

mca

unread,
Aug 6, 2014, 9:56:22 PM8/6/14
to Jens Preußner, alp...@googlegroups.com
Jens:

<snip>
I already faced some challenges, namely putting hypermedia control into profiles. I looked through some of your examples on Git and noticed that you started to divide a document into "base elements" and "containers". The container descriptors (semantic) return "collection", that makes sense to me.
</snip>
that container pattern[1] is nothing special right now. it is a pattern i'm testing and it seems valuable. but there are definitely other ways to use ALPS documents. for example, i also have been using a pattern where the semantic elements are "free-floating" and all transitions have no "child" descriptors (meaning that any transition might have one or more of the semantic elements)[2].

It's my opinion that the container pattern makes sense when everyone (yes, every single server and client on the planet that will ever use this profile ;) can agree on the aggregate "objects." this is not likely for very large systems or systems that have a highly varied workflow and/or composition space. but simpler, less varying spaces should be well-served by the container pattern. this is all speculation since i've only implemented a small set of examples to this point. i am looking for more feedback (like yours!)

<snip>
In the example in your recent ALPS draft on Git, you had a descriptor (state transition) with the id "collection" but returning a "contact", which makes the whole profile describing some kind of collection? 
A quite specific question would be: Whats the best practice to include collections (as a standalone resource, with hypermedia control, like search) and items (also a standalone resource) into the very same profile document? In terms of the contact-example, the approach in the specs is like the collection+json approach: A single contact is a collection of contacts with one entry. Would it make sense to have two profiles, one for the actual contact and one for the collection?
</snip> 
right now ALPS does not offer a way to express "cardinality" and ALPS profiles should be assumed to ALWAYS represent collections. IOW, a container pattern in ALPS expresses a  "one-or-more" model. I should say that cardinality *was* in the earliest draft of the spec but has dropped out since so many samples i build ALWAYS had cardinality="multiple" ;)

hope this helps. looking forward to your continued feedback.


Jens Preußner

unread,
Aug 7, 2014, 11:26:16 AM8/7/14
to alp...@googlegroups.com
Hi Mike,

very clear and helpful answer! Thank you. I created a repo at https://github.com/jenzopr/bioalps and uploaded a draft of the assembly profile. A description of the particular problem space is included in the readme. As you see, there are still open questions that remained during drafting.
  • Base descriptors are created "free floating", as you suggested for highly varying domains. Nevertheless, I added a container for an assembly (it can be seen as object or aggregation of descriptors) that is returned by some state transitions. I'm not sure if this is neccessary or if returning something else (maybe "item") is more appropriate.
  • I wish to include "cross-references" and hierarchy, so I gave every object a safe state transition to itself (called item), which means that there might be a clash between the assemblies item descriptor and some other item descriptor. I'm still looking for a good way to include a safe transition to another representation.
  • I also included semantic descriptors inside the transitions container. Maybe I don't need them at all?

Best,

Jens


mca

unread,
Aug 9, 2014, 6:16:54 PM8/9/14
to Jens Preußner, alp...@googlegroups.com
Jens:

seems solid to me

however this is one thing that confuses me a bit:
<descriptor href="bioproperty#property" /> <descriptor href="bioobject#item"> <descriptor href="bioobject#link"/>

descriptor hrefs SHOULD be resolvable URLs (maybe this is not clear in the docs) and these are not (unless i am missing something).

do you mean these to point to some other ALPS doc? 

is the "bioproperty" actually a placeholder, like a template value? like this:
<descriptor href="{bioobject}#link"/>
or is there something else you mean that i've missed?

maybe a gist that shows a sample resolution of this ALPS to another runtime format (HTML? Cj?, etc.) would clear up my questions.

cheers.

Jens Preußner

unread,
Aug 11, 2014, 3:21:20 AM8/11/14
to alp...@googlegroups.com
Thanks Mike, I totally missed to clear the references. I will put up a revised version in a few minutes, together with the other "missing" documents that were referenced. Additionally, I removed the included hypermedia control and replaced them by IANA relations search, edit-form and create-form. I currently have the feeling that it adds a bit more flexibility. But maybe I'm wrong :)
Have a view at the gist for the assembly: https://gist.github.com/6bab8db26fb2fb767c08.git

Best,
Jens

mca

unread,
Aug 11, 2014, 9:36:11 AM8/11/14
to Jens Preußner, alp...@googlegroups.com
as valid Cj, this certainly works.  

curious on why you created a _properties section for each item instead of expressing those values as "data" elements.

Jens Preußner

unread,
Aug 11, 2014, 9:47:33 AM8/11/14
to alp...@googlegroups.com
This is because every property is described by a "term" and a "value" (see [1]). I would need a nested structure inside the data elements to include it, right?

[1] https://github.com/jenzopr/bioalps/blob/master/bioproperty.xml

mca

unread,
Aug 11, 2014, 10:44:59 AM8/11/14
to Jens Preußner, alp...@googlegroups.com
the challenge you have is that you are creating a Cj representation that puts vital info where no generic Cj client will see it.  you now need to create a custom Cj client for this _property extension.

i think a better approach is to put these properties as name/value items in the data collection:

the above example will work for every generic Cj client that exists today.

what's missing from ALPS right now is a formal "mapping" document from ALPS -> Cj. that would solve much of the problem of how ALPS semantic and transition elements are encoded into Cj. i guess that's a project that needs some attention!  

cheers.

Jens Preußner

unread,
Aug 11, 2014, 11:26:13 AM8/11/14
to alp...@googlegroups.com
Hi Mike,
I agree on the fact that its not visible for a generic client. Is putting the properties into the data element not widening the semantic gap again? I mean, the property-terms used as name of the semantic descriptor could be anything and are not documented in the profile itself.
A workaround would be to see each property as its own resource and include them as a link relation - but actually I don't like this idea very much, because (1) the overhead introduced by fetching those resources via GETs could be great for objects with many properties (given the case that a client would like to see the properties within the representation of the object) and (2) many existing resources already have the properties directly included and they will very likely not change that.

Best!


mca

unread,
Aug 11, 2014, 11:49:09 AM8/11/14
to Jens Preußner, alp...@googlegroups.com
<snip>
Is putting the properties into the data element not widening the semantic gap again?
</snip>

great comment.

mapping ALPS semantic details to Cj message details is the "magic." we don't need to make Cj "look like" the internal model (e.g. the same nesting, object shapes, etc.). we only need to make Cj _carry_ the semantic details in a way that apps on each end (client and server) and consistently convert into their own internal models.
 
with this in mind, it is not important that the wire-level message have an actual "_property" section. but it IS important that both client and server can recognize that parts of the message that are the properties of an item. does that make sense?

one way to make this happen is to decorate the data elements with domain-specific information. for example, i can add a Cj extension which indicates the "type" of a single data element as "property":

now, generic clients can deal with this message just fine. at the same time, custom clients that recognize the internal concept of "property" can find the semantic details needed to construct a helpful internal model.

does this make sense?

WARNING: RANT AHEAD
Now, pardon me, while i rant on a bit. hopefully this will help you get the sense of what i think ALPS can do for us as we build widely dist apps.

IMO, a successful approach to creating systems that both offer wide support for generic clients AND provide a way to share domain-specific semantic details is to separate the internal model (objects, properties, relationships, etc.) from the external message (HTML, Cj, Siren, Atom, etc.).

the gap occurs when there is no common *bridge* between internal and external. 

the most common way to attack this problem to date is to make the external *match* the internal model. IOW, object serialization. we do this every day. SOAP was based on this pattern (serialize object trees within a message envelope) and most all JSON-based services do this by offering custom objects on the wire via JSON nested hash-tables and arrays.  

the challenge w/ exposing object models on the wire is that we all need to agree on the object model *before* we can write the apps. 

ALPS is an alternate attempt at this problem of bridging internal and external models to close the gap

to do that, ALPS offers a way to consistently talk about the essence of the internal model (the data and actions) w/o actually forcing us to expose our internal object details. and ALPS is a _standardized_ way to do that. so that even generic clients that don't understand your internal model can understand the essence of that model.

formats like HTML, Cj, HAL, Atom, etc.) offer ways to consistently talk over the wire using a shared _external_ model. the format IS the model. this is proven to work well since HTML has been around for close to 25 years and we have a powerful (if not always perfect) generic client for it. 

the challenge of message formats is that, the wider their applicability, the less domain-specific semantics are designed-in. and that's (again) where ALPS comes in. 

i hope my ranting offers some useful info. thanks for the chance to bring it up ;)

cheers.



Jens Preußner

unread,
Aug 11, 2014, 12:16:10 PM8/11/14
to alp...@googlegroups.com
Totally makes sense to me. I'm very pleased to see how the domain-specific information works here - this can also solve problems with features in the bioobject profile.

By the way, the rant was really useful for me! Thank you very much for your constant effort!

mca

unread,
Aug 11, 2014, 12:18:10 PM8/11/14
to Jens Preußner, alp...@googlegroups.com
jens:

thanks for starting this convo. it's a great help to expose weak parts of the design and missing explanations, etc.

please keep me posted on how things progress and keep pushing the envelope.

FWIW, your Qs helped me sort out some things i think need to be improved to Cj, too.

thanks.

Jens Preußner

unread,
Dec 12, 2014, 9:54:15 AM12/12/14
to alp...@googlegroups.com, jens.von...@googlemail.com
Hi Mike & all,
I recently worked on the bioalps and some "real world" examples, i.e. how you would represent a resource in Cj with associated ALPS profiles. One question that I stumbled on is the following:
We sometimes have to send large amounts of measured values. Until now, I added a semantic descriptor called measurement and gather the values there. Which means, they would end up in the server's response.
Since the client may be only interested in the associated metadata, I thought about defining a safe state transition, which can be used to fetch the data in raw (lets say csv) format at a later timepoint.
Would you recommend that? Or should "raw" data be included in the primary server response?

Pros and cons I came up with having data behind another transition:
Pros: The client can decide on its own if it wants to see the raw data or if metadata is enough.
Cons: A second GET has to be done and raw data would be a "dead end" in terms of hypermedia control. There would be no profile/hypermedia control in the raw file served.
What do you think?
Thank you and have a nice weekend!

mca

unread,
Dec 14, 2014, 10:32:16 AM12/14/14
to Jens Preußner, alp...@googlegroups.com
Jens:

can you post some examples (maybe gists) of what you are doing right now (and/or thinking of doing)?

IIRC, we talked before about an extension that allows "mass data" passing -- not sure if it was with you, tho.

also interested in the details of your challenge both on the WRITE operations and the READ operations.

looking forward to hearing more about what you're working on.

cheers.

Jens Preußner

unread,
Dec 17, 2014, 11:38:07 AM12/17/14
to alp...@googlegroups.com, jens.von...@googlemail.com
Hi Mike,

I started working out examples and got stuck at several points. We can solve them one by one and here is the first one :)

First, consider this gist: https://gist.github.com/jenzopr/3fb7b358be060711c7da
It contains a profile.xml with three (four) semantic descriptors: A title and a property, which consist of a term and a value. It also has a state transition "search".

The example Cj document shows an empty collection, but implements queries and templates. Now, here are my thoughts:
  1. If I got it right, the queries implements a template to construct the URL for a "GET". In this case it would be something like
    http://example.com/collection/search?title=abcde&term=blood+group&value=B
  2. Fine for now. Since ALPS indirectly implies multiplicity, a resource can easily have two properties, i.e. two term-value pairs. To make them searchable at the same time I could flatten them, so have a semantic descriptor called "blood group" and "height" for example. The drawback: I would loose flexibility that the implicit multiplicity from ALPS gave me in the first place. For the Cj part, I have to find a way to encode a template for an URL like:
    http://example.com/collection/search?title=abcde&term[]=blood+group&value[]=B&term[]=height&value[]=178cm or http://example.com/collection/search?title=abcde&1-term=blood+group&1-value=B&2-term=height&2-value=178cm
  3. Now for the templating, i.e. creating or editing a resource: I added two possible options, how one could construct a template for the property descriptor. The second option is nevertheless a bit nasty when more than one property should be included.

Thanks again for your help. Maybe this goes more in direction of Cj-specs and we should switch the forum?

Best, Jens

mca

unread,
Dec 22, 2014, 8:31:03 AM12/22/14
to Jens Preußner, alp...@googlegroups.com, collect...@googlegroups.com
IIUYC, this is about the design of Cj, not ALPS.

Cj's templating language currently does not make it easy to send instructions on creating _arrays_ of inputs or an unlimited *number* of inputs. 

is that the stumbler right now?

if yes, let me ask you -- how would you solve this in HTML?

cheers.

Jens Preußner

unread,
Jan 9, 2015, 3:20:22 AM1/9/15
to alp...@googlegroups.com, collect...@googlegroups.com
Hi Mike,

I guess it's not possible with plain HTML, but I previously used HTML+JS to add input fields to a HTML <form>. Of course, this solution is more for the real user - not machine readable.
But how could a possible solution look like? You would need another transition to add the number of required input fields first?

Best,
Jens

mca

unread,
Jan 9, 2015, 11:15:20 AM1/9/15
to Jens Preußner, alp...@googlegroups.com, collect...@googlegroups.com
right -- figured the HTML "solution" would involve code-on-demand.

so, a couple observations:
- one solution i've used is to support multiple data elements in a *single input* i've done this quite often actually. using HTML as a guide...
<input name="data-points" value="size:13, weight:3g, color:lavender" />
the clients "knows" to insert multiple values separated by commas.
the server "knows" to parse the value of data-points w/ commas as the separator.
you can 
 - share this in human-readable docs and humans can bake it into the source code
 - create a custom attribute on the input that communicates this information <input name="data-point" value="" separator="," />
 - create a specialized hypermedia control for this <multiple-input-with-commas name="data-point", value=""/>


- for purely M2M solutions, repeated requests for the same data points is not at all "tedious" and keeps the interface simple. you can establish a "repeating pattern" in the *server* that keeps asking for more inputs until the client says "I have no more data". using HTML again...
<form class="data-points" action="..." method="post">
  <input name="property-name" value="" />
  <input name="property-value" value="" />
  <input type="submit" />
</form>
the client "knows" to fill in values for each property pair ("property-name=size&property-value=13")
the server "knows" to accept the two values and return the same form again to wait for another pair
the client "knows" to post an empty set to signal the end of the multiple pairs ("property-name=&property-value=")
the server "knows" that an empty pair signals the end of the list and stops sending the same inputs to the client and continues processing the collection of inputs.
you can:
 - share this in human-readable docs and humans can bake it into the source code
 - create a custom attribute on the FORM that communicates this information <form class="data-points" action="..." method="post" stop-repeats="empty" />
 - create a specialized hypermedia control for this <repeating-form name="data-points", action="..." method="post" stop="empty"/>

there are other possible variations, too.

hope this gives you some ideas.

cheers

Jens Preußner

unread,
Jan 15, 2015, 4:53:43 AM1/15/15
to alp...@googlegroups.com
Hi Mike!

Thanks for the detailed answer and your time invested in this issue. Once I figured out what fits my needs I will come back and let you know.

Best,
Jens
Reply all
Reply to author
Forward
0 new messages