about DataSource and Miriam

5 views

Skip to first unread message

Igor Rodchenkov

unread,

Mar 25, 2010, 10:51:23 AM3/25/10

to bridgedb...@googlegroups.com, pathway-c...@googlegroups.com

Hi folks!

First of all (recent emotions), the more I am into the Bridge DB code, the more I like it. Excellent job, especially impressed by DataSource! Thank you.

Here, I would like to suggest something about Miriam in Bridge DB (I'm not a member of Miriam team, by the way, but built on it quite a lot). As you know, in org.bridgedb.bio.BioDataSource, we have init() method that reads dadasource attributes from the text file (org/bridgedb/bio/datasources.txt). In the same class, there are regexp patterns hard-coded, e.g.:

DataSourcePatterns.registerPattern(

BioDataSource.UNIPROT,

Pattern.compile("([A-N,R-][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9])|([O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9])")

);

These are potential problems... (including to update) Alternative? Well, as we know, Miriam is good at being a standard, having XML schema and data export to XML, being very small "database", having standard DB names and synonyms, and ID patterns, etc. There are java library (MiriamLink 1.1.1, though heavily uses web service) capable to return the list of data sources, resources, convert db:id pairs to data URN, URLs, etc. So I encourage Bridge dB to use that independently supported and quite easy staff instead of (or in addition to) the internal text file and Pattern.compile("([A-N,R-][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9])|([O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9])" (although there are data sources in datasources.txt that are absent from Miriam; this is a good point to ask them add!) So, init() method could get all the required datasources with attributes from the latest miriam xml automatically! Entire (small!) Miriam can be loaded and unmarshalled only once and used for all DataSource and Xref-related things.

So far, I've made a new MiriamLink version that I'd live to share. It's actually one class, schema, and pom.xml (helps xjc generate sources), and almost no dependencies (when using Java6)). Miriam team is willing to help develop and release the official version of this idea:

- browse src here: http://biopax.hg.sourceforge.net/hgweb/biopax/validator/file/ (go to "miriam-lib" there), e.g.:

-- http://biopax.hg.sourceforge.net/hgweb/biopax/validator/file/fcf9ddba2d86/miriam-lib/src/main/java/org/biopax/miriam/MiriamLink.java (caution: the revision number may change)
-- http://biopax.hg.sourceforge.net/hgweb/biopax/validator/file/5a3a113216d8/miriam-lib/pom.xml
-- http://biopax.hg.sourceforge.net/hgweb/biopax/validator/file/fcf9ddba2d86/miriam-lib/src/test/java/org/biopax/miriam/MiriamLinkTest.java

- at a maven2 repository: http://biopax.sourceforge.net/m2repo/snapshots/org/biopax/miriam-lib/ is the re-engineered one we're talking about (compiled from the above source
(do not be confused, as there are actually another versions on the miriam-lib in the m2repo: http://biopax.sourceforge.net/m2repo/snapshots/uk/ac/ebi/miriam-lib/ - which is simply the unmodified miriam-lib v1.1.1 release, manually "mavenized" and deployed there)

PS:

To checkout the sources, one should have Mercurial client ("hg") installed; and then do -

hg clone http://biopax.hg.sourceforge.net:8000/hgroot/biopax/validator biopax-validator

- then take take only the biopax-validator/miriam-lib directory (unfortunately, you get all the "validator" modules, but feel free to ignore/remove unrelated).

Cheers,
--
Igor Rodchenkov

baderlab.org

Martijn van Iersel

unread,

Mar 25, 2010, 12:28:14 PM3/25/10

to bridgedb...@googlegroups.com

Hi Igor

See my answers in between below...

>
> Here, I would like to suggest something about Miriam in Bridge DB (I'm
> not a member of Miriam team, by the way, but built on it quite a
> lot). As you know, in org.bridgedb.bio.BioDataSource, we
> have init() method that reads dadasource attributes from the text file
> (org/bridgedb/bio/datasources.txt). In the same class, there are
> regexp patterns hard-coded, e.g.:
>
> DataSourcePatterns.registerPattern(
>
> BioDataSource.UNIPROT,
>
> Pattern.compile("([A-N,R-][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9])|([O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9])")
>
> );
>
>
> These are potential problems... (including to update) Alternative?
> Well, as we know, Miriam is good at being a standard, having XML
> schema and data export to XML, being very small "database", having

> standard DB names and synonyms, and *ID patterns*, etc.

Yes, we're aware of Miriam (in fact, the Xref.getURN() method will
return a miriam-based urn if available.
The intention has been to rely on Miriam for this type of data as much
as we can, but in actuality this hasn't happened everywhere yet, due to
time constraints. But I'm hoping to improve this in the future (and
patches are always welcome of course)

> There are java library (MiriamLink 1.1.1, though heavily uses web
> service) capable to return the list of data sources, resources,
> convert db:id pairs to data URN, URLs, etc. So I encourage Bridge dB
> to use that independently supported and quite easy staff instead of
> (or in addition to) the internal text file
> and Pattern.compile("([A-N,R-][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9])|([O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9])"

Yes, this is a good idea, but I still want to keep datasources.txt
around as a fallback solution (i.e. as a cache), for two reasons.

Reason 1: Webservices can be unreliable, (as we're seeing now with
CRONOS wsdl, for example). Since we're dealing with very little data,
~100s of records at max, it's a piece of cake to maintain such a cache.

Reason 2: There is another problem with Miriam: some of the DataSources
that we consider very important they don't want to add. Examples of
these are Affymetrix probeset IDs and Agilent. Affymetrix and Agilent
don't provide online pages for each possible identifier, so Miriam
doesn't want to consider those. Yet mapping to and from microarray
reporters is one of the primary use-cases for BridgeDb. I've had a long
discussion with the Miriam folks over this, and I think the end result
is that we agreed to disagree. Miriam is for linking to identifiers,
BridgeDb is for mapping between identifiers, so even though there is
overlap there is also some tension there.

> (although there are data sources in datasources.txt that are absent

> from Miriam; *this is a good point to ask them add*!)

Indeed I've already submitted a few datasources to Miriam myself, but
like I said some are considered out of their scope. That's fine, I would
still like to use Miriam data wherever possible, but it's not an end-all
solution unfortunately.

> So, init() method could get all the required datasources with
> attributes from the latest miriam xml automatically! Entire (small!)
> Miriam can be loaded and unmarshalled only once and used for all
> DataSource and Xref-related things.
>
> So far, I've made a new MiriamLink version that I'd live to share.
> It's actually one class, schema, and pom.xml (helps xjc generate
> sources), and almost no dependencies (when using Java6)). Miriam team
> is willing to help develop and release the official version of this idea:
>
> - browse src
> here: http://biopax.hg.sourceforge.net/hgweb/biopax/validator/file/ (go
> to "miriam-lib" there), e.g.:
>
> --
> http://biopax.hg.sourceforge.net/hgweb/biopax/validator/file/fcf9ddba2d86/miriam-lib/src/main/java/org/biopax/miriam/MiriamLink.java (caution:
> the revision number may change)
> --
> http://biopax.hg.sourceforge.net/hgweb/biopax/validator/file/5a3a113216d8/miriam-lib/pom.xml
> --
> http://biopax.hg.sourceforge.net/hgweb/biopax/validator/file/fcf9ddba2d86/miriam-lib/src/test/java/org/biopax/miriam/MiriamLinkTest.java
>
> - at a maven2
> repository: http://biopax.sourceforge.net/m2repo/snapshots/org/biopax/miriam-lib/ is

> the re-engineered one we're talking about (/compiled from the
> above source
> /(do not be confused, as there are actually another versions on
> the miriam-lib in the
> m2repo: //http://biopax.sourceforge.net///m2repo/snapshots/uk/ac/ebi///miriam-lib//
> <http://biopax.sourceforge.net/m2repo/snapshots/uk/ac/ebi/miriam-lib/>/ -

> which is simply the unmodified miriam-lib v1.1.1 release, manually

> "mavenized" and deployed there)//

>
>
> PS:
> To checkout the sources, one should have Mercurial client ("hg")
> installed; and then do -
> hg
> clone http://biopax.hg.sourceforge.net:8000/hgroot/biopax/validator biopax-validator

> - then take *take only the biopax-validator/miriam-lib* directory

> (unfortunately, you get all the "validator" modules, but feel free to
> ignore/remove unrelated).
>

Ok, thanks for the info, this is very useful. I'll take a look at this
and I'll try to do what I can to improve the cohesion between Miriam and
BridgeDb.