-Alan
---------- Forwarded message ----------
From: David Wood <da...@zepheira.com>
Date: Mon, Mar 22, 2010 at 10:59 AM
Subject: [Purl-interest] Announcing PURL Federation Development Effort
To: purl-dev <purl...@purlz.org>
Cc: public-lod <publi...@w3.org>, semw...@meetup.com,
purl-interest <purl-i...@purlz.org>, DCMI Architecture Forum
<dc-arch...@jiscmail.ac.uk>, semantic-web <semant...@w3.org>
Hi all,
The National Center for Biomedical Ontology and Zepheira are pleased
to announce work on a PURL Federation. A PURL Federation will allow
multiple PURL service operators to cooperate in PURL resolutions,
covering for each other in the case of service outages and allowing
the persistent resolution of PURLs as funding levels and
organizational details change with time.
PURL Federations are intended to enhance the ability of Semantic Web
and Linked Data communities to ensure the persistence of their
identifiers.
Prototypes of the PURL Federation code will be released occasionally
in the coming months, with an Alpha version to be released in the
Summer of 2010. Further announcements of releases will be made on the
purl-dev mailing list and on http://purlz.org.
The architecture document guiding the development of a PURL Federation
is available at:
http://purlz.org/project/purl/development/wiki/PURLFederationArchitecture
Review and feedback on the proposed architecture is encouraged. We
want to hear your use cases and good ideas. Thanks in advance.
Regards,
Dave
--
David Wood
Partner
Zepheira - The Art of Data
http://zepheira.com/team/dave/
Cell: +1 540 538 9137
_______________________________________________
Purl-interest mailing list
Purl-i...@purlz.org
http://purlz.org/mailman/listinfo/purl-interest
There are a few things that might require some more discussion:
* "A simpler architecture without the use of caching proxies was
considered and rejected." The caching proxy will still require some
time to decide whether a response is going to come back, although the
timeout value can be configured on the host. Don't try to encase the
problem in too many layers, as it just creates more complexity, and in
reality each layer is just as vulnerable as the next if you are going
for 99.99999% uptime. If the EC2 cloud goes down, as happens from time
to time, what part of the infrastructure would correct the balance?
Are we still relying on a (single?) human to monitor the whole
package, because in that case the uptime ratings will only come at a
high cost for a permanent building and monitoring infrastructure.
* "DNS service is thus motivated to set low or null times-to-live on
DNS responses. However, the use of proxies eliminates this risk"
Proxies don't eliminate the risk, they just move it to another level.
Null TTL values wouldn't be recommended in any system, no matter what
uptime it is looking for, as they will just be a place where the
system can be accidentally DOS'd if the traffic is too high. You
should really specify a particular uptime, minimum latency, and
timeout values before going into describing minimal TTL caching values
on DNS requests, so people understand what sort of availability is
going to be necessary. What is the expected traffic for the proxy
resolvers? Millions (billions?) of queries per day?
* General proxies can't act as HTTPS intermediaries without having
information either intercepted in a man in the middle attack or having
the certificates distributed. This isn't just a disadvantage, it is a
design issue if the federated PURL system is ever going to be designed
to be used with HTTPS, which I haven't seen yet, and don't really see
a case for. PURL's could redirect to an HTTPS host, but if the
information about where particular requests are going to redirect is
public anyway, then there is no advantage to resolving the redirection
URL using HTTPS. If people want privacy to this degree they could
implement the PURL proxy inside their own network.
* "Therefore, a new data format is required. We suggest using Turtle,"
The system won't be future proof if it is standardising on a yet
unstandardised serialisation of the RDF model, when it could just as
effectively standardise on "RDF", and suggest some serialisations. The
focus on not using XML seems very partisan, especially considering
Turtle hasn't managed to pass through the standardisation process yet.
People aren't necessarily going to write a custom Turtle parser, and
the libraries that contain Turtle parsers currently, also contain
RDF/XML, NTriples, N3, etc., parsers, so there is no harm in going for
a neutral position for this.
* Should people want to know which actual physical server a response
was derived from, would they be able to?
* Can resolvers choose which PURL's they are going to serve or is it
necessary for every member to serve details for every other member?
What are the typical sizes of the information that would be required
to enable this into eternity...
* There would have to be a simple way for each actual resolver to be
able to trust the definitions that it was receiving. Distributed trust
doesn't come without multiple humans being in the loop at some time,
so people will have to negotiate this process. The method of doing
this should be in the document somewhere I think.
* How easy is it for people to override the system to use a local
resolver? What is the difference if they are not using the PURL
software package? (Say the internet connection went down!!!!, but they
need access to information they have locally.)
Cheers,
Peter
> --
> You received this message because you are subscribed to the Google Groups "Shared names" group.
> To post to this group, send email to shared...@googlegroups.com.
> To unsubscribe from this group, send email to shared-names...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/shared-names?hl=en.
>
>
Local resolvers are a big part of the shared names story. How to get
non-PURLZ resolvers to sync with PURLZ-based resolvers is an open
question for us. You could either keep 'master' information in a PURLZ
database and export to other platforms/APIs, or keep it external to
PURLZ and import into PURLZ. I favor the latter due to a personal
distrust of all database technology. The database is very small (so
far), so this is feasible, and maintaining it in a version control
system such as GIT makes auditing changes easy, thereby making it
easier to trust other curators, thereby making it easier to bring
curators aboard, thereby spreading the curation workload better.
One might want to use PURLZ software for a local mirror. Technically
this ought to be straightforward - you just set up a PURLZ server and
either have it pull information from the 'federation' or import it
from the master database. For shared names it's important that it be
possible to do this without having to ask permission, i.e. the
database must be accessible by anyone. I doubt the PURLZ software
would prohibit database access by design; it should be just a matter
of configuration.
Jonathan