Usage analytics on hits to PURLs?

79 views
Skip to first unread message

Thomas Baker

unread,
Aug 12, 2010, 6:53:18 PM8/12/10
to persist...@googlegroups.com
Dear all,

DCMI uses PURLs to identify its metadata properties and classes
(i.e., [1]). These PURLs redirect to RDF schemas (e.g., [2]).

We would like to better understand how heavily these resources
are being dereferenced. In principle, we could monitor usage
of either [1] or [2], but I understand there is or was at one
time an issue with PURL servers occasionally being overloaded,
so we'd like to keep on eye on how frequently the PURLs are
getting accessed.

Can anyone advise us where to look for guidance on setting up
such analytics?

Many thanks,
Tom Baker

[1] http://purl.org/dc/terms/title
[2] http://dublincore.org/2008/01/14/dcterms.rdf

--
Tom Baker <tba...@tbaker.de>
Dublin Core Metadata Initiative

Dehn,Tom

unread,
Aug 13, 2010, 10:21:28 AM8/13/10
to persist...@googlegroups.com
We get a range 0f 50 - 120 purl requests every second, At sometimes the
PURLS is so overwhelmed with requests that some go unresolved and return
a 503 http error. I've increased purl parameters and resources to reduce
these 503 errors, but the periodically occur about once a day at about
7:00 - 9:00 AM Monday thru Friday (duration from 15 to 150 seconds),
they rarely occur during a weekend.

About 98% of the request are to resolve dc elements.

I've attached a compressed partial http log file from
22/Jul/2010:08:31:00 - 34:59 (just 4 minutes).

Access to our production log files are restricted.

If you need more information I can access and copy different log files.

Let me know if you need more info.

20100722!8!31-8!35.log.gz

Thomas Baker

unread,
Aug 13, 2010, 10:43:58 AM8/13/10
to persist...@googlegroups.com
Thank you, Tom!

WOW - 98% of the requests to your PURL server, overall,
are for dc elements!! Some of these are for dc PURLs that
are badly formed (e.g., /dc/elements/1.1\) or over ten years
obsolete (1700 hits for variants of /dc/elements/1.0/).

Correct me if you think otherwise, but the "burst-like"
nature of requests suggests scheduled batch processes.

Is this a stable situation or a growing trend?

Tom

--
Tom Baker <tba...@tbaker.de>

Young,Jeff (OR)

unread,
Aug 13, 2010, 10:54:21 AM8/13/10
to persist...@googlegroups.com
Tom,

It seems that the purpose of HTTP identity is being mistaken with HTTP
resolvability much too often. The situation with W3C's XHTML URIs seems
analogous:

http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic

In theory, this situation could be remedied by having the DC element
URIs intentionally return 503 with an HTML entity body that does an
*HTML* redirect instead. I assume this will break automated systems just
like the W3C's DTD identifier change did forcing people to fix their
systems, without interfering with browser access.

Just an idea.

Jeff

Dehn,Tom

unread,
Aug 13, 2010, 11:12:41 AM8/13/10
to persist...@googlegroups.com
I think the situation is somewhat stable.
I have been monitoring the purls server over the past few weeks.

I'm about to change the purl server parameter and resources again to
improve the response.
(When I do I need to watch the server closely, so I don't do too much).

I don't know if the "burst-like" requests are due to batch processes.
I did some analysis of ip addresses during the period of high request,
but could not resolve it to a single address.

It seems that there are a lot of badly formed purls in pages out there
on the internet. It seems that they have been there for maybe a decade.
Return a http code 500, that is ignored does not solve the problem.

Tom Dehn
Office of Research
OCLC Inc.
614-761-5150

-----Original Message-----
From: persist...@googlegroups.com
[mailto:persist...@googlegroups.com] On Behalf Of Thomas Baker

Sent: Friday, August 13, 2010 10:44 AM
To: persist...@googlegroups.com

Thomas Baker

unread,
Aug 13, 2010, 11:23:07 AM8/13/10
to persist...@googlegroups.com
Hi Jeff,

On Fri, Aug 13, 2010 at 10:54:21AM -0400, Young,Jeff (OR) wrote:
> It seems that the purpose of HTTP identity is being mistaken with HTTP
> resolvability much too often. The situation with W3C's XHTML URIs seems
> analogous:
>
> http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic

Thank you for the link. That confirms what I suspected - that this
is a general issue.

> In theory, this situation could be remedied by having the DC element
> URIs intentionally return 503 with an HTML entity body that does an
> *HTML* redirect instead. I assume this will break automated systems just
> like the W3C's DTD identifier change did forcing people to fix their
> systems, without interfering with browser access.
>
> Just an idea.

One would need to re-think the HTTP response practice for "Cool URIs"... [1]

Tom

[1] http://www.w3.org/TR/cooluris/

--
Tom Baker <tba...@tbaker.de>

Young,Jeff (OR)

unread,
Aug 13, 2010, 11:42:03 AM8/13/10
to persist...@googlegroups.com
Tom,

I believe that 503 (Service Unavailable) is perfectly reasonable and
understandable behavior for a Linked Data RWO URI. In addition to the
HTML redirect, the entity body can include text (including generalizable
OWL/RDFa) explaining the situation in terms the SW might well understand
and appreciate.

Why doesn't the W3C use this 503/HTML redirect trick to the DTD rather
than their dead end 503 page?

Given problems with the status quo, I'm just putting it on the table for
consideration.

Jeff

> -----Original Message-----
> From: persist...@googlegroups.com
> [mailto:persist...@googlegroups.com] On Behalf Of Thomas Baker
> Sent: Friday, August 13, 2010 11:23 AM
> To: persist...@googlegroups.com
> Subject: Re: Usage analytics on hits to PURLs?
>

Thomas Baker

unread,
Aug 13, 2010, 12:03:42 PM8/13/10
to persist...@googlegroups.com
Hi Jeff,

On Fri, Aug 13, 2010 at 11:42:03AM -0400, Young,Jeff (OR) wrote:
> I believe that 503 (Service Unavailable) is perfectly reasonable and
> understandable behavior for a Linked Data RWO URI. In addition to the
> HTML redirect, the entity body can include text (including generalizable
> OWL/RDFa) explaining the situation in terms the SW might well understand
> and appreciate.
>
> Why doesn't the W3C use this 503/HTML redirect trick to the DTD rather
> than their dead end 503 page?
>
> Given problems with the status quo, I'm just putting it on the table for
> consideration.

I think you are right that this is an issue best addressed
at the level of W3C since it impacts Web architecture and the
best-practice use and interpretation of HTTP response codes.

In this regards, DCMI in particular is in the position
of following others' leads - for DCMI to test a new and
undocumented approach with something as crucial as its
namespace URIs could only create confusion.

It is not clear to me how best to raise this issue to W3C.
One possible channel would be to put the problem on the table
in the Library Linked Data Incubator Group in the context of
the more general issue "Management of data and distribution",
sub-point "Issues of Web architecture... HTTP... caching
strategies" [1]. Ah! I see your name is already listed
there... We could perhaps talk about this in Pittsburgh...

Tom

[1] http://www.w3.org/2005/Incubator/lld/wiki/Topics

--
Tom Baker <tba...@tbaker.de>

Thomas Baker

unread,
Aug 13, 2010, 12:12:50 PM8/13/10
to persist...@googlegroups.com
Tom,

On Fri, Aug 13, 2010 at 11:12:41AM -0400, Dehn,Tom wrote:
> I think the situation is somewhat stable.
> I have been monitoring the purls server over the past few weeks.
>
> I'm about to change the purl server parameter and resources again to
> improve the response.
> (When I do I need to watch the server closely, so I don't do too much).

Okay.

> I don't know if the "burst-like" requests are due to batch processes.
> I did some analysis of ip addresses during the period of high request,
> but could not resolve it to a single address.

I guess I presumed that because when we noticed the
/dc/elements/1.0/ requests a few years ago, we were able
to resolve them back to a certain large company that
was broadcasting its requests at predictable intervals.

Is it reasonable at least to assume that most such requests
are being made automatically by software as opposed to by
individuals clicking on links -- or is little enough known
to say even that?

> It seems that there are a lot of badly formed purls in pages out there
> on the internet. It seems that they have been there for maybe a decade.
> Return a http code 500, that is ignored does not solve the problem.

As I suggested, perhaps Jeff could raise this as in issue in
the W3C LLD XG.

Tom

--
Tom Baker <tba...@tbaker.de>

Young,Jeff (OR)

unread,
Aug 13, 2010, 12:27:01 PM8/13/10
to persist...@googlegroups.com
A conversation in Pittsburg would be good. I suspect the problem isn't
going away.

Jeff

> -----Original Message-----
> From: persist...@googlegroups.com
> [mailto:persist...@googlegroups.com] On Behalf Of Thomas Baker
> Sent: Friday, August 13, 2010 12:04 PM
> To: persist...@googlegroups.com
> Subject: Re: Usage analytics on hits to PURLs?
>

Dehn,Tom

unread,
Aug 13, 2010, 12:31:14 PM8/13/10
to persist...@googlegroups.com, Young,Jeff (OR)
You are probably right. The issue will affect everyone.

Tom Dehn
Office of Research
OCLC Inc.
614-761-5150

-----Original Message-----
From: persist...@googlegroups.com
[mailto:persist...@googlegroups.com] On Behalf Of Thomas Baker
Sent: Friday, August 13, 2010 12:13 PM
To: persist...@googlegroups.com
Subject: Re: Usage analytics on hits to PURLs?

Ed Summers

unread,
Aug 13, 2010, 4:45:10 PM8/13/10
to persist...@googlegroups.com, Young,Jeff (OR)
Another thing worth putting on the agenda is caching. Thoughtful use
of the Expires header could allow intermediary HTTP caches to help
reduce the amount of requests making it to purl.org. Right now
purl.org is telling everyone they can't cache the response. See
"Expires: Thu, 01 Jan 1970 00:00:00 GMT" below:

ed@curry:~$ curl -i http://purl.org/dc/terms
HTTP/1.1 302 Moved Temporarily
Date: Fri, 13 Aug 2010 20:40:33 GMT
Server: 1060 NetKernel v3.3 - Powered by Jetty
Location: http://purl.org/dc/terms/
Content-Type: text/html; charset=iso-8859-1
X-Purl: 2.0; http://localhost:8080
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Length: 258

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>302 Found</TITLE>
</HEAD>
<BODY>
<H1>Found</H1>
The resource requested is available <A
HREF="http://purl.org/dc/terms/">here</A>.<P>
</BODY>
</HTML>

//Ed

Message has been deleted

liang

unread,
Mar 3, 2016, 11:47:43 AM3/3/16
to persistenturls
When you talk about the usage stat, where in purlz can I find the log there?
Thank you.

Liang
Reply all
Reply to author
Forward
0 new messages