Chunking KaufKauf data into smaller pieces

21 views
Skip to first unread message

Hogan, Aidan

unread,
Jan 26, 2010, 10:09:14 AM1/26/10
to andreas....@unibw.de, tobias....@unibw.de, mh...@computer.org, pedant...@googlegroups.com
Hi folks,

I was running a crawler the other day, which came across your KaufKauf
dataset. It seems that your linked-data export provides an entire dump
of information for all related entities, which leads to a lot of
redundancy. For example, the dataset at [1] returns a description of
about 46k entities, where each entity has three triples and a
dereferencable URI. However, all of the 46k dereferencable URIs give the
entire dump with three triples for all of the entities, resulting in
massive redundancy (3 * 46k^2 = >6b triples from 138k unique). The same
goes for entities such as [2].

I would suggest, perhaps, one page which dereferences to a description
of all the entities (acting like a links page, e.g., just use [3] as
is), and for each individual entity (e.g., [1]), just give the
information for that entity, e.g.:

<gr:BusinessEntity
rdf:about="http://openean.kaufkauf.net/id/businessentities/GLN_043702000
0084">
<gr:hasGlobalLocationNumber
rdf:datatype="http://www.w3.org/2001/XMLSchema#string">7905860000050</gr
:hasGlobalLocationNumber>
<rdfs:isDefinedBy
rdf:resource="http://openean.kaufkauf.net/id/businessentities/"/>
</gr:BusinessEntity>

Thus, an agent looking for the description of an entity such as [1]
through dereferencing will not be swamped with the descriptions of
entities it is not interested in. Also, a crawler will not end up with
massive amounts of duplicate data from your site, and will not overload
your servers by streaming gigabytes of repetitive data.

Same would be great for entities described in, for example, [3].

Thoughts?
Aidan

[1] http://openean.kaufkauf.net/id/businessentities/GLN_0437020000084
[2] http://openean.kaufkauf.net/id/EanUpc_0001000000748
[3] http://openean.kaufkauf.net/id/businessentities/

(As a side note, the Java string hashcodes of all 46k URLs on your site
with format [1] give an even hashcode... This is because they differ by
ID number, which always ends in an even number, and always has an even
number of odd numbers... this can lead to strange load-balancing in
hash-based distributed application... which is a little strange but I
guess nothing to worry about from your perspective.)

Andreas Harth

unread,
Mar 29, 2010, 1:25:16 PM3/29/10
to pedant...@googlegroups.com, andreas....@unibw.de, tobias....@unibw.de, mh...@computer.org
Hi guys,

just to reiterate Aidan;s point.

I'm currently working on the 2010 Billion Triple Challenge dataset,
and also crawling a lot of data from kaufkauf.net.

Given that the Billion Triple Challenge dataset is widely used, there's
a good chance that quite a few people will notice that your data is
slightly goofy.

Best regards,
Andreas.

Andreas Harth

unread,
Mar 30, 2010, 5:25:33 AM3/30/10
to marti...@ebusiness-unibw.org, pedant...@googlegroups.com, andreas....@unibw.de, tobias....@unibw.de, mh...@computer.org
Hi Martin,

Martin Hepp (UniBW) wrote:
> We wanted to find a good compromise between the network traffic and the
> number of files. Loading one huge file for looking up a single element
> is obviously not feasible. hosting one million files in the single
> directory was also no option, however, since handling directories with
> such a large number of files is cumbersome.

given the server setup maybe a better solution is to redirect to
the data files rather than just (virtually) copying their content.

So, a lookup on thing-URI, e.g.

http://openean.kaufkauf.net/id/EanUpc_0004100380772

redirects to whatever data-URI that holds the triples.

That way, one data files has one URI and thus is only crawled once.

Best regards,
Andreas.

Martin Hepp (UniBW)

unread,
Mar 30, 2010, 3:51:25 AM3/30/10
to Andreas Harth, pedant...@googlegroups.com, andreas....@unibw.de, tobias....@unibw.de, mh...@computer.org
Hi Andreas,
I think i already explained my point: We do have only limited control
over the server on which this dataset is being hosted. Since there are a
lot of entities with just a few statements per each entity, we decided
to produce

a) one data dump holding "all in one"
b) a lot of smaller files that collate the data for a limited set of
entities
c) an .htaccess configuration that serves the matching smaller file
containing the definition for that element.

However, since there are usually just a few triples per entity and more
than 1 Mio entities, we decided not to create one file per entity but
instead group the triples for a few hundred entities in one file and use
range patterns in the .htaccess file.

We wanted to find a good compromise between the network traffic and the
number of files. Loading one huge file for looking up a single element
is obviously not feasible. hosting one million files in the single
directory was also no option, however, since handling directories with
such a large number of files is cumbersome.

I fail to see how this solution would not be compliant with any web
architecture requirements and would strongly argue that our data is not
"goofy" in any way.

When you dereference an entitity URI, you get the proper triples for
that element plus a few triples that you dont't need. The same happens
all the time on the Web - if you crawl an XHTML-RDFa document, you will
also get triples that you don't need.

Also, the total size per file is in my opinion pretty moderate - 46 k
per file is less than a majority of HTML pages.

If you want to get all triples from the data set, simple retrieve the
OWL file from

http://openean.kaufkauf.net/id/

and import all owl:imports.

That will give you 697 individual files, following the pattern

http://openean.kaufkauf.net/id/file<nnn>.owl

with <nnn> being a number from 1 to 697.

It also imports

http://openean.kaufkauf.net/id/businessentities/

By the way, I think that trying to do a "naive" crawl by dereferencing
every single URI is an archaic harvesting pattern.

PS: Yes, I know, using an RDF repository + SPARQL endpoint + Pubby may
be a more elegant solution, but it was unfeasible in the given setting.

--
--------------------------------------------------------------
martin hepp
e-business& web science research group
universitaet der bundeswehr muenchen

e-mail: he...@ebusiness-unibw.org
phone: +49-(0)89-6004-4217
fax: +49-(0)89-6004-4620
www: http://www.unibw.de/ebusiness/ (group)
http://www.heppnetz.de/ (personal)
skype: mfhepp
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================

Project page:
http://purl.org/goodrelations/

Resources for developers:
http://www.ebusiness-unibw.org/wiki/GoodRelations

Webcasts:
Overview - http://www.heppnetz.de/projects/goodrelations/webcast/
How-to - http://vimeo.com/7583816

Recipe for Yahoo SearchMonkey:
http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey

Talk at the Semantic Technology Conference 2009:
"Semantic Web-based E-Commerce: The GoodRelations Ontology"
http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287

Overview article on Semantic Universe:
http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html

Tutorial materials:
ISWC 2009 Tutorial: The Web of Data for E-Commerce in Brief: A Hands-on Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_ISWC2009

Andreas Radinger

unread,
Mar 30, 2010, 7:55:15 AM3/30/10
to Andreas Harth, aidan...@deri.org, tobias....@unibw.de, mh...@computer.org, pedant...@googlegroups.com
Hi Andreas and Aidan,

did you know that there is a Sitemap with the Semantic Sitemap Extension at
http://openean.kaufkauf.net/sitemap.xml
=> you just have to crawl every sc:dataDumpLocation-entry and you get the
complete dataset!
Or can you explain which technique your crawler are using?

Best regards,
Andreas

------------------------------------------
Dipl.-Ing. Andreas Radinger
Professur f�r Allgemeine BWL, insbesondere E-Business
e-business & web science research group
Universit�t der Bundeswehr M�nchen

e-mail: andreas....@unibw.de
phone: +49-(0)89-6004-4218

skype: andreas.radinger

Andreas Harth

unread,
Mar 30, 2010, 8:11:20 AM3/30/10
to Andreas Radinger, aidan...@deri.org, tobias....@unibw.de, mh...@computer.org, pedant...@googlegroups.com
Hi,

Andreas Radinger wrote:
> did you know that there is a Sitemap with the Semantic Sitemap Extension at
> http://openean.kaufkauf.net/sitemap.xml
> => you just have to crawl every sc:dataDumpLocation-entry and you get
> the complete dataset!

there are several problems with Semantic Sitemaps, the most
crucial being that there is no code available to parse and process
the files. The second problematic issue is that some of the chunking
algorithms that can be defined do not scale. The third problem is that
we do not want to get your entire dataset but only a fraction of the
data to give each data provider an opportunity to get some data into
the BTC dataset.

> Or can you explain which technique your crawler are using?

We do lookups on URIs (c.f. [1]).

By using redirects from thing-URIs to data-source-URIs instead of
server-side URI mappings you would solve the issue that your serving
redundant data for other Linked Data clients as well.

Best regards,
Andreas.

[1] http://www.w3.org/DesignIssues/LinkedData.html

Andreas Harth

unread,
Mar 30, 2010, 8:25:27 AM3/30/10
to marti...@ebusiness-unibw.org, pedant...@googlegroups.com, andreas....@unibw.de, tobias....@unibw.de, mh...@computer.org
Hi Martin,

Martin Hepp (UniBW) wrote:
> So you mean we should give a 302 status code pointing to the RDF/XML
> file instead of simply serving the file?

yes, a 302 or 303 would do the trick.

Dublin Core uses 302 while the Linked Data crowd seems to prefer 303's.

There's an issue with caching of redirects for those pedants who care.

The HTTP spec [1] says:

302 Found:
"This response is only cacheable if indicated by a Cache-Control or Expires
header field."

303 See Other
"The 303 response MUST NOT be cached, but the response to the second
(redirected) request might be cacheable."

I prefer 302's because those are cachable.

It's probably a minor issue, but hey it's the pedantic web list after all.

Best regards,
Andreas.

[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.3

Martin Hepp (UniBW)

unread,
Mar 30, 2010, 8:14:58 AM3/30/10
to Andreas Harth, Andreas Radinger, aidan...@deri.org, tobias....@unibw.de, mh...@computer.org, pedant...@googlegroups.com
So, as said - a 302 to the URI of the matching 46 k RDF/XML file would
help, am I right?


--
--------------------------------------------------------------
martin hepp


e-business & web science research group

universitaet der bundeswehr muenchen

e-mail: he...@ebusiness-unibw.org
phone: +49-(0)89-6004-4217
fax: +49-(0)89-6004-4620

Martin Hepp (UniBW)

unread,
Mar 30, 2010, 8:06:20 AM3/30/10
to Andreas Harth, marti...@ebusiness-unibw.org, pedant...@googlegroups.com, andreas....@unibw.de, tobias....@unibw.de, mh...@computer.org
Hi Andreas,

So you mean we should give a 302 status code pointing to the RDF/XML
file instead of simply serving the file?

I think that is a very good idea - it should solve your issues and be
feasible on our end.


I will check it internally and get back to you quickly.

Martin


--
--------------------------------------------------------------
martin hepp
e-business & web science research group

Ed Summers

unread,
Mar 30, 2010, 8:57:29 AM3/30/10
to pedant...@googlegroups.com
On Tue, Mar 30, 2010 at 8:25 AM, Andreas Harth <ha...@kit.edu> wrote:
> 303 See Other
> "The 303 response MUST NOT be cached, but the response to the second
> (redirected) request might be cacheable."
>
> I prefer 302's because those are cachable.

Fortunately it looks like HTTP (RFC 2616) is going to be tightened up
a bit with respect to the cache-ability of 303s [1].

"""
A 303 response SHOULD NOT be cached unless it is indicated as
cacheable by Cache-Control or Expires header fields. Except for
responses to a HEAD request, the entity of a 303 response SHOULD
contain a short hypertext note with a hyperlink to the Location URI.
"""

I guess YMMV until this gets out of draft, and implemented though. I
haven't actually looked to see what cache implementations do when
encountering Cache-Control or Expires header with a 303.

//Ed

[1] http://tools.ietf.org/html/draft-ietf-httpbis-p2-semantics-09#section-8.3.4

Jonathan Rees

unread,
Mar 30, 2010, 9:49:22 AM3/30/10
to Pedantic Web Group

On Mar 30, 8:25 am, Andreas Harth <ha...@kit.edu> wrote:
> Hi Martin,
>
> Martin Hepp (UniBW) wrote:
> > So you mean we should give a 302 status code pointing to the RDF/XML
> > file instead of simply serving the file?
>
> yes, a 302 or 303 would do the trick.
>
> Dublin Core uses 302 while the Linked Data crowd seems to prefer 303's.

Take another look - I believe the Dublin Core URIs do a 302 to a URI
that has a # in it: A 302-> B#C, B 200-> document describing A = B#C.
The setup is rather tricky, but it has the same effect as a 303 A 303-
> B, B 300-> document describing A (but without introducing the alias
B#C for A).

If you believe in the httpRange-14 rule, as http://pedantic-web.org/fops.html#redirect
seems to, then you have to do either what DC does or use a 303.

> There's an issue with caching of redirects for those pedants who care.
>
> The HTTP spec [1] says:
>
> 302 Found:
> "This response is only cacheable if indicated by a Cache-Control or Expires
> header field."
>
> 303 See Other
> "The 303 response MUST NOT be cached, but the response to the second
> (redirected) request might be cacheable."
>
> I prefer 302's because those are cachable.
>
> It's probably a minor issue, but hey it's the pedantic web list after all.

This is widely regarded to be a mistake in the HTTP RFC, and it has
been fixed in the HTTPbis draft. If you are aware of web clients or
proxies that fail to cache 303s I would be very interested to hear of
them.

See http://www.w3.org/2001/tag/group/track/issues/57

Best
Jonathan

Andreas Harth

unread,
Mar 30, 2010, 10:15:15 AM3/30/10
to pedant...@googlegroups.com
Hi,

Jonathan Rees wrote:
> Take another look - I believe the Dublin Core URIs do a 302 to a URI
> that has a # in it: A 302-> B#C, B 200-> document describing A = B#C.
> The setup is rather tricky, but it has the same effect as a 303 A 303-
>> B, B 300-> document describing A (but without introducing the alias
> B#C for A).

ah ok. I didn't look that close, but I note that the redirected location
contains a hash.

> This is widely regarded to be a mistake in the HTTP RFC, and it has
> been fixed in the HTTPbis draft. If you are aware of web clients or
> proxies that fail to cache 303s I would be very interested to hear of
> them.

Squid 2.7 didn't cache 303's last time I checked.

Best regards,
Andreas.

Jonathan Rees

unread,
Mar 30, 2010, 11:43:49 AM3/30/10
to Pedantic Web Group

On Mar 30, 10:15 am, Andreas Harth <ha...@kit.edu> wrote:
> Hi,
>
> Jonathan Rees wrote:
> > Take another look - I believe the Dublin Core URIs do a 302 to a URI
> > that has a # in it: A 302-> B#C, B 200-> document describing A = B#C.
> > The setup is rather tricky, but it has the same effect as a 303  A 303-
> >> B, B 300-> document describing A (but without introducing the alias
> > B#C for A).

s/300/200/


>
> ah ok.  I didn't look that close, but I note that the redirected location
> contains a hash.
>
> > This is widely regarded to be a mistake in the HTTP RFC, and it has
> > been fixed in the HTTPbis draft. If you are aware of web clients or
> > proxies that fail to cache 303s I would be very interested to hear of
> > them.
>
> Squid 2.7 didn't cache 303's last time I checked.

Thanks! I've added this morsel to the TAG tracker. If all goes well
the developers will get a bug report when HTTPbis gets closer to
completion.

In this case, if there's a single RDF file that gets 303-referenced
multiple times, the cost will be at worst the multiple 303 requests
(not cacheable anyhow since they're all different), since the file,
having its own single URI, will be cached independently.

There's another option on the horizon, by the way: the LRDD protocol
or something like it. For nose-following, before doing a GET on a URI
with origin O seeking a 303 (especially when you have many URIs from
the same origin), look for a special file http://O/.well-known/host-meta,
and therein you might find one or more generic rewrite rules that map
URIs you want to nose-follow to URIs of the resources that describe
them. Then instead of doing a million GETs yielding 303s, you do a
million applications of a common rule, with no traffic at all.
Unfortunately .well-known is still working its way through IETF, and
as far as I know the linked data world hasn't got wind of this yet.
But it seems a plausible design to me, and addresses one of the big
complaints about 303, namely its network load. (Using # URIs doesn't
really help in this instance I think.)

Ed Summers

unread,
Mar 30, 2010, 12:33:36 PM3/30/10
to pedant...@googlegroups.com
On Tue, Mar 30, 2010 at 11:43 AM, Jonathan Rees
<jonath...@gmail.com> wrote:.

> Unfortunately .well-known is still working its way through IETF, and
> as far as I know the linked data world hasn't got wind of this yet.
> But it seems a plausible design to me, and addresses one of the big
> complaints about 303, namely its network load.

Is this the latest/greatest spec for the .well-known pattern?

http://tools.ietf.org/html/draft-nottingham-site-meta-05

?

//Ed

Jonathan Rees

unread,
Mar 30, 2010, 4:47:36 PM3/30/10
to Pedantic Web Group
On Mar 30, 12:33 pm, Ed Summers <e...@pobox.com> wrote:
> On Tue, Mar 30, 2010 at 11:43 AM, Jonathan Rees
>
> <jonathan.r...@gmail.com> wrote:.

> > Unfortunately .well-known is still working its way through IETF, and
> > as far as I know the linked data world hasn't got wind of this yet.
> > But it seems a plausible design to me, and addresses one of the big
> > complaints about 303, namely its network load.
>
> Is this the latest/greatest spec for the .well-known pattern?
>
>  http://tools.ietf.org/html/draft-nottingham-site-meta-05
>
> ?
>
> //Ed

I thought it was but I looked and apparently it's finally an RFC
(standards track):

http://www.rfc-editor.org/authors/rfc5785.txt

host-meta, which builds on it and is informational, not standards
track, is here:

http://tools.ietf.org/id/draft-hammer-hostmeta-05.txt

It's still a draft. Example 1 shows an example of a link template. My
assumption has been that the "describedby" link relation (registered
in 5785) is what 303 is meant to communicate.

Peter Ansell

unread,
Mar 30, 2010, 7:35:50 PM3/30/10
to pedant...@googlegroups.com

There is a reasonable semantic difference between the different
implications of 303 if you take caching into account, ie,

HTTP 303 overall: "not a substitute" and;

HTTP GET 303 cache off: "we cannot serve you <BAR>, an incomplete
description of <BAR> may be found at <FOO>," -- which implies a
semantic link between <BAR> and <FOO> as it is defined to be a partial
description, but this may change at any time, so please come back
again to see if we have changed our mind, and ;

HTTP GET 303 cache on: "we cannot serve you <BAR>, an incomplete
description of <BAR> can be found at <FOO>" -- which defines a
(temporary) semantic link between <BAR> and <FOO> as within X minutes
etc., anyone who knows about this request can also tell you there is a
partial description available at <FOO> and you can avoid <BAR>
altogether as it will not deviate, and ;

HTTP POST 303 cache on: "we accept and acknolwedge the information in
your request to <BAR>, your next target should be <FOO>, and we
endorse <FOO> as the only place that HTTP POST to <BAR> could follow
on to within the given cache time frame, so don't POST us the same
information at this URI even if it is important as we may not
physically receive it anyway. Depending on how we are implemented
internally, you could still get us information one more time using
HTTP GET if the request is short enough to fit into a URI. If you
actually want to see <FOO> just go there, as there is no enduring
meaning to accessing <BAR> again with the same request" -- which
implies a very vague (temporary) semantic link between <BAR> and
<FOO>, and;

HTTP POST 303 cache off: "we accept and acknolwedge the information
your request to <BAR>, your next target should be <FOO>, and we do not
endorse <FOO> as the only place that the same HTTP POST requests to
<BAR> could follow on to, but do not go back and try <BAR> again in
order to get to <FOO>, just go to <FOO> directly if you want the
information at <FOO>" -- which implies there is specifically not a
generalisable semantic link between <BAR> and <FOO>, as any successive
exactly equivalent POST request could have different implications on
the state of the resource, and overall, the next target could be
semantically unrelated at all to the first POST request as it could be
related to the state of the server overall.

Why is GET(cache and no cache), and POST (with cache) so special in
this case that they may contain a semantic implication that POST
without caches specifically doesn't contain?

If a server does endorse the location to be cacheable (perhaps
accidentally) for HTTP POST requests then it would never even know
that successive POST requests even occurred. That would really mess up
any servers that accidentally put cache headers on 303, which were
unrecognised by caches in the past, but now may be the cause of data
loss problems. Any implications of the change in definition to these
current systems to make the semantic link in only some cases between
the two locations in 303's will be directly on the head of the Linked
Data/Semantic Web community and their obsession with indirect
referencing.

No wonder you are having issues trying to convince browsers that the
new draft is what they should implement. Browsers don't even show
users that they are using POST or GET unless they try and repeat an
action, and the idea of 303 with POST seemed (to me) to be to avoid
user-agents ever accidentally repeating state-changing actions by
repositioning the user-agent according to the servers knowledge of the
overall process to something semantically different, but safe from
damaging side-effects if the user-agent requested it with HTTP GET
without the parameters they sent to the original 303'd resource.

It is surprising that the HTTP version isn't changing given this
change in definition. How many changes in a standard are necessary for
versions to change? Are we applying a typical liberal Semantic Web
view on in-place-gradual-meaning-changes to our strict, traditional,
internet standards now? If we have to define which standard we are
following by referring to "HTTP/1.1 RFCNNNN" instead of just
"HTTP/1.1" it seems to make a mockery of the original definition, even
if the new RFC only fixes some issues or vagueness in a backwards
compatible manner in most cases.

This subject would be so much simpler if someone had just made up a
new status code for the new purpose!

Cheers,

Peter

Jonathan Rees

unread,
Mar 30, 2010, 8:06:55 PM3/30/10
to Pedantic Web Group
Hi Peter - fancy meeting you here...

The change to the spec has this effect: Currently, if a 303 response
to a GET has an Expires: header, the client must ignore it, even
though the intent that the response be cacheable seems clear. Under
HTTPbis, the client may pay attention to the Expires: header and
interpret it in the usual way, just as for 302.

If there is no Expires: or Cache-control:, the response is not
cacheable, in either HTTP or HTTPbis. Probably the normal case.

I haven't studied the PUT case.

To say that a server is counting on a client ignoring an Expires:
header that the server itself provides seems a little twisted to me.
More likely the server is simply not generating an Expires: because it
thinks it would have no effect, or else is generating one because it
doesn't realize about the bogus requirement to ignore it (perhaps
using code shared with 302 or whatever) and would therefore be happier
with the HTTPbis semantics.

You'll have to check my work, sorry... I'm sure you'll let me know if
I got this wrong.

http://tools.ietf.org/html/draft-ietf-httpbis-p2-semantics-08#section-8.3.4

And by the way the GET/303 case isn't even recognized by RFC 2616 as
something you'd ever do; probably another reason for its statement on
uncacheability.

If the above doesn't clear the matter up for you, you should take it
up with the HTTP WG. If you know of some deployed situation that would
be affected adversely it would be very important to call it out.

Jonathan


On Mar 30, 7:35 pm, Peter Ansell <ansell.pe...@gmail.com> wrote:


> On 30 March 2010 23:49, Jonathan Rees <jonathan.r...@gmail.com> wrote:
>
>
>
>
>
> > On Mar 30, 8:25 am, Andreas Harth <ha...@kit.edu> wrote:
> >> Hi Martin,
>
> >> Martin Hepp (UniBW) wrote:
> >> > So you mean we should give a 302 status code pointing to the RDF/XML
> >> > file instead of simply serving the file?
>
> >> yes, a 302 or 303 would do the trick.
>
> >> Dublin Core uses 302 while the Linked Data crowd seems to prefer 303's.
>
> > Take another look - I believe the Dublin Core URIs do a 302 to a URI
> > that has a # in it: A 302-> B#C, B 200-> document describing A = B#C.
> > The setup is rather tricky, but it has the same effect as a 303  A 303-
> >> B, B 300-> document describing A (but without introducing the alias
> > B#C for A).
>

> > If you believe in the httpRange-14 rule, ashttp://pedantic-web.org/fops.html#redirect


> > seems to, then you have to do either what DC does or use a 303.
>
> >> There's an issue with caching of redirects for those pedants who care.
>
> >> The HTTP spec [1] says:
>
> >> 302 Found:
> >> "This response is only cacheable if indicated by a Cache-Control or Expires
> >> header field."
>
> >> 303 See Other
> >> "The 303 response MUST NOT be cached, but the response to the second
> >> (redirected) request might be cacheable."
>
> >> I prefer 302's because those are cachable.
>
> >> It's probably a minor issue, but hey it's the pedantic web list after all.
>
> > This is widely regarded to be a mistake in the HTTP RFC, and it has
> > been fixed in the HTTPbis draft. If you are aware of web clients or
> > proxies that fail to cache 303s I would be very interested to hear of
> > them.
>

> > Seehttp://www.w3.org/2001/tag/group/track/issues/57

Peter Ansell

unread,
Mar 30, 2010, 9:46:08 PM3/30/10
to pedant...@googlegroups.com
On 31 March 2010 10:06, Jonathan Rees <jonath...@gmail.com> wrote:
> Hi Peter - fancy meeting you here...

Its not like I am not pedantic. I just have different views on some
things. This mailing list is interesting.

> The change to the spec has this effect: Currently, if a 303 response
> to a GET has an Expires: header, the client must ignore it, even
> though the intent that the response be cacheable seems clear. Under
> HTTPbis, the client may pay attention to the Expires: header and
> interpret it in the usual way, just as for 302.
>
> If there is no Expires: or Cache-control:, the response is not
> cacheable, in either HTTP or HTTPbis. Probably the normal case.
>
> I haven't studied the PUT case.

I didn't think through that one either.

> To say that a server is counting on a client ignoring an Expires:
> header that the server itself provides seems a little twisted to me.
> More likely the server is simply not generating an Expires: because it
> thinks it would have no effect, or else is generating one because it
> doesn't realize about the bogus requirement to ignore it (perhaps
> using code shared with 302 or whatever) and would therefore be happier
> with the HTTPbis semantics.

That may be true, but before the thing goes to standard it would be
nice to verify that httpbis is backwards compatible as much as
possible.

> You'll have to check my work, sorry... I'm sure you'll let me know if
> I got this wrong.
>
> http://tools.ietf.org/html/draft-ietf-httpbis-p2-semantics-08#section-8.3.4
>
> And by the way the GET/303 case isn't even recognized by RFC 2616 as
> something you'd ever do; probably another reason for its statement on
> uncacheability.

I was mostly commenting on the way the new draft HTTP/1.1 RFC implies
new semantic relationships between a URI that responds with 303 and
its target depending on the HTTP verb that is used. It may not be a
big issue, as it is actually the servers fault if they put cache
headers on the results of 303 POST requests, but it doesn't seem to be
something that has been thought through very well in terms of
migration guidelines, which should be in that rfc if it is obsoleting
another one.

> If the above doesn't clear the matter up for you, you should take it
> up with the HTTP WG. If you know of some deployed situation that would
> be affected adversely it would be very important to call it out.

I don't know of any major use of 303 outside of Linked Data, but
overall, there is probably no way to know. Crawlers don't go around
submitting HTTP POST forms, and it is only after the submission that
you would ever know if the server was using 303, so statistics will be
pretty much impossible to find if you want a large sample. The issues
will only be generally evident after the change is implemented and
things start breaking. Turning caching off for a large website to
debug it might not be possible, and testing with caching servers
internally probably don't match the bugs that you would find on the
internet.

Good luck contacting all of the cache and browser manufacturers to
evangelise the new changes.

I wouldn't however spend too much time on the following action item
(which appears to be closed but possibly hasn't been fixed) if there
are more important issues to attend to.

http://www.w3.org/2001/tag/group/track/actions/348

Browsers in general should put security above usability, and there are
ways for servers to abuse 303 headers that make security issues
invisible. Not knowing the location that the page was actually fetched
from (ie, the last request that was used and that returned the HTML
document) using the address bar is a security issue in my opinion, and
it seems in some others opinion as well according to the bug on
bugzilla. I don't personally see the benefit of being able to bookmark
the original URL and have the browser serve the page from a different
location URL that the user never sees. In particular, there may never
be a change in the URL the DOM sees as its base URL as that would
complicate the HTML/DOM semantics, so it would be purely to aid the
act of bookmarking, and that can be done without the URL being in the
address bar to mask the actual page URL.

You could focus on trying to get the DOM model to be extended with
this piece of information but even that may still be an issue as the
rest of the page DOM may come from a different origin. You can see why
purl.org wants to have the change implemented, as otherwise they are
invisible, but that doesn't mean there aren't issues with its general
use.

Cheers,

Peter

Jonathan Rees

unread,
Mar 30, 2010, 10:09:20 PM3/30/10
to Pedantic Web Group
On Mar 30, 9:46 pm, Peter Ansell <ansell.pe...@gmail.com> wrote:

> On 31 March 2010 10:06, Jonathan Rees <jonathan.r...@gmail.com> wrote:
>
> > Hi Peter - fancy meeting you here...
>
> Its not like I am not pedantic. I just have different views on some
> things. This mailing list is interesting.

Please don't take offense. Perhaps I misspoke. All I meant to say
was: Small world, nice to see you here too.

>
> > The change to the spec has this effect: Currently, if a 303 response
> > to a GET has an Expires: header, the client must ignore it, even
> > though the intent that the response be cacheable seems clear. Under
> > HTTPbis, the client may pay attention to the Expires: header and
> > interpret it in the usual way, just as for 302.
>
> > If there is no Expires: or Cache-control:, the response is not
> > cacheable, in either HTTP or HTTPbis. Probably the normal case.
>
> > I haven't studied the PUT case.
>
> I didn't think through that one either.
>
> > To say that a server is counting on a client ignoring an Expires:
> > header that the server itself provides seems a little twisted to me.
> > More likely the server is simply not generating an Expires: because it
> > thinks it would have no effect, or else is generating one because it
> > doesn't realize about the bogus requirement to ignore it (perhaps
> > using code shared with 302 or whatever) and would therefore be happier
> > with the HTTPbis semantics.
>
> That may be true, but before the thing goes to standard it would be
> nice to verify that httpbis is backwards compatible as much as
> possible.
>
> > You'll have to check my work, sorry... I'm sure you'll let me know if
> > I got this wrong.
>

> >http://tools.ietf.org/html/draft-ietf-httpbis-p2-semantics-08#section...

I've already written a little piece to wrap up the investigation which
says pretty much what you say above. I just need about 1/2 hour to do
final edits and post it.

Best
Jonathan

> Cheers,
>
> Peter

Andreas Radinger

unread,
Apr 1, 2010, 4:55:51 AM4/1/10
to pedant...@googlegroups.com, Andreas Harth, marti...@ebusiness-unibw.org, tobias....@unibw.de, mh...@computer.org
Dear all,

the configuration of the "OpenEAN data set for Semantic Web-based
E-Commerce" was changed slightly.
Below you can find some examples explaining which responses you will get.
Many thanks to Andreas Harth for the useful suggestions!

1.)
$ curl -I -L http://openean.kaufkauf.net/id/EanUpc_0001000000748
HTTP/1.1 303 See Other
Location: http://openean.kaufkauf.net/id/file1.owl
HTTP/1.1 200 OK
Last-Modified: Thu, 21 May 2009 01:47:42 GMT
ETag: "55001a-82197-46a625463ab80"
Content-Length: 532887

2.)
$ curl -I -L http://openean.kaufkauf.net/id/EanUpc_0004100380772
HTTP/1.1 303 See Other
Location: http://openean.kaufkauf.net/id/file3.owl
HTTP/1.1 200 OK
Last-Modified: Thu, 21 May 2009 01:47:42 GMT
ETag: "55001e-81d21-46a625463ab80"
Content-Length: 531745

3.)
$ curl -I -L
http://openean.kaufkauf.net/id/businessentities/GLN_7905860000050
HTTP/1.1 303 See Other
Location: http://openean.kaufkauf.net/id/businessentities/
HTTP/1.1 200 OK
Last-Modified: Thu, 21 May 2009 01:50:44 GMT
ETag: "578002-988156-46a625f3cc500"
Content-Length: 9994582

We hope this solves some problems and enables caching.

Please let me know if you think that following response would be the better
choice:
$ curl -I -L http://openean.kaufkauf.net/id/EanUpc_0001000000748
HTTP/1.1 302 Moved Temporarily
Expires: ......
Location: http://openean.kaufkauf.net/id/file1.owl#EanUpc_0001000000748
HTTP/1.1 200 OK
Last-Modified: Thu, 21 May 2009 01:47:42 GMT
ETag: "55001a-82197-46a625463ab80"
Content-Length: 532887


Best regards,
Andreas

------------------------------------------
Dipl.-Ing. Andreas Radinger
Professur f�r Allgemeine BWL, insbesondere E-Business

e-business & web science research group

Universit�t der Bundeswehr M�nchen

e-mail: andreas....@unibw.de
phone: +49-(0)89-6004-4218

skype: andreas.radinger


Peter Ansell

unread,
Apr 2, 2010, 5:16:58 PM4/2/10
to pedant...@googlegroups.com, Andreas Harth, marti...@ebusiness-unibw.org, tobias....@unibw.de, mh...@computer.org
The way you are using 303 works given the current RFC. The only
difference is that the new draft "httpbis" rfc would allow you to also
add Cache headers to the 303 response, in addition to the Cache
headers on the 200 responses so that users do not have to repeatedly
check whether the 303 response redirects to the same location.

Cheers,

Peter

> Professur für Allgemeine BWL, insbesondere E-Business


> e-business & web science research group

> Universität der Bundeswehr München


>
> e-mail: andreas....@unibw.de
> phone: +49-(0)89-6004-4218
> fax:     +49-(0)89-6004-4620
> www:   http://www.unibw.de/ebusiness/
> skype: andreas.radinger
>
>
>
>

> --
> To unsubscribe, reply using "remove me" as the subject.
>

Reply all
Reply to author
Forward
0 new messages