newb

20 views
Skip to first unread message

jgronski

unread,
Nov 3, 2008, 12:42:52 AM11/3/08
to foresite
Hello,

I am hoping to use your foresite library to parse examples of atom/rdf
resource maps. I have a couple questions about foresite:

1. Am I correct in understanding that the original intent of the
library was to parse and serialize jstor? If so, then do these jstore
resource maps exist in some form that is accessible on the web?

2. When running "from foresite import *" in python I had to delete a
reference to "from utils import generateAtomContent" in parser.py to
import the library into python.None of the files in the foresite
package refer to that function other than that import statement. It
seemed like a kosher thing to do at the time but, odd that I would
have to do so. Am I missing some sort of dependency? The python egg
should have resolved it . . .

3. I have tried to install your library and have run into troubles
asking your library to download an existing resource map.
Specifically, the one mentioned in your documentation:
"http://www.openarchives.org/ore/0.9/atom-examples/
atom_dlib_maxi.atom"

I feel that perhaps I am way off track at this point and missing
something large. I am quite new to the protocol and code so please
have patience with my questions!

performing the following commands:
+++++++++++++++
from foresite import *
remdoc = ReMDocument("http://www.openarchives.org/ore/0.9/atom-
examples/atom_dlib_maxi.atom")
ap = AtomParser()
rem = ap.parse(remdoc)
+++++++++++++++

results in this trace:
++++++++++++
Traceback (most recent call last):
File "/Users/jgronski/Desktop/foo.py", line 4, in ?
rem = ap.parse(remdoc)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.4/
lib/python2.4/site-packages/foresite-0.9-py2.4.egg/foresite/
parser.py", line 250, in parse
rem = ResourceMap(uri_r[0])
IndexError: list index out of range
+++++++++++++

It's wierd, 'cause I looked at the example and from my rough
understanding of ORE the example is correct. My thought is that
instead of using the xpath "/atom:entry/atom:link[@rel='self']/@href"
in parser.py to get the resource map, the one that seems to work is "/
atom:feed/atom:link[@rel='self']/@href".

But, again, this seems too basic to be a mistake. What am I missing?

- Jessica

Further possibly irrelevant details about the environment:
Mac OSX
Python 2.45
Foresite version 1.0 - downloaded here at http://code.google.com/p/foresite-toolkit/


Robert Sanderson

unread,
Nov 3, 2008, 7:06:19 AM11/3/08
to fore...@googlegroups.com

Hi Jessica,

Thanks for your interest in Foresite and ORE!


1. Am I correct in understanding that the original intent of the
library was to parse and serialize jstor? If so, then do these jstore
resource maps exist in some form that is accessible on the web?

Yes, if you go to:
  http://foresite.cheshire3.org/stable/ore/(identifier)

you'll get the ReM for the object with the (identifier) id in JSTOR.
eg http://foresite.cheshire3.org/stable/ore/j100378

You may also be interested in my ORE wrapper around Flickr:
  http://foresite.cheshire3.org/flickr/ore/photo/(photo-id)

eg: http://foresite.cheshire3.org/flickr/ore/photo/2387359614

There's also wrappers for photosets, pools and annotations.


2. When running "from foresite import *" in python I had to delete a
reference to "from utils import generateAtomContent" in parser.py to
import the library into python.

*blush* Yes, that was what I used to call build_html_atom_content.  Have fixed and published a 1.0-1 release.  Not sure how that one got through!

3. I have tried to install your library and have run into troubles
asking your library to download an existing resource map.
Specifically, the one mentioned in your documentation:
"http://www.openarchives.org/ore/0.9/atom-examples/
atom_dlib_maxi.atom"


For this one, it's a versioning problem.  The atom parser now works on the 1.0 specification's atom, rather than 0.9 which is very different. 
If you wanted to parse 0.9 atom, there's the foresite.parser.OldAtomParser class which should do the trick.

Another versioning problem, which is also fixed in SVN is where the parser looks for the URI for the Aggregation in the 1.0 spec.  Too much involvement with spec writing, not enough with code updating :)

Both now fixed in the new release, also in SVN, along with updated docs.

With the fixed code, try for example:
------------------------------
[cheshire@aglarond ~]$ python
Python 2.6b2 (r26b2:65082, Aug 14 2008, 14:14:00)
>>> from foresite import *
>>> rd = ReMDocument('http://www.openarchives.org/ore/1.0/atom-examples/atom_arXiv_maxi.atom')
>>> ap = AtomParser()
>>> rem = ap.parse(rd)
>>> rem.aggregation.title
[rdflib.Literal('Parametrization of K-essence and Its Kinetic Term', language=None, datatype=None)]
-------------------------------

Hope that helps!  Let us know how you get on :)

Rob

Jessica Gronski

unread,
Nov 4, 2008, 2:02:30 AM11/4/08
to fore...@googlegroups.com
Rob,
Thanks for responding so promptly! I checked out the newest version
and all examples in the directory seem to work as advertised.

My question: When following your link to the JSTOR ReM (Resource Map I
presume) in my browser I didn't find a rdf or atom serialization as I
expected. Also, when treating the example JSTOR link as a resource map
in the foresite library, the parser was unable to parse the uri
making me think that this uri is not a resource map. How does one find
the serialized version of the resource map for say:
http://foresite.cheshire3.org/stable/ore/j100378
?

-Jessica

> 1. Am I correct in understanding that the original intent of the
>>
>> library was to parse and serialize jstor? If so, then do these jstore
>> resource maps exist in some form that is accessible on the web?
>
> Yes, if you go to:
> http://foresite.cheshire3.org/stable/ore/(identifier)
>
> you'll get the ReM for the object with the (identifier) id in JSTOR.
> eg http://foresite.cheshire3.org/stable/ore/j100378

interesting. when following this lin

azar...@gmail.com

unread,
Nov 4, 2008, 6:36:59 AM11/4/08
to fore...@googlegroups.com
Hi Jessica,

Sorry, I should have said that the URI is for the aggregation, which should then redirect your browser to a resource map, depending on the settings as to which sorts of documents it will accept. Mine (firefox2.x, linux) takes me to the RDFa serialization at:

http://foresite.cheshire3.org/stable/ore/j100378/rdfa.html

If you want to go straight to a resource map, then my particular setup is the following:

URI-A +
/rdfa.html --> simple block of RDFa, suitable for importing into HTML
/rdf.xml --> RDF/XML (striped)
/pretty.xml --> RDF/XML (a more 'pretty' style)
/rem.n3 --> n3 style rdf
/rem.nt --> ntriples
/rem.turtle --> turtle
/atom.xml --> atom

I'll fix the RemDoc implementation to send appropriate accept headers and follow the redirection.

Rob

azar...@gmail.com

unread,
Nov 4, 2008, 7:39:21 AM11/4/08
to fore...@googlegroups.com
On Nov 4, 2008 11:36am, azar...@gmail.com wrote:
> /rem.turtle --> turtle

This should be /rem.ttl, following the recommendation in the turtle spec.

> I'll fix the RemDoc implementation to send appropriate accept headers and follow the redirection.

Changes are in the subversion repository now, however I've also fixed the server side code to not break if it doesn't get an Accept header, defaulting to Atom, so it's not strictly necessary.

Hope that helps!

Rob

Jessica Gronski

unread,
Nov 6, 2008, 1:22:20 AM11/6/08
to fore...@googlegroups.com
Again, thanks for the help!

Ok so here's what seemed to work given the original jstor example uri

URI-A= http://foresite.cheshire3.org/stable/ore/j10037

using the canonical script

rm = ReMDocument('URI-A/[file]')
rp = RdfLibParser()
ap = AtomParser()
rd = rp.parse(rm) or rd = ap.parse(rm)
:

> URI-A +
> /rdfa.html --> simple block of RDFa, suitable for importing into HTML

> /rem.n3 --> n3 style rdf
> /rem.nt --> ntriples

after setting the format type('rdfa', 'n3', 'nt') this parsed fine

> /rdf.xml --> RDF/XML (striped)
> /pretty.xml --> RDF/XML (a more 'pretty' style)

the rdf parser worked without setting the ReMDocument format type

> /rem.turtle --> ttl
this doesn't work but that makes sense since the rdflib you use for
python doesn't have a turtle parser (just a generator)

> /atom.xml --> atom
Here however there seems to be some sort of problem here. Looking at
the file pulled from URI-A/atom.xml it doesn't seem to be an atom
file.

To show what I did I attached a file to this message. Then it can be
completely obvious what I'm doing.

Another unrelated question:

Is there an aggregation describing these all these aggregations (ie.
URI-A)? Basically, I'm curious if your repository has some sort of
batch discovery mechanism. I see that there are some standard
approaches outlined here:
http://www.openarchives.org/ore/1.0/discovery

but all require access to a sitemap, some sort of aggregated resource.
I see there are a number of data providers out there and I imagine
there is some way to browse all those resource aggregations that's
standard.

-Jessica

foo.py

Robert Sanderson

unread,
Nov 6, 2008, 6:36:09 AM11/6/08
to fore...@googlegroups.com

Hi Jessica,

> URI-A +
> /rdfa.html --> simple block of RDFa, suitable for importing into HTML
> /rem.n3 --> n3 style rdf
> /rem.nt --> ntriples

after setting the format type('rdfa', 'n3', 'nt') this parsed fine


Yep, in the latest SVN code it tries to set the format and the mimeType from the response headers, so this shouldn't be necessary in the future, but not every server side implementation will do the right thing (or use the same content-types) so it's probably safest to set them explicitly anyway.

 
> /rdf.xml --> RDF/XML (striped)
> /pretty.xml --> RDF/XML (a more 'pretty' style)

the rdf parser worked without setting the ReMDocument format type

> /rem.turtle --> ttl
this doesn't work but that makes sense since the rdflib you use for
python doesn't have a turtle parser (just a generator)


Yeah, rdflib is pretty good, but not quite N-N for serializers and parsers. 
 
> /atom.xml --> atom
Here however there seems to be some sort of problem here. Looking at
the file pulled from URI-A/atom.xml it doesn't seem to be an atom
file.

What you should be getting is an atom entry document, rather than an atom feed.
In ORE 1.0 we moved away from mapping feed -> aggregation and entry -> aggregated resource in order to allow people to put aggregations into their own feeds and copy them around.  It's the aggregation that's the unit of interest, and in atom that goes into an entry... the feed is just a convenient location to find them at.

http://foresite.cheshire3.org/stable/ore/j100378/atom.xml returns what looks like an atom entry doc, but actually doesn't map the default namespace to atom, and hence the parser doesn't work. Oops!  Will Fix.

 
Is there an aggregation describing these all these aggregations (ie.
URI-A)? Basically, I'm curious if your repository has some sort of
batch discovery mechanism. I see that there are some standard
approaches outlined here:
http://www.openarchives.org/ore/1.0/discovery


At the moment it doesn't, I'm afraid, for a few reasons:

1.  JSTOR would be (understandably) very cross if I gave away all of their data wholesale via PMH or similar.  At the moment I'm already treading on a very fine line with the license and their good will, but the idea is that they will host the service themselves (and can hence log and do whatever else they want to do)
2.  The use case for doing this is to enable visualization and exploration of known documents, rather than a discovery/publication service.

Technically: 
3.  It's all static data, so PMH wouldn't be very useful as the publication time for every document would be the same.
4.  Sitemaps have a limit of 50,000 URLs and there's 4+ million to deal with!  The journal aggregations could go in to a sitemap however.
5.  Atom Feeds would be static and I'm not sure what would actually go in them ... a feed with just the journal aggregations as per sitemaps?
6.  As the lowest level of AR are all on JSTOR, it's the zero knowledge case for Resource Embedding, so that doesn't help either.

And:
7.  I haven't gotten around to it yet! :)

 
but all require access to a sitemap, some sort of aggregated resource.
I see there are a number of data providers out there and I imagine
there is some way to browse all those resource aggregations that's
standard.

You might be interested, on the browser front, in my firefox/greasemonkey plugin:

http://www.csc.liv.ac.uk/~azaroth/foresite-explorer.user.js

Which finds the jstor ID in the URL and adds an SVG viz layer into the page.  It also works (99%) for flickr and amazon wishlists, as examples which everyone can experiment with.  Any feedback would be greatly appreciated, other than about the lack of documentation, which is coming :)

I'm also prodding Ross (a PhD student in my dept who wrote the OREsome viz client for the RepoCamp challenge) to tidy up and release the processing java code for his stuff.

Hope that helps!

Rob

Jessica Gronski

unread,
Nov 8, 2008, 8:54:53 PM11/8/08
to fore...@googlegroups.com
>>
>> Is there an aggregation describing these all these aggregations (ie.
>> URI-A)? Basically, I'm curious if your repository has some sort of
>> batch discovery mechanism. I see that there are some standard
>> approaches outlined here:
>> http://www.openarchives.org/ore/1.0/discovery
>
>
> At the moment it doesn't, I'm afraid, for a few reasons:
>
> 1. JSTOR would be (understandably) very cross if I gave away all of their
> data wholesale via PMH or similar. At the moment I'm already treading on a
> very fine line with the license and their good will, but the idea is that
> they will host the service themselves (and can hence log and do whatever
> else they want to do)

Ok, that makes sense that they want control. I could make the case
that all an aggregation does is give away the metadata, but I could
see how that in and of itself could be considered too much. Heck,
Facebook doesn't want its social network given away!

> 2. The use case for doing this is to enable visualization and exploration
> of known documents, rather than a discovery/publication service.

Yep! Don't want to set up jstor, just want some examples of Resource
Maps to experiment with. You wouldn't happen to have a lead on some
people who are willing to allow me to use their data . . . ? I've just
started digging into the data providers using the PMH protocol. I ran
a perl script discovering the different metadata formats available
from these data providers
(http://www.openarchives.org/Register/BrowseSites) but no one seems to
be using the ORE resource map format to publish their metadata
content.

>
> Technically:
> 3. It's all static data, so PMH wouldn't be very useful as the publication
> time for every document would be the same.
> 4. Sitemaps have a limit of 50,000 URLs and there's 4+ million to deal
> with! The journal aggregations could go in to a sitemap however.
> 5. Atom Feeds would be static and I'm not sure what would actually go in
> them ... a feed with just the journal aggregations as per sitemaps?
> 6. As the lowest level of AR are all on JSTOR, it's the zero knowledge case
> for Resource Embedding, so that doesn't help either.
>
> And:
> 7. I haven't gotten around to it yet! :)
>

: P Fair enough.

I can't say that I exactly follow all your points about the difficulty
of these formats but the foresite forum doesn't seem like the best
format to have that kind of back-and-forth kind of chat.

>
>>
>> but all require access to a sitemap, some sort of aggregated resource.
>> I see there are a number of data providers out there and I imagine
>> there is some way to browse all those resource aggregations that's
>> standard.
>
> You might be interested, on the browser front, in my firefox/greasemonkey
> plugin:
>
> http://www.csc.liv.ac.uk/~azaroth/foresite-explorer.user.js
>
> Which finds the jstor ID in the URL and adds an SVG viz layer into the
> page. It also works (99%) for flickr and amazon wishlists, as examples
> which everyone can experiment with. Any feedback would be greatly
> appreciated, other than about the lack of documentation, which is coming :)


Hmmm. So, while I have used greasemonkey in the past I've never
tinkered with the scripts myself. I'm not quite clear how this script
is supposed to alter what I see. I imagine it visualizes resource maps
but when I go to a site through firefox (say
http://foresite.cheshire3.org/stable/ore/j100378/rdf.xml) nothing
seems to be different. Perhaps a loading problem? (Specs: Firefox
3.01, Mac OSX) The script does seem to be installed. On the off chance
the script was supposed to alter the way flikr/jstor sites were seen I
went to those sites as well but didn't notice a difference there
either.

>
> I'm also prodding Ross (a PhD student in my dept who wrote the OREsome viz
> client for the RepoCamp challenge) to tidy up and release the processing
> java code for his stuff.

I saw the page for the RepoCamp Challenge and that's what led me to
the foresite library! Congrats to Ross! I didn't realize he was your
student.

>
> Hope that helps!
>
> Rob
>
>
> >
>

Jessica Gronski

unread,
Nov 8, 2008, 9:19:00 PM11/8/08
to fore...@googlegroups.com
> Yep! Don't want to set up jstor, just want some examples of Resource
> Maps to experiment with. You wouldn't happen to have a lead on some
> people who are willing to allow me to use their data . . . ? I've just
> started digging into the data providers using the PMH protocol. I ran
> a perl script discovering the different metadata formats available
> from these data providers
> (http://www.openarchives.org/Register/BrowseSites) but no one seems to
> be using the ORE resource map format to publish their metadata
> content.


Oh yeah, here are the most of the formats supported by various
repositories(see attached file). (The number is their counts). I see
now that two support the rdf format which may or may not be a resource
map (Dspace at MIT and edocUR - Universidad del Rosario). I guess I
should check them out and see if they contain serialized resource
maps.

-Jessica

tmp

Jessica Gronski

unread,
Nov 9, 2008, 11:51:42 PM11/9/08
to fore...@googlegroups.com
>> You might be interested, on the browser front, in my firefox/greasemonkey
>> plugin:
>>
>> http://www.csc.liv.ac.uk/~azaroth/foresite-explorer.user.js
>>
>> Which finds the jstor ID in the URL and adds an SVG viz layer into the
>> page. It also works (99%) for flickr and amazon wishlists, as examples
>> which everyone can experiment with. Any feedback would be greatly
>> appreciated, other than about the lack of documentation, which is coming :)
>
>
> Hmmm. So, while I have used greasemonkey in the past I've never
> tinkered with the scripts myself. I'm not quite clear how this script
> is supposed to alter what I see. I imagine it visualizes resource maps
> but when I go to a site through firefox (say
> http://foresite.cheshire3.org/stable/ore/j100378/rdf.xml) nothing
> seems to be different. Perhaps a loading problem? (Specs: Firefox
> 3.01, Mac OSX) The script does seem to be installed. On the off chance
> the script was supposed to alter the way flikr/jstor sites were seen I
> went to those sites as well but didn't notice a difference there
> either.

Wow! Ok, I see the addition now. There's a mini button with the ORE
symbol on it when I got to this page. I like it! It's a nice way to
navigate and certainly easier than browsing the xml. It makes it
easier to think about aggregation objects.

-Jessica

It however works for this journal:
http://www.jstor.org/stable/j100001

Jessica Gronski

unread,
Nov 10, 2008, 12:10:38 AM11/10/08
to fore...@googlegroups.com
On Sun, Nov 9, 2008 at 8:51 PM, Jessica Gronski <jgro...@gmail.com> wrote:
>>> You might be interested, on the browser front, in my firefox/greasemonkey
>>> plugin:
>>>
>>> http://www.csc.liv.ac.uk/~azaroth/foresite-explorer.user.js
>>>
>>> Which finds the jstor ID in the URL and adds an SVG viz layer into the
>>> page. It also works (99%) for flickr and amazon wishlists, as examples
>>> which everyone can experiment with. Any feedback would be greatly
>>> appreciated, other than about the lack of documentation, which is coming :)
>>
>>
>> Hmmm. So, while I have used greasemonkey in the past I've never
>> tinkered with the scripts myself. I'm not quite clear how this script
>> is supposed to alter what I see. I imagine it visualizes resource maps
>> but when I go to a site through firefox (say
>> http://foresite.cheshire3.org/stable/ore/j100378/rdf.xml) nothing
>> seems to be different. Perhaps a loading problem? (Specs: Firefox
>> 3.01, Mac OSX) The script does seem to be installed. On the off chance
>> the script was supposed to alter the way flikr/jstor sites were seen I
>> went to those sites as well but didn't notice a difference there
>> either.
>
> Wow! Ok, I see the addition now. There's a mini button with the ORE
> symbol on it when I got to this page. I like it! It's a nice way to
> navigate and certainly easier than browsing the xml. It makes it
> easier to think about aggregation objects.
>
> -Jessica
>

Feedback-ish thoughts:
1. On the UI level you might consider putting in a way to "exit" the
visualization mode.
2. does this mean that citations are not included in the jstor
aggregations? The couple jstor articles I've looked at don't seem to
have citation relationships with anything. (Looking at j100378, the
stanford law review, again.)
3. the "included in" relationship makes for a strange graph because
two different nodes represent the same thing after expanding the
"included in" node.
4. when there are many issues in a journal the number of nodes seems
overwhelming. Perhaps you should be able to type in a number?

Cheers! I hope you find some of the feedback helpful.

Robert Sanderson

unread,
Nov 10, 2008, 6:55:27 AM11/10/08
to fore...@googlegroups.com

Heyas,

Re Discovery:  I'm not sure if anyone is doing OAI-PMH + ORE yet.  Perhaps you might want to ask on the main ORE google list as well?


> Wow! Ok, I see the addition now. There's a mini button with the ORE
> symbol on it when I got to this page. I like it! It's a nice way to
> navigate and certainly easier than browsing the xml. It makes it
> easier to think about aggregation objects.

Feedback-ish thoughts:
1. On the UI level you might consider putting in a way to "exit" the
visualization mode.

There's an X in the top right of the main viz pane, but it could probably do with being more obvious, and/or on the little options pane. 

 
2.  does this mean that citations are not included in the jstor
aggregations? The couple jstor articles I've looked at don't seem to
have citation relationships with anything. (Looking at j100378, the
stanford law review, again.)


There are some citations in JSTOR which can be exposed as links but not all journals and articles have them. 

3. the "included in" relationship makes for a strange graph because
two different nodes represent the same thing after expanding the
"included in" node.

Yes, agreed.  There's some code in there (but turned off) that makes it into a proper graph where if you expand included-in and then includes, it links back to the original node. However dragging the graph around and expanding/contracting other bits causes havoc.  The current thought is to somehow grey out the duplicate and make the original one hightlight when you hover over the dupe (or something like that?)
 
4. when there are many issues in a journal the number of nodes seems
overwhelming. Perhaps you should be able to type in a number?

Also most definitely agreed, I think this is one of the bigger issues in ORE in general -- how to deal with very large aggregations at any level. 
By type in a number, do you mean set, in the options pane, a maximum number of nodes to expand per relationship?  Or to be able to jump to the Nth expanded node by typing in N?

Many thanks!

Rob
Reply all
Reply to author
Forward
0 new messages