On Wed, Aug 13, 2008 at 8:13 AM, Mark H. Wood <
mw...@iupui.edu> wrote:
> Sad to say, there are probably as many automated ways of building
> batches as there are DSpace sites. What you do will depend on the
> form in which you can get the data.
This is my experience too. I wrote a tiny Python library of
DSpace-automation-stuff (with classes for building a contents file, a
dublin_core.xml file, a mapfile [yes, that's rare, but it has
happened], breaking up a namelist from a citation or from HTML, and
parsing a name) that I remix as needed for new projects. (Next on the
list to add to it: better file/folder management, because I'm so
error-prone when I write that stuff...)
I can see that I'll have to rewrite a lot of this to create SWORD
packages instead. So be it; I think SWORD is a better way and I'll be
able to do more with it. (I got to talking with some people long ago
about drop boxes for the repository, and it just plain broke my brain,
how hard that was going to be. SWORD makes it a good deal more
feasible to write drop boxes and hands-off gateways, I think.)
For my sins, I do a lot of HTML screenscraping -- back issues of
e-periodicals, mostly. That's all ad-hoc, as no two e-periodicals have
the same HTML. It tends to be an 80/20 problem (give or take 10% based
on HTML quality and consistency); I can whack out most of the metadata
with regular expressions and my namelist/name parsers, but not all of
it. Information is often lurking in PDFs, which means handwork.
I say all this to (I hope) help people understand what the bounds
around what's feasible look like for untalented scripters.