Tools to create XML Records for Batch ingests

Ernie Gillis

unread,

Aug 16, 2014, 1:11:38 PM8/16/14

to isla...@googlegroups.com

HI folks!

I am writing up various workflows for employees in my area to ingest content to Fedora Commons via Islandora.

One of these workflows is planning how to efficiently import batches of assets with their matching XML metadata file.

My questions are:

- What tools are people using to either create the metadata and crosswalk it to an XML file, and / or tools used to create the XML file?

- And… what are the typical XML formats being use for the metadata records to go in the batch? Dublin Core? MODS? RDF? FOXML? Something else?

I am a developer / designer / db admin / systems admin for my library and archives, so writing the XML by hand is not problematic for myself. Our archivist has a set of tools for creating the EADs and what not, but they are creating more metadata on collection level at the moment.

Many thanks in advance!!

Ernie

Nick Ruest

unread,

Aug 16, 2014, 9:41:04 PM8/16/14

to isla...@googlegroups.com

Hi Ernie-

On 14-08-16 01:11 PM, Ernie Gillis wrote:
> HI folks!
> I am writing up various workflows for employees in my area to ingest
> content to Fedora Commons via Islandora.
>
> One of these workflows is planning how to efficiently import batches of
> assets with their matching XML metadata file.
>
> My questions are:
> - What tools are people using to either create the metadata and
> crosswalk it to an XML file, and / or tools used to create the XML file?
> - And… what are the typical XML formats being use for the metadata
> records to go in the batch? Dublin Core? MODS? RDF? FOXML? Something else?
>

You're aware of Islandora XML Forms[1][2][3], right? This allows you to
create whatever metadata form you would like against a given schema via
the Drupal Forms API. Included is the ability to associated a given form
a solution pack/content model. Here you can setup additional xslt files
to do any additional transforms that you'd like.

Batch ingest can be done via Islandora Batch[4]. Currently, Islandora
Batch is set to accept DC or MODS descriptive metadata files upon
ingest. You can batch ingest just descriptive metadata, archival
objects, and/or both.

Also, are you aware of the archival interest group?

-nruest

[1] https://github.com/Islandora/islandora_xml_forms
[2] https://wiki.duraspace.org/display/ISLANDORA713/XML+Form+Builder
[3] https://wiki.duraspace.org/display/ISLANDORA713/XML+Forms
[4] https://github.com/islandora/islandora_batch

> I am a developer / designer / db admin / systems admin for my library
> and archives, so writing the XML by hand is not problematic for myself.
> Our archivist has a set of tools for creating the EADs and what not, but
> they are creating more metadata on collection level at the moment.
>
>
> Many thanks in advance!!
> Ernie
>

> --
> For more information about using this group, please read our Listserv
> Guidelines: http://islandora.ca/content/welcome-islandora-listserv
> ---
> You received this message because you are subscribed to the Google
> Groups "islandora" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to islandora+...@googlegroups.com
> <mailto:islandora+...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/islandora.
> For more options, visit https://groups.google.com/d/optout.

Ernie Gillis

unread,

Aug 16, 2014, 11:04:42 PM8/16/14

to isla...@googlegroups.com

Thanks Nick!
Yes, I have used these tools before. I guess I'm trying to figure out how to not reinvent wheels and do double data entry.

If I, for instance, have 20 TIFF files and have metadata pertaining to each file from an EAD (since the EAD) is of the collection and a good chunk of fields have the same values for each TIFF (ie "creator"). There are some fields that may be different for each TIFF (ie a caption might go in a note field or something, & it's different for each TIFF).

I am curious what the common practice is. I am presuming it to be common practice for people to export a brief record 20 times from the EAD generator (ie archivists toolkit), rename each XML to match the name for each TIFF, and zip the files together. Once zipped, one would use the batch ingest, then manually edit the metadata for each tiff with the form builder to add the caption field.

Though easily accessible, thanks to the programming of the solution packs & form builder, I am always wondering & wanting to streamline the process as much as I can. Which is why I am wondering about tools (either built into islandora solution packs or external to it).

My support staff is mostly undergrad student workstudy and (as much as I think they're awesome) the fewer steps I have to train on & double check, the better.

The list you gave is awesome, I also want to expand on it to learn as much as I can about what is done and with what :)

Sara Allain

unread,

Aug 21, 2014, 2:03:15 PM8/21/14

to isla...@googlegroups.com

Hi Ernie,

We use a couple of different tools for creating XML for ingest. We're not currently deriving anything from EAD though I expect that to change in the future, as we now have an AtoM install where all our archival descriptions will be created.

We also rely on undergraduate students to do a lot of the data entry for us. The workflow we've implemented is to have students create the metadata in a spreadsheet (we use Google to take care of version issues between students). I imagine that you could export the EAD as CSV to enter into a spreadsheet and have the students clean it up. In this way you could export as EAD once, delete everything that you don't want, and then

Once we have a spreadsheet that we're happy with, we use Open Refine (aka Google Refine) to further parse the data (i.e. exploding content in the subject field). I export from Open Refine using the custom export function, applying a MODS template to the data so that I end up with a huge MODS file, which is then broken down and renamed. I've blogged about our workflow [1] and so far it's been good for us!

Sara

[1] https://www.utsc.utoronto.ca/digitalscholarship/content/blogs/converting-spreadsheets-modsxml-using-open-refine

jy

unread,

Aug 21, 2014, 2:07:33 PM8/21/14

to isla...@googlegroups.com

Open refine looks awesome I'll have to give it a go. I have a php script that parses a csv into and array, clean up the data, and creates individual mods records for each record. The initial script was a bit of work but its fairly easy to tweak for each project.

John

--

For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.

To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.

Visit this group at http://groups.google.com/group/islandora.
For more options, visit https://groups.google.com/d/optout.

--

╔══════════ ೋღ☃ღೋ ══════════╗

~ ~ ~ ~ ~ ~ ~ ~~ ~ ~ ~John ~~ ~ ~ ~ ~ ~ ~ ~ ~

╚══════════ ೋღ☃ღೋ ══════════╝

Jennifer

unread,

Aug 25, 2014, 6:56:04 PM8/25/14

to isla...@googlegroups.com

I also use Open Refine to clean and parse data from spreadsheets when we get them. I then export in xml from Open Refine and apply an xslt create MODS records. For EADs, I have an xslt specific for this standard.

One of the issues with using the XML forms is that the majority of our data comes in batches which the form can't handle. This is why we use other tools like Open Refine, Oxygen and a custom xslt for MarcEdit.

Jennifer

Kelli Babcock

unread,

Aug 25, 2014, 7:28:55 PM8/25/14

to isla...@googlegroups.com

This is a really interesting thread! Thanks for bringing it up, Ernie!

At the University of Toronto Libraries we use simple Dublin Core as the "common" schema to search across all collections within our main site (http://collections.library.utoronto.ca/), and then any collaborator with their own multi-site can control whichever standard they want to use on their multi-site, so long as it maps back to DC.

Right now our multi-sites mainly use DC but we have a couple of sites that use MODS; one upcoming project that will make use of the the FSU Full MODS form[1] (speaking of which, does anyone have a full MODS to DC xsl handy?); and Darwin Core[2].

Using DC as the standard to search across all collections seemed like the most logical thing to do when initially designing our repository, because we knew there would be multi-sites using many different metadata standards that would eventually have to map (easily) back to one common standard. But we've run across some issues relying on Dublin Core as our common standard... Islandora just seems to be designed to work better with MODS.

Tools:

- For DC, we provide a simple DC Excel template to our collaborators[3], which can then be converted to individual XML files, labeled by identifier, using a simple java application created by our awesome, awesome programmer, Ken Yang, built[4] - the purpose behind this is to try to enable people who aren't comfortable working in something like Google Refine to create DC xml simply from an easy-to-fill-in spreadsheet.

- For MODS, we are now piggy-backing on Sara Allain's work with Google Refine and our Digital Scholarship Librarian, Leslie Barnes, worked with another programmer in our department to develop a MODS split tool to grab individual XML files from the Google Refine results[5]. Using Google Refine isn't a work-flow that many of our collaborators would be comfortable with, but it works great for internal work-flows.

Jennifer, would you be willing to share the EAD to MODS xslt you have? Coming down the pipeline we'll likely be ingesting EAD records from AtoM... Converting them to MODS makes good sense.

If it isn't totally obvious from this giant post, I'm always happy to talk about metadata work flow with anyone, and it would be great to get feedback on our own work flow is anyone wants to exchange ideas.

Total aside: has anyone else ever noticed that if you "associate" a DC2MODs xsl in formbuilder to a DC form, and then update the DC metadata from within the form, the DC form will be looped to become populated with MODS after you save?

Take care,
Kelli

[1] https://github.com/Islandora/islandora_ingest_forms/blob/master/MODS/FSU%20Full%20MODS%20form.xml
[2] https://github.com/Islandora/islandora_ingest_forms/tree/master/DarwinCore

[3] dc_template_draft1.xlsx attached

[4] https://docs.google.com/file/d/0B4VuqV1fiyhceENvVU5VWVNQYWs/edit

[5] mods-split.sh

--

dc_template_draft1.xlsx

mods-split.sh

Reply all

Reply to author

Forward