Lessons from real data and actual ingest workflows : an IR scenario / late night tests

221 views

Skip to first unread message

Diego Pino

unread,

Aug 12, 2020, 1:55:33 PM8/12/20

to archipelago commons

Hi,

Some sharing/bonding here. We have been helping some Library friends on getting started with an IR Archipelago to test some use cases (no eye candy yet, sorry) and last night i finished the 4th batch ingest (3 initial attempts while i learned) of 1208 Objects, into 9 collections including over 4000 files, from MIDI music to (many) PDFs.

I used and abused this chance to test some workflows, experiment with twig templates and some new modules, PDF extraction to test, some patching and of course de-bugging of Drupal 8.

In case some is interested in the workflow.

- Data comes from a commercial Repo

- Was in a spreadsheet with no formal schema and a weird format

- files, attachments all in folders

- An symfony app converts CSV to JSON, Drush command wraps all together and creates the Digital Objects.

- A lot of coffee, LibreOffice and Diet Coke were involved.

In detail What i did:

1. Wrote a Symfony 5 console app that parses rows and normalizes, cleans data. It also uses Archipelago as a Linked Open Data endpoint (because we do that) to convert simple keywords into actual WIKIDATA Linked data elements (with labels, IRIS/URIS, etc) and we cache results so Wikidata does not suffer for our flood. We keep ones that had no result in WIKIDATA in our metadata anyway, maybe in the future, maybe our own tools (rolls eyes) will allow for future replacement tools/approaches when that happens.

* Learned lesson:

a lot is in wikidata.org!

e.g a label like Distributed computation: the new wave of synthetic biology devices" leads to a (Q Item) "uri": "http:\/\/www.wikidata.org\/entity\/Q38003818"

but also funny that "label": " User Oriented Design" leads to no Q item! (should not surprise me really)

- Also deals with some date formatting (always so complex, why?) and some splitting of agents, emails, etc.

* Lesson learned.

From a single date i generate 3 versions of dates for ease of later computing, solr indexing and display, human friendly but still PHP computable, full ISO 8601 to the second and timestamp

Did you know unix timestamps for very old things are negative? I also learned that lesson.

Encoding. Make sure your JSON uses the right character encoding! (Hint, escaping UTF-8 the JS way is the way)

- Passes the cleaned data through a custom Twig template that transforms the array (columns and values) into a proper JSON and then dumps that into a file, one JSON file per object.

All this runs via command line and is quite simple to execute.

- Processing of the 2176 lines takes about 30 minutes. Its can be faster but the archipelago i'm using as LoD endpoint resolver is on OSX/DOCKER (my computer is so slow sometimes), so not that fast.

2. Once that is done, there is a simple bash script that matches JSON files to folders with files. e.g 1641181.json with a folder with a lot of different files named 1641181/

- Bash script runs inside Docker because it uses the Strawberryfield Drush (so depends on Drupal) service that duck tapes (good brand) files and JSON and puts all via the JSONAPI into the system.

- This (running on AWS so scp-ied all up there) was SO fast. I can not even believe we put 4000 large files into Archipelago (S3 minio) in less than an hour and some where quite large and we did checksumming, exif extraction, classification, etc.

- Ingest went quite well (even some failed ones could be fixed manually)

* Learned lessons.

I want to keep track of the generated UUIDs during this and add also a few partial update commands to drush and also DO more prechecks before adding assets to Drupal via the JSON API. JSON API has no rollbacks. And deleting is always more time consuming. Archipelago is doing a great job en removing unused files when deleting objects. Happy we coded all that instead of depending on core.

Dealing with assets in S3 can be daring if you run apps via the terminal to extract HOCR/PDF TExt, etc. Most of the code i have found assumes you have a tiny little Drupal 8 where all your files are in a folder. So some extra coding to fetch/save/cache files from remotes before running text extraction was needed and full text search works quite well honestly!

Drupal 8 and 9 has a bug, where you can not delete files that are not owned by you! (even if you are a semi-god admin user).

In general debugging large ingests is complex so more verbose commands and more Bash tools are needed and in the works.

Once the tech part was done, exposing this assets in cool ways is my next challenge. I did so pre research so IR assets have 2 types and also an educationalUse extra property (love semantics) so we can actually do some cool data driven displays, but, when you mix in the same object a MIDI sound and a PPT and a PDF, which wins and why becomes challenging too. Working on ways people can even further fine tune via the UI how a certain display (e.g a single one made for "article" can adapt based on the data it is given.

Conclusion: again, metadata, good, well written was the star of the show (schema.org my old friend). and not waiting for derivatives was great to have! Small footprint (and long emails, sorry)

Do you have any interesting IR scenarios/ needs/ ideas you would like to share, in specific about large data cleanups/ingests? Or Questions?

FYI: Once we are done with this, and when/if our friends feel ready to share we will post a link here. Don't expect fancy-ness (yet) just honestly well done repo-ying. All this is learning (i'm doing a 5th ingest now), we will make tools out of this outcome to lower the barrier.

Cheers

Diego

Diego Pino

unread,

Aug 14, 2020, 10:07:29 AM8/14/20

to archipela...@googlegroups.com

Continuing on this thread of passing ideas, experience and almost cookbook like sharing:

Another thing i learned this week is that our strategy of allowing IIIF to deal with almost everything is great but requires some knowledge of what you are asking IIIF to do:

E.g Cantaloupe has huge capabilities and can read a lot of formats-encoding and if one is a coding-superstar one can even add extra processors (not me), etc but still as cool as it is it can not deal with everything, yes, not everything. (why do we even have so many formats?)

Let's speak about popular JPEGs (yes, old 8bit, poorly encoded, not loved by anyone really)

If you have been playing with Archipelago you will have noticed that we do something little other repos do and some people could even find silly , and that is to get you a pronom ID (https://www.nationalarchives.gov.uk) for every file that is attached to an ADO. (archipelago Digital Object), something like this

 "sequence": 2,
 "flv:pronom": {
       "label": "JPEG File Interchange Format",
       "mimetype": "image\/jpeg",
       "pronom_id": "info:pronom\/fmt\/44",
       "detection_type": "signature"
       },
 "dr:mimetype": "image\/jpeg",

Ok. So cool. Why is that important? Well, happans that most people-repos-systems are over confident on another oldie. The mimetype: and happens that a variety of jpeg files share the same mimetype but are totally, absolutely different beasts! So info:pronom/fmt/645 is a JPEG too and fmt/41 too. But both are RAW. Like RAW in more than 8 bytes, no compression, full capture of a Camera sensor type of RAW. And those are not readable by Cantaloupe (and i tried all my superpowers). So if you pass your super IIIF twig template and let archipelago built that dream manifest, you will get broken images.

But wait, no need to let things fail and look like you know nothing of digital formats. In Archipelago we can index any any JSON into solr and filter against it. How does that happen.

Simple. JSON KEY NAME providers. Its how we tell Drupal/solr/the world how to find data in our deep nested JSON and put it into flat-simple Drupal to be indexed.

So at /admin/structure/strawberry_keynameproviderand you use the JMESPATH Provider plugin (magic querying power), you save, add this new magic field exposed as "as_image_flv_pronom_pronom_id" to your solr index, reindex and BUM (or tadá) you can now filter images by pronom id, and fetch, display, group, highlight (or put in a carousel.. i'm so 90's) exactly what can be show. So wait.. so can also do the same for width and heights? Yes, we got that, and have a slider that filters assets by dimensions? WOW. And by colors.. wait for 1.0.0!

Will start building some collective knowledge tuts somewhere to share, including implementations of this (like a view using bootstrap cards i built last night).

Hope someone finds this useful. Its not like the invention of black ink (that was a good one) but still good to know i had no need to ask a developer and import 5 YAML files and change core and reinstall all i had to get this done. Archipelago keeps making me smile

Have a good day

Diego

cshl.l...@gmail.com

unread,

Aug 17, 2020, 9:49:55 AM8/17/20

to archipelago commons

Diego,

Not coming from the Drupal world, I am not seeing how twig templates, Symfony, Drush, LoD endpoint resolver, pronam and JMESPAHT provider plugin all contribute to performing a batch ingest. Putting all of that aside for now, a really simple working example of how to batch import data would really helpful in my understanding.

If I have a CSV file with the following 5 columns: Collection, Title, Author, Keyword, File path image file, how can I batch ingest this into Archipelago using some basic curl commands?

What API endpoint would I need to call? You said that that this needs to happen within docker, but which container and how do I get to it. How does one use Archepilago as a LoD endpoint resolver to obtain a correct keyword?

I see there is a "Edit content as Raw JSON Metadata" in the admin interface, but that JSON does not seem like something I would want to generate externally to the Archipelago.

A simple example would go a long way.

Thanks

-Tom

PS - With regards to your comment about: "Did you know unix timestamps for very old things are negative?", I take offense if you are putting everyone born before 1 Jan 1970 in the category of very old things. :-)

Diego Pino

unread,

Aug 17, 2020, 11:18:33 AM8/17/20

to archipelago commons

Dear Tom,

Oh, no! no offense intended. I found the timestamp issue while testing some 1603 and older manuscripts (see also https://en.wikipedia.org/wiki/Year_2038_problem), so with things i meant the websemantic concept of a "thing" and with pretty old a few centuries old. I apologize deeply, anyone able to read and/or respond to one of my boring posts is always assumed to be young in spirit and for sure more finely (with additional properties like isPatient:TRUE) defined owl class that a simple "thing". Will make sure i never do this again specially being myself an old thing i do not need anything else reminding how older i am (and also because of that how little i respect javascript as a language)

Now to batch ingesting. Thanks, good call. Yes. Sadly these details get lost sometimes in translation (mea culpa) and this post mostly reported on experimental workflows that are not yet documented or in public code, but soon enough will. But that said you can do simple API ingest via drush to test things out right now and i will do this in two parts: will explain here how that works with a full example but also will create the first part of the documentation.

GLOSSARY:

Symfony: is the core framework used to code Drupal but also many many other PHP apps. Its a framework because it abstracts some things i would need ages to code by providing a routing system, and event system a kernel, service containers, etc. There is another of these frameworks named Laravel, but we use Symfony. Drupal 8 runs Symfony 3. We run a mix of that and 4, the code i wrote for the IR ingest i did is written in Symfony 5. One of the nice things is that it allows you to also write terminal tools (console) and i did that here

Twig templates: made to take data/mix with modifiers and HTML and render dynamic webpages. Core to Drupal 8 and 9, essence of Archipelago where we allow anything/data/metadata/endpoints to be parsed and processed via this and exposed via the UI. So your RAW json (the one you don't want to write manually, and i agree there with you to an 70% of agreement) passes through one of the many twig templates we provide (See in admin/content/ the "metadata displays" are Twig templates). In my post here i used a Twig template to shape the data coming from a CSV into that RAW JSON pre ingest. It allows me to iterate over many values in a single cell of a CSV, normalize case, clean and trim, etc. TWIG can be used (traditionally) to output HTML, XHTML and XML (like we did for islandora 7 in the already famous multi importer) but also for JSON.

LoD endpoint resolvers: Archipelago exposes a few endpoints where you can pass a string as argument (via a curl, e,g) and get a JSON back with 5 possible responses from either LoC (all the endpoints there, quite cool), WIKIDATA and also Getty AAT. More to come if anyone wants to help me. This endpoints are public and used by the autocompletes when you are filling a webform. I reuse them when dealing with CSV data since most people do not do the work of manually matching concepts/subject to URIs.

Drush: https://drushcommands.com

Drush i a way of executing Drupal functionality directly from a Terminal (no UI). Drush is sensitive to the folder you are into to decide what can be run (context) and that can be quite important in a multisite environment like play.archipelago.nyc that runs 4 archipelagos, or/and can also take arguments where you specific the URL of your site, which user is running the command. Many commands are 1:1 with things you can do via the Drupal UI, others are specific to a certain use case:

Where does Drush lives in Archipelago? Inside the esmero-php docker container (If you look at the Installation instructions here https://github.com/esmero/archipelago-deployment/blob/8.x-1.0-beta3/docs/ubuntu.md#step-3-deploy-drupal-892-and-the-awesome-archipelago-modules) you will notice that we call a drush command there to install Drupal!)

So, a basic Drush command is this (clear your Archipelago caches using your unix/gnu terminal)

docker exec -ti esmero-php bash -c "drush cr"

That will

1.- open a bash terminal in the esmero-php container

2.- Your starting folder will be /var/www/html (where internally your https://github.com/esmero/archipelago-deployment clone is to be found)

3.- run an executable (the -c) named drush cache:rebuild using an alias (cr) see https://drushcommands.com/drush-9x/cache/cache:rebuild/

4.- ask you which cache you want to rebuild (a number)

finish and return to your host terminal session (abandoning the

Archipelago ships the latest code here (Drush 10)

docker exec -ti esmero-php bash -c "drush --version"

Drush Commandline Tool 10.1.1

Ok. Enough of context (we can go back to that anytime)

EXAMPLE INGEST (ONE OBJECT, 2 FILES)

Simple examples are shipped in the beta3 release so you want to familiarize yourself with this first. Drush command we added is in the beta3 release so make sure you are actually running webform_strawberryfield, format_strawberryfield and strawberryfield from the 8.x-1.0-beta3 branch and all is updated (recently)

https://github.com/esmero/archipelago-deployment/blob/8.x-1.0-beta3/docs/democontent.md (please read, don't skip)

The gist is that when you follow the instructions there you are cloning inside your d8content folder a repo containing folders, .json files and binary files. and then you call this script i wrote that for every .json file it uses our drush command to in a single step upload files found in a given folder, get the UUIDs, modify the JSON to connect the files to the metadata and upload them

https://github.com/esmero/archipelago-recyclables/blob/edge/deploy_ados.sh

Going deeper there

This is one of the commands called there

drush archipelago:jsonapi-ingest /var/www/html/d8content/archipelago-recyclables/ado/0c2dc01a-7dc2-48a9-b4fd-3f82331ec803.json --uuid=0c2dc01a-7dc2-48a9-b4fd-3f82331ec803 --bundle=digital_object --uri=http://esmero-web --files=/var/www/html/d8content/archipelago-recyclables/ado/0c2dc01a-7dc2-48a9-b4fd-3f82331ec803 --user=jsonapi --password=jsonapi --moderation_state=published;

Let's digest this

drush archipelago:jsonapi-ingest // The command

/var/www/html/d8content/archipelago-recyclables/ado/0c2dc01a-7dc2-48a9-b4fd-3f82331ec803.json // The location inside the docker container (esmero-php) of the json file. That same location from the outside (host computer) is simply the d8content/ folder (do an ls to confirm)

--uuid=0c2dc01a-7dc2-48a9-b4fd-3f82331ec803 // A UUID. I'm passing this to avoid double ingests. If you try to ingest the same object with the same UUID it will not allow you. If you don't you will get twice the same object.

Question: How do i generate a UUID? Good question? So many ways, we are using UUID V4, you can go to an online website (free) or run a drush command.

Try it:

docker exec -ti esmero-php bash -c "drush uuid"

--bundle=digital_object --uri=http://esmero-web // This ones will not change except if you created custom content types or are ingesting collections or have a different name for the docker web container

--files=/var/www/html/d8content/archipelago-recyclables/ado/0c2dc01a-7dc2-48a9-b4fd-3f82331ec803 // The location of a folder with any (for real) type of files you want to attach to the JSON and thus to the Object. Those will be uploaded first by the drush command and then later classified, exif-ied, pronom-ied, checksum-ied and persisted in S3 by strawberryfield. I patched the module recently to allow files with spaces and weird names, if you are running an older version, please use file names without spaces. (please)

--user=jsonapi --password=thepassword // Your default jsonapi pass/credentials API credentials. Secure your JSON API after calling this or protect if behind a firewall, etc.

--moderation_state=published // the moderation state. leave empty to ingest as a draft. If the content model is custom and not moderated then archipelago will just skip this.

So. What is inside one of those JSON files? In other words what you need to ingest your first object via API using ONE row of your CSV? (i will not go into the full CSV here yet, let's start with a single one first)

See here: https://github.com/esmero/archipelago-recyclables/blob/edge/ado/f4a4c6ee-4ce9-4b4c-8704-e8057bad0a7d.json#L1

I will mark in RED for that same file only things that are REALLY required and in green the ones you have in your CSV (means of they are really required/quite recommended and also in your CSV, red wins), please reed the foot note about "ismemberof":

{
	"type": "Panorama",
	"label": "Strawberry Field at Thorpes Organic Family Farm",
	"owner": "ESIE (Empire State Immersive Experiences)",
	"audios": [],
	"images": [],
	"models": [],
	"videos": [],
	"warcs": [],
	"creator": "Lund, Allison",
	"documents": [],
	"edm_agent": [],
	"ismemberof": null,
	"description": "Strawberry field at Thorpes Organic Family Farm in East Aurora, NY. Image depicts late-season strawberry \"u-pick\" fields on a late July evening, 2020.",
	"subject_loc": [
	{
	"uri": "http:\/\/id.loc.gov\/authorities\/subjects\/sh85128547",
	"label": "Strawberries"
	},
	{
	"uri": "http:\/\/id.loc.gov\/authorities\/subjects\/sh2010104552",
	"label": "Organic farming--United States"
	}
	],
	"website_url": "",
	"as:generator": {
	"type": "Update",
	"actor": {
	"url": "https:\/\/play.archipelago.nyc\/form\/descriptive-metadata",
	"name": "descriptive_metadata",
	"type": "Service"
	},
	"endTime": "2020-07-10T14:01:08-04:00",
	"summary": "Generator",
	"@context": "https:\/\/www.w3.org\/ns\/activitystreams"
	},
	"date_published": "2020-07-02",
	"term_aat_getty": null,
	"ap:entitymapping": {
	"entity:file": [
	"images",
	"documents",
	"audios",
	"videos",
	"models",
	"warcs"
	]
	},
	"local_identifier": "",
	"subject_wikidata": [
	{
	"uri": "http:\/\/www.wikidata.org\/entity\/Q745",
	"label": "Fragaria"
	},
	{
	"uri": "http:\/\/www.wikidata.org\/entity\/Q165647",
	"label": "organic agriculture"
	}
	],
	"geographic_location": {
	"lat": "42.755309521944",
	"lng": "-78.509831946323",
	"city": "Wales Town",
	"state": "New York",
	"value": "12866, Strykersville Road, Wales Hollow, Wales Town, Erie, New York, 14052, United States of America",
	"county": "Erie",
	"osm_id": "337075304",
	"country": "United States of America",
	"category": "place",
	"locality": "Wales Hollow",
	"osm_type": "way",
	"postcode": "14052",
	"country_code": "us",
	"display_name": "12866, Strykersville Road, Wales Hollow, Wales Town, Erie, New York, 14052, United States of America",
	"neighbourhood": "",
	"state_district": ""
	},
	"strawberry_field_widget_id": "descriptive_metadata"
	}

Gist:

- basically create a valid JSON file (you can use atom, a script, python, php, textmate, Oxygen XML editor, Apple Property editor, etc. all those validate JSON).

You want to have the

- "type" key which triggers different view modes in your archipelago, but if you omit it i "think" all can be also fine (crossing fingers)

- "label" is more than important, its the title of your object. If you omit it archipelago will give you one quite silly one.

- "keywords". You can use the structure used by LoC and omit the URIS if you don't know them. Or the one by WIKIDATA, etc. or even put them as a simple list on some other key, lets name it keywords like this

"keywords": [

"super", "duper","keyword"],

I feel that is one of the MANY beautiful things of archipelago. Put your data where it makes more sense to you, be consistent, modify webforms so they can read from there or experiment with the "Edit content as Raw JSON Metadata" (another tut but really intuitive) and then make sure you make them appear later via the twig templates in your MODS, Dublin core, schema.org, etc. Iterate, refine, move forward. Disclaimer. I KNOW we don't have all the intuitive tools around yet to make this the best ever metadata editing platform. But we will, i'm working hard and we will.

- lastly the audios, images, etc empty keys and the special "ap:entitymapping": { "entity:file": [ structure which includes the same list of audios, images, etc. That is an Archipelagism. Actually, all that is as: or ap: or flv: basically a key that is prefixed is either created by us during ingest or used as a hint for something else. In this case this one is important, why? Archipelago when uploading the files you will provide in your drush command as a folder, will calculate the mime type and will put each file (like a media router) in a key named like the first part of the mimetype (with a little bit of imagination because i do renamed application for document) pluralized. So an image/jp2 will end in images. but that is actually a first pass, then "ap:entitymapping": { "entity:file": structure tells strawberryfield to resolve all those keys as being files (you can add a key named ""entity:node" and it will deal with it as a connection to another d.o) and by doing so will trigger checksum, exif extractions, persistence etc. etc.

Sorry for the long explanation but i feel this is needed. The background behind this all and why things happen. So, next step is:

- Please create a JSON and add a folder with your files inside the docker container (i will also allow external files eventually no worries)

- Try an ingest yourself first. Come back here and let's share what you did/ the JSON.

Once that happened, let's move to what is to come soon (code started, quite advanced already) a full UI driven ingest mechanism, following the popularity of our Islandora Multi Importer. but this one is an important first step. Know your data, love your archipelago (old proverb, from negative unix timestamp times)

Does this help in anyway? Please let me know (other than writing docs for this, starting now) how could i make this easier for you?

Best

Diego

cshl.l...@gmail.com

unread,

Aug 24, 2020, 2:50:29 PM8/24/20

to archipelago commons

Dear Diego,

Your reply was very useful. I was able to make a copy of d8content, renaming the directory to mycontent, slightly manipulate the json files. I then copied the new mycontent directory to the docker container and started up ingest process, using the following commands:

docker cp mycontent esmero-php:/var/www/html

docker exec -ti esmero-php bash -c 'mycontent/archipelago-recyclables/deploy_ados.sh'
Just as long as I was able to assign a new UUID, for each ingest, everything went smoothly. I was surprised that the json did not explicitly reference any .jpg/.pdf/.mp4 file, but automatically processes all of the files in the media directory for that UUID. At least for the book, the system sorts the pages by the filename (which is reasonable). Not sure if I have a full video and a trailer, how the json would distinguish between the two .mp4 files.

Before any of the digital objects are uploaded, need to create a hierarchy of Digital Objects Collections, where the ADO are able to be placed. How does one use "ismemberof"? Since I am just running a remote script, I am not getting the collection ID, back from the script. Do I use the UUID for the ADO Collection, in the "ismemberof" field of the item? You referenced a footnote to the "ismemberof" above in your answer which might answer the question, but I have not been able to track that footnote down.

Now for a more interesting question. Let's say I have collections from a number of different Archival systems, that I want all imported into Archipelago. Let's say the first one is Digitool. Do I try to recreate all of the Metadata that is in Digitool, and have a digitool-tiff, digitool-pdf, digitool-jpg and a digitool-ead json formats, so that all of the metadata is captured. Then would I have to create twigs templates for each digitool json formats. I would then also do the same for some collections in Omeka, so there might be a omeka-pdf, omeka-jpg and omeka-jp2k, which have more of a Dublin Core set of metadata. And finally, I would want to upload the structure of some collections contained within ArchivesSpace, so that I can then subsequently upload actual digitized content to Archipelago. So I might also have an archivesspace-element json format.

Should I try to keep the metadata between all of these formats/sources as close to the original as possible using the above approach, or should I be trying to normalize the metadata into a "native ADO" json, which will work with the original twigs templates. A third approach, might be to OAI-PMH to extract and import the metadata. Any and all insights appreciated.

Thanks

-Tom

Diego Pino

unread,

Aug 26, 2020, 12:34:34 PM8/26/20

to archipelago commons

Hi Tom, sorry for the late response. Great you got it working!

Will reply in more detail tonight or tomorrow morning (just a litte overcommitted but no later than tomorrow, i promise) and also explain in more details how the Files v/s Metadata connecting works (also why you need (or could not need) a UUID). I also feel you would make a great test user for the new UI based ingest module (AMI) i'm writing that should make some of this less Unix (nothing wrong with Unix) and more UX aware. Its spreadsheet based but coded in a way it can be expanded to other APIs, remote sources, etc. so will explain that in more detail in another thread when ready for public consumption, which intersects with your question about multiple sources (e.g migration from digitool)

I appreciate your reply, your testing time and also that you are letting us all know it worked.

Best

Diego

Diego Pino

unread,

Aug 27, 2020, 1:52:50 PM8/27/20

to archipelago commons

Hi Tom replying in between lines

On Monday, August 24, 2020 at 2:50:29 PM UTC-4, cshl.l...@gmail.com wrote:

Dear Diego,

Your reply was very useful. I was able to make a copy of d8content, renaming the directory to mycontent, slightly manipulate the json files. I then copied the new mycontent directory to the docker container and started up ingest process, using the following commands:
docker cp mycontent esmero-php:/var/www/html
docker exec -ti esmero-php bash -c 'mycontent/archipelago-recyclables/deploy_ados.sh'

Great!

Just as long as I was able to assign a new UUID, for each ingest, everything went smoothly. I was surprised that the json did not explicitly reference any .jpg/.pdf/.mp4 file, but automatically processes all of the files in the media directory for that UUID. At least for the book, the system sorts the pages by the filename (which is reasonable).

Giving the JSON and the Corresponding folder an UUID as a name is a good practice because you can modify the script to use that UUID as an argument and only ingest a single time. Archipelago Objects can carry any Binary Payload (like a lot of files) but the key there are the keys i marked in red in my long previous post, basically Archipelago is like a mail sorter (like those USPS has and someone was trying to get rid of..) for every file found in that folder you pass it will try to route them to JSON keys that match the first part of the Mime Type of each one :so classifies, routes/distributes their Drupal File IDs (integers) to different places in the JSON, orders them as you correctly noticed giving them a sequence number based on file named and other quite simple but clever algorithm, creates the extra supporting structure (all the ones that start with as:images.etc

Not sure if I have a full video and a trailer, how the json would distinguish between the two .mp4 files.

It won't it will just classify both and based on your numbering (001, 002 or 1, 2 or a few more, using natural language ordering strategy) add sequence numbers. Then The viewer (Video formatter you apply for that type of object/Display mode) can be setup to use one of the other. You will also see that the JSON includes a tags:[] keys. We can (and i will add that code during this release process) use that as a way of letting the Formatter (e.g Video one) to play only one tag, or all, etc. But there is more. Probably your trailer and full video will have different encodings, which also means different Pronom IDs, eve if both have the same mime type and the same extensions. Pronom IDs become your most trustable source of decision and exclusion of files that are only meant for preservation of just download. Video is a good use case and i will make a demo asset in play.archipelago.nyc to showcase that use case, include realtime thumbnail, etc.

Before any of the digital objects are uploaded, need to create a hierarchy of Digital Objects Collections, where the ADO are able to be placed. How does one use "ismemberof"? Since I am just running a remote script, I am not getting the collection ID, back from the script. Do I use the UUID for the ADO Collection, in the "ismemberof" field of the item? You referenced a footnote to the "ismemberof" above in your answer which might answer the question, but I have not been able to track that footnote down.

Good question: ismemberof, or ispartof, or friendof, or depicts, is fed by a webform element with an autocomplete for other Digital Objects (can be just any or one more controlled, like only get objects that have an typeof: "Magazine" if the object being edited is of type "Issue" or stuff like that . When used with the webform it will put there the internal Drupal NODE ID of the "thing" you autocompleted. In specific in the demo deployment ismemberof is setup to only show Collections, because its a common use case. So if you (as we speak, there are more UX driven solutions in the works) want to start adding objects to collections, i would first ingest only collections and since collections tend to be like 1:10000 in amount i would keep track of the numbers (Node ids, can be seen when pressing the edit tab on an ado or pressing on the Devel tab, there look for what i marked in red.

Drupal\node\Entity\Node { ▼

+in_preview: null

#values: array:21 [▼

"nid" => array:1 [▼

"x-default" => "2003"

]

Then on your JSON files you have in the server, put that into the JSON. "ismemberof": 2003 .Member of multiple collections? "ismemberof": [ 2003, 2004 ]

WIP: I do not like these Node Ids. Drupal is plagued by them and since they are autogenerated on a large migration out (lets say you want to use in 5 years the new thing that someone invented (hopefully us too but honestly anything, that is our main promise as a system/community) and you want to take all your data, those Ds are a Drupalism and make no sense out of your current contex). So i'm working for a few weeks already in a UUID resolver for those values. Issue is not really putting the UUID there but allowing Drupal Views (the ones that list objects for a collection to use them to connect objects when querying Solr. Idea is you can just put a UUID there and the webform element will be able to transform on the fly to the internal ID because drupal does not know different, and also Solr index will be make expand the UUID into an ID and also a full connected entity)

Now for a more interesting question. Let's say I have collections from a number of different Archival systems, that I want all imported into Archipelago.

Well that makes me happy and wants us to want to help you. Migrations. Such a delight always!

Let's say the first one is Digitool.

Oh, lovely Digitool

Do I try to recreate all of the Metadata that is in Digitool, and have a digitool-tiff, digitool-pdf, digitool-jpg and a digitool-ead json formats, so that all of the metadata is captured. Then would I have to create twigs templates for each digitool json formats.

let's start with your JSON data (i assume digitool gives you that,if not we can also use XML directly): If you want to keep your ingested data 1:1 to what you have, then yes, you need to at least modify the Object Description Twig template so you can show it/cast it to the world, the IIIF ones to fetch the descriptive metadata values that you want to show (Access Conditions, etc) and you should at least! add a few extra JSON keys.

- label

- type

- "ap:entitymapping": {

"entity:file": [

"images",

"documents",

"audios",

"videos",

"models",

"warcs"

]

and

- "images",

- "documents",

- "audios",

- "videos",

- "models",

- "warcs"

(PS: let's say you are going for this, i could make sure you don't need to and all of this i could fill up automatically the missing.. with some extra code.. (good time to open a github account/or watch/add issues to https://github.com/esmero/strawberryfield/issues)

But there is more. You could also maybe want to clean up your data before ingesting. So you could use that other piece of code i wrote and i mentioned in my first post here (still and internal app but can share) to conform to a given commonality

Or. Maybe you do not want to do any work upfront, and you want to see all inside the system first, just metadata. Totally Fine. And then you can use this once merged, tested today and it made me happy! https://github.com/esmero/strawberryfield/issues/101 to patch your data, copy JSON from one to another, etc.

And there are the in-betweens, like generating LARGE CSV files and using the same tool i mentioned or AMI which provides a UI (Also WIP, i move as fast as i can but that is so so close)

For files. If you infest all together, (one .json file and a folder) only use your master files, like for the same JPEG and a TIFF? go for the TIFF.

I would then also do the same for some collections in Omeka, so there might be a omeka-pdf, omeka-jpg and omeka-jp2k, which have more of a Dublin Core set of metadata. And finally, I would want to upload the structure of some collections contained within ArchivesSpace, so that I can then subsequently upload actual digitized content to Archipelago. So I might also have an archivesspace-element json format.

Almost the same process. Either keep each individuality in place (you can even add an extra JSON key element named "system_i_moved_from" to tag them and identify them inside the server and then clean, refine, even create webform that match just those formats while you figure out things. I recently created a new Webform element: the strawberry transplanter (yeah.. so many dad jokes/references to strawberries i know). This webform element can take values from one structure and put them into another using a template, So basically you can also interactively move data between schemas. E.g

Instead of putting all data coming from archivespace as JSON directly on the root of the document, add a key named

as:import or simply importdata, you decide, just be consistent (ah.. also write docs!)

So it would look like

{

"type": "thing",

"label":"Original super long and perfectly chosen title",

"importeddata":{

.. your other full JSON here from archivespace

}

So. Now you have the chance of A) keeping all the old, but, building twig templates (or adapting the ones we provided) that are NOT aware of anything inside "importeddata". As you start analysing your data, grouping, even indexing to solr so you can search what is inside importeddata you can create this new JSON Patch actions that move data from one place to another batch. or Simply use this webform element to do that manually. (i actually like this idea, we are doing that for large EAD V2 imports).

Should I try to keep the metadata between all of these formats/sources as close to the original as possible using the above approach

My metadata person says yes. As much of your provenance, origin, intent (even the email of who cataloged, you don't want to loose contact and be able to make questions to those professionals). You can always of course clean up too. Many many of those other sources won't have any linked data or any of the deeper nested data we can deal with. So you want to reconcile against WIKIDATA e.g. But also a chance to clean up what i name (lie, just made up that name as i wrote this) "Learned restrictions" basically things you were forced to put there, flatten, reduce to the demands and workflows imposed by each tool.

, or should I be trying to normalize the metadata into a "native ADO" json, which will work with the original twigs templates. A third approach, might be to OAI-PMH to extract and import the metadata. Any and all insights appreciated.

Also an option: I wish Allison Lund could give us a hand here. A Native ADO json would basically what comes out of your perfect and ideal input webform.You build that, you create a full perfect exemplar of what your archivists, catalogers and users could need. You experiment with the best displays, best linkings, etc. Once that is done, people are happy, you like what you see and how all looks you JSON Patch your current data into this new one. (Should i do a tutorial of this? JSON Patch and JSON Diff are great tools for this, but also some pretty well formed Twig templates can take one and put the values of the other into the frame. In JSON-LD this is named Framing. Its like built your perfect house, leave place for your old things, create also a larger kitchen. Once ready you move your old furniture. Some of that maybe will need to be modified, maybe some not needed but it its big enough, it will pass.

OAI-PMH is also an option. We used to do OAI to CSV and then ingest (to enrich, normalize) but the CSV lacks a lot of the native data your other systems will have.

Question: would maybe a sessions (2-3 hours, it takes that time) of migrating from many many sources to Archipelago) be of interest of people here? Like end of Sept? That could also speak about use cases i have not considered but could create code to support.

Hope this helps a little more. I feel the idea of your dream house (with a nice pool, not too large so cleaning does not become a thing) before moving (so it also speaks to the moving) is my first inclination. Since here at least you are not restricted by the platform to e.g just use Subjects coming from LoC. Or to have only one image per Object.

Thanks

Thanks to you. Please follow up if this made no sense or i left too many questions unanswered. Also, feel free to share source examples of any of your systems. We would love to help here.