Ode to the in-betweens, the invisible paths through the meadow (1.0.0 Release + Notes)

122 views
Skip to first unread message

dp...@metro.org

unread,
Aug 15, 2022, 7:22:26 PM8/15/22
to archipelago commons

Ode to the in-betweens, the invisible paths through the meadow

Archipelago, the humble tiny dream, the initial impossible thought turned into a colorful idea, the careful act of choosing, curating, planting seeds, nurturing and watering, waiting and contemplating the seasons come and go, the software that became a tiny garden, a communal space where many of you came and went and returned to stay a little bit longer, the idea turned into a community that adopted (the software) the space, hopes and care, is flowering once again in the form of a release and you are welcome to sit, read, reflect, enjoy, watch the birds (or be them), inhale deep and let the wind build a wild mix of colors and perfumes on a long hot summer evening.

The semantics of the versioning of this release -1.0.0 - are confusing. Building software the way we think of software is tricky. We decided to release not often because we believe users and ourselves don’t deserve the pain of upgrades so every release packs a lot. Also we had an initial goal--some daisies, dandelions, beans and pumpkin plants needed to be present, not fair having a 2.0 for a patch of dirt with a promise. Note: If this is already too much jump to the release notes at the end (but really, keep reading).

At this point, it may feel that all previous versions and past 3 1/2 years of public code (e.g RC2 or Beta3 or even alphas, or some better known by their nicknames, “a Cat's Pajamas” or “All grand beauties withhold their deepest secrets") were - semantically - continuous loops - a rehearsal- around a circularly laid path (your needs, our shared sometimes goals) in the aim of learning, revisiting each step, avoiding larger stones and having a clear start and end you can walk safely over and over (a circle of refinement). But to be honest (as someone that has been walking the main road laid in front here, filling up a hole here and there, but mostly visiting the side tracks and invisible paths between the bushes while doing so) it was never about the path; I say this in the deeper sense of two dots that connect. It was always (or became) about the slowly, patiently contemplated displacement. It is now and then about the surroundings, the changing seasons, the evolving and devolving of the landscape and mostly about the people (you, them) that we met, the shared times, the learning, narratives and tales and space, while laying new reasons to advance and finding old reasons to stay behind. The past and present tense(s) are all entangled.

Said differently, 1.0.0 might as well be version 5.7.2. Even if it is more than acceptable and desirable to have a concrete, clear goal here (CS people will agree/enforce this, and everyone trusts full versions and defined roadmaps, feature lists, in software, way more, anyway) of what this Software and Community can/wants to be and where in the grand (or small) scheme of things a Repository system even falls into, what we all have built as a community is not a technologically imposed thing where to put/extract/safely meta(meta(data)) and media -only- but a safe space to build on concretely. A platform upon which to set down, brick by brick, your needs and ideas that support your uses, that represent your core daily work efforts, a space to keep and share histories (your objects), ways, workflows, relationships that are the real gold (the yellow flowers in this meadow), your efforts and evolving needs and wishes. This is made possible, not by providing the what purely, but the hows and whys, the ability to put new and old ways in place without breaking the fragile balance of what was already done, the road (sunset, sunrise?) left behind, that is still so treasured and valid.

So. 1.0.0, “shimmering at the meadow's edge”1, is somewhere in between a future path and the offset edge of a well walked one. And it is a happy place. We are privileged and proud (and lucky) that we have not left anyone behind and that we could stay true to our initial ideas. We have made more tools/tooling and the existing ones more colorful (fast too) and reliable. We have created things that you would not find somewhere else, not just because we can, but because we think you might need them. We have put reflections, self-criticism, external criticism, deep care for details, time, efforts, inline comments, hundreds of slack messages and strong cups of coffee and tea into this software; and open minds and hearts into the act of community nurturing. We ensured that all that you have/had/knew is still valid - and in a huge effort for being (finally Diego!) super concrete here - without introducing any deprecation or changes that could affect your precious (meta)data, the fruit of your efforts. All this means anything ingested on day one (Sep 18, 2018, when nobody knew we were even planning this) is as valid, as immutable and as plastic, portable and yours as ever and it will keep being so. Also worth mentioning, Archipelagos are running in the wild (never to be domesticated) on nearly every continent, from humble to large, with simple petals and some with complex/compound blooms, using pre-conceived use cases and data models or unexpected ones. Finally, we hope that you like it (or love it, what even is the difference?) and that your work will be easier because of it. 

Thanks for reading so far into these notes.

Now to the concrete:

Local Deployment Machine (default branch):

https://github.com/esmero/archipelago-deployment/tree/1.0.0

This is now a Drupal 9 only release. 

Production Deployment Machine (also the default branch)

https://github.com/esmero/archipelago-deployment-live/blob/1.0.0

New Configurations, latest everything we could test on Drupal/associated packages and libraries. Fully rebuilt Theme for Bootstrap 5 (ouch!), updated Templates for Object Description, IIIF V3 with Annotations, Better thumbnails, Updated Webforms, Views, blocks, etc. Also a lovely and soothing new color combination and fonts.

Documentation:
https://docs.archipelago.nyc


DevOps:
  • New (arm64/amd64) PHP8 Multi platform Containers with much faster (custom build) Tesseract 5.1.2 with JP2 support, more languages, PDFAlto and better tooling

  • Redis 6.2 for caching

  • Custom built (arm64/amd64) Cantaloupe 6.0.0, now much more capable of dealing with large media

  • New (arm64/amd64) NLP Container with FastText for Language detection and more languages too

  • Latest Solr 8, Latest Databases (MYSQL/MariaDB)

Live Preview: As as test/proof of concept https://archipelago.nyc was updated today from 1.0.0-RC1 to 1.0.0, which is 18 months. All data works/all images work. Check it out. I did not add new content (yet) but team will play with it during the week.

House Keeping: All 1.0.0 and 0.4.0 branches will stay open for a week (or 8 days really). In case someone finds a really breaking error/big typo on the docs or hits that edge case scenario and that new Laptop catches fire. After that 1.1.0 begins to walk on its own feet while 1.0.0 stays back in the horizon.

New Features by Module

Strawberryfield (983 additions and 60 deletions)

  • File Composting via Queues. Other modules can notify(event) of leftover files (tmp/garbage). A queue will recurrently check if those files are not used anymore and cleanup. Includes time to live settings to allow reuse in case of recurrent ingest/delete/ingest operations.

  • Related ADOs get their Caches cleared on any change.

  • S3 File/caching service with remote Checksum validation/existence.

  • Better/Safer MediaInfo extraction for Videos/Audios

  • New Keys for Strawberry Flavors to accommodate for extra NLP extracted data, including detected language and requested language

  • Better Date/EDTF/Human parsable date exposing and Indexing. Better deduplication and safer ways to deal with the worst metadata nightmare ever (dates)

  • Strawberry Flavors get correctly removed and cleaned up on ADO removal (OCR, Webpage extractions etc)

  • New automatic (and indexable) key. Str_flatten_keys_unfiltered and str_flatten_keys. The latter now keeps track in Solr of every existing JSON key (JSON PATH) for keys that have values. Means you can facet by “Objects where the Images were captured with a camera that exposes Exposure”, or ADOs without PDFs but with DOCX files.


Format_strawberryfield (158,439 additions and 630 deletions)*

  • OpenCV Webworker with Facet Detection/Contour (mid tones, Contrast)

  • For the OSD Viewer integrated with the Annotorious Annotation tool

  • Swappable Icons for OSD. Also cool tiny cute SVG icons made by us.

  • More Annotation Tools. Choice of Polygon/Square or both + advanced Polygon Tool

  • IABookreader is now IIIF v3 compatible too.

  • Citeproc Formatter and a Twig extension. Both driven dy data/twig templates. Will selectable Citation CSL Format (All of them!) and automatic JS addition    

  • Copy to Clipboard JS Twig Extension 

  • EDTF Humanizer TWig extension. Pass a date. Get the human readable dates back

  • Search API Twig extension. Now you can display facets/data from a search directly in your Twig templates.

  • Improve Pannellum JS. Now on Panorama Tour Scene Change other Formatters can react. E.g Leaflet Map can change and focus on a geographic location of the current Scene.

  • Sub Module for Rendering Maps via Views. Feeds from current GEOJSON endpoints and allows you to build thematic/searchable maps.


* Number of additions is deceiving. SVG is also code!


Webform_strawberryfield (28 additions and 19 deletions)

  • New Europeana Endpoint/API (0.10.3) for LoD

  • Better EDTF/Date elements (bug fixes too)


AMI  (2,856 additions and 740 deletions)


  • More Solr Fields are combined during Harvest, new CMODELS handled.

  • Fixed Cell Size limit for Excel Files

  • Fixes a Core PHP5/6/7/8 Bug when reading/Writing CSVs with JSON data and/or Offsets (this one was hard). This also fixes Paging issues when editing LoD

  • Checkbox for reviewed LoD Data in the EDIT form

  • LOD (processed via an AMI) can be downloaded/replaced/uploaded again via a CSV

  • Better AMI Step by Step Logic (more checks) and the form now remembers decisions that are tedious to repeat (like CMODEL to Type mappings)

  • Fixes ZIP based Spreadsheet types (XLSX)

  • Better Check on ADO Delete Operations (Form)

  • Much better remote file handling and S3 (path) based ingest with HTTP based + local based Media type identification, remote check via Checksums for S3.

  • Twig Extension that can call any LoD endpoint. Means reconciliation can be done also “Live” during ingest. E.g ami_lod_reconcile(coordinates, "nominatim;thing;reverse", "en", 1)

  • Better AMI Set Preview (includes what LoD is going to be used/present)

  • New options for AMI sets that Update data. Can be full update, Replace or append. Also a failsafe “Never touch my files” option that will do that. You won’t be able to (by mistake or even if you intend so) remove/mess up with File level metadata and media.

  • Custom Facet Processor that allows Views that do batch processing to use those as filters. Taps deeply into Facet Module/VBO module to do the “impossible” but “required”

  • Solr plugin has basically no more memory limitations. You can harvest Multiple Remote collections and get a whole-repository-in-a-spreadsheet. It uses an adaptive approach on how many remote documents to process to avoid filling up memory and can run for hours.

  • Preserve the UUID when reusing a file between batches

  • Completely revamp and fix Webform Search and replace on Batches. You can add/remove/single entries in multivalued JSON keys. Failsafe and extra useful.


Strawberry_runners (747 additions and 306 deletions)

  • OCR can run on Single/Multiple per Object Images too

  • OCR will attempt to extract existing text into ALTO before attempting image based OCR

  • OCR allows to define via metadata the desired/expected languages (multiple ones), will do post OCR language detection and will decide based on that what type of Natural Language processing to do. Also requested + detected languages are stored in Solr

  • WACZ will do post HTML/page extraction language detection and will decide based on that what type of Natural Language processing to do. Detected languages are stored in Solr

  • PDFAlto can be configured

  • Tesseract 5.1.0 Language folder can be configured / swapped

  • Language detection is done via FastText now.

  • OCR supports also JP2s

  • Natural language terms are filtered to avoid false positives, not-really agents/places. 

  • Post Processing can be forced on anything via metadata ({“ap:tasks":"ap:forcepost":TRUE})

  • Fixes Empties on AltoXML and MINIOCR processing for the OCR Highlight Solr processor that is now stricter

  • WACZ files also process now the extraPages.jsonl used by Browsertrix-crawler (and it's cloud super version)

Who walked the road with us?

If you are reading this, most likely we have met and shared at least a few words and tales, or we might even have been walking together for a while. The when, the where our story started/intersected is less important. All and everyone has brought significant interactions. It has been a communal effort all the way and because of that I will try to mention and thank everyone by first name when possible. Will start with the ones contemplating the meadow and/or we have held hands (fictional) on the path of developing this project. 

Allison, over 2 years together and I have to confess you have shaped the roadmap and ways of Archipelago more than any colleague, through questioning, constant use, migrations and many “what ifs?". Can’t say it enough, this project could not be possible without your continuous efforts, care, deeper thinking, patience and support. Admiration and sincere gracias. 

Nate. Our Director believed (and still does) in this and provided a base, ideas, wisdom, space/ and found the resources (a sustainable plan, a unique experience for an OSS) so we could do what we do best: be stubborn! Thanks Nate. 

Giancarlo, a good friend and valued colleague for years, understands that open code leads to open spaces and ideas, that through sharing, documenting and testing and exploring you become part of the result and the solution -pieces of your code and ideas are all over the place and core to us. Also the best bug finder ever and owner of an infinite patience. Grazie mille. 

Katie, for her dedicated and mindful migration work, positive and constructive feedback, and enthusiastic approach to working through uncertainties/issues and adapting workflows. 

Al, for his testing and reviewing of docker containers and deployment processes in different environments, for wading into the development waters and getting started contributing enhancements for Archipelago users.

Martha, Jen, Chuck, Sarah, Megan, Prashanth, Lisa, Ianthe, Carl, David, Max, Brenden and Giancarlo(again) for being founding members of the Working Group, a future thinking instance of community practitioners. Thanks to you all for providing actionable feedback on improving our workflows, software and practices. 

Also so grateful to our Advisory Board members for your valuable feedback on past releases. All what we learned became part of this project.

Mike, for your amazing disposition to discuss the architecture and everyday crazy ideas, for building tools and for being kind and caring all across this community. 

To the Hamilton College, Barnard College, Union College, Western Washington University, Edinburgh and Rensselaer Polytechnic Institute teams that have been working side by side with the core team, using, testing, extending and experimenting with the platform and their (meta)data. Your feedback, time and positive energies have been invaluable during this release process.

To the many intersecting projects and their members that have added core but also fun functionality to our work, Johannes, Ilya, Simon. Thanks so much. Your creativity and dedication to OSS and willingness to collaborate are legendary.

Pat, friend, thanks so much for your code, your eternal patience and bug finding/fixing. Derek and Noah for your use cases, community interactions (candies, treats, that book, and coffee too) and for making this software accessible to many more users. 

METRO Staff for your many times invisible support and patience hearing every week on our progress, specially to Anne for sharing your experience with all of us. 

And to all of you, new and old to the community for being here and there, early and late in the day, for being respectful, caring, dedicated and willing to explore with enthusiasm what we have to offer. 

Finally (super important) to all our pets/furry friends that have been of emotional support: Tess, Tisca, Zulu, Trewa, Calisto, Baron, FILL UP YOUR PET’S NAME HERE.

Gracias and a big hug

Diego (a.k a the hug machine)

1. Kimmerer, Robin Wall. Braiding Sweetgrass. Milkweed Editions, 2015.

Nate Hill

unread,
Aug 16, 2022, 9:08:51 AM8/16/22
to dp...@metro.org, archipelago commons
A huge congrats to the METRO team and to everyone else who has participated in this work. Beautiful!
Nate

--
You received this message because you are subscribed to the Google Groups "archipelago commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archipelago-com...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/archipelago-commons/e112f9be-6e01-4b53-a8e1-1c3a2e1f74f1n%40googlegroups.com.
--
Nate Hill
Executive Director
Metropolitan New York Library Council

David Keiser-Clark

unread,
Aug 18, 2022, 10:01:19 AM8/18/22
to dp...@metro.org, archipelago commons, Nate Hill
Enthusiastically cheering for you all from the sidelines :)
David

David Keiser-Clark
Academic Application Developer
Makerspace Program Manager
Office of Information Technology
Williams College

pronouns: he/him/his


Kameelah Rasheed

unread,
Aug 18, 2022, 4:26:58 PM8/18/22
to David Keiser-Clark, dp...@metro.org, archipelago commons, Nate Hill
Also, cheering from the sidelines!

Sent from my iPhone // Kameelah Janan Rasheed 

On Aug 18, 2022, at 10:01 AM, David Keiser-Clark <dw...@williams.edu> wrote:


Reply all
Reply to author
Forward
0 new messages