Sandbox - XML metadata import

822 views
Skip to first unread message

Julie S

unread,
Dec 8, 2023, 9:17:48 AM12/8/23
to archivematica
Hello,

We'd like to use the XML metadata import feature that was introduced in AM 1.14.0 but can't seem to get it working in our dev instance (now on 1.14.1). We've also attempted to configure the metadata XML validation feature, but aren't sure if it's set up properly since the XML metadata is not appearing in the METS (is enabling the validation feature required to enable the import feature?). So far, we've added metadata_xml_validation_enabled = true to /etc/archivematica/clientConfig.conf and a METADATA_XML_VALIDATION_SETTINGS_FILE.

I tried processing the sample XML metadata transfer (/home/artefactual/archivematica-sampledata/SampleTransfers/MetadataXMLValidation/small_initial_ingest) in the sandbox as well but don't see the XML metadata in the dmd sections of the METS file either (I've attached the METS below).

Is there something more or different that needs to be configured or is there an additional step in the dashboard that needs to be taken in order to get XML metadata files parsed into the METS file?

Thank you for any suggestions you can offer!
Julie
sandbox-test.xml

Douglas Cerna

unread,
Dec 8, 2023, 10:09:48 AM12/8/23
to archiv...@googlegroups.com
Hello,

The XML metadata feature is not available in the sandbox because it's still running AM 1.13.2

You could check the logs of your archivematica-mcp-client service. By default XML validation will not stop processing on errors but will log them in a message starting with "Error(s) processing and/or validating XML metadata".

Also you could download and set up the directory with sample validation settings file and schemas and its complementary small_initial_ingest transfer to test if the feature is set up and executes properly.

Hope this helps.


--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/archivematica/eefdb86f-af14-4105-89f0-4bbcee023acen%40googlegroups.com.


--
Douglas Cerna (he/him),
Software Developer, Artefactual Systems Inc.
http://www.artefactual.com

Julie S

unread,
Dec 8, 2023, 12:28:04 PM12/8/23
to archivematica
Hi Douglas,

Thank you for clarifying about the sandbox version and pointing out the error message to look for in our testing. We'll try setting up the directory and running the sample transfer in our dev instance. 

Just to clarify, do we have to enable the XML validation feature in order to access the XML metadata import feature or is import independent of validation? And is there a particular location where the validation settings file and schemas directory should be placed (I couldn't quite tell from the documentation)?

Thanks so much!
Julie

Douglas Cerna

unread,
Dec 8, 2023, 2:09:02 PM12/8/23
to archiv...@googlegroups.com
Hello Julie,

You need to enable the feature but can turn validation off for the kind of XML content you want to import.

Let me give you an alternative example of what we have in the Adding and validating metadata section of the documentation.

Suppose we have a transfer with a single file:

my-transfer/
└── beihai.tif

And we have our XML metadata in a beihai.xml file which contains:

<?xml version="1.0" encoding="UTF-8"?>
<metadata>
<![CDATA[
Beihai is a city in the south of Guangxi, People's Republic of China.
]]>
</metadata>

We add it to the transfer in a new metadata directory:

my-transfer/
├── beihai.tif
└── metadata
    └── beihai.xml

Now we tell Archivematica how to associate the XML metadata to our content through a source-metadata.csv file:

filename,metadata,type
objects/beihai.tif,beihai.xml,local

We end up with:




Douglas Cerna

unread,
Dec 8, 2023, 2:12:42 PM12/8/23
to archiv...@googlegroups.com
Ouch. Sorry hit Send accidentally! :D

my-transfer/
├── beihai.tif
└── metadata
    ├── source-metadata.csv
    └── beihai.xml


The last part of enabling the feature but turning off validation comes from the settings file:

XML_VALIDATION = {
    "metadata": None,
}
XML_VALIDATION_FAIL_ON_ERROR = False


Having the top level XML element tag (metadata) to None is what allows you to include your XML content without validating it.

Hope this helps. Sorry again for the accidental split.

Douglas Cerna

unread,
Dec 8, 2023, 2:14:30 PM12/8/23
to archiv...@googlegroups.com
Oh, and the location of the settings file and schemas is arbitrary as long as the user running the archivematica-mcp-client service can read them.

Julie S

unread,
Dec 15, 2023, 12:50:19 PM12/15/23
to archivematica
Hi Douglas,

Thank you for the instructions! They're very helpful. We've taken some time to test this but we unfortunately still aren't able to get the XML metadata parsed into the METS file..

We've configured our instance as follows:
- set "metadata_xml_validation_enabled = true" in /etc/archivematica/clientConfig.conf
- create a settings file with the following content:
XML_VALIDATION = {
    "metadata": None,
}
XML_VALIDATION_FAIL_ON_ERROR = False

- store the settings file in a location that the user running the archivematica-mcp-client service can read: in this case, /lib/archivematica/MCPClient/settings/xml_validation.py

I then created a standard transfer with the structure you indicated and copying the XML metadata for beihai.xml from the Adding and validating metadata section of the documentation.
xmlMetadata-AMexample/
├── beihai.tif
└── metadata
    ├── source-metadata.csv
    └── beihai.xml

And the source-metadata.csv (encoded in UTF-8) has the following contents:
filename,metadata,type
objects/beihai.tif,beihai.xml,local

When I process the package, the XML metadata is recognized as a metadata file, but there is no corresponding dmdSec with the metadata in the METS. There is only one dmdSec for the package as a whole. 
<mets:mets xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version1121/mets.xsd">
<mets:metsHdr CREATEDATE="2023-12-15T17:10:23"/>
<mets:dmdSec ID="dmdSec_1" CREATED="2023-12-15T17:10:22" STATUS="original">
<mets:mdWrap MDTYPE="PREMIS:OBJECT">
<mets:xmlData>
<premis:object xsi:type="premis:intellectualEntity" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd" version="3.0">
<premis:objectIdentifier>
<premis:objectIdentifierType>UUID</premis:objectIdentifierType>
<premis:objectIdentifierValue>9abcfbeb-47cf-4b0d-8c9b-1aee5f7354fd</premis:objectIdentifierValue>
</premis:objectIdentifier>
<premis:originalName>xmlMetadata-AMexample-9abcfbeb-47cf-4b0d-8c9b-1aee5f7354fd</premis:originalName>
</premis:object>
</mets:xmlData>
</mets:mdWrap>
</mets:dmdSec>
<mets:amdSec ID="amdSec_1">

I also don't see any errors in the MCPClient or MCPServer logs or in the stdout/stderr for the Generate METS.xml document job related to processing or validating XML metadata.

I'm not sure if there's something I'm missing or that needs adjusting, but any suggestions you can share would be much appreciated.

Thank you again for your help!
Julie

Douglas Cerna

unread,
Dec 18, 2023, 11:16:42 PM12/18/23
to archiv...@googlegroups.com
Hello Julie,

Have you set your METADATA_XML_VALIDATION_SETTINGS_FILE=/lib/archivematica/MCPClient/settings/xml_validation.py environment variable in /etc/default/archivematica-mcp-client?

The rest of your config looks good to me.

Julie S

unread,
Dec 19, 2023, 2:29:28 PM12/19/23
to archivematica
Hi Douglas,

That worked like a charm, thank you so much! I'm now seeing XML metadata parsed into the METS for standard packages for the most part (no success with zipped directories yet!).

We haven't changed anything in our settings file, however I am seeing the error that you mentioned previously in the stderr for the Generate METS.xml document job:
Error(s) processing and/or validating XML metadata: 
- XML validation schema not found for keys: ['http://www.loc.gov/standards/mods/mods.xsd', 'http://www.loc.gov/mods/v3', 'mods']
- XML validation schema not found for keys: ['', 'bag-info']

The metadata files flagged in the error don't get parsed into the METS though. Of the XML metadata samples and the examples in the official documentation, I've only been able to get metadata.txt.xml and beihai.xml transferred into the METS; the rest appear to be blocked by the above error.

I also found the numbering for the dmdSec_X in the METS a little odd. I'm not sure if it's intentional or not. In the instances where metadata.txt.xml or beihai.xml were parsed into the METS, the first block gets labelled dmdSec_1 as expected, but the following blocks with the XML metadata do not get labelled sequentially e.g., I've gotten dmdSec_66, dmdSec_19, dmdSec_69, dmdSec_78 (there have been at most 5 descriptive metadata sections in any of these METS files).

Would you happen to have any insights or suggestions about these two issues?

Apologies for all these questions and thank you again!
Julie

Douglas Cerna

unread,
Dec 20, 2023, 10:36:02 AM12/20/23
to archiv...@googlegroups.com
Hello Julie,

If you want to test the small_initial_ingest transfer or the example in the official documentation I'd recommend you to adjust your configuration and set up the sample validation settings file and schemas which include namespaces and tags for validating all those metadata XML files. Don't forget to restart your archivematica-mcp-client service after you modify your settings and to use the right transfer types in each case (Unzipped bag for the small initial ingest transfer and Standard for the example in the documentation).

Those dmdSec numbers look very strange. There's a global counter for dmdSec elements in METS creation that assigns those numbers sequentially. When I run both of those transfers I get the expected sequences (14 dmdSecs for the small ingest transfer and 4 for the example in the documentation). Try making the configuration changes I suggested above and see if that fixes this problem.
 

Sarah Romkey

unread,
Dec 20, 2023, 10:36:17 AM12/20/23
to archiv...@googlegroups.com
Hi Julie,

Just a heads-up that Douglas is on holidays so he might not get back to you about those validation errors for a bit.

Regarding the dmdSec numbering, it is indeed random- you'll see the same behaviour if you do a metadata update during reingest, for example. I'm not sure why it behaves like this but since the primary function of the METS is to be machine-readable, we've kind of figured it's not really harmful- most important is that those numbers are unique, even if they are a bit illogical/wonky.

Hope that helps!

Cheers,

Sarah

Sarah Romkey, MAS,MLIS
Head of Hosting and SaaS Products
@archivematica / @accesstomemory




Sarah Romkey

unread,
Dec 20, 2023, 10:37:13 AM12/20/23
to archiv...@googlegroups.com
Oh and there is Douglas from holidays with a better answer than me! Go back to vacation Douglas! ;) 

Sarah Romkey, MAS,MLIS
Head of Hosting and SaaS Products
@archivematica / @accesstomemory



Julie S

unread,
Jan 3, 2024, 1:17:57 PM1/3/24
to archivematica
Thank you for surfacing from vacation to reply to my questions, Douglas, and thank you Sarah as well for the additional insight re the dmdSec numbers! I thought I had noted that I would also be away for vacation as well, but I see that I forgot, so apologies for that! Hope you both had a lovely holiday and happy new year :)

We will try replicating the example configuration exactly to test the validation/processing and dmdSec numbering issues.

In the meantime, may I clarify if the following scenario should be possible?
- We host Archivematica instances for various clients and are considering setting "metadata": None and XML_VALIDATION_FAIL_ON_ERROR = False across our instances to reduce the number of schemas we would need to monitor
- With the above settings, an XML metadata file that references a namespace will not be validated but will have its metadata parsed into the METS file

This is what I was trying to test but perhaps I misunderstood how this feature is supposed to work or there is something else not configured correctly. Processing the small_initial_ingest transfer as an unzipped bag with the above config (renamed in testing to AM-XML-sample), validation fails as expected but the metadata is not being parsed into the METS file either. 
Stdout and stderr for the Generate METS.xml document job:
Standard output (stdout)
Skipping creation of normative structmap Deleted empty directory /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/metadataReminder/AM-XML-sample-973bd0fe-e96e-4295-bdcb-abf2f03ed98b/logs/fileMeta Deleted empty directory /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/metadataReminder/AM-XML-sample-973bd0fe-e96e-4295-bdcb-abf2f03ed98b/logs/transfers/AM-XML-sample-ccf14d3c-cdae-45d6-a395-8a61d022487a/logs/fileMeta METS file subsections counts: - dmdSec entries: 6 - amdSec entries: 41 - techMD entries: 40 - rightsMD entries: 0 - digiprovMD entries: 319 - sourceMD entries: 1
Errors and diagnostics (stderr)
/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/metadataReminder/AM-XML-sample-973bd0fe-e96e-4295-bdcb-abf2f03ed98b/metadata doesn't exist Error(s) processing and/or validating XML metadata: - XML validation schema not found for keys: ['', 'bag-info'] - XML validation schema not found for keys: ['http://www.openarchives.org/OAI/2.0/oai_dc.xsd', 'http://www.openarchives.org/OAI/2.0/oai_dc/', 'dc'] - XML validation schema not found for keys: ['http://www.lido-schema.org', 'lidoWrap'] - XML validation schema not found for keys: ['https://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd', 'http://www.loc.gov/MARC21/slim', 'record'] - XML validation schema not found for keys: ['http://www.loc.gov/standards/mods/mods.xsd', 'http://www.loc.gov/mods/v3', 'mods'] - XML validation schema not found for keys: ['https://slubarchiv.slub-dresden.de/slubarchiv/standards/rights/rights1.xsd', 'http://slubarchiv.slub-dresden.de/rights1', 'rightsRecord'] - XML validation schema not found for keys: ['http://www.loc.gov/standards/alto/alto-v2.0.xsd', 'http://www.loc.gov/standards/alto/ns-v2#', 'alto']

And all the dmdSecs in the METS for the package. Only one section corresponds to an XML metadata file:
<mets:metsHdr CREATEDATE="2024-01-03T17:55:21"/>
<mets:dmdSec ID="dmdSec_1" CREATED="2024-01-03T17:55:20" STATUS="original">

<mets:mdWrap MDTYPE="PREMIS:OBJECT">
<mets:xmlData>
<premis:object xsi:type="premis:intellectualEntity" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd" version="3.0">
<premis:objectIdentifier>
<premis:objectIdentifierType>UUID</premis:objectIdentifierType>
<premis:objectIdentifierValue>973bd0fe-e96e-4295-bdcb-abf2f03ed98b</premis:objectIdentifierValue>
</premis:objectIdentifier>
<premis:originalName>AM-XML-sample-973bd0fe-e96e-4295-bdcb-abf2f03ed98b</premis:originalName>

</premis:object>
</mets:xmlData>
</mets:mdWrap>
</mets:dmdSec>
<mets:dmdSec ID="dmdSec_2" CREATED="2024-01-03T17:55:20" STATUS="original">

<mets:mdWrap MDTYPE="PREMIS:OBJECT">
<mets:xmlData>
<premis:object xsi:type="premis:intellectualEntity" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd" version="3.0">
<premis:objectIdentifier>
<premis:objectIdentifierType>UUID</premis:objectIdentifierType>
<premis:objectIdentifierValue>52c2b7e3-47fb-4fb9-a280-e15255393d0f</premis:objectIdentifierValue>
</premis:objectIdentifier>
<premis:originalName>%transferDirectory%objects/images/</premis:originalName>

</premis:object>
</mets:xmlData>
</mets:mdWrap>
</mets:dmdSec>
<mets:dmdSec ID="dmdSec_3" CREATED="2024-01-03T17:55:20" STATUS="original">

<mets:mdWrap MDTYPE="PREMIS:OBJECT">
<mets:xmlData>
<premis:object xsi:type="premis:intellectualEntity" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd" version="3.0">
<premis:objectIdentifier>
<premis:objectIdentifierType>UUID</premis:objectIdentifierType>
<premis:objectIdentifierValue>6fbb56da-473f-4ec0-9daf-c106cdf3f041</premis:objectIdentifierValue>
</premis:objectIdentifier>
<premis:originalName>%transferDirectory%objects/images/scans_tif/</premis:originalName>

</premis:object>
</mets:xmlData>
</mets:mdWrap>
</mets:dmdSec>
<mets:dmdSec ID="dmdSec_4" CREATED="2024-01-03T17:55:20" STATUS="original">

<mets:mdWrap MDTYPE="PREMIS:OBJECT">
<mets:xmlData>
<premis:object xsi:type="premis:intellectualEntity" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd" version="3.0">
<premis:objectIdentifier>
<premis:objectIdentifierType>UUID</premis:objectIdentifierType>
<premis:objectIdentifierValue>43220661-9516-4d52-a039-87f3a5dda1e0</premis:objectIdentifierValue>
</premis:objectIdentifier>
<premis:originalName>%transferDirectory%objects/ocr/</premis:originalName>

</premis:object>
</mets:xmlData>
</mets:mdWrap>
</mets:dmdSec>
<mets:dmdSec ID="dmdSec_5" CREATED="2024-01-03T17:55:20" STATUS="original">

<mets:mdWrap MDTYPE="PREMIS:OBJECT">
<mets:xmlData>
<premis:object xsi:type="premis:intellectualEntity" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd" version="3.0">
<premis:objectIdentifier>
<premis:objectIdentifierType>UUID</premis:objectIdentifierType>
<premis:objectIdentifierValue>54b18afb-5c0e-4867-b57c-bb24008b2e17</premis:objectIdentifierValue>
</premis:objectIdentifier>
<premis:originalName>
%transferDirectory%objects/ocr/RiesTaunA_1666408611-19330529_xml/

</premis:originalName>
</premis:object>
</mets:xmlData>
</mets:mdWrap>
</mets:dmdSec>
<mets:dmdSec ID="dmdSec_82" CREATED="2024-01-03T17:55:21" STATUS="original">
<mets:mdWrap MDTYPE="OTHER" OTHERMDTYPE="eb3489dd6ef1730d91560b0343098b7f">
<mets:xmlData>
<metadata original_file="metadata.txt">
TEXTFILE Encoding-Test-Pattern: Ä--ä--Ö--ö--Ü--ü--ß External-Description: valid small test example with additional metadata External-Identifier: test-sip-valid-small LDP-collection: Sammlung LDP-funder: Kostenträger LDP-lender: Besitzer LDP-project: Projekt SLUBArchiv-archivalValueDescription: Archivierungsgrund SLUBArchiv-exportToArchiveDate: 2020-07-16T13:40:17 SLUBArchiv-externalId: test-sip_2020-07-17_13-40-17_96152 SLUBArchiv-externalIsilId: DE-14 SLUBArchiv-externalWorkflow: testcases SLUBArchiv-hasConservationReason: false SLUBArchiv-rightsVersion: 1.0 SLUBArchiv-sipVersion: v2020.1 Title: valid small test example with additional metadata Bagging-Date: 2021-11-18 Bag-Software-Agent: Archive::BagIt <https://metacpan.org/pod/Archive::BagIt> Payload-Oxum: 159486.36 Bag-Size: 155.7 kB SLUBArchiv-lzaId: SLUB:LZA:testworkflow:testcases:test-sip_2020-07-17_13-40-17_96152
</metadata>

</mets:xmlData>
</mets:mdWrap>
</mets:dmdSec>

Is this expected or is there something else that should be set on our end? Apologies if I'm going around in circles a bit on this, and thank you for all your help so far!

All the best for 2024!
Julie

Douglas Cerna

unread,
Jan 3, 2024, 4:28:42 PM1/3/24
to archiv...@googlegroups.com
Hello Julie,

Happy new year to you too :)

If I follow your configuration changes correctly so far I'd say getting only one dmdSec representing the metadata.txt.xml file and warnings/errors for the rest of XML metadata files in the small_ingest transfer is expected.

Correct me if I'm wrong, but when you simplified your settings file (which you shared on December 15th) to turn validation off for the metadata XML element tag, you also removed/omitted the rest of XML namespaces and tags. The problem is that the source-metadata.csv file in the small_ingest transfer still lists all the XML metadata files so the feature tries to process them and cannot match/validate the rest.

Also the scenario you describe sounds reasonable to me. You should be able to XML schema files across instances.


Julie S

unread,
Jan 8, 2024, 5:27:29 PM1/8/24
to archivematica
Hi Douglas,

We configured our backend to match the sample in github and the XML metadata was successfully parsed into the METS file with the small_initial_ingest! I’m planning to test the other sample transfers related to XML metadata ingest this week but I expect those to work now.

I dug into the documentation again, the below section in particular:

The goal of the validation process is to determine a validation key from the root node of each XML file in the metadata directory of the SIP and then get an XML schema file to validate against using the XML_VALIDATION dictionary. If the value for the validation key is set to None the XML metadata file is added to the METS without any validation.

The validation key of each metadata XML file is determined from its root node in the following order:
1. The XML schema location defined in the 
xsi:noNamespaceSchemaLocation attribute
2. The XML schema location defined in the 
xsi:schemaLocation attribute
3. Its XML namespace
4. Its tag name

Just to clarify, does this mean that a key:value pair must be added to the dictionary in the validation.py file each time a new metadata namespace or tag is introduced in the metadata directory of a transfer, regardless of if we want the metadata validated? If there is no entry for a namespace or tag, the metadata will not be included in the METS?

As a related question, should we expect CSV and XML metadata imports to work in the same transfer or would one overwrite/take precedence the other?

Thank you as always for all your help!

Julie

Douglas Cerna

unread,
Jan 10, 2024, 9:33:13 AM1/10/24
to archiv...@googlegroups.com
Hello Julie,

That's correct, you need key: value pairs in your settings file for any namespace or tag you plan to include in your source-metadata.csv regardless of validation. As you saw in your January 3rd email, you'll get a validation error If an XML metadata file is listed in it and doesn't have a matching entry.

You should be able to combine CSV and XML metadata imports. CSV metadata is processed first.


Julie S

unread,
Feb 9, 2024, 12:32:33 PM2/9/24
to archivematica
A very delayed reply, but thank you for all your help with this Douglas! Just to update that we've got the XML import feature working now :)

Thanks again and have a lovely weekend,
Julie
Reply all
Reply to author
Forward
0 new messages