Validating Shadow Metadata Files before importing

41 views
Skip to first unread message

mathieu st-louis

unread,
Apr 7, 2017, 11:39:29 AM4/7/17
to Alfresco Bulk Import Tool
First of all thank you for this tool. We've been using it quite a lot to migrate documents to Alfresco.


I would like to know if there is a way to "pre-validate" Shadow Metadata files before importing the documents.
At first I thought the "dry-run" option would do that but the properties are not validated against the metamodel.
What does the "dry-run" option do?


Also, just for clarification, what happens if an error occurs during an import?

I understand it this way:
- Any previous successful import batches will stay in the repository.
- Current batch that has a file in error will stop and none of the files of the current batch will be imported into the repository.
- Bulk import will stop completely at this point and no other batch will be processed.


I am using v2.1.0 of the tool.

Thank you very much,






Peter Monks

unread,
Apr 7, 2017, 2:55:55 PM4/7/17
to alfresco-bulk-f...@googlegroups.com
G'day Mathieu,

I would like to know if there is a way to "pre-validate" Shadow Metadata files before importing the documents.
At first I thought the "dry-run" option would do that but the properties are not validated against the metamodel.

Beyond basic syntactic validation of the XML against the DTD (which the tool does automatically), there's currently no pre-validation of metadata.  Part of the reason for this is that Alfresco itself doesn't provide any way to do this - the only native way to validate metadata is to attempt to write it into the repo then somehow "undo" the write (e.g. by rolling back the transaction, which is expensive and has unacceptable performance characteristics).

Technically it would be possible to write a custom validator that calls the Alfresco Data Dictionary APIs to manually validate the loaded metadata, but that would involve a fair bit of code that would have to closely track the equivalent code in the core Alfresco source, and that's not something I'd be especially keen on implementing (for one thing Alfresco engineering changes those kinds of implementation details from time to time without any form of public notification, so trying to stay on top of any changes in the core validation logic is practically impossible).

What does the "dry-run" option do?

The "dry run" feature of the tool does everything short of physically writing the data into the repository (it simply emits a log message at that point), so amongst other things it:
  1. reports everything that it finds on disk, including directories, content files, shadow metadata files and version files
  2. uncovers permission issues, performance issues (e.g. where the source is a remotely mounted volume) and certain types of corruption / data errors in the source content
  3. loads the shadow metadata files, which also syntactically validates the XML
  4. tells you which files and folders will be created new in the target space in the repository
  5. tells you which files would "replace" an existing file in the target space
Of course because metadata validation only occurs upon write to the repository, and "dry run" mode doesn't perform any writes, the metadata isn't validated against the data dictionary.

Also, just for clarification, what happens if an error occurs during an import?
I understand it this way:
- Any previous successful import batches will stay in the repository.
- Current batch that has a file in error will stop and none of the files of the current batch will be imported into the repository.
- Bulk import will stop completely at this point and no other batch will be processed.

Yep - that's exactly correct.

And FWIW the recommended triage process in this case is:
  1. Fix whatever issue caused the import to stop (if you can - there are some errors, such as those caused by unreliable network infrastructure, that are non-deterministic)
  2. Rerun the exact same import with "replace" turned off (unchecked) - in this mode the tool will resume from wherever it left off
The tool was carefully designed from day one to be both fail fast and efficiently resumable, specifically because for large scale operations such as content ingestions, that combination of features is the most reliable and simplest to reason about.

Cheers,
Peeter


--
You received this message because you are subscribed to the Google Groups "Alfresco Bulk Import Tool" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alfresco-bulk-filesystem-import+unsubscribe@googlegroups.com.
To post to this group, send email to alfresco-bulk-filesystem-imp...@googlegroups.com.
Visit this group at https://groups.google.com/group/alfresco-bulk-filesystem-import.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

mathieu st-louis

unread,
Apr 7, 2017, 3:50:42 PM4/7/17
to Alfresco Bulk Import Tool
Hi Peter,

Thank you for such a precise and complete answer!

Peter Monks

unread,
Apr 7, 2017, 4:31:56 PM4/7/17
to alfresco-bulk-f...@googlegroups.com
You're welcome!  Hope the tool works well for you!

Cheers,
Peter

Apologes for speling & gramar erorrs - sent from mobil deivce

On Apr 7, 2017, at 12:50 PM, mathieu st-louis <mat...@gmail.com> wrote:

Hi Peter,

Thank you for such a precise and complete answer!

--
You received this message because you are subscribed to the Google Groups "Alfresco Bulk Import Tool" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alfresco-bulk-filesys...@googlegroups.com.
To post to this group, send email to alfresco-bulk-f...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages