Pipeline to check and fix CSV metadata

55 views
Skip to first unread message

Alan Orth

unread,
Jul 31, 2019, 11:08:48 AM7/31/19
to DSpace Technical Support
Dear list,

For years I've used OpenRefine to do basic sanity checks and cleaning of CSV files before batch upload (either directly or with SAFBuilder). OpenRefine makes it very easy to do things like trim whitespace, facet on text values or custom patterns to eyeball outliers like invalid dates or ISSNs, and you can even write Python (though it's Python 2 and quite cumbersome). This is much more powerful and methodical than using a spreadsheet application for the same task, but still becomes tedious when you have dozens of metadata fields and hundreds or thousands of records.

To make a long story short, I've just written a metadata cleaning pipeline geared towards working with CSVs in the DSpace ecosystem. Its implementation is basically a series of checks and fixes applied as a pipeline. For example, the order is roughly:

1. Strip leading, trailing, and excessive whitespace
2. Strip newlines
3. Remove "unnecessary" Unicode characters like non-breaking spaces
4. Fix invalid multi-value separators like "Kenya|Ethiopia"
5. Drop duplicate metadata values
6. Validate subject terms against AGROVOC REST API
7. Validate languages against ISO 639-2 or ISO 639-3
8. Validate ISSNs and ISBNs
9. Validate dates against ISO 8601 (and warn if date missing)

It is slightly geared towards our repository's workflow, but I think the implementation is simple and powerful enough that many of you could benefit from it. I will keep working to extend it. If you are interested in using or improving it you can find the code on GitHub:


Regards,
--
Alan Orth
alan...@gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
"In heaven all the interesting people are missing." ―Friedrich Nietzsche

Bram Luyten

unread,
Aug 2, 2019, 4:37:24 AM8/2/19
to Alan Orth, DSpace Technical Support
Beautiful ! Thanks for sharing Alan !!!!

Hope that this kind of validation/cleaning, or a more advanced warning system, can get into the default uploader at some point. 
So we can all collectively prevent eachother from shooting ourselves in the feet with these uploads.

best,

Bram

logoBram Luyten
250-B Lucius Gordon Drive, Suite 3A, West Henrietta, NY 14586
Gaston Geenslaan 14, 3001 Leuven, Belgium
atmire.com


--
All messages to this mailing list should adhere to the DuraSpace Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/CAKKdN4VD5x%3DWFR8tMtPKbfVoYsYWPHDhKtYwSBCNp3JnHMk7Tw%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages