Hi Everyone!
We're reaching out to see if others have encountered special character encoding issues in OAI-PMH records. Several institutions we support are seeing malformed metadata in fields like dc.title and dc.description.abstract, where apostrophes appear as ' for example, ABC's University renders as ABC's University.
We've addressed this by applying a custom logic in our etdms.xsl and oai_dc.xsl crosswalk files:
Since many institutions have their OAI-PMH feeds harvested by third parties this issue has downstream impact. Our team has noticed it's a widespread issue with commonly harvested metadata fields. Has anyone else run into this? And would there be interest in our team submitting this as a Github Issue/PR fix? Thank you Community! Best Regards, Rachel
Hi Rachel.
Yes, we are definitely experiencing the same issue. I have found that a lot of Word special characters are introduced specifically into the "Abstract" field as users simply copy and paste from word into the Abstract field during submission. I have addressed this by adding a JS based replacement script in the submission form, and that seems to be working. I am currently working on another issue where an item is failing on an invalid XML character, but I am struggling to find where in the Metadata this is hiding. I am busy with a script that identifies the actual item that is failing, downloading all of its metadata and scrutinising it for invalid characters, but am not finding it (yet). What I have discovered is that the extracted text from a PDF document contains the "smart quotes" characters, but as far as i am aware this does not get pulled into OAI-PMH.
I am continuing my investigation today and hope to resolve it.
I would certainly appreciate you logging it a s a PR.
Kind Regards.
Shaun
--
All messages to this mailing list should adhere to the Code of Conduct: https://lyrasis.org/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dspace-tech/0f5da870-c371-4463-b2d7-66922cfc625bn%40googlegroups.com.