CDATA use for imports

28 views
Skip to first unread message

Poulter, Dale

unread,
May 18, 2020, 9:11:24 AM5/18/20
to DSpace Technical Support

Good morning,

 

We are migrating several items from an older system to DSpace using the simple item import.  As is often the case with older systems,  the data is not as clean as we would like.  As a result several items fail due to bad html (open tags no closing tags, and a few diacritic issues).  One way to allow the data to migration is to wrap the text in <![CDATA[[ ….]]> .  However, it appears the import ignores anything in the CDATA section.  Is this expected behavior?

 

 

--Dale

 

---------------------------------------
Dale Poulter

Director

Library Technology and Digital Services
Vanderbilt University

419 21st Avenue South, Office 812 
Nashville, TN  37203-2427
(615)343-5388
(615)207-9705 (cell)
dale.p...@vanderbilt.edu

 

Mark H. Wood

unread,
May 18, 2020, 11:18:26 AM5/18/20
to DSpace Technical Support
On Mon, May 18, 2020 at 01:11:17PM +0000, Poulter, Dale wrote:
> We are migrating several items from an older system to DSpace using the simple item import. As is often the case with older systems, the data is not as clean as we would like. As a result several items fail due to bad html (open tags no closing tags, and a few diacritic issues). One way to allow the data to migration is to wrap the text in <![CDATA[[ ....]]> . However, it appears the import ignores anything in the CDATA section. Is this expected behavior?

I assume that it was a typo, but a CDATA section opens with
"<![CDATA[" not "<![CDATA[[".

Are you talking about the content files or the metadata? IOW would
you describe the problem more thoroughly.

A tool like HTML Tidy might help if you are ingesting HTML files.

For metadata, you should know that only some fields will be
interpreted as HTML, and in those only a subset of HTML is processed.
I have a small and slowly growing set of substitution rules wired into
my batch ingestion process, to take care of things like naked left
brokets and "R&D".

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu
signature.asc

Poulter, Dale

unread,
May 18, 2020, 12:46:56 PM5/18/20
to Mark H. Wood, DSpace Technical Support
Mark,

Thanks for the reply. The information is being pulled from a MySQL database. These are old ETD entries that were entered into the system by students. We are pulling the specific fields to create the Dublin_core.xml file ingest file.


-Dale
--
All messages to this mailing list should adhere to the DuraSpace Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/20200518151820.GC16830%40IUPUI.Edu.

Mark H. Wood

unread,
May 18, 2020, 3:39:32 PM5/18/20
to DSpace Technical Support
On Mon, May 18, 2020 at 04:46:50PM +0000, Poulter, Dale wrote:
> Thanks for the reply. The information is being pulled from a MySQL database. These are old ETD entries that were entered into the system by students. We are pulling the specific fields to create the Dublin_core.xml file ingest file.

OK, so the problem is with metadata values. I haven't yet found a
list of which fields can be marked up, but here's a link to which
elements can be used. Note that this only applies to XMLUI -- I don't
know what JSPUI will do with marked-up metadata.

https://wiki.lyrasis.org/display/DSDOC6x/Simple+HTML+Fragment+Markup

I haven't found my lists of things to be fixed up when building
batches. Each source of batch input seems to come with its own set of
problems anyway. I usually have to build batches, do a test (-d)
ingestion, see what is rejected, add a rule, and repeat until the test
runs without error.
signature.asc

Poulter, Dale

unread,
May 18, 2020, 6:30:29 PM5/18/20
to Mark H. Wood, DSpace Technical Support
Mark,

Just to close the loop. Thanks for your help. I ended up just creating a list of entities and their replacements which appears to have worked -- at least in testing.


-Dale

Paul Münch

unread,
May 19, 2020, 2:02:41 AM5/19/20
to DSpace Technical Support
Hello,

unfortunately it is possible to add some executable scripts in the description metadata of communities and collections. Even if someone don’t plan evil things, inexperienced community or collection admins could do some damage.

Do you have a solution or a workaround for this? I've looked for the code snippet which execute the HTML code but didn’t find anything.

Many thanks in advance and kind regards,

Paul Münch

Paul Münch

unread,
May 19, 2020, 2:09:12 AM5/19/20
to DSpace Technical Support

Mark H. Wood

unread,
May 19, 2020, 8:57:00 AM5/19/20
to DSpace Technical Support
On Tue, May 19, 2020 at 08:09:07AM +0200, Paul Münch wrote:
> unfortunately it is possible to add some executable scripts in the description metadata of communities and collections. Even if someone don’t plan evil things, inexperienced community or collection admins could do some damage.
>
> Do you have a solution or a workaround for this? I've looked for the code snippet which execute the HTML code but didn’t find anything.

Have you looked at dspace-xmlui/src/main/java/org/dspace/app/xmlui/wing/element/SimpleHTMLFragment.java?

Paul Münch

unread,
May 27, 2020, 3:40:42 AM5/27/20
to dspac...@googlegroups.com
Hello Mark,

thanks for the reply. I checked the SimpleHTMLFragment.java, but it
isn't used in the community or collection UI. I guess that it's a XSLT
problem.

HTML-code snippets in the community or collection description fields are
interpreted, but not on the item page. The only difference I see is that
in item-view.xsl the function xsl:value-of is used instead of
xsl:copy-of in community-view.xsl or collection-view.xsl. I update
xsl:copy-of to xsl:value-of but nothing changed.

I like the feature it self but try to avoid users to add script-tags in
description texts.

Kind regards,

Paul Münch

Am 19.05.20 um 14:56 schrieb Mark H. Wood:

Bram Luyten

unread,
May 27, 2020, 8:54:51 AM5/27/20
to Paul Münch, DSpace Technical Support
Hi Paul,

I definitely agree that it is a potential security risk and that people editing community and collection pages have to watch out what they are doing. 
However, the ability to get script tags executed on those pages makes some integrations relatively light weight.

One example are the Twitter badges you can configure via https://publish.twitter.com/
Copy paste the resulting script tag in your collection or community description and the tweets are immediately there: https://newdemo.openrepository.com/handle/2384/582855

Maybe it would make sense to allow or disallow either the entry of such code into the description fields, or the rendering, based on a repository wide on-off switch?

with kindest regards,

Bram

logoBram Luyten
250-B Suite 3A, Lucius Gordon Drive, West Henrietta, NY 14586
Gaston Geenslaan 14, 3001 Leuven, Belgium
DSpace Express Hosting - Open Repository Hosting - Custom DSpace Services


--
All messages to this mailing list should adhere to the DuraSpace Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.

Pascal-Nicolas Becker

unread,
May 27, 2020, 10:04:10 AM5/27/20
to Paul Münch, DSpace Technical Support
Hi Paul,

this issue was discussed several times. Community/Collection descriptions can be edited by repository administrators and Community/Collection administrators only. We always said that those are trusted. Of course you can argue, that they could make mistakes even if they don’t want to, but it would be very hard to create a system that actively protects administrators from making any mistake.

If we still feel the urge to change this, I would recommend to make it configurable, to allow the old behavior.

Best regards,
Pascal

> Am 27.05.2020 um 14:54 schrieb Bram Luyten <br...@atmire.com>:
>
> Hi Paul,
>
> I definitely agree that it is a potential security risk and that people editing community and collection pages have to watch out what they are doing.
> However, the ability to get script tags executed on those pages makes some integrations relatively light weight.
>
> One example are the Twitter badges you can configure via https://publish.twitter.com/
> Copy paste the resulting script tag in your collection or community description and the tweets are immediately there: https://newdemo.openrepository.com/handle/2384/582855
>
> Maybe it would make sense to allow or disallow either the entry of such code into the description fields, or the rendering, based on a repository wide on-off switch?
>
> with kindest regards,
>
> Bram
>
> To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/CACwo3X2q%2BrLP8ZPODaRLKv5cD_YMruTqMWCMEBZ2AFdJeqcg6g%40mail.gmail.com.

--
The Library Code GmbH
Pascal-Nicolas Becker

Reichsstr. 18
14052 Berlin
Germany

pas...@the-library-code.de
Tel.: +49 30 51 30 48 35
https://www.the-library-code.de

Geschäftsführer: Pascal-Nicolas Becker
Amtsgericht Charlottenburg, HRB 186457 B
USt-IdNr.: DE311762726

Paul Münch

unread,
Jun 2, 2020, 2:10:52 AM6/2/20
to DSpace Technical Support
Hi Bram,
Hi Pascal,

thanks for your replies and you both are absolutely right. In our repository with open access publications we have a heavy use of this feature and there are only a few administrators. So this is ok and we know who they are.

On the other side there are for example research data repositories in which each institute or research group has its own collections with (maybe) varying administrators. It would be hard to monitor each description text.

To make it configurable would be a great feature. But until an full implementation it is useful for me to know, how I can avoid the rendering.

Kind regards,
Paul
> To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/68EFFBB9-D002-4956-8A7F-510047F794A9%40the-library-code.de.

Reply all
Reply to author
Forward
0 new messages