delete invalid characters from xml file

897 views
Skip to first unread message

Elena Razlogova

unread,
Mar 17, 2008, 8:42:24 AM3/17/08
to zoter...@googlegroups.com
Hi,

Can someone suggest Windows/Linux programs that can delete invalid
control characters from XML files?

One solution to this problem is to use Zap Gremlins feature in BBedit,
but BBedit is only available for the Mac. It would also be great if I
could recommend a freeware tool.

I need to include this info in the instructions for Scribe-to-Zotero
transition here:
http://chnm.gmu.edu/tools/scribe/scribetozotero.php

Thanks!
Elena

Bruce D'Arcus

unread,
Mar 17, 2008, 3:51:12 PM3/17/08
to zoter...@googlegroups.com
Elena Razlogova wrote:

> Can someone suggest Windows/Linux programs that can delete invalid
> control characters from XML files?

I'm not really sure, but you might give HTML Tidy a try. It says it can
clean up XML as well; probably just a question of using the right
parameters.

<http://tidy.sourceforge.net/docs/quickref.html>

Another (Python) option might be:

<http://www.crummy.com/software/BeautifulSoup/>

Bruce

Dan Stillman

unread,
Mar 17, 2008, 9:19:22 PM3/17/08
to zoter...@googlegroups.com
On 3/17/08 8:42 AM, Elena Razlogova wrote:
> Can someone suggest Windows/Linux programs that can delete invalid
> control characters from XML files?
>

I haven't tested this, but on Linux/OS X you should be able to do it
with iconv with the source and target encodings set to UTF-8 and with
the -c flag (to discard characters that can't be converted):

iconv -c -f UTF-8 -t UTF-8 input.xml > output.xml

Robert Forkel

unread,
Mar 18, 2008, 2:42:08 AM3/18/08
to zoter...@googlegroups.com
iconv doesn't help (tested on ubuntu 7.10). the problem is, these
control characters are ascii, hence also in utf-8. they just are not
allowed in xml - for no good reason at all, as far as i can see.
regards,
robert

Bruce D'Arcus

unread,
Mar 18, 2008, 6:49:24 AM3/18/08
to zoter...@googlegroups.com
Robert Forkel wrote:

> iconv doesn't help (tested on ubuntu 7.10). the problem is, these
> control characters are ascii, hence also in utf-8. they just are not
> allowed in xml - for no good reason at all, as far as i can see.

You mean like "?" or ">"?*

If that's the only issue, then presumably a simple search-and-replace
script would do? E.g. if one of the previously mentioned tools don't
also clean up this problem, just pipe the output to a simple script that
does this fix up.

Bruce

* IF you're talking about those specific examples, there are good
reasons for disallowing them: entities are delimited by "&" and elements
by "<" and ">". The restriction makes it easier to parse.

Robert Forkel

unread,
Mar 18, 2008, 7:40:59 AM3/18/08
to zoter...@googlegroups.com
On Tue, Mar 18, 2008 at 11:49 AM, Bruce D'Arcus <bda...@gmail.com> wrote:
>
> Robert Forkel wrote:
>
> > iconv doesn't help (tested on ubuntu 7.10). the problem is, these
> > control characters are ascii, hence also in utf-8. they just are not
> > allowed in xml - for no good reason at all, as far as i can see.
>
> You mean like "?" or ">"?*

no. ascii control characters from the range x0000-x001F (see
http://lists.xml.org/archives/xml-dev/199804/msg00502.html ). i don't
know whether this has changed with xml 1.1.

Elena Razlogova

unread,
Mar 18, 2008, 9:07:36 AM3/18/08
to zoter...@googlegroups.com
>>> iconv doesn't help (tested on ubuntu 7.10). the problem is, these
>>> control characters are ascii, hence also in utf-8. they just are not
>>> allowed in xml - for no good reason at all, as far as i can see.
>>
>> You mean like "?" or ">"?*
>
> no. ascii control characters from the range x0000-x001F (see
> http://lists.xml.org/archives/xml-dev/199804/msg00502.html ). i don't
> know whether this has changed with xml 1.1.

Yes, these are the ones I need to delete--Zotero can't import files
containing these characters.
Elena

Elena Razlogova

unread,
Mar 18, 2008, 9:11:40 AM3/18/08
to zoter...@googlegroups.com
>>> iconv doesn't help (tested on ubuntu 7.10). the problem is, these
>>> control characters are ascii, hence also in utf-8. they just are not
>>> allowed in xml - for no good reason at all, as far as i can see.
>>
>> You mean like "?" or ">"?*
>
> no. ascii control characters from the range x0000-x001F (see
> http://lists.xml.org/archives/xml-dev/199804/msg00502.html ). i don't
> know whether this has changed with xml 1.1.

Yes, these are the ones I need to delete--Zotero can't import files
containing these characters.
Elena

Dan Stillman

unread,
Mar 18, 2008, 10:21:32 AM3/18/08
to zoter...@googlegroups.com
On 3/18/08 7:40 AM, Robert Forkel wrote:
>> Robert Forkel wrote:
>>
>> > iconv doesn't help (tested on ubuntu 7.10). the problem is, these
>> > control characters are ascii, hence also in utf-8. they just are not
>> > allowed in xml - for no good reason at all, as far as i can see.
>>
>
> no. ascii control characters from the range x0000-x001F (see
> http://lists.xml.org/archives/xml-dev/199804/msg00502.html ). i don't
> know whether this has changed with xml 1.1.
>

OK, right. In that case, there's always a regular expression search.
E.g., in PHP:

$xml = preg_replace('/[\x00-\x1F]/', '', $xml);

Erik Hetzner

unread,
Mar 18, 2008, 12:21:49 PM3/18/08
to zoter...@googlegroups.com, Robert Forkel
At Tue, 18 Mar 2008 12:40:59 +0100,

"Robert Forkel" <xrot...@googlemail.com> wrote:
> no. ascii control characters from the range x0000-x001F (see
> http://lists.xml.org/archives/xml-dev/199804/msg00502.html ). i don't
> know whether this has changed with xml 1.1.

What about tr? (it sounds like you are on unix)

best,
Erik Hetzner

Elena Razlogova

unread,
Mar 19, 2008, 8:49:58 AM3/19/08
to zoter...@googlegroups.com
> OK, right. In that case, there's always a regular expression search.
> E.g., in PHP:
>
> $xml = preg_replace('/[\x00-\x1F]/', '', $xml);

Many thanks for all the suggestions.

So in theory I can add an upload feature (like Zotero broken db
upload), run php on a user's RDF file, then provide a link to download
a cleaned-up version?

Most Scribe users are non-techie historians and giving them a tr or
php command to run probably is not going to do any good.

Thanks,
Elena


Reply all
Reply to author
Forward
0 new messages