Can someone suggest Windows/Linux programs that can delete invalid
control characters from XML files?
One solution to this problem is to use Zap Gremlins feature in BBedit,
but BBedit is only available for the Mac. It would also be great if I
could recommend a freeware tool.
I need to include this info in the instructions for Scribe-to-Zotero
transition here:
http://chnm.gmu.edu/tools/scribe/scribetozotero.php
Thanks!
Elena
> Can someone suggest Windows/Linux programs that can delete invalid
> control characters from XML files?
I'm not really sure, but you might give HTML Tidy a try. It says it can
clean up XML as well; probably just a question of using the right
parameters.
<http://tidy.sourceforge.net/docs/quickref.html>
Another (Python) option might be:
<http://www.crummy.com/software/BeautifulSoup/>
Bruce
I haven't tested this, but on Linux/OS X you should be able to do it
with iconv with the source and target encodings set to UTF-8 and with
the -c flag (to discard characters that can't be converted):
iconv -c -f UTF-8 -t UTF-8 input.xml > output.xml
> iconv doesn't help (tested on ubuntu 7.10). the problem is, these
> control characters are ascii, hence also in utf-8. they just are not
> allowed in xml - for no good reason at all, as far as i can see.
You mean like "?" or ">"?*
If that's the only issue, then presumably a simple search-and-replace
script would do? E.g. if one of the previously mentioned tools don't
also clean up this problem, just pipe the output to a simple script that
does this fix up.
Bruce
* IF you're talking about those specific examples, there are good
reasons for disallowing them: entities are delimited by "&" and elements
by "<" and ">". The restriction makes it easier to parse.
no. ascii control characters from the range x0000-x001F (see
http://lists.xml.org/archives/xml-dev/199804/msg00502.html ). i don't
know whether this has changed with xml 1.1.
Yes, these are the ones I need to delete--Zotero can't import files
containing these characters.
Elena
Yes, these are the ones I need to delete--Zotero can't import files
containing these characters.
Elena
OK, right. In that case, there's always a regular expression search.
E.g., in PHP:
$xml = preg_replace('/[\x00-\x1F]/', '', $xml);
What about tr? (it sounds like you are on unix)
best,
Erik Hetzner
Many thanks for all the suggestions.
So in theory I can add an upload feature (like Zotero broken db
upload), run php on a user's RDF file, then provide a link to download
a cleaned-up version?
Most Scribe users are non-techie historians and giving them a tr or
php command to run probably is not going to do any good.
Thanks,
Elena