Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Getting rid of special characters

1 view
Skip to first unread message

JustWondering

unread,
Dec 21, 2009, 12:55:21 PM12/21/09
to
I am collecting some tweets using the Twitter API. The output is an
XML file. I use DOMDocument to process the XML doc.

Every once in a while, someone includes some special characters, or
HTML code in the tweet's body, and that trips DOMDocument (when I do --
>saveXML()). Would anyone have advice on how to get rid of special
characters, and HTML code from a string?

Daniel Egeberg

unread,
Dec 21, 2009, 7:05:07 PM12/21/09
to

What kind of error message do you get? What do you mean with "special
characters"? You could probably use a regular expression to do like
$string = preg_replace('#\W#', '', $string);, but to give an optimal
solution, more information would be needed.

JustWondering

unread,
Dec 21, 2009, 11:36:32 PM12/21/09
to

Warning: DOMDocument::load() [domdocument.load]: PCDATA invalid Char
value 11

satya-weblog.com

unread,
Dec 22, 2009, 5:11:11 AM12/22/09
to

What about using <![CDATA[ ........ ]]>
or HTML encode the data!

"Álvaro G. Vicario"

unread,
Dec 22, 2009, 5:38:44 AM12/22/09
to
El 22/12/2009 5:36, JustWondering escribi�:

This suggest that the XML file is not valid... Can you load the XML file
within a web browser? Can you provide a sample?

--
-- http://alvaro.es - �lvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programaci�n web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--

JustWondering

unread,
Dec 22, 2009, 8:19:42 AM12/22/09
to
On Dec 22, 2:38 am, "Álvaro G. Vicario"
<alvaro.NOSPAMTH...@demogracia.com.invalid> wrote:

> El 22/12/2009 5:36, JustWondering escribió:
>
>
>
> > On Dec 21, 4:05 pm, Daniel Egeberg<degeb...@php.net>  wrote:
> >> On Dec 21, 6:55 pm, JustWondering<eastside...@gmail.com>  wrote:
>
> >>> I am collecting some tweets using the Twitter API. The output is an
> >>> XML file. I use DOMDocument to process the XML doc.
>
> >>> Every once in a while, someone includes some special characters, or
> >>> HTML code in the tweet's body, and that trips DOMDocument (when I do -->saveXML()). Would anyone have advice on how to get rid of special
>
> >>> characters, and HTML code from a string?
>
> >> What kind of error message do you get? What do you mean with "special
> >> characters"? You could probably use a regular expression to do like
> >> $string = preg_replace('#\W#', '', $string);, but to give an optimal
> >> solution, more information would be needed.
>
> > Warning: DOMDocument::load() [domdocument.load]: PCDATA invalid Char
> > value 11
>
> This suggest that the XML file is not valid... Can you load the XML file
> within a web browser? Can you provide a sample?
>
> --
> --http://alvaro.es- Álvaro G. Vicario - Burgos, Spain
> -- Mi sitio sobre programación web:http://borrame.com

> -- Mi web de humor satinado:http://www.demogracia.com
>

The XML file structure is valid. One of the elements is labeled
"Text", it works fine, with the exception of one line that has the
following text in it: " (:€:shorty:„:)"

Note that the character that follows the first " is invisible, and
that's what's causing the problem. It's a character below 20 ASCII.

JustWondering

unread,
Dec 22, 2009, 9:01:31 AM12/22/09
to
On Dec 22, 5:19 am, JustWondering <eastside...@gmail.com> wrote:
> On Dec 22, 2:38 am, "Álvaro G. Vicario"
>
>
>
> <alvaro.NOSPAMTH...@demogracia.com.invalid> wrote:
> > El 22/12/2009 5:36, JustWondering escribió:
>
> > > On Dec 21, 4:05 pm, Daniel Egeberg<degeb...@php.net>  wrote:
> > >> On Dec 21, 6:55 pm, JustWondering<eastside...@gmail.com>  wrote:
>
> > >>> I am collecting some tweets using the Twitter API. The output is an
> > >>> XML file. I use DOMDocument to process the XML doc.
>
> > >>> Every once in a while, someone includes some special characters, or
> > >>> HTML code in the tweet's body, and that trips DOMDocument (when I do -->saveXML()). Would anyone have advice on how to get rid of special
>
> > >>> characters, and HTML code from a string?
>
> > >> What kind of error message do you get? What do you mean with "special
> > >> characters"? You could probably use a regular expression to do like
> > >> $string = preg_replace('#\W#', '', $string);, but to give an optimal
> > >> solution, more information would be needed.
>
> > > Warning: DOMDocument::load() [domdocument.load]: PCDATA invalid Char
> > > value 11
>
> > This suggest that the XML file is not valid... Can you load the XML file
> > within a web browser? Can you provide a sample?
>
> > --
> > --http://alvaro.es-Álvaro G. Vicario - Burgos, Spain

> > -- Mi sitio sobre programación web:http://borrame.com
> > -- Mi web de humor satinado:http://www.demogracia.com
>
> The XML file structure is valid. One of the elements is labeled
> "Text", it works fine, with the exception of one line that has the
> following text in it: " (:€:shorty:„:)"
>
> Note that the character that follows the first " is invisible, and
> that's what's causing the problem. It's a character below 20 ASCII.

Specifically, this is ASCII 11 or hex 0b

0 new messages