Problem loading xml data in .txt file into Word 2003

Bob Alston

unread,

May 2, 2010, 8:16:35 PM5/2/10

to

I have downloaded a formated xml data stream and saved it as a *.txt
file. I normally load it into Word 2003 and then output it as an
*.xml document. The request and response specifies UTP-8.

This process has worked just great for jundreds of downloaded
documents.

Recently I have gotten this error from word 2003

The xml file xxxxxxxx cannot be opened because there are
problems with the contents

An invalid character was found in text content
Error location: Line: 3, col: 2034

I do not have a text editor that allows me to go to this location in
the file. Can anyone suggest such?

I did use Notepad++ and with UTP-8 encoding and word wrap on, I
scrolled down to the end of the file. About 20 characters from the
end, the display was unintelligible characters. If I changed the
encoding to Ascii, I could see the text which looked to be normal
characters.

1) How can I get to the root cause?

2) Any text editors that will let me go to precisely the specified row
and column?

Bob

Peter Flynn

unread,

May 3, 2010, 2:57:16 PM5/3/10

to

Emacs.

Alternatively, if allowed, post the file somewhere we can retrieve it from.

///Peter
--
XML FAQ: http://xml.silmaril.ie/

Robert Aldwinckle

unread,

May 3, 2010, 4:00:54 PM5/3/10

to

"Bob Alston" <boba...@gmail.com> wrote in message
news:7998345c-67b7-4ec5...@r34g2000yqj.googlegroups.com...

> I have downloaded a formated xml data stream and saved it as a *.txt
> file. I normally load it into Word 2003 and then output it as an
> *.xml document. The request and response specifies UTP-8.
>
> This process has worked just great for jundreds of downloaded
> documents.
>
> Recently I have gotten this error from word 2003
>
> The xml file xxxxxxxx cannot be opened because there are
> problems with the contents
>
> An invalid character was found in text content
> Error location: Line: 3, col: 2034
>

> I do not have a text editor that allows me to go to this location in the
> file. Can anyone suggest such?

If it is just a character column you could use Notepad with Wrap off and
Status bar on.

Bob Alston

unread,

May 4, 2010, 2:28:55 PM5/4/10

to

On May 3, 3:00 pm, "Robert Aldwinckle" <rob...@techemail.com> wrote:
> "Bob Alston" <bobals...@gmail.com> wrote in message
>
I set Notepad correctly and located the character in question. I see
no problem. It is a simple lower case character or a normal word.

Is it possible that Word is pointing me to the wrong place?

Bob

Peter Flynn

unread,

May 4, 2010, 2:44:05 PM5/4/10

to

Yes; if the document contains multibyte characters (eg UTF-8) and if one
of them is corrupt, it may push the byte offsets out, so the pointer may
be on the wrong byte.

Use Emacs, or any of the large XML editors that allows you to load a
malformed file in order to repair it. Notepad is not such an editor.

Robert Aldwinckle

unread,

May 4, 2010, 4:43:59 PM5/4/10

to

"Bob Alston" <boba...@gmail.com> wrote in message

news:d8944787-a884-4f78...@q32g2000yqb.googlegroups.com...

Yes. The error message frequently refers to the root file when in fact the
error is in a script file that it calls. A symptom of that being the case
occurs when the reported "line number" exceeds the number of lines in the
root file. I would try using ProcMon to clarify the context of the error
better.

Good luck

Robert
---

Bob Alston

unread,

May 5, 2010, 11:08:39 PM5/5/10

to

I downloaded a free 30 day trial of Akltova XML spy and opened the
*.txt file using that software. I told me there was one invalid
character that should not be present using UTF-8 encoding. It said
the offending character was: 0xBF and it showed an upside down "?" in
front of the 0xBF

Unfortunately while it could replace the character for me, it could
not tell me exactly where the character exists nor why?

Can anyone tell me if this is a common invalid character in XML and
the likely cause?

Bob

Robert Aldwinckle

unread,

May 6, 2010, 11:24:41 AM5/6/10

to

"Bob Alston" <boba...@gmail.com> wrote in message

news:27df1fd6-4e62-4873...@h9g2000yqm.googlegroups.com...

> I downloaded a free 30 day trial of Akltova XML spy and opened the
> *.txt file using that software. I told me there was one invalid
> character that should not be present using UTF-8 encoding.

> It said the offending character was: 0xBF and it showed an upside down
> "?" in
> front of the 0xBF

That's what Charmap says. So I would suspect a substitution already
occurred somewhere else--or your new tool is misinterpreting something too.
<eg>

BTW how does this correlate with what Notepad showed you ("simple lowercase
character in normal word")? E.g. consider the context, not just the
problem character.

Robert
---

Bob Alston

unread,

May 6, 2010, 1:13:03 PM5/6/10

to

On May 6, 10:24 am, "Robert Aldwinckle" <rob...@techemail.com> wrote:
> "Bob Alston" <bobals...@gmail.com> wrote in message

> > Bob- Hide quoted text -
>
> - Show quoted text -

I found the offending character. It was the 0xBF character. It
immediately preceded the name of a person, at the end of a paragraph
of text. Sort of a "signature" apparently identifying the author.
The person's name is middle eastern.

On our character set, it is a box drawing character. You can enter it
easily by holding down the ALT key while typing 191 on the keypad.

Bob

Peter Flynn

unread,

May 6, 2010, 2:08:16 PM5/6/10

to

At a wild random guess, that part of the file was copied and pasted from
a document that had been written on an obsolete system using Windows
1252 or something like. A previous process found the character (possibly
part of a different multibyte character) and turned it into a 0xBF. See
http://en.wikipedia.org/wiki/Unicode_Specials for details of this
behaviour. The solution is to go back to whoever generated the document
and tell them it is not Unicode-compliant and they should edit it and
regenerate it if they want it processed. And to fix their input systems
to make sure it doesn't happen again.

Bob Alston

unread,

May 6, 2010, 2:13:49 PM5/6/10

to

> part of a different multibyte character) and turned it into a 0xBF. Seehttp://en.wikipedia.org/wiki/Unicode_Specialsfor details of this

> behaviour. The solution is to go back to whoever generated the document
> and tell them it is not Unicode-compliant and they should edit it and
> regenerate it if they want it processed. And to fix their input systems
> to make sure it doesn't happen again.
>
> ///Peter
> --

> XML FAQ:http://xml.silmaril.ie/- Hide quoted text -

>
> - Show quoted text -

Thanks for the scenario on how it might have happened.
I think the key here is for the system generating the XML response, to
use filters to ensure that if the encoding is UTP-8, that only UTP-8
characters are included. It seems to me that is the responsibility of
the system generating the XML response.

Bob

Bob Alston

unread,

May 6, 2010, 2:17:13 PM5/6/10

to

> > part of a different multibyte character) and turned it into a 0xBF. Seehttp://en.wikipedia.org/wiki/Unicode_Specialsfordetails of this

> > behaviour. The solution is to go back to whoever generated the document
> > and tell them it is not Unicode-compliant and they should edit it and
> > regenerate it if they want it processed. And to fix their input systems
> > to make sure it doesn't happen again.
>
> > ///Peter
> > --

> > XML FAQ:http://xml.silmaril.ie/-Hide quoted text -

>
> > - Show quoted text -
>
> Thanks for the scenario on how it might have happened.
> I think the key here is for the system generating the XML response, to
> use filters to ensure that if the encoding is UTP-8, that only UTP-8
> characters are included. It seems to me that is the responsibility of
> the system generating the XML response.
>

> Bob- Hide quoted text -
>
> - Show quoted text -

I should have also mentioned that the document in question is
generated by a State government system.
Great fun.

Bob

Bob Alston

unread,

May 11, 2010, 2:45:07 PM5/11/10

to

> > > part of a different multibyte character) and turned it into a 0xBF. Seehttp://en.wikipedia.org/wiki/Unicode_Specialsfordetailsof this

> > > behaviour. The solution is to go back to whoever generated the document
> > > and tell them it is not Unicode-compliant and they should edit it and
> > > regenerate it if they want it processed. And to fix their input systems
> > > to make sure it doesn't happen again.
>
> > > ///Peter
> > > --

> > > XML FAQ:http://xml.silmaril.ie/-Hidequoted text -

>
> > > - Show quoted text -
>
> > Thanks for the scenario on how it might have happened.
> > I think the key here is for the system generating the XML response, to
> > use filters to ensure that if the encoding is UTP-8, that only UTP-8
> > characters are included. It seems to me that is the responsibility of
> > the system generating the XML response.
>
> > Bob- Hide quoted text -
>
> > - Show quoted text -
>
> I should have also mentioned that the document in question is
> generated by a State government system.
> Great fun.
>

> Bob- Hide quoted text -
>
> - Show quoted text -

Oops. I incorrectly stated UTP-8 above. It should have been UTF-8.

Guess that just shows I know about unshielded twisted pairs <grin>

Bob