15 Gb XML and wrong UTF-8 character

Skip to first unread message

Iuri Cuznetov

Sep 15, 2021, 10:18:22 AM9/15/21
to OpenRefine
Hi all!

I am trying to convert a huge 15 Gb XML file into CSV. Openrefine opens it quite easy on a 128 Gb RAM computer. But after I am trying to select the tag I need and update the preview it gives me the error:
"javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,850606237]
Message: An invalid XML character (Unicode: 0xfffe) was found in the element content of the document."

I've tried to open it with Notepad++ 64bit version and correct the encoding, but it says the file is too big. Can somebody please direct me how to delete/ignore this character while parsing?

Thank you in advance,

Thad Guidry

Sep 15, 2021, 10:55:05 AM9/15/21
to openr...@googlegroups.com
That character Unicode: 0xfffe is the Specials character.

There could be any number of things wrong with the XML file itself (corrupted, hierarchy starts and then stops abruptly because of a bad export process, etc.)

1. Try to first validate the XML file.  If it's invalid, then OpenRefine as a tool for conversion to CSV will not work using our XML importer.
2. Try to verify the encoding of the XML file found in the first lines of the file (use Linux or Git for Windows you can do "head myXMLfile.xml"
2. You can definitely hack your way down the long long road of using Line-based importer, but I would not suggest it.
4. Try to find a tool that can tell you where the XML file is broken or invalid and manually clean it up  (awk, sed, or my prefered tool, Altova XMLSpy which has a free trial)

I could go on, but I'll stop there and await your response.

You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/a20c96a5-824b-49f3-9e1a-c2a03434b504n%40googlegroups.com.

Thad Guidry

Sep 15, 2021, 11:00:51 AM9/15/21
to openr...@googlegroups.com

You might also use a Hex Editor to verify the beginning of the file to see if it has 0xEF 0xBB 0xBF which is UTF-8
If it doesn't then it will be interpreted as ASCII by default.  Read more here:
Reply all
Reply to author
0 new messages