Stream a huge XML file and correct non UTF-8 characters

22 views
Skip to first unread message

jerome

unread,
Feb 27, 2015, 4:54:12 PM2/27/15
to nod...@googlegroups.com
There are some fairly gigantic XML files (the largest is around 30MB, and there are as many as 40 per directory).

They are encoded UTF-8, but unfortunately have some non UTF-8 characters, which should be replaced with the correct UTF-8 character codes.

(em-dash should be &emdash; or — etc.)

Anybody have experience with treating giant XML files as streams, operating on them (ideally in the manner described) and writing the correct version of the file, to disk?

Thanks!
Reply all
Reply to author
Forward
0 new messages