Issue 223 in flaxcode: Entity decoding is slow for large files

1 view
Skip to first unread message

codesite...@google.com

unread,
Feb 2, 2011, 12:01:25 AM2/2/11
to flax-c...@googlegroups.com
Status: New
Owner: ----

New issue 223 by kevin.clark: Entity decoding is slow for large files
http://code.google.com/p/flaxcode/issues/detail?id=223

What steps will reproduce the problem?
1. Get a file full of entities
2. Try to decode the entities
3. Wait

What version of the product are you using? On what operating system?
0.7.3, on OSX and Linux

Please provide any additional information below.

decode_entities uses std::string::replace to swap in decoded entities for
their encoded brethren. replace is O(N) (it has to shift the whole string
over), and when things get big, it gets ugly (we were having issues with a
very odd 26M file with way too many entities). I've got a patch up over on
github. This is the most significant commit (plus unit test):

https://github.com/Greplin/htmltotext/commit/6aa3037b93df7ef12e6df2588ae35a2c5bb5382e

The following two commits (here:
https://github.com/Greplin/htmltotext/commits/) are probably useful too.
Minor cleanup.

I put in one other optimization (besides getting the copy down to one
pass). Instead of copying each entity into another string for use by
sscanf, I'm just shoving in a NULL byte, and replacing it afterwards.

Anyway, you're welcome to the patches. Let me know if you need changes.

Reply all
Reply to author
Forward
0 new messages