To whomever this may concern,
While working on Android performance I noticed an issue that originates in TagSoup code that is to do with unnecessary Reader buffering.
HTMLScanner#scan(Reader, ...) prefers a BufferedReader, presumably to avoid stalls due to reading one character at a time. The code does the following:
if (r0 instanceof BufferedReader) {
r = new PushbackReader(r0, 5);
}
else {
r = new PushbackReader(new BufferedReader(r0), 5);
}
This is generally fine, except in Android's
Html.fromHtml method HTMLScanner is called with a StringReader as the Reader. There is no upside to buffering from a StringReader, only downside, as BufferedReader will allocate a buffer (8kb by default) and copy the StringReader's String into it.
This could be addressed with this change:
if (r0 instanceof BufferedReader || r0 instanceof StringReader) {
r = new PushbackReader(r0, 5);
}
else {
r = new PushbackReader(new BufferedReader(r0), 5);
}
One could take a step further and add more types of Reader classes where there is no need to wrap them with a BufferedReader. For instance a LineNumberReader doesn't need to be wrapped because it maintains a buffer internally. A CharArrayReader doesn't need to be wrapped for similar reasons to StringReader, because it reads from an internal array.
This issue came up in an Android performance investigation in code that was making many calls to Html.fromHtml(String), which came down to one 8kb BufferedReader allocation per such call.
AFAICT TagSoup 1.2.1 was the final release and there is no way for me to contribute a patch. A colleague encouraged me to notify this email list about the issue, hence this email.
I'm happy to answer questions.