Unnecessary buffering in HTMLScanner

9 views
Skip to first unread message

Shai Barack

unread,
Jul 13, 2025, 4:55:22 AMJul 13
to tagsoup-friends
To whomever this may concern,

While working on Android performance I noticed an issue that originates in TagSoup code that is to do with unnecessary Reader buffering.

HTMLScanner#scan(Reader, ...) prefers a BufferedReader, presumably to avoid stalls due to reading one character at a time. The code does the following:

if (r0 instanceof BufferedReader) {
r = new PushbackReader(r0, 5);
}
else {
r = new PushbackReader(new BufferedReader(r0), 5);
}

This is generally fine, except in Android's Html.fromHtml method HTMLScanner is called with a StringReader as the Reader. There is no upside to buffering from a StringReader, only downside, as BufferedReader will allocate a buffer (8kb by default) and copy the StringReader's String into it.

This could be addressed with this change:

if (r0 instanceof BufferedReader || r0 instanceof StringReader) {
r = new PushbackReader(r0, 5);
}
else {
r = new PushbackReader(new BufferedReader(r0), 5);
}

One could take a step further and add more types of Reader classes where there is no need to wrap them with a BufferedReader. For instance a LineNumberReader doesn't need to be wrapped because it maintains a buffer internally. A CharArrayReader doesn't need to be wrapped for similar reasons to StringReader, because it reads from an internal array.

This issue came up in an Android performance investigation in code that was making many calls to Html.fromHtml(String), which came down to one 8kb BufferedReader allocation per such call.

AFAICT TagSoup 1.2.1 was the final release and there is no way for me to contribute a patch. A colleague encouraged me to notify this email list about the issue, hence this email.
I'm happy to answer questions.

Shai Barack

unread,
Jul 16, 2025, 12:19:01 AMJul 16
to Dave Pawson, tagsoup-friends
Yes.

On Android the problem is exposed because android.text.Html passes a StringReader to tagsoup, which tagsoup needlessly wraps with a BufferedReader. But this is not an Android problem per se.

Though even then, this has only been a problem of note in programs that make many calls in quick succession to Html.fromHtml with a small String (that gets amplified to a large 8kb array allocation in BufferedReader per call).

You can find if the same problem affects your program by profiling runtime heap allocations. If it does then my proposed patch above, or a patch like it, should help your use case.

If there is a process to contribute this patch upstream then I'd gladly follow it for everyone's benefit. But AFAICT there have been no changes upstream since 2011 or so.

On Sun, Jul 13, 2025, 1:57 AM Dave Pawson <dave....@gmail.com> wrote:
What about users 'outside' of the Android platform / dev environment?
Does the same logic apply?

regards

--

---
You received this message because you are subscribed to the Google Groups "tagsoup-friends" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tagsoup-frien...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tagsoup-friends/4a4cc079-5213-4986-a90d-acaa35cac1abn%40googlegroups.com.


--
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.

Dave Pawson

unread,
Jul 16, 2025, 12:19:02 AMJul 16
to sha...@google.com, tagsoup-friends
What about users 'outside' of the Android platform / dev environment?
Does the same logic apply?

regards

On Sun, 13 Jul 2025 at 09:55, 'Shai Barack' via tagsoup-friends <tagsoup...@googlegroups.com> wrote:
--

---
You received this message because you are subscribed to the Google Groups "tagsoup-friends" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tagsoup-frien...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tagsoup-friends/4a4cc079-5213-4986-a90d-acaa35cac1abn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages