Separate memory init file *enlarges* overall code size

Soeren Balko

unread,

Dec 21, 2014, 12:48:23 AM12/21/14

to emscripte...@googlegroups.com

I played around with the separate memory init file and was surprised to see that it does, in fact, increase the total code size. In fact, the numbers I got are:

* JS with inline memory initialization: 23186642 bytes

* JS and separate memory init file: 15250276+8988744 = 24239020 bytes

That's a bit surprising to me as I would expect the binary memory init file to spend one byte per, well, byte in HEAP8. Also, the inline memory initializer is a plain JS array, which is unecessarily large (each value takes at least 1-3 bytes per byte plus 1 byte for the comma). If the initial memory values were encoded as an UTF-8 string (and at runtime retrieved using String.charCodeAt), there were 1-2 bytes per "entry" (=byte on the heap), only (on average if memory init values are uniformly distributed: 1.5 bytes). Of course, that would produce non-printable characters in the generated JS file. Not sure if all JS interpreters would like that. If no, base64 (or basE91 for less overhead - see http://base91.sourceforge.net/), would still use up less space in the JS file.

If noone objects, I would work on implementing the latter.

Soeren

Alon Zakai

unread,

Dec 21, 2014, 2:03:07 PM12/21/14

to emscripte...@googlegroups.com

Chad did some investigation on base64 and other options here, I think - I believe the result was negative (worse after gzip).

But more importantly, this is surprising - I suspect in the inline version we are removing zeros at the end or something like that, but not in the binary file. Otherwise this makes no sense. We should figure that out.

- Alon

--
You received this message because you are subscribed to the Google Groups "emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to emscripten-disc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chad Austin

unread,

Dec 21, 2014, 4:58:26 PM12/21/14

to emscripte...@googlegroups.com

Hi Soeren,

@evanw and I have done similar research in this issue: https://github.com/kripken/emscripten/issues/2188

If we represent the meminit block as a large string literal rather than an array of 8-bit numbers, it would reduce code size by about 50%, improve JavaScript parse time, AND make it more readable, as C string literals would be visible in the output.

Fixing this has been on our wishlist for some time and if you want to take a crack at it, we would be thrilled!

Let me know if there's anything we can do to help,
Chad

--

You received this message because you are subscribed to the Google Groups "emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to emscripten-disc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Chad Austin

Technical Director, IMVU

http://engineering.imvu.com

http://chadaustin.me

Soeren Balko

unread,

Dec 21, 2014, 11:22:43 PM12/21/14

to emscripte...@googlegroups.com

Seems like these guys did a pretty thorough analysis already and ended up concluding that "ministr" seems to be the way to go. So far, I tried base64, which already gives me a massive improvement for the uncompressed Javascript (18 MB, down from 23 MB) and also a small improvement (~200 kB) for the gzip -9 files.

So far, my tryout implementation is based on a script that I run using --js-transform. It uses regular expressions to find integer arrays and replaces them with some base64 string and a function wrapper around them to turn them into an int8 array. I like the ministr approach as it preserves the (printable) byte sequences (thus benefitting readability of string literals) and apparently speeds up parsing time. If only they had provided their escaping code for non-printable characters.

Also, I still need to figure where exactly the "allocate([....], ...)" calls are generated and change the code in there.

If only for the sake of speeding up the JS parser, I wonder if some basic inline RLE compression could be done as well. It would most probably not help with the gzipped file, but keep the uncompressed JS file smaller and potentially up parsing time at the expense of a small runtime overhead to expand the RLE-encoded byte sequences into a region on the heap.

Soeren

To unsubscribe from this group and stop receiving emails from it, send an email to emscripten-discuss+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Chad Austin

unread,

Dec 22, 2014, 12:11:25 AM12/22/14

to emscripte...@googlegroups.com

On Sun, Dec 21, 2014 at 10:22 PM, Soeren Balko <soe...@zfaas.com> wrote:

So far, my tryout implementation is based on a script that I run using --js-transform. It uses regular expressions to find integer arrays and replaces them with some base64 string and a function wrapper around them to turn them into an int8 array. I like the ministr approach as it preserves the (printable) byte sequences (thus benefitting readability of string literals) and apparently speeds up parsing time. If only they had provided their escaping code for non-printable characters.

Here is the code I wrote for my tests: https://github.com/chadaustin/Web-Benchmarks/blob/master/meminit/meminit.py

Evan pointed out that my code is incorrect in the case of an octal escape followed by numeric digits, but I don't think he posted his code.

Also, I still need to figure where exactly the "allocate([....], ...)" calls are generated and change the code in there.

If only for the sake of speeding up the JS parser, I wonder if some basic inline RLE compression could be done as well. It would most probably not help with the gzipped file, but keep the uncompressed JS file smaller and potentially up parsing time at the expense of a small runtime overhead to expand the RLE-encoded byte sequences into a region on the heap.

Hm, I wonder if the improved JS parse time would be offset by the more complex decoding / startup JITting. Probably worth measuring.

Either way, a straight up string literal would be a huge improvement over the status quo for people who can't or don't want to use a separate meminit binary file.

Thanks for investigating this. :)

To unsubscribe from this group and stop receiving emails from it, send an email to emscripten-disc...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Sören Balko

unread,

Dec 22, 2014, 12:13:10 AM12/22/14

to emscripte...@googlegroups.com

I think the patch is here: https://gist.github.com/evanw/11339324

You received this message because you are subscribed to a topic in the Google Groups "emscripten-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/emscripten-discuss/ZmEdtOXH3QQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to emscripten-disc...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Soeren Balko, PhD

Founder & Director

zfaas Pty Ltd

Brisbane, QLD

Australia

Soeren Balko

unread,

Dec 22, 2014, 6:26:50 PM12/22/14

to emscripte...@googlegroups.com

Another (minor) optimization is to use the standard Javascript escapes \t, \b, \f, \n, \r, and \v (2 bytes, each) instead of octal sequences (3 bytes if not succeeded by a digit, then the fixed-length [4 byte] hex \xYZ encoding must be used).

Generally though, I cannot confirm that the "ministr" memory representation is smaller than base64. In my case, it is, in fact larger. Assuming a uniform distribution of byte values, the ministr representation in UTF-8 uses:

1 byte for the 95 "Latin 1" characters with a Unicode code point between U+0020...U+007E - 37.1%
2 bytes for the 96 "Latin 1 Supplement" characters with a Unicode code point between U+00A0...U+00FF - 37.5%

2 bytes for the 7 Javascript escape sequences \0, \t, \b, \f, \n, \t, \v (ignoring the \0 followed by digit case) - 2.7%

3 bytes for the remaining 25 characters in octal representation between U+0001...U+001F - 9.8%

4 bytes for the remaining 33 characters in hex representation between U+007F...U+009F - 12.9%

So on average, we get some 1.985 bytes per character. In turn, base64 uses 1.333 bytes per character (it only uses characters that use one byte in UTF-8), but produces a non-human-readable memory representation. For the existing int8-array representation, we get the following:

2 bytes for 10 characters in U+0000...U+0009 (one digit and one comma) - 3.9%

3 bytes for 90 characters in U+000A...U+0063 (two digits and one comma) - 35.2%

4 bytes for the remaining 156 characters in U+0064...U+00FF (three digits and one comma) - 60.9%

On average, that yields 3.57 bytes per character.

Of course, real-world static memory content is often skewed towards certain byte values, e.g. \0 and Latin-1 text characters. In those cases, the ministr approach may yield a more compact representation that base64. Other baseX approaches (notably: basE91) may be worth the try, but would need a potentially slow, pure Javascript-based implementation.

In the program that I looked at (ffmpeg), the static memory content seems to also exhibit ranges of recurrent identical byte values (often \0), which is amenable to a simple RLE encoding scheme, which could be overlayed over the ministr encoding. Not sure if this is worthwhile doing as this is essentially what gzip is doing anyway and it comes with a small runtime overhead to expand the RLE-encoded sequences.

Soeren

Soeren

To unsubscribe from this group and stop receiving emails from it, send an email to emscripten-discuss+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Chad Austin
Technical Director, IMVU
http://engineering.imvu.com
http://chadaustin.me

--
You received this message because you are subscribed to the Google Groups "emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to emscripten-discuss+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Chad Austin
Technical Director, IMVU
http://engineering.imvu.com
http://chadaustin.me

--

You received this message because you are subscribed to a topic in the Google Groups "emscripten-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/emscripten-discuss/ZmEdtOXH3QQ/unsubscribe.

To unsubscribe from this group and all its topics, send an email to emscripten-discuss+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Soeren Balko

unread,

Dec 23, 2014, 10:14:01 PM12/23/14

to emscripte...@googlegroups.com

I just submitted a pull request, which extends the "allocate" function to accept static memory defined as an UTF-8 string, where the Unicode character code points are the byte values: https://github.com/kripken/emscripten/pull/3106

In order to replace the current representation of static memory as Javascript arrays with compact UTF-8 strings (see my previous post), I created a "poor man's solution", which is a simple node script that regexps in the emscripten-generated Javascript "binary" and replaces all "allocate([...], ...)" calls with "allocate("...", ...). The resulting reduction in code size is quite noticeable - I did not measure the impact on parsing times, though: https://gist.github.com/anonymous/74196a36efbb4733a6f5

@Alon: Obviously, that functionality should be integrated into emscripten itself. However, after the change to the LLVM backend, I haven't bothered finding my way in there. Can you please suggest where to look (or simply incorporate the functionality yourself, if that's a quick addition)?

Happy holidays everyone,

Soeren

Soeren Balko

unread,

Dec 24, 2014, 11:57:53 PM12/24/14

to emscripte...@googlegroups.com

@Alon: Found it (emscripten-fastcomp/lib/Target/JSBackend/JSBackend.cpp) and will add the feature myself. I would suggest hiding it behind a flag like "-s UTF8_MEMORY=1" or so.

Soeren

Soeren Balko

unread,

Dec 27, 2014, 12:52:26 AM12/27/14

to emscripte...@googlegroups.com

I just opened two pull requests for the incoming branches of the emscripten-fastcomp and emscripten repositories: https://github.com/kripken/emscripten-fastcomp/pull/57, https://github.com/kripken/emscripten/pull/3106. These patches take care of rendering statically allocated memory as an (escaped) UTF8 string in the backend. In order to enable the functionality, I added the configuration option "UTF8_STATIC_MEMORY" (see settings.js). It's on by default. When set to 0, it will generate the static memory as before (i.e., as JS arrays of integers, representing byte values).

Enjoy,

Soeren

Alon Zakai

unread,

Dec 29, 2014, 2:22:40 PM12/29/14

to emscripte...@googlegroups.com

Thanks, I'll take a look at those pulls now. Let's move the discussion to there.

- Alon

To unsubscribe from this group and stop receiving emails from it, send an email to emscripten-disc...@googlegroups.com.

Reply all

Reply to author

Forward