Performance question

Александр Гурьянов

unread,

Jan 15, 2020, 8:22:51 AM1/15/20

to emscripte...@googlegroups.com

Hi guys. I working on next version of js-dos, and I have some ideas
where to go. I have couple questions to see which way is better.

Currently, js-dos is a WebAssembly binary (+asyncify changes) that
does emulation in 16ms frames, and stops with emscripten_sleep for
processing user input. So typical work flow is:

16ms emulation| <process user input> | 16ms emulation | <process user
input> | etc.

Typically <user input> is empty, and performance is limited by browser
performance, and browser scheduler that stops emulation with
emscripten_sleep.

My first question is:
1. "Can I have any performance boost if I move emulation core inside worker?"

From my point of view it's not sens, because even in worker I should
stop emulations every 16ms to process messages and send updates to
page. Even If I do not account frames data (320x200 32bpp image), I
think performance will be same as in regular integration. Am I wrong?

My second questing:
2. "Is WASI perfromance is better than default WebView WebAsssembly
core?" (android/ios)

One of my target is provide js-dos also on mobiles (android/ios) and I
think I can do it as is, or also I can do it with WASI. I think with
WASI clients can integrate js-dos in more ways, but what about
performance? Is it more preformant to use WASI instead of browser on
mobiles, or they are same?

Floh

unread,

Jan 16, 2020, 5:34:45 AM1/16/20

to emscripten-discuss

I don't have a useful answer, but a question and I can give an overview how my emulators work on the web:

Fist the question:

What are you using the asyncify changes for? Just for "slicing up" the execution loop into frames, or also for other things?

And here's how I do it in my emulators (https://floooh.github.io/tiny8bit/):

- emulation and rendering happens through the "traditional" emscripten frame callback (via emscripten_request_animation_frame_loop(), this calls my own frame-callback for each frame), and everything runs single-threaded on the browser thread

- in the frame callback I convert the frame duration to a number of ticks the emulator needs to run for this frame (e.g. with 60Hz frame rate and a 1 MHz system this would be around 16667 ticks), the measured frame time must be clamped to some sensible upper value, otherwise if the host system is too slow to run the emulation in realtime, it would run into a death spiral of longer and longer frames until the application freezes and is killed by the browser.

- the emulation executes for this number of ticks, depending on the complexity of the emulated system this takes from under a millisecond up to 7 milliseconds emulators with a complex video system, like the C64 (on my mid-2014 13"MBP in Firefox, Chrome seems to be around 20% faster)

- inside the emulation, everything runs "cycle-synced", all subsystem of the emulation (the emulated chips) do their work for one cycle, then the next cycle, and so on, all in very simple sequential and single-threaded C code.

- the emulation updates a simple RGBA8 pixel buffer, and an audio sample buffer while it's executing

- in each frame, the pixel buffer is copied into a WebGL texture and rendered as a quad, and when the intermediate audio sample buffer is filled up, it is forwarded to WebAudio (via ScriptProcessorNode, I know it's deprecated, but for this use case it works perfectly)

- when all this is done, the frame callback function returns and the browser runtime is idle until the next frame starts

I've been thinking about multi-threading, but haven't really had an idea how this would make sense. The user can't "see" anything that happens between presented frames, and the emulation itself runs bursts of cycles for one frame, the result is presented, control returns to the browser, and the whole thing repeats for the next frame.

The native versions of the emulators work exactly the same btw, same code base.

Platform abstraction happens through my sokol headers (https://github.com/floooh/sokol).

So as I said, probably not all that helpful if js-dos works entirely differently, but I've arrived at this "application model" after quite a lot of trial and error (especially on the audio side), and it's now in a state for the last 2 years or so where I'm happy with it. It's remarkable that most things which improved the "perceived quality" on the web platforms where actually simplifications (also again mostly on the audio side). E.g. in the beginning I had a fairly complicated audio system in place where the emulation speed was tweaked by the audio system's buffer playback timing, hoping to eliminate audio artefacts because of gaps between buffers, but in the end the better solution was to just "trust" the frame callback being called at exactly the right time (which it seems to be across all platforms I'm testing), and generating the audio samples with the frequency expected by WebAudio, and then use ScriptProcessorNode to push new data to WebAudio. This works so well that I wonder why ScriptProcessorNode is deprecated. If you need to generate the audio data on the main thread, it's the perfect way to get the data into WebAudio.

But I disgress :)

Cheers,

-Floh.

Alon Zakai

unread,

Jan 16, 2020, 12:35:00 PM1/16/20

to emscripte...@googlegroups.com

About WASI, it doesn't have graphics/audio/mouse input etc. APIs yet, so it's probably too early to port something like DOSBox? Unless you create a custom embedding, but that would be a lot of work.

About performance, unless iOS supports some form of native WASI VM in a special way, I think you'd need to ship your own VM in your app. And I believe they prevent JIT compilation in such a case, so it would be slower compared to the browser (unless there is a working solution for AOT compilation perhaps? I'm not aware of a robust one yet).

--
You received this message because you are subscribed to the Google Groups "emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to emscripten-disc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/emscripten-discuss/CAKOm%3DVFLTVLCiF6QnQWe7tg666Q8ni1jXb3yUbAYcT38ED7Sow%40mail.gmail.com.

Александр Гурьянов

unread,

Jan 17, 2020, 12:32:05 PM1/17/20

to emscripte...@googlegroups.com

Alon, thanks for clarification.

Floh, very interesting!

> What are you using the asyncify changes for? Just for "slicing up" the execution loop into frames, or also for other things?

I am using it just for slicing, to prevent browser freeze. I think
that emulation flow of js-dos is very similar:

- emulation and rendering happens inside infinite loop. Function of
this loop can call it self recursively, this is reason why it's not
easy to use emscripten frame callback instead of asyncify. So this
loop is paused by asyncify each 16ms (1 frame), and then resumes
execution with post message call. setTimeout is not so effective (min
interval is 4ms). This execution flow have simmilar performance that
emulation through emscripten frame callback.

- inside each frame js-dos tries to emulate N instructions. This N is
defined config or can be calculated automatically based on host
performance. Of course real count of emulated instructions is clamped
to fit 16 ms interval. In "auto" mode, js-dos used simple algorithm to
compare how many instructions is executed and increase of decrease N
dynamically.

- emulation updates SDL pixel buffer, that rendered to canvas. updates
is smart, only changed parts are updated (not sure that it's have
effect on emscripten sdl implementation, guess all canvas is updated
each time)

- sound emulation is made by sdl mixer callback, again it's memory
buffer that updated each frame, and pushed to emscripten sdl
implementations (which used AudioNode). this is not very good,
sometimes sound can lag.

- mouse and keyboard input are made through emscripten sdl impelementation.

Actually, I really want to drop SDL and replace it with something
ligther like your sokol abstraction. Not sure that it will give better
performance, but I think code will be much easier to understand. To do
this I plan to add messages abstraction, to have chance to switch to
workers/WASI env if needed.

I also have no idea how to increse performance more. Original dos box
have "dynamic core" that recompiles program to host CPU (x86), but it
used a lot of assembler code, that can't be compiled with WebAssembly,
and I don't understand it very well yet. Maybe there is some way to
make "dynamic core" for WebAssembly.

Also, do you think that future SMID support can increase performance
of emulator? All emulated cpu instructions are emulated one by one, so
looks SMID is also useless.

пт, 17 янв. 2020 г. в 00:35, Alon Zakai <alon...@gmail.com>:

> To view this discussion on the web visit https://groups.google.com/d/msgid/emscripten-discuss/CAEX4NpR%3Di102C76Gp2NeDbHGv93j_XDgRW_EnEOSVoBX_h5diQ%40mail.gmail.com.

Floh

unread,

Jan 19, 2020, 3:03:28 PM1/19/20

to emscripten-discuss

Thanks for the detailed explanation :)

Also, do you think that future SMID support can increase performance
of emulator? All emulated cpu instructions are emulated one by one, so
looks SMID is also useless.

I did, at least for complete system emulators, but the idea I have is so far out that I don't know if I will ever get around tackling that idea.

When you look at complete computer emulation, almost everything that's not the actual CPU is about counters or shifting groups of bits from one place to another, maybe applying logical operations like bitwise AND/OR/XOR. So for each emulated tick in an entire computer system, there's quite a few counters counting, and quite a few logical ops happening, and many of those could happen in parallel encoded in a single SIMD instruction.

One problem is control flow / conditional stuff. E.g. actions that need to happen when counters reach certain values, and so on...

So the crazy idea would be to map the *entire computer system* to a very wide SIMD-like bit vector (probably split across several SIMD registers), and then run a "tick program" over that very-wide SIMD vector which would perform bit twiddling, counting, moving bits from one part of the SIMD vector to another etc etc, "as parallel as possible", until the SIMD vector represents the new state of the computer system after the current tick.

It's probably also a good idea to code-generate / compile this "tick program" from some sort of hardware description language.

TBH I have no idea if that idea is realistic with current SIMD instruction sets, or even GPUs, the idea might just be too crazy :D

Reply all

Reply to author

Forward