Comments helpful: new Emscripten Fetch API

1,029 views

Skip to first unread message

Jukka Jylänki

unread,

Sep 9, 2016, 9:20:44 AM9/9/16

to emscripte...@googlegroups.com

Hey all,

one of the particular points of attention we have been looking at is to ensure that it would be easy to write memory efficient applications with Emscripten. Runtime performance with asm.js, and even more so with wasm, is in the same ballpark with native, but with memory usage, it can be quite challenging to reach the same memory footprint that a native port has. A big reason for this is the in-memory filesystem MEMFS that Emscripten provides, which emulates synchronous filesystem access by storing all the files in memory all the time.

Also, since we've gained support for multithreading with SharedArrayBuffer, file I/O is something that we want to perform efficiently also in pthreads. In the current form, the MEMFS filesystem is unfortunately not shareable, but file I/O needs to be proxied between threads.

We are working towards optimizing these aspects, and as a first step, I'd like to present a proposal for a new "Emscripten Fetch" API, which is a low level built-in runtime API for managing XHRs and IndexedDB operations from within Emscripten applications.

The high level summary points for emscripten_fetch() are:

- allows performing XHRs in a flexible manner, replacing the multiple wget variants we have (but does not deprecate)

- allows interacting with IndexedDB storage for easy persistence strategies of XHRed data

- multithread-aware, callable both in main thread and pthreads. (and works in non-multithreaded build modes as well)

- defines a concept of asynchronous, synchronous and "waitable" operations for convenient management of when blocking vs nonblocking operations are needed.

- a more finegrained control over XHRs/IndexedDB persistence of individual files, compared to the "full disk mount" -based MEMFS/IDBFS persistence behavior.

Preliminary documentation, with code samples, is available at this point at http://clb.demon.fi/dump/emscripten_fetch/doc/docs/api_reference/fetch.html . If our contributors have a moment, feedback would be greatly appreciated.

In particular, I'd like to nail the API structure down without missing use cases early on, so that we wouldn't risk having to do any kind of second iteration of emscripten_fetch_ex() immediately afterwards.

The current implementation status is still in experimental phase. A lot of unit-test-like things already work, and the development branch lives in

https://github.com/juj/emscripten/tree/emscripten_fetch

A number of unit tests exists at https://github.com/juj/emscripten/tree/emscripten_fetch/tests/fetch, which can illustrate low level details of the work.

A diff of the work is at https://github.com/kripken/emscripten/pull/4553 if anyone is interested in peeking under the hood.

For the curious, a later planned follow-up step for this work is that there will be an "ASMFS" POSIX filesystem, which builds on top of emscripten_fetch() and its synchronous IndexedDB+XHR accesses to enable memory efficient multithreaded fopen()+fread()+fwrite() semantics that does not need proxying.

Please let me know if you have questions or comments from this, either in here or as comments to the PR at https://github.com/kripken/emscripten/pull/4553.

Thanks!

Jukka

Robert Goulet

unread,

Sep 9, 2016, 2:35:07 PM9/9/16

to emscripten-discuss

That looks very promising!

Some early comments:

- Are you planning to add user pointer to emscripten_fetch_attr_t so that we can pass it to the callbacks?

- What about caching? i.e. emscripten_fetch decides whether or not it should do an XHR request or use IndexedDB data based on timestamp of the file from the remote web server?

Floh

unread,

Sep 10, 2016, 6:53:31 AM9/10/16

to emscripten-discuss

This looks very nice!

There's 2 things that immediately popped up, but I think both are covered. Also please excuse the unstructured mess that follows :)

- Allow to add request headers, I think this is taken care of from looking at the emscripten_fetch_attr_t struct in https://github.com/juj/emscripten/blob/emscripten_fetch/system/include/emscripten/fetch.h

- Allow ranged requests (with an offset and size), this is mentioned in the docs but probably not yet in the fetch_attr struct, one question there would be how such a byte range would be represented in the IndexedDB cache and the local file system. I guess a byte-range downloaded from the HTTP server would also only update a byte-range in the local version of the file?

The main problems and technical challenges we are seeing in our games all revolve around cache-control and CDNs. On one hand we need 'content version ids', all the expiration-time based caching mechanisms are are fairly useless to us, because a specific game-version (client and server versions) expects data files exactly matching a specific version.

What we usually do for content-versioning is this:

- set the cache-control max-age to some time far in the future, so that all expiration-time-based cache heuristics are (hopefully) disabled

- have some sort of version id for each data file (usually a content hash or a simple version number), and add this to the URL somehow, one way is to add the version id as an URL parameter (e.g. ?id=xxx), or make the version id part of the file name

We are using a MD5-hash of the file content as version-id (hence content-id), another option would be an incremental version number string, or the ETag provided by the web server.

- the content-id could be used as ETag to let the web server know about the file version the client already has (however, I haven't worked with ETags yet, but plan to look into this, so my understanding may be wrong)

- the content-id should be associated with the file content that's already in the IndexedDB or even in memory-filesystem (the memory-filesystem would then also act as some sort of 1st-level-cache), it would be cool to query the IndexedDB backing store by filename *and* content-id (basically do you already have *this* file with *this* specific version)

In general we use our MD5-hash content-id for all sorts of things in our own HTTP filesystem that we have layered on top of HTTP requests:

- replace all expiration-time based cache-control things with explicit versioning via content-id (we add the content-id as an URL parameter, so that the same file with 2 different content-id's appear as different 'things' to any caching layer (e.g. CDNs)

- use the content-id in our own local cache implementation to decide whether the client already has the right version in the local cache, in this case we don't need to do an HTTP request at all

- when a file is downloaded or loaded from cache, compute the actual MD5 hash and compare it to the expected hash to check whether the download or cache content is corrupted (or more likely has been tampered with)

Here's a couple (not really thought-through) ideas how such a content-id could be integrated with the fetch API:

- add a general 'ETag' behaviour flag in the fetch_attr_t flags field, in this case, the ETag coming from the HTTP response would be used as content-id from here on, otherwise, a custom content-id string can be provided through a new 'const char*' in the fetch_attr_t struct

- the content-id value should always be available / associated-with the actual data, for instance when the data is stored in the IndexedDB, the content-id should be stored with it, or the content-id should be available to user-callbacks (e.g. in the emscripten_fetch_t struct)

Provide a small set of 'fetch_cache' functions to query the IndexedDB (and in-memory-filesystem?), for instance (omitting the emscripten_ here):

fetch_cache_exists("name", "content-id"): check if an entry in the cache with matching content-id exists, if it exists, I don't need to do an HTTP request to retrieve it, content-id could be optional in this case just the items existance is checked

fetch_cache_get_contentid("name"): if the item exists in the local cache, get its content id

fetch_cache_invalidate("name", "content-id"): invalidate/delete an item in the caching layer, again, if content-id is not provided, only the name is checked

Ok, so that's about content-ids, what follows now is a brain dump about a way to load data from web servers, but without having implemented that yet:

There's currently 2 popular ways to load data from web servers in games:

1) as asset bundles: advantage is better compressibility, much less protocol overhead during download (HTTP request/response header overhead), much more efficient storage on local device, disadvantage: if one byte in a bundle changes, it must be completely redownloaded, and bundles usually need to be downloaded before a game or level is started, they are not useful for granular on-demand-streaming

2) as unique files: advantages are: much less overhead when files are streamed on demand during game play, if a specific texture is needed at one point during gaming, only that texture is downloaded, not the whole bundle containing that texture, same with versioning, if one specific file has been updated in a new game version, only that one file needs to be downloaded, disadvantage: much higher protocol-overhead, at least with HTTP (much better with HTTP2), usually less compressibility, and less efficient local storage

What I would *like* to try out at some point is some sort of block-based on-demand-paging all the way from the web server to the local cache (or even to the in-memory representation). HTTP downloads would work as range-based HTTP requests on blocks, or consecutive ranges of blocks, and the job of the 'HTTP filesystem' would be to gradually bring the local mirror of the 'page file' to the same state as the remote mirror, and only update blocks that are out-of-date.

However I would first try this on a native platform where I would have direct filesystem access. I'm not sure if this idea collides with a browser's own cache implementation.

Generally I think emscripten's fetch API should remain as general-purpose as possible, without any of those ideas 'baked in', but it should enable the implementation of such ideas. I think associating a content-id with a data item is still within the area of being 'general purpose' as long as no assumptions are made, what the content-id actually is.

Ok, that's all I can think of so far, apologies for the unstructured brain dump :)

Cheers,

-Floh.

Floh

unread,

Sep 10, 2016, 6:59:06 AM9/10/16

to emscripten-discuss

Oh dear, reading my response again it's not really clear what I want :)

I think the main topic is: do we need more control over local caching behaviour down in the fetch API, and if yes, what's the best way while keeping the API general-purpose.

The simple expiration-time based caching model is often not enough for games where it must be guaranteed that the latest version of a file fetched, but at the same time, redundant HTTP traffic should be avoided.

I hope that makes it clearer where I'm going ;)

-Floh.

Jukka Jylänki

unread,

Sep 11, 2016, 10:20:08 AM9/11/16

to emscripte...@googlegroups.com

- Are you planning to add user pointer to emscripten_fetch_attr_t so that we can pass it to the callbacks?

Yeah, that is definitely a must to have feature, and it is already included, see https://github.com/juj/emscripten/blob/emscripten_fetch/system/include/emscripten/fetch.h#L53 . I haven't managed to document that yet, but will note to do that when completing the docs.

- What about caching? i.e. emscripten_fetch decides whether or not it should do an XHR request or use IndexedDB data based on timestamp of the file from the remote web server?

This is a good idea. The current protocol for IndexedDB-backed emscripten_fetch() is as follows:

1. Look at IndexedDB if URL/pathname exists. If it does, return that without doing any XHRs at all.

2. If the entry in IndexedDB does not exist, do the XHR and store it in IndexedDB.

This is sufficient for scenarios where application manages its own IndexedDB data lifetime scheme. Even though the above protocol looks "cache-like", the intent or semantics of the IndexedDB storage are not to be a (transient) cache, but a permanent persistent storage. If an application would like to treat the storage as a transient cache, they should do so by managing old file eviction manually. I think I'll need to add some kind of API for enumerating files in the storage for that.

For timestamp management, there's two ways:

a) emscripten_fetch() supports passing any custom HTTP request headers, so it is possible to pass a Last-Modified-Since: header to the request from application code to manually do downloads only if newer than a given timestamp. The EMSCRIPTEN_FETCH_REPLACE flag can be paired to force replacing an old entry in IndexedDB if it exists. This allows applications to perform their own caching schemes if they want to do something complex.

b) The above will be enough if the application knows the modified timestamps. The timestamps will be stored with the data in IndexedDB (in posix inode format). For convenience, I think it would be good to add a flag EMSCRIPTEN_FETCH_UPDATE_IF_MODIFIED, which would change the protocol to

1. Look at IndexedDB if URL/pathname exists. If it exists, read its modified timestamp.

2. Perform an XHR to download the data, with Last-Modified-Since: time be the timestamp if the datafile did exist.

3. If the XHR comes back with new modified data, update the entry in IndexedDB, otherwise return the original data file from IndexedDB.

This would help applications not need to fire up a IndexedDB timestamp read first, but they could do updates in one fetch() request.

Slightly related,

c) Our existing IDBFS filesystem has a whole mount point covering version number string, where a built application can set a version for the cache, and application rebuilds can bump up the number of this cache to invalidate everything existing in the storage. I'm thinking that something similar overarching could be also useful here so that application rebuilds have a scheme where they can nuke all old files if needed.

Jukka Jylänki

unread,

Sep 11, 2016, 11:21:03 AM9/11/16

to emscripte...@googlegroups.com

2016-09-10 13:53 GMT+03:00 Floh <flo...@gmail.com>:

This looks very nice!

There's 2 things that immediately popped up, but I think both are covered. Also please excuse the unstructured mess that follows :)

- Allow to add request headers, I think this is taken care of from looking at the emscripten_fetch_attr_t struct in https://github.com/juj/emscripten/blob/emscripten_fetch/system/include/emscripten/fetch.h

Yeah, applications can add arbitrary request headers (arbitrary==any safe headers not enforced by browser security rules, i.e. anything you could pass in JS as well).

- Allow ranged requests (with an offset and size), this is mentioned in the docs but probably not yet in the fetch_attr struct, one question there would be how such a byte range would be represented in the IndexedDB cache and the local file system. I guess a byte-range downloaded from the HTTP server would also only update a byte-range in the local version of the file?

This is planned to be supported, though not yet implemented. This one actually is a critical feature that I've been thinking a lot about, because I want to extend the byte range requests to reads from IndexedDB as well in addition to XHRs. So no matter whether the data came from an XHR or an IndexedDB read, it should be possible to pull in partial bytes. This allows downloading large pak files, but not actually needing to pull them in to memory in full.

The main problems and technical challenges we are seeing in our games all revolve around cache-control and CDNs. ...

Thanks for all the thoughts here. These bring up good points, we should find a flexible way to implement different strategies.

On one hand we need 'content version ids', all the expiration-time based caching mechanisms are are fairly useless to us, because a specific game-version (client and server versions) expects data files exactly matching a specific version.

What we usually do for content-versioning is this:

- set the cache-control max-age to some time far in the future, so that all expiration-time-based cache heuristics are (hopefully) disabled

The default IndexedDB persistence scheme is simple (see previous mail reply), i.e. file is only downloaded and persisted on the first run and never again download after that. This is practically a no expiration-time based mechanism. The persistence of data in IndexedDB is not intended to be a cache, but a persistent storage, and the language in emscripten_fetch() documentation aims to explicitly avoid the use of the word 'cache'. The amount of infinite lifetime persistence guarantee varies by browser and disk quotas, but should be fairly good compared to automatically managed transient browser caches. Applications should be free to define their caching model on top of the storage.

- have some sort of version id for each data file (usually a content hash or a simple version number), and add this to the URL somehow, one way is to add the version id as an URL parameter (e.g. ?id=xxx), or make the version id part of the file name

By default files in IndexedDB are namespaced by their URLs, so query params become part of that. There is also a field "destinationPath" in the request attributes where one can say the destination filename in IndexedDB for the download, so one can download URLs to be placed into arbitrary "filenames" (rather just keys) in IndexedDB, to e.g. strip a query param, or add a new one.

We are using a MD5-hash of the file content as version-id (hence content-id), another option would be an incremental version number string, or the ETag provided by the web server.

- the content-id could be used as ETag to let the web server know about the file version the client already has (however, I haven't worked with ETags yet, but plan to look into this, so my understanding may be wrong)

- the content-id should be associated with the file content that's already in the IndexedDB or even in memory-filesystem (the memory-filesystem would then also act as some sort of 1st-level-cache), it would be cool to query the IndexedDB backing store by filename *and* content-id (basically do you already have *this* file with *this* specific version)

In general we use our MD5-hash content-id for all sorts of things in our own HTTP filesystem that we have layered on top of HTTP requests:

- replace all expiration-time based cache-control things with explicit versioning via content-id (we add the content-id as an URL parameter, so that the same file with 2 different content-id's appear as different 'things' to any caching layer (e.g. CDNs)

- use the content-id in our own local cache implementation to decide whether the client already has the right version in the local cache, in this case we don't need to do an HTTP request at all

- when a file is downloaded or loaded from cache, compute the actual MD5 hash and compare it to the expected hash to check whether the download or cache content is corrupted (or more likely has been tampered with)

Here's a couple (not really thought-through) ideas how such a content-id could be integrated with the fetch API:

- add a general 'ETag' behaviour flag in the fetch_attr_t flags field, in this case, the ETag coming from the HTTP response would be used as content-id from here on, otherwise, a custom content-id string can be provided through a new 'const char*' in the fetch_attr_t struct

Supporting ETags is a good idea, I'll add that to the list. ETags are generally always generated by the CDN server and not the requesting client, so the ability to provide a custom ETag by the client might not be needed? If desirable, a custom ETag can be passed by setting a custom HTTP ETag request header though, so fetch_attr_t itself would not need such a field. I can make emscripten_fetch() look at ETags when EMSCRIPTEN_FETCH_UPDATE_IF_MODIFIED is used. If ETags are being used, it is sensible to ignore Last-Modified-Since semantics altogether, so when emscripten_fetch() XHR sees an ETag being served by the CDN, it can just ignore Last-Modified-Sinces.

- the content-id value should always be available / associated-with the actual data, for instance when the data is stored in the IndexedDB, the content-id should be stored with it, or the content-id should be available to user-callbacks (e.g. in the emscripten_fetch_t struct)

Being able to store some amount of metadata along with a fetch request might be a good idea. I'll think about that a little. One strict design restriction is that we want to avoid having to ping-pong multiple IndexedDB<->application code<->IndexedDB trips per file, so the scheme will need to be implementable as a one-stop load/store op.

Provide a small set of 'fetch_cache' functions to query the IndexedDB (and in-memory-filesystem?), for instance (omitting the emscripten_ here):

fetch_cache_exists("name", "content-id"): check if an entry in the cache with matching content-id exists, if it exists, I don't need to do an HTTP request to retrieve it, content-id could be optional in this case just the items existance is checked

fetch_cache_get_contentid("name"): if the item exists in the local cache, get its content id

fetch_cache_invalidate("name", "content-id"): invalidate/delete an item in the caching layer, again, if content-id is not provided, only the name is checked

I think this might be (relatively) slow if the application code would query each file twice, once for the metadata and a second time for the actual data, so my gut instinct for this kind of thing would be to maintain a metadata index file at the application level, i.e. a /file_metadata.txt which would store for each file the metadata associated with it. The application could emscripten_fetch() that file into memory at startup, and persist changes to it after bundles of files are downloaded. That would avoid doubling up the number of fetches to two per file.

Another way might be to have the storage always carry a bit of metadata per each file. That might be cleanly implementable, I'll give it a test.

Robert Goulet

unread,

Sep 12, 2016, 10:35:13 AM9/12/16

to emscripten-discuss

On Sunday, September 11, 2016 at 10:20:08 AM UTC-4, jj wrote:

- Are you planning to add user pointer to emscripten_fetch_attr_t so that we can pass it to the callbacks?

Yeah, that is definitely a must to have feature, and it is already included, see https://github.com/juj/emscripten/blob/emscripten_fetch/system/include/emscripten/fetch.h#L53 . I haven't managed to document that yet, but will note to do that when completing the docs.

Oh yeah sorry, somehow I didn't see it in my first read.

- What about caching? i.e. emscripten_fetch decides whether or not it should do an XHR request or use IndexedDB data based on timestamp of the file from the remote web server?

This is a good idea. The current protocol for IndexedDB-backed emscripten_fetch() is as follows:

1. Look at IndexedDB if URL/pathname exists. If it does, return that without doing any XHRs at all.
2. If the entry in IndexedDB does not exist, do the XHR and store it in IndexedDB.

This is sufficient for scenarios where application manages its own IndexedDB data lifetime scheme. Even though the above protocol looks "cache-like", the intent or semantics of the IndexedDB storage are not to be a (transient) cache, but a permanent persistent storage. If an application would like to treat the storage as a transient cache, they should do so by managing old file eviction manually. I think I'll need to add some kind of API for enumerating files in the storage for that.

For timestamp management, there's two ways:

a) emscripten_fetch() supports passing any custom HTTP request headers, so it is possible to pass a Last-Modified-Since: header to the request from application code to manually do downloads only if newer than a given timestamp. The EMSCRIPTEN_FETCH_REPLACE flag can be paired to force replacing an old entry in IndexedDB if it exists. This allows applications to perform their own caching schemes if they want to do something complex.

b) The above will be enough if the application knows the modified timestamps. The timestamps will be stored with the data in IndexedDB (in posix inode format). For convenience, I think it would be good to add a flag EMSCRIPTEN_FETCH_UPDATE_IF_MODIFIED, which would change the protocol to

1. Look at IndexedDB if URL/pathname exists. If it exists, read its modified timestamp.
2. Perform an XHR to download the data, with Last-Modified-Since: time be the timestamp if the datafile did exist.
3. If the XHR comes back with new modified data, update the entry in IndexedDB, otherwise return the original data file from IndexedDB.

This would help applications not need to fire up a IndexedDB timestamp read first, but they could do updates in one fetch() request.

What about avoiding downloading the file when its not newer, as such:

1. Look at IndexedDB if URL/pathname exists. If it exists, read its modified timestamp.

2. Perform an XHR to get URL/pathname modified timestamp, without downloading the data yet.

3. If the XHR comes back with newer modified timestamp, perform an XHR to download and update the IndexedDB entry with the newer data and modified timestamp.

4. Return possibly updated IndexedDB data.

Slightly related,

c) Our existing IDBFS filesystem has a whole mount point covering version number string, where a built application can set a version for the cache, and application rebuilds can bump up the number of this cache to invalidate everything existing in the storage. I'm thinking that something similar overarching could be also useful here so that application rebuilds have a scheme where they can nuke all old files if needed.

For production purposes, yes I guess a versioning can be useful. But for usual game development, where the content can change every minute, this won't be necessary. As long as we can opt-in/out of all these caching systems that's fine.

There's another point I thought that could be very useful for managing memory more tightly: what if we could specify, somehow, a maximum amount of memory that can be used by MEMFS at any time? When MEMFS would become full, if a new file wanted to be copied in, the oldest one could be removed from memory, assuming it is already persisted in IndexedDB? That would greatly help running on lesser hardware, and make memory management much easier, since most of the time, once we have read a file content to instantiate game objects in memory, we don't need the file to reside in MEMFS anymore. That's something I believe emscripten_fetch could potentially handle?

Reply all

Reply to author

Forward

0 new messages