Hi Zhenwei,
We structure our data so that we get the maximum benefit from the automatic deduplication that the blob format provides. Depending on
how your data is organized, even if you have quite a lot of data, once the common parts are normalized it may become much more manageable.
You shouldn't have to hold all denormalized Objects in memory at once. Instead, as you instantiate individual objects, you can add them to the state engine and then drop them on the floor. Once they're in the state engine, they are deduped. Alternatively, in theory there's only one data origination server per environment, so these can be pretty hefty machines with lots of RAM (it's not unheard-of to have
100s of GB of RAM on single instances, check out the r3.8xlarge from AWS). One more alternative, you could also shard the data origination server, to produce multiple blobs which get loaded on clients,
or produce multiple blobs and then combine them to again have a single blob to load on clients.
Usually, the bigger concern is minimizing the memory footprint on the client instances. If the deduplicated memory footprint works for you on your clients, and the tradeoffs are right for your use case, then you can definitely find ways to produce the blobs.
Drew.