A simple solution: with some small probability, add any new CSS/JS
file to a central collection (another S3 bucket?). Check each new file
against the contents of that collection; if it's there, store a
reference to it instead of the contents. After processing lots of
sites, the commonly used files will end up there with high
probability. Off-site references may want to get higher inclusion
probabilities.
One important property is that you should be able to get all of the
dependent files for a page without buffering too much or making
multiple passes. A basic rule to ensure that is that for all pages
within a domain, dependent files come before the files that depend on
them. To bound the buffer size, keep page domains contiguous and reset
the buffer for each new domain. Off-site references could just be
pulled into the stream right before the file that first references
them on a site; the above suggestion should help ensure that this
doesn't cause massive duplication.
Hopefully this is all just a few lines of code.
-Ken