SDCH for templated server response

ernest.w....@gmail.com

unread,

Dec 22, 2015, 5:13:02 PM12/22/15

to SDCH

Hi,

My server uses some kind of templates to build response something like:

...SomeConstantPart001$(TokenToReplace001)SomeVeryLongConstantPart001$(TokenToReplace002)SomeConstantPart002

The $(...) stuff is a part which replaced at runtime, the rest are constant parts. Here are questions:

Can I cut out these "SomeConstantPartXXX" and dump all of them into file which will be my dictionary instead of running something like femtozip to build the dictionary?
I dont think it is wise to encode the whole body at runtime, looks like I just can replace constant parts with delta instructions since I know where the data resides in dictionary, right?
I've encountered somewhat odd behavior, I see relatively short string, say 10 bytes are not replaced. Even when I encode the dictionary itself using the same dictionary, which theoretically would leave me with file free of any string and just delta instruction, is it correct?

Thanks!

openvcdiff

unread,

Dec 22, 2015, 8:20:43 PM12/22/15

to SDCH

Hi Ernest:

Ernest wrote:

My server uses some kind of templates to build response something like:
...SomeConstantPart001$(TokenToReplace001)SomeVeryLongConstantPart001$(TokenToReplace002)SomeConstantPart002
The $(...) stuff is a part which replaced at runtime, the rest are constant parts. Here are questions:
Can I cut out these "SomeConstantPartXXX" and dump all of them into file which will be my dictionary instead of running something like femtozip to build the dictionary?

Yes. If you are able to extract all your static content, it is an easy and effective way to build a dictionary. However, you won't be able to encode common strings from TokenToReplaceXXX.

I dont think it is wise to encode the whole body at runtime, looks like I just can replace constant parts with delta instructions since I know where the data resides in dictionary, right?

That is reasonable as long as you also use proper VCDIFF instructions for literal content (TokenToReplaceXXX) as well. I would still suggest using the encoder to ensure that it provides the proper checksums, byte counts, etc.

I've encountered somewhat odd behavior, I see relatively short string, say 10 bytes are not replaced. Even when I encode the dictionary itself using the same dictionary, which theoretically would leave me with file free of any string and just delta instruction, is it correct?

Please read the description of kBlockSize in the open-vcdiff source code:

https://github.com/google/open-vcdiff/blob/master/src/blockhash.h

Best regards,

Lincoln

ernest.w....@gmail.com

unread,

Dec 29, 2015, 7:15:57 AM12/29/15

to SDCH

Hi Lincoln,

>> Yes. If you are able to extract all your static content, it is an easy and effective way to build a dictionary. However, you won't be able to encode common strings from TokenToReplaceXXX.

It is OK, I'm not interested.

>> That is reasonable as long as you also use proper VCDIFF instructions for literal content (TokenToReplaceXXX) as well. I would still suggest using the encoder to ensure that it provides the proper checksums, byte counts, etc.

Totally agree! But I would like to skip the string matching part. Is there a way to "feed" artificially "matches" for the encoder to do the rest of the work - create delta instructions, counts and the proper file structure in whole?

>> Please read the description of kBlockSize in the open-vcdiff source code:

Ok, 16. I have a feeling that I saw much longer strings not being encoded, will re-check it.

Christopher O'Connell

unread,

Dec 29, 2015, 12:51:42 PM12/29/15

to SD...@googlegroups.com

Hi Ernest,

We considered such a scheme for generating our encoded page for some time. Out template engine combines lots of pieces and is constructed in such a way that many pieces may be generated in a fairly static manner. We spent quite some time instrumenting it to try and create chunks in the dictionary from these pieces and then combine them as complete VCDIFF sub-streams into the final output.

It worked poorly for us. In the end, we found that we were better off generating dictionaries from the full pages, as the sub-chunks we were generating from template blocks ended up with a dictionary which was too large.

What we did find very helpful was to "cheat" on the construction of our dictionaries for CSS. Particularly with AMP/HTTP2/SPDY/etc, we've found that in-lining CSS provides a major benefit for many connections -- especially when combined with SDCH. Since our total CSS corpus is large, and parts of it may be combined with virtually any page, we were finding dictionary generation to be an exceedingly lengthy process.

Instead, we strip out the CSS during generation and process it differently as part of a simple decoding process where tokenize it into rules, keys and values. We then take a histogram of the most common elements, and use it to generate the CSS entries in our dictionary, up to the size budget we've set for CSS entries in the dictionary.

Anecdotally, without any hard data readily to hand to back it up (not in the office right now), we found that this approach worked "almost as well"

as using longest common sub-string on the CSS, and had the benefit that it can be applied in a matter of seconds on each CSS build.

All the best,

~ Christopher

--
You received this message because you are subscribed to the Google Groups "SDCH" group.
To unsubscribe from this group and stop receiving emails from it, send an email to SDCH+uns...@googlegroups.com.
To post to this group, send email to SD...@googlegroups.com.
Visit this group at https://groups.google.com/group/SDCH.
For more options, visit https://groups.google.com/d/optout.

openvcdiff

unread,

Dec 29, 2015, 12:54:13 PM12/29/15

to SDCH

Hi Ernest:

>> That is reasonable as long as you also use proper VCDIFF instructions for literal content (TokenToReplaceXXX) as well. I would still suggest using the encoder to ensure that it provides the proper checksums, byte counts, etc.
Totally agree! But I would like to skip the string matching part. Is there a way to "feed" artificially "matches" for the encoder to do the rest of the work - create delta instructions, counts and the proper file structure in whole?

You should be able to adapt VCDiffCodeTableWriter to suit your needs.

https://github.com/google/open-vcdiff/blob/master/src/encodetable.h

Best regards,

Lincoln

ernest.w....@gmail.com

unread,

Dec 29, 2015, 1:24:41 PM12/29/15

to SDCH

will check it. thanks!

ernest.w....@gmail.com

unread,

Dec 29, 2015, 1:45:04 PM12/29/15

to SDCH

Thanks for sharing Christopher, any chance you remember why this approach worked poorly? because of CSS? my case much simpler, I always return javascript stitched together from pieces with dynamic data (not too much data, indeed) inserted into it (token, or call it placeholders replacement). The response is relatively short, several kilobytes, if I say 7kb in average most probably I'm close to reality, however up to 20k response may occur. That's why I dont see any reason to encode on-fly, I dont see any benefit here and it definitely will increase the latency and demand more CPU

ernest.w....@gmail.com

unread,

Jan 4, 2016, 8:16:24 AM1/4/16

to SDCH

Hi Lincoln,

Checked the kBlockSize, as I understand this block defines minimal substring to encode however i have following data in dictionary

var ebPtcl="http://",ebBigS="

and server response has a string

var ebPtcl="http://",ebBigS="blah-blah some long string"

however the string before the "blah-blah" stuff is not being encoded. is it expected behavior?

Sincerely,

E.

On Wednesday, December 23, 2015 at 3:20:43 AM UTC+2, openvcdiff wrote:

openvcdiff

unread,

Jan 4, 2016, 12:41:45 PM1/4/16

to SDCH

Hi Ernest:

I've encountered somewhat odd behavior, I see relatively short string, say 10 bytes are not replaced. Even when I encode the dictionary itself using the same dictionary, which theoretically would leave me with file free of any string and just delta instruction, is it correct?
Please read the description of kBlockSize in the open-vcdiff source code:
https://github.com/google/open-vcdiff/blob/master/src/blockhash.h

Checked the kBlockSize, as I understand this block defines minimal substring to encode however i have following data in dictionary

var ebPtcl="http://",ebBigS="

and server response has a string
var ebPtcl="http://",ebBigS="blah-blah some long string"

however the string before the "blah-blah" stuff is not being encoded. is it expected behavior?

Please reread the description of kBlockSize from the link I mentioned above,

in particular the comment on lines 49-53.

In this case, kBlockSize is 16 and your match has size 29, so yes, this

behavior is possible depending on the alignment of the match within the

target text.

Cheers,

Lincoln

ernest.w....@gmail.com

unread,

Jan 5, 2016, 12:26:40 AM1/5/16

to SDCH

Thanks again Linkoln, it still wasnt clear what "aligned block" means but line 268 gives excellent example which explains it all.

ernest.w....@gmail.com

unread,

Jan 10, 2016, 3:27:59 AM1/10/16

to SDCH

Just in case anyone will be interested to play around with it. kBlockSize just affects the block size of matched substrings, the second important value is kMinimumMatchSize (default is 32) which is defined in vcdiffengine.h. So changing the kBlockSize to get shorter/longer substrings to be matched is not enough these matches may be discarded later because of kMinimumMatchSize which you should change accordingly.

And BTW, it was not worth trying, looks like getting all your, say, 16 byte substrings being encoded increases the character entropy in such a way that later on the gzip gives you worse compression.

Can Selçik

unread,

Jan 10, 2016, 3:57:38 AM1/10/16

to SD...@googlegroups.com

Yeah, I have attempted playing around with those values as well and only saw a decrease in compression.

--

Reply all

Reply to author

Forward