Is it safe to say that "tokens in SDCH/vcdiff dictionaries are separated by \n (0x0a)"?

4 views
Skip to first unread message

Can Selcik

unread,
Dec 22, 2015, 5:13:02 PM12/22/15
to SDCH
Hi,

I have a few quick questions about SDCH/vcdiff dictionaries. Aside from the SDCH header referencing the host, path, expiry etc., is it safe to say that tokens in SDCH/vcdiff dictionaries are separated by newlines (0x0a)? If that's the case, what if the token we want to encode contains a newline?

Thanks,
Can Selcik

openvcdiff

unread,
Dec 22, 2015, 8:09:48 PM12/22/15
to SDCH
Hi Can:

On Tuesday, December 22, 2015 at 2:13:02 PM UTC-8, Can Selcik wrote:
I have a few quick questions about SDCH/vcdiff dictionaries. Aside from the SDCH header referencing the host, path, expiry etc., is it safe to say that tokens in SDCH/vcdiff dictionaries are separated by newlines (0x0a)? If that's the case, what if the token we want to encode contains a newline?
 
What do you mean by "tokens"?  The SDCH specification only mentions tokens in the context of headers.

Within dictionary contents, newlines are treated the same as any other byte content.  They are not used as delimiters.

Best regards,
Lincoln

Can Selcik

unread,
Dec 22, 2015, 11:03:38 PM12/22/15
to SDCH
Hi Lincoln,

Thanks for the response. I guess I shouldn't have referred to them as tokens but rather as the entries in the dictionary. I can't seem to be able to find the specification describing the structure of a VCDIFF dictionary. The SDCH header for the dictionary is clear to me but I'm essentially wondering how I can go from:

std::list<const char*> common_long_strings = pickStringsFromDocuments(documents);



to

writeDictToFile(filename, common_long_strings)

Thanks,
Can

Randy Smith

unread,
Dec 27, 2015, 4:38:28 PM12/27/15
to SD...@googlegroups.com
I don't believe there are any separate entries in the dictionary; it's
treated as a single long string that the decoding processes can copy
arbitrary contiguous strings from. (Caveat: I'm more familiar with
the SDCH implementation on top of vcdiff than I am of vcdiff, and this
is basically a vcdiff question.)

-- Randy
> --
> You received this message because you are subscribed to the Google Groups
> "SDCH" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to SDCH+uns...@googlegroups.com.
> To post to this group, send email to SD...@googlegroups.com.
> Visit this group at https://groups.google.com/group/SDCH.
> For more options, visit https://groups.google.com/d/optout.

Can Selcik

unread,
Dec 28, 2015, 4:11:32 PM12/28/15
to SDCH
Thanks Randy, that makes sense and actually clarifies a lot. And yes, I guess the question is more related to the VCDIFF implementation rather than SDCH. :)

Christopher O'Connell

unread,
Dec 29, 2015, 12:51:42 PM12/29/15
to SD...@googlegroups.com
Other than the headers used to identify the dictionary hash as part of SDCH, there is no explicit structure to the SDCH dictionary.

If you commonly have a string like "hello\nworld" in your document bodies, then that exact string would be in the dictionary. If you also have the exact string "hello\n\nworld" with enough frequency in your document bodies, then that string might also be in the dictionary -- literally "hello\nworldhello\n\nworld" as your dictionary.

In general a dictionary is just a big stream of bytes. The compressor simply picks and offset and a length (and some more stuff) from the big stream of bytes.

You can also add any arbitrary data to the dictionary. For instance, we add a time stamp for when it was built at the very end -- it's probably not very useful in terms of helping compress pages, but for a couple of bytes it makes it really easy to see which version we're working with.

All the best,

~ Christopher

--

Jim Roskind

unread,
Dec 29, 2015, 12:51:45 PM12/29/15
to SDCH

My recollection, from distant memories, is that the data inside of the dictionary is not separated into tokens. This means that although the dictionary is created from common strings, consecutive strings may actually form a portion of text that is reproduced by the decoding. The encoding specifies an arbitrary offset in the dictionary, as well as a length. There are no separators found in the dictionary.

YMMV,

Jim

sent from mobile

--

ernest.w....@gmail.com

unread,
Dec 29, 2015, 1:17:04 PM12/29/15
to SDCH
BTW, is there a tool to convert vcdiff dictionary to SDCH one? it is not that difficult, but still I would prefer a tool to do it

Can Selcik

unread,
Jan 4, 2016, 2:36:04 PM1/4/16
to SDCH
Thank you for the responses guys. I went over RFC 3284 and now the structure of the dictionary is very clear to me.

When it comes to a tool to convert the VCDIFF dictionary into the SDCH one. we do have an internal solution for it, it is a rather simplistic Python script. Pretty standard argparse interface to specify the path to the input dictionary, the host it will be used on, its expiration etc., and it outputs the dictionary with the SDCH header prepended to it.

Another thing you can do is to have your reverse proxy that is encoding your origin's response deal with the wrapping of the VCDIFF dictionary with the SDCH metadata.
Reply all
Reply to author
Forward
0 new messages