As part of work on a tangential project, I have a tool suitable for
building SDCH dictionaries given sample web pages. I would be curious
if anybody can report back results or compare to other tools available
for dictionary generation. I used the tool to generate an sdch
dictionary for google search results pages, training on 50 of the top
100 queries from 2010, and then benchmarking on the other 50 queries.
The dictionary achieves net compression rates that are 20% better than
the current google sdch dictionary (rU20-FBA.dct). Keep in mind this
test is not at all comprehensive as it is trained/benchmarked on such
a small sample set, and is being compared to the google dictionary
which may have been trained on earlier/slightly different pages. It
would be great of google opened up its dictionary building tool to
help spur adoption, but in the meantime I think this generates a
reasonable dict and can serve as a good starting point for exploring
sdch.
To learn more check out
https://github.com/gtoubassi/femtozip/wiki/Sdch