there's mainly three steps in the WebP encoder:
* Step I: analysis phase: every 16x16 macroblock is assessed for its complexity (basically: how many bits are removed every time the decimation factor is raised one notch? That's the susceptibility slope). Then macroblocks are grouped in classes of equivalent susceptibility (these are the 'segments' in vp8 terminology) and a quantization factor is assigned to these segments. Macroblocks with high response to decimation factor will tend to blur quickly so we try to not raise the quantization step too much for these. Conversely, macroblocks with a lot signal and low susceptibility can sustain a higher QP. There's always going to be "something" left.
* Step II: so far, Step I was compression-agnostic: no bitrate was involved. In Step II, every macroblock is analysed again and each coding modes are tried. Should it be coded 4x4? 16x16? What is the distortion incurred? All modes are tried by default, and the PSNR is measured so that the best tradeoff between bit-cost and reconstruction quality is made. Still, during this phase, no encoding is actually done. We just record statistics about the coefficients distribution and the coding tokens that will be needed for the final coding. For evaluating the bit-cost, we use our best prior knowledge of the distribution, which may be off a little compared to what the final one will be.
* Step III: VP8 spec allows the transmission of custom probability tables that are best fit to the observed distribution of coding tokens. This is pretty much equivalent to the "optimized Huffman tables" one can find in JPEG. So, during this final assembly phase, the statistics collected during step II are finalized and written into the bitstream, followed by the actual coding of each macroblocks. There can be some "RD-opt" decisions being made. RD-opt stands for rate-distortion optimization. Namely, since the final probabilities are known, we can really compute the exact number of bits for each possible coding modes, along with the reconstruction distortion. And pick the best possible.
Afterward, there's still few parameters that can be decided: filtering strength for each segments, etc.
Note that you can loop several time on Step II (using the '-pass' option in cwebp), so that the bit-cost is refined several time until convergence.
Comparatively, JPEG encoding has very few decisions to make. The freedom you have is around the quantization tables, the Huffman code optimization, and the "coring" of the coefficients (ie. how much you down-quantize them). The extra freedom in coding modes for VP8 allows better compression (but also means there's more modes to try out. Hence, it's slower).
Hope it helps,
skal