Hi Brett,
Thanks for reading our book and thanks for your question. I hope this response effectively addresses your question.
First, the numerical examples given in figures 10.3 - 10.5 are small and simple enough to fit on a page, whereas the graphical example in figure 10.6 is distinct and represents better the kind of dimensions one might actually work with in a computer vision problem. Figure 10.6 is not the direct logical conclusion of the examples in 10.3 - 10.5.
We like to think of the output from a convolutional layer with any number of filters (this output is called an activation map) as something of a latent image itself. So moving forward with the example in Figure 10.6 (a 32x32x3 image convolved with 16 filters), the activation map produced is 32x32x16 (ignoring kernel size, stride length and padding and so assuming the output is the same size as the input - explained in detail on pages 168-169).
When this output is fed into the next layer (in your example, with 10 filters), each filter there is indeed a 3D array: much like in the first layer, each filter has as much depth as the input (here, an activation map instead of an image) which is 16. The resulting output from this layer would be another activation map with the dimensions 32x32x10.
Extending your example, a third layer with 8 filters would yield a 32x32x8 activation map, with each of those 8 filters having a depth of 10 to match the input.
Hopefully that's clear now — let us know if not!
Grant and Jon