I'm trying to understand how the temporal / spatial encodings are decoded. My understanding of H.264 SVC mode is that sub-streams combine to create the ideal stream.
i.e. (temporal scalability example)
Stream 1 - Frame 0, 6, 12, 18, 24
Stream 2 - Frame 2, 4, 8, 10, 14, 16, 20, 22, 26, 28
Stream 3 - Frame 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29
So, you can receive Stream 1 at 5fps
If you want 15fps you receive Stream 1 and Stream 2
If you want 30fps you receive Stream 1, Stream 2 and Stream 3
Is this how the VP8 temporal encoder works or instead does each stream produce all the data for that quality level?
i.e.
Stream 1 - Frame 0, 6, 12, 18, 24
Stream 2 - Frame 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28
Stream 3 - Frame 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29
I guess my question is, from a decoder perspective for a temporal stream do I pass in all streams data to decode 30fps or just the best stream (Stream 3). If the former, how do you indicate to the decoder what stream the current packet belongs to or is that data built into the substream itself?
Hopefully this makes sense. Thank you in advance,
Ehren