Cinepak, for the unfamiliar, is basically the simplest scheme that could be reasonably described as a "video codec", developed in the 90s to play grainy video from CD-ROM on commodity computers without video acceleration. It uses vector quantization, which is similar to indexed color, except that you index whole blocks of pixels (in this case 2x2 or 4x4). It also supports basic inter-frame compression (omission of macroblocks that don't change between frames). It is kinda grainy, but at 640x480 it is mostly okay, especially when using NTSC output (any bright red objects may disagree...).
By default everything is set up for a P2EDGE 32MB with a VGA board on pin 32, but that's all configurable. Note that lower-bandwidth memory is really not fast enough for truecolor 640x480 operation, but stepping down a gear to 16bpp makes it okay again for 8 bit PSRAM (and presumably HyperRAM, too, but I haven't actually tested HyperRAM support at all). Note that the 8bpp mode is monochrome only.
Note the max_strips value. This corresponds to the same value in cinepak.spin2. If this option is omitted, it defaults to 3. More strips -> higher quality, higher bitrate. But also needs more memory to buffer codebooks (well, ffmpeg-encoded files never re-use them anyways...)
This is great work @Wuerfel_21 ! The P2 can certainly playback some reasonable quality video with your code. I was able to get it to work playing back the bunny movie from microSD with some 16 bit wide PSRAM fitted to the P2-EVAL. I think apart from a quick test I did way back when developing my video driver, you are probably the first person to actually use its double buffering capabilities (or at least has mentioned it in the forums). Also in theory with two independent banks of PSRAM (different data pins) you could even double the read/write bandwidth because my video driver can be setup to read video from one memory bank while you write to the other, and then switch over automatically to the other bank per field/frame (or at least it is meant to be able to do that - never tested it yet, you'd be the first there too). That feature may end up being helpful in dual 4 or 8 bit PSRAM setups for higher performance.
Not sure about the interlace issue you mentioned but maybe there is a bug you can identify...or something else? Playback mostly seems smooth though I did see the scrolling credits at the end is a little jerky, is there a little frame rate variation in the decoding there? I do also see some grain/shimmer as you mention but that may all be a part of it. I'll need to read through your code in depth to see what you did.
EDIT: okay now I see some of the other videos running a bit too fast and then re-read your post again to learn there is no frame rate control put in there yet which explains it. Also very much liked the 80's Howard Jones inclusion, that takes me back to fun times.
The issue is that which of Source/AltSource ends up being the top/bottom field seems to be inconsistent based on ???. In the no_sd version, AltSource is the top field, but in the regular version, AltSource is the bottom field. If you change it in either the picture is wrong. The documentation file also doesn't specify which is supposed to be which???
It's been a while since I messed with that since I kinda got cought up in the whole "write better encoder" thing. Turns out generating a good codebook (= the "palette" of blocks) is an NP-hard problem (as is regular color quantization, which is why most programs suck at converting images to 256 colors (or god forbid 16)).
I think I cobbled together an algorithm that is somewhat nicer than FFMPEG's, though is mildly slower overall (but I plan to implement multi-threading to offset that). The main tangible improvement though is in the RGB -> YUV conversion. I use proper rounding (so encoded colors match input colors +/-1 instead of being kinda off) and gamma-correct the Y values to maintain the original pixel's brightness with the subsampled UV vector (reduces blockyness and edge artifacts on bright red areas).
The actual algorithm is a combination of fast-search PNN (with the funny k-d trees) and classic LBG (I found ELBG's shifting step to be slow (not sure why) and not actually that great at anything but reducing numerical error. It's still in there to deal with dead codewords, but not much else). I'm also using a weighted distance metric (UV get double weight, cause that's cheap to implement) to reduce banding and specks of inappropriate hue.
Cinepak is maxes out at 2bpp for actual image data (excluding codebooks), but yeah, seems like a way superior scheme (better quality and probably faster to decompress) than CCC. The codebooks add up to 3K per strip used though. OTOH, image data can be reduced a lot by changing the V1/V4 coding threshold while still maintaining decent quality. Bottoms out at 0.5 bits per pixel, but at that point you should just reduce the resolution)
Speaking of, here's another funny APNG illustrating the aforementioned YUV conversion issue. "Fast YUV" is similar to what FFmpeg does, minus rounding errors:
(Note that this is before actual quantization is applied. Also this is an extreme case and at low resolution.)
Yeah I think this is why having dual independent PSRAM banks might be useful - though obviously more pins are required (36 IO pins). While the video driver is reading a frame from one bank you have the entire PSRAM bandwidth to yourself on the other bank for writing. This bandwidth increase could be very large if there is no reading going on and could speed up the copy to PSRAM with large transfers. I wonder then if 720p50 or 25 could even be achievable...? Is there any more headroom on the decoding side for increasing the resolution? Or could you use more COGs if required? What is the COG use currently?
There's a profiling option that will tell you some per-frame metrics. Do keep in mind that IO and FS metrics are masked by VQ due to async transfers. I think 800x480 would certainly work with 300+ MHz (current 640x480 uses the default VGA 252 MHz setting). Might try to see what's possible at 1024x768 (native resolution of the VGA monitor on the """bench"""). Probably need to reduce buffers to 16bpp (which is... fine-ish). Also, the coding efficiency kinda goes down with resolution since the amount of "interior" blocks (i.e. ones where all the Y values are approximately equal) grows quadratically to the amount of "edge" blocks, favoring the former in the VQ algorithm.
In my WIP encoder, I already counteract that by increasing weights of interior blocks for building the V1 codebook (since V1 won't get picked for edge blocks, anyways). I think I may need to correspondingly de-weight the V4 blocks at the same locations (so the V4 book doesn't get filled with as many pointless codes), but that would require an additional copy of data. Which I guess is fine considering how slow the actual algorithm is. Which would also allow me to reset the weights after the skip/V1/V4 decisions are made, which is probably a good idea.
The decode cog is running Spin2 (and as currently written, is rebooted for every frame), so it could also do other things. You can also change it to run on Cog 0 by fiddling with the options at the top of cinetest, but that makes it slower.
I think the PSRAM access schedule isn't quite ideal yet. I might look into using request lists to make one buffer swap (i.e. uploading a 640x4 line and downloading the second-next one over the same buffer) into a single op, which might improve bandwidth utilization. But that would make the interface less nice.
Using multiple cogs would be tricky due to the variable-rate encoding (i.e can't easily split the data). What could be done is having separate cogs for odd/even scanlines. They'd both need to decode the entire bitstream, but would only process half of each block. You could also split strip processing, but then you'd need to constrain the encoder to never use less than N strips and never re-use codewords between different strips and also figure out how to split the read bandwidth among the strips. I.e. NO.
You might also try to fiddle about with the per COG burst sizes so that your writes line up well with this burst size and you reduce the total requests issued, while also trying to maximize the use of remaining bandwidth per scan line. It's a bit of a balance there that depends on the video line frequency and the scan line size and P2 clock frequency and there are probably some sweet spots.
From purely a video and PSRAM perspective I know a 1024x768 true-colour frame buffer is achievable on a P2 at 325MHz, but whether sufficient write bandwidth remains for a video decoder to use is the main question. A dual PSRAM setup would probably allow it, and single PSRAM not. Going down to 16bpp halves the bandwidth over true-colour at the expense of video quality. Also you can update new frames just at 30Hz or 25Hz instead of 60/50Hz so only need half the write bandwidth again. Presumably the decoder is already using that trick.
One other low-processing video format maybe worth looking at is Smacker or Bink from radgametools. It used to be used for a lot of video game cut-scenes, and I remember them claiming HD video playback on an 80486.
I kinda like the Heptafon codec I developed myself, but that's kinda underkill, being designed for P1 rather (no multiply, no output buffering, etc) and thus being not quite transparent. Also not available in any standard tool.
All the usual ADPCM codecs are memes. Like, entirely obsoleted by the existence of aforementioned Heptafon (maybe my encoder is just better rather than the format, but I've tried encoding Microsoft ADPCM with audacity and ffmpeg and both suck). All of the ones that do outperform it require multiplies and have higher bitrate. I could probably make a heptafon deriviative that does require such things which would in turn destroy those.
Got it going on my P2-EVAL setup with the PSRAM module using P32-P47 pins. The higher resolution looks very nice and crisp. If I get a chance sometime I'll try to fit another PSRAM module on P0-P23 and put the VGA board on P24-P31 and see if I can figure some way to hack your code to flip frames alternatively from each PSRAM, assuming it can be done with the way you have it coded (you'll probably know if that is even possible). My second PSRAM board does have two PSRAM loads per data pin though (it's 64MB module instead of 32MB), hopefully it can still run okay at 325MHz. I should try that one out separately first I guess... If that works it'd allow 24bpp colour at this higher resolution.
795a8134c1