PS: if I'm seeing it right, than SDL2 is using GLES2/WebGL under the hood, and an SDL_CopyRect() might be this under the hood:
...and if this is true, then each CopyRect is one dynamic buffer update (vla glBufferData() or glBufferSubData()) and a draw call, and that's for each rectangle (so it looks like no batching happens at all).
This sort of stuff is very fast in native GL implementations, and very slow in WebGL (e.g. it's even more expensive than the 'trivial draw call' example in my post above. IME this explains the performance difference you are seeing, and also why the software renderer is faster, the WebGL calling overhead is much more expensive than the actual drawing operations.
The way to make this scenario fast in WebGL is via "sprite batching": have two big vertex buffers (alternating each frame, one 'in flight' for rendering, the other currently filled with the CPU), write all the rectangle vertices for one frame into a memory chunk, do a single glBufferSubData() to copy into a GL vertex buffer, followed by a single glDrawArrays(). Ideally use a single texture atlas which has the images for all used sprites, or if that's not possible, have very few atlasses, and sort sprites within a "depth layer" by texture. E.g. do everything to minimize the number of draw calls as much as possible.