Desktop Hardware Encoding for webrtc

2,735 views
Skip to first unread message

Alexandre GOUAILLARD

unread,
Dec 4, 2019, 5:39:05 AM12/4/19
to discuss...@googlegroups.com
Dear all,

I'm putting together info about hardware acceleration for webrtc on desktop (no IoT, no UWP, ...).
At this stage INTEL HW acceleration is pretty complete, thanks to intel webrtc team contributions, but the section on Nvidia (npipe/nvenc) and AMD AMF is pretty slim. 

Note that we are NOT speaking about supported pre-encoded frames, but rather adding HW-Encoder support in liwebrtc through the VideoEncoderFactory design pattern.

Anybody either from Nvidia / AMD, or with experience integrating it in libwebrtc, who would be ok to exchange on the matter? Please contact me off-list at agouaillard (at) gmail.com . All results will be shared with those who contributed.

Thanks in advance, 

Regards, 

Dr Alex.


--
Alex. Gouaillard, PhD, PhD, MBA
------------------------------------------------------------------------------------
President - CoSMo Software Consulting, Singapore
------------------------------------------------------------------------------------

Sebastian Kunz

unread,
Dec 4, 2019, 8:01:03 AM12/4/19
to discuss...@googlegroups.com
Hello Alex,
that are some exciting news. Changes I have been waiting for! Does this also provide a direct pipeline with the desktop_capture modules? I believe it is not possible to use the webrtc desktop_capturer (which on windows 8+ uses DesktopDuplicationApi) and a VideoEncoder, that is hardware accelerated (e.g. NVENC with nvidia GPUs) together without unnecessarily copying between RAM and VRAM. Please correct me if I am wrong, I wasn't able to find a way to leave the captured frame on the GPU and feed it to the Hardware Encoder.
Thank you very much. Excited for the future!

Regards,
Sebastian

--

---
You received this message because you are subscribed to the Google Groups "discuss-webrtc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss-webrt...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/discuss-webrtc/CAHgZEq4tx9x8ENcRtu3KNgb_MUk4Ejv%2BLzTFHUKRw%3DX5Ja1AXg%40mail.gmail.com.

Alexandre GOUAILLARD

unread,
Dec 4, 2019, 11:48:02 AM12/4/19
to discuss...@googlegroups.com
I m not sure specifically about this case, 
but in the case of INTEL H265 HW support for example, they take a non-native handle (i.e. an I420 in RAM raw frame) and there seems to be improvement over the default case anyway.

How did your implementation with NVENC, even with the copy, compared to the default case?
is your NVENC-enabled VideoEncoder class available somewhere?

Sebastian Kunz

unread,
Dec 5, 2019, 3:24:09 AM12/5/19
to discuss...@googlegroups.com
So in my specific case I want to encode ID3D11Texture2D, which is provided by windows DesktopDuplicationApi. I implemented my own dda capturer to send the newly acquired frames to the VideoSink. For the video encoder I am using following code   https://github.com/WonderMediaProductions/webrtc-dotnet-core/blob/master/webrtc-native/NvEncoderH264.cpp. Please note that I don't take credit for this code, since I didn't write it. If you care about the actual implementation using NVENC, you can look at  https://github.com/WonderMediaProductions/webrtc-dotnet-core/tree/master/webrtc-native-nvenc. This mostly makes use of the examples already provided by Nvidia, you can find them here https://github.com/NVIDIA/NvPipe/tree/master/src/Video_Codec_SDK_9.0.20/Samples/NvCodec/NvEncoder.
So with my own capture device this works pretty well. I also tried the desktop_capture module with a Hardware Encoder using NvPipe. The performance was terrible.
However I never tried the webrtc desktop_capturer with the NvEncoder, that I am currently using (see first link), simply for the reason that the desktop capturer doesn't provide the frame in a ID3D11Texture2D. I also have the code for the RGB encoder using NvPipe, in case you're interested.
I also tried webrtc's desktop_capturer with a non hardware accelerated encoder. That was the worst of all. This might be due to the reason that I didn't bother looking into the encoder settings for that one, I just used the default encoder, with no configurations made.

Alexandre GOUAILLARD

unread,
Dec 5, 2019, 7:12:04 AM12/5/19
to discuss...@googlegroups.com


On Thu, Dec 5, 2019 at 9:24 AM Sebastian Kunz <sebasti...@precipoint.de> wrot
So in my specific case I want to encode ID3D11Texture2D, which is provided by windows DesktopDuplicationApi. I implemented my own dda capturer to send the newly acquired frames to the VideoSink.

 

Yes, that's the entry point I was looking for. Simple (no simulcast support, ....) but it seems to be complete for a single stream.

So with my own capture device this works pretty well. I also tried the desktop_capture module with a Hardware Encoder using NvPipe. The performance was terrible.

I thought that NVpipe was just a wrapper around the Video Codec SDK? How come the results are so different?
 
However I never tried the webrtc desktop_capturer with the NvEncoder, that I am currently using (see first link), simply for the reason that the desktop capturer doesn't provide the frame in a ID3D11Texture2D.

hum.  The intel HW acceleration support code has some provision for dealing with D3D11Texture2D and transform it into something webrtc understands (I420, or NV12).
You can look here if it helps, and the corresponding files in the same folder:
 
I also have the code for the RGB encoder using NvPipe, in case you're interested.

Right now, i m collecting all possible resources, so yes, any link you have is more than welcome.

Thank you again.
 

Sebastian Kunz

unread,
Dec 5, 2019, 8:09:21 AM12/5/19
to discuss...@googlegroups.com
Yes NvPipe is just a wrapper around Nvidias Codec SDK. The performance issues don't come from NvPipe. They come from copying between RAM and VRAM. I was capturing from a 4k monitor, which lead to massive frame sizes.
The reason why I didn't bother investigating more into desktop_capture module + nvenc is that the application which, we are developing, is all about latency. We can't afford to convert from one format to another. It's simply to expensive. In the best case scenario you have one copy operation from the capture loop to the encoder.

Happy to hear that the links helped!

--

---
You received this message because you are subscribed to the Google Groups "discuss-webrtc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss-webrt...@googlegroups.com.

Barry Li

unread,
Dec 5, 2019, 10:13:19 AM12/5/19
to discuss-webrtc
Wow, Cool. How about the mac OS? 

Alexandre GOUAILLARD

unread,
Dec 5, 2019, 11:37:32 AM12/5/19
to discuss...@googlegroups.com
What about it? What are you trying to do?
- device capture or screen capture?
- which hardware do you want to support (GPU or other)?
- which codec?

--

---
You received this message because you are subscribed to the Google Groups "discuss-webrtc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss-webrt...@googlegroups.com.

David Collins

unread,
Dec 11, 2019, 11:49:05 AM12/11/19
to discuss-webrtc
There are at least two options I'm aware of for MacOS, possibly three. The one I have direct experience with is the VideoToolbox api. It supports only h.264 at the time of this writing, but it works on both iOS and MacOS. Like the other solutions discussed here, the result is a handle to a block of memory in VRAM. Ideally, it should stay there without being copied to RAM more than the once required for it to be packaged for network delivery to the remote h.264 app. Similarly, I'd be interested to know whether on decode, where the product is an OpenGL texture, it can stay on the card, and be composited using OpenGL for display. H.264 is supported but I don't believe it is universally implemented on browsers, so until Apple stops snubbing vp8/vp9 or until some new standard overshadows all of them, the VideoToolbox is probably not a great solution. Also, it is VERY poorly documented. There is a session from WWDC where it is discussed carefully, but even so, it is a strange interface. The other option is that Intel must have MacOS API's to their on-chip hardware encode/decode. This will only work on MacOS, but if that's OK, it's probably the best solution, since it produces vp8/vp9 and will be similar to solutions that others produce as a model for LINUX and Windows. The third is pure conjecture on my part because I have no direct knowledge about it, but I would be shocked if AMD and Nvidia do not have API's for the Mac to access their GPU facilities. Not all Macs have GPU's, so you would want to create an abstract implementation that can manage either if you want to take advantage of the GPU when it is available. It is virtually guaranteed to be more performant.

Personally, it's the "plumbing" I'm more interested in. Since encoding by definition produces very small products, and since it has to be in RAM ultimately, I'm not very concerned with that end of things. The performance improvements and parallelism provided by hardware encoding will outweigh the hit for moving the compressed frames backwards on the memory bus to VRAM. On the decoding side, the performance/power improvement will be less of a win, and dragging uncompressed textures from VRAM to RAM could outweigh the performance improvements. Ideally, I'd like to be able to treat the decoded frame as an abstract image handle that allows me polymorphically to do any necessary compositing/blitting with OpenGL. In theory, the image ought to be able to stay in VRAM until it is released, i.e. it never HAS to be copied to RAM.

David Collins

unread,
Dec 11, 2019, 11:49:05 AM12/11/19
to discuss-webrtc
Thanks for raising this topic. It's of great interest to me as well. I have some experience converting a software encode/decode pipeline to a hardware enabled one on a non-WebRTC conferencing app on iOS, using VideoToolbox. The key issue (especially on decode, where the products are big - whole frame textures - and the performance/power improvements are modest) is being able to control the lifespan of the encoded/decoded frames in order to keep them in VRAM. Sebastian Kunz is alluding to that process, so I'll watch this thread and experiment. Hopefully, I'll be able to add something substantive to the discussion, at least from the perspective of the application I'm working on.

Alexandre GOUAILLARD

unread,
Dec 11, 2019, 4:21:24 PM12/11/19
to discuss...@googlegroups.com
VTB is supported in webrtc stand alone code from google, and also for the specific webkit H.264 simulcast implementation. Here is the diagram of the obj-c framework implementation below and how it integrates with the C++ layer. Check the code of corresponding classes for implementation details.

H264 Hardware Coding (ObjC).jpg

The curious might have noted this commit in webkit almost two years ago:
https://trac.webkit.org/changeset/225761/webkit The VCP API mentioned there is a real-time version / extension of VTB which is still at this time private (can only be used by apple product).


H265 HW encoding and decoding is working on mac as well and was done during the IETF Hackathon one moth ago thanks to a code contribution by the INTEL team from shanghai. They provide support for iOS, macOS, android and windows through different OS frameworks.

H.264 is supported but I don't believe it is universally implemented on browsers,

Today it is, except for Firefox 68 and only momentarily. 
 
so until Apple stops snubbing vp8/vp9,

apple supports VP8 in safari since march:

VP9 is not mandatory to implement for webrtc 1.0.

The other option is that Intel must have MacOS API's to their on-chip hardware encode/decode.

it works the other way around, VTB wraps the hardware.
 
[...]
 
Not all Macs have GPU's, so you would want to create an abstract implementation that can manage either if you want to take advantage of the GPU when it is available. It is virtually guaranteed to be more performant.

https://cs.chromium.org/chromium/src/third_party/blink/renderer/platform/peerconnection/rtc_video_encoder_factory.h  Is the glue class in Chrome which wraps chrome’s media::GpuVideoAcceleratorFactories into libwebrtc’s VideoEncoderFactory. You can check chrome GPuVideoAccelerator Classes for more details, ad you will see e.g. some frame type with channels that are specific to Mac GPUs.

Personally, it's the "plumbing" I'm more interested in. [...]

look at the "kNative" type of frame in the Media Engine implementation, and the supports_native_handle member of encoder_info structure in the VideoEncoder class for more details.

  • Hope this helps.
Reply all
Reply to author
Forward
0 new messages