Implement E2EE in H264 WebRTC and Freeswitch

240 views
Skip to first unread message

Ivan Hutomo

unread,
Jul 19, 2024, 5:49:53 AMJul 19
to discuss-webrtc
I tried to implement E2EE between Avaya and WebRTC. Since I couldn’t modify the code from Avaya, I’m using FreeSWITCH as a gateway to encrypt or decrypt video going to or from Avaya. Here’s my current status: audio encryption and decryption work properly, and I can hear the sound clearly.

On FreeSWITCH, I can decrypt the video, and it can be displayed after I use switch_core_codec_decode_video post-decryption. However, when I try to encrypt the video using FreeSWITCH, it can’t be displayed in my WebRTC, even though all the payloads are properly decrypted.

I also tried skipping SPS and PPS from decryption and skipping the first header as well, but the video still couldn’t be shown. I already make sure all the payload, and it seems it can back to the original payload exactly before it is encrypted. 

My suspicion now is that I might that maybe H264 in webrtc could not properly decode the video even after I already decrypt it. I implement frame decryptor using frameDecryptorInterface,  if I tried to communicate between WebRTC the result are okay. This is really strange. How could I force H264 to be able to decode my video after I successfully decrypt?

Thank you so much for all of your help

Jason Gu

unread,
Jul 19, 2024, 11:04:38 PMJul 19
to discuss-webrtc
As I know webrtc code will check PPS_ID in the function PpsParser::ParsePpsIdFromSlice(), which is done before decryption on the receiver side. So you might also need to skip the first 2 bytes in the nalu payload when encryption. You can refer to LiveKit implementation: 

Khun V

unread,
Jul 20, 2024, 12:06:41 AMJul 20
to discuss...@googlegroups.com
encrypting H.264 is tricky and I'm not sure LiveKit got it 100% right:
1. 2-byte offset is definitely not enough as many webrtc implementations (libwebrtc included) parse slice_header in order to extract slice_qp_delta. To make it work you should follow Apple's recommendations - the old one was to skip 32 bytes (encrypted H.264 in MPEG TS) or parse slice_header entirely and start encrypting data right after it (encrypted H.264 in (f)mp4);
2. I don't see any start code emulation prevention - encryption might generate sequences like 00 00 01, and if you don't check/escape this kind of stuff you'll get very rare random glitches - these are fun to debug lol

--
This list falls under the WebRTC Code of Conduct - https://webrtc.org/support/code-of-conduct.
---
You received this message because you are subscribed to the Google Groups "discuss-webrtc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss-webrt...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/discuss-webrtc/d866d53c-c892-43db-a6fa-00d7388c2f98n%40googlegroups.com.

孙泽

unread,
Jul 20, 2024, 12:12:59 AMJul 20
to discuss-webrtc
yeah,for diff frame, you need to diff offset.it is difficult to multiple process 

---Original---
From: "Khun V"<witaly...@gmail.com>
Date: Sat, Jul 20, 2024 12:06 PM
To: "discuss-webrtc"<discuss...@googlegroups.com>;
Subject: Re: [discuss-webrtc] Re: Implement E2EE in H264 WebRTC and Freeswitch

Philipp Hancke

unread,
Jul 20, 2024, 11:26:12 PMJul 20
to discuss...@googlegroups.com
if you encrypt SPS and PPS you will have a lot of fun (not). With more recent changes coming to M128 it also does not seem prudent to fudge with the beginning of the slice which is going to be looked at to determine if the slice is the start of the frame.

Khun V

unread,
Jul 20, 2024, 11:56:52 PMJul 20
to discuss...@googlegroups.com
The start of the frame check is covered by 1 or 2-byte offset akin to LiveKit's as first_mb_in_slice is the very first thing is the slice header.

SPS and PPS should not be touched for sure, as well as other non-VCL stuff like SEIs - the logic should be 1) parse nal_unit_type (see PS); 2) encrypt it if the type is in 1-5 range, leave it alone if not.

PS. Philipp, you have a bug in your h.264 e2ee implementation prototype, namely "const nalType = view.getUint8(i + 4) & 0b1111" - nal_unit_type is 5 bytes long, so it has to be 0x1f / 0b11111. There are some other subtle issues, mostly originating from the extremely bad decision on behalf of the H.264/AVC (and HEVC, and VVC) standardization committee to incorporate transport/muxing into the video compression standard. These 00 00 00 01 start codes are a massive pain in the back when you do encryption, they can also be 3-byte long (00 00 01) and some encoders even mix 3- and 4-byte long start codes within the same access unit (aka frame) making it even more tricky.

Ivan Hutomo

unread,
Jul 22, 2024, 7:26:17 AMJul 22
to discuss-webrtc

The key is to avoid encrypting SPS and PPS. You can trace the package using Wireshark, which is very useful. In WebRTC, we will have a pattern like 0,0,0,1,x, and you will need to AND x with 0x1f. This will allow you to decide whether it is SPS or PPS. In WebRTC, the IDR frame will follow and be combined with SPS and PPS. You can encrypt the IDR frame, but not the SPS and PPS.

Ivan Hutomo

unread,
Jul 22, 2024, 7:27:07 AMJul 22
to discuss-webrtc
Finally, I managed it to work, I traced every packet on WireShark and got some idea. Here is my implementation if you are all interested. Here is my implementation:

Decryptor.java
#include "api/crypto/octon_frame_decryptor.h"

#include <openssl/bio.h>
#include <openssl/evp.h>
#include <stddef.h>
#include <stdio.h>

#include <numeric>
#include <vector>

#include "rtc_base/logging.h"

namespace webrtc {

OctonFrameDecryptor::FrameDecryptor(const std::vector<uint8_t>& new_keys)
: key_bytes(new_keys) {
RTC_LOG(LS_VERBOSE) << "XXX OctonFrameDecryptor " << key_bytes.size();
}

int decrypt(unsigned char* key,
unsigned char* ciphertext,
int ciphertext_len,
unsigned char* iv,
unsigned char* plaintext) {
EVP_CIPHER_CTX* ctx = EVP_CIPHER_CTX_new();
if (!ctx) {
RTC_LOG(LS_INFO) << "Error in Decryption 100";
return -1;
}

int len;
int plaintext_len;

if (EVP_DecryptInit_ex(ctx, EVP_aes_256_ctr(), NULL, key, iv) != 1) {
RTC_LOG(LS_INFO) << "Error in Decryption 200";
EVP_CIPHER_CTX_free(ctx);
return -1;
}

if (EVP_DecryptUpdate(ctx, plaintext, &len, ciphertext, ciphertext_len) !=
1) {
RTC_LOG(LS_INFO) << "Error in Decryption 300";
EVP_CIPHER_CTX_free(ctx);
return -1;
}

plaintext_len = len;

if (EVP_DecryptFinal_ex(ctx, plaintext + len, &len) != 1) {
RTC_LOG(LS_INFO) << "Error in Decryption 400";
EVP_CIPHER_CTX_free(ctx);
return -1;
}

plaintext_len += len;
EVP_CIPHER_CTX_free(ctx);

return plaintext_len;
}

bool matchesPattern(const std::vector<uint8_t>& data,
uint8_t nal_unit_type,
size_t nal_header_size = 5) {
// Check for the start code 0x00000001 and NAL unit type
if (data.size() < nal_header_size)
return false;
return data[0] == 0x00 && data[1] == 0x00 && data[2] == 0x00 &&
data[3] == 0x01 && (data[4] & 0x1F) == nal_unit_type;
}

int find_index_after_pattern(const std::vector<uint8_t>& encrypted_data,
uint8_t nal_unit_type,
size_t nal_header_size) {
size_t encrypted_size = encrypted_data.size();

for (size_t i = 0; i <= encrypted_size - nal_header_size; i++) {
if (matchesPattern(
std::vector<uint8_t>(encrypted_data.begin() + i,
encrypted_data.begin() + i + nal_header_size),
nal_unit_type, nal_header_size)) {
if (i + 5 < encrypted_size) {
return static_cast<int>(i);
} else {
return -1; // Pattern found, but no value follows
}
}
}
return -1; // Pattern not found
}

std::vector<uint8_t> GenerateIVDecryptor(const uint8_t* unencrypted_data,
size_t unencrypted_bytes) {
// implement by yourself
}

OctonFrameDecryptor::Result OctonFrameDecryptor::Decrypt(
cricket::MediaType media_type,
const std::vector<uint32_t>& csrcs,
rtc::ArrayView<const uint8_t> additional_data,
rtc::ArrayView<const uint8_t> encrypted_frame,
rtc::ArrayView<uint8_t> frame) {
RTC_LOG(LS_VERBOSE) << "[Decryptor] Ivan decrypting";

const uint8_t* encrypted_data = encrypted_frame.data();
const size_t encrypted_size = encrypted_frame.size();
const size_t frame_size = frame.size();

uint8_t unencrypted_bytes = 1;
size_t nal_header_size = 0;
enum nal_unit_type {
NAL_UNIT_TYPE_NON_IDR = 1,
NAL_UNIT_TYPE_IDR = 5,
NAL_UNIT_TYPE_SPS = 7,
NAL_UNIT_TYPE_PPS = 8
};

if (media_type == cricket::MEDIA_TYPE_VIDEO) {
unencrypted_bytes = 12;
nal_header_size = 5;
} else if (media_type == cricket::MEDIA_TYPE_AUDIO) {
unencrypted_bytes = 1;
}

if (encrypted_size < unencrypted_bytes) {
RTC_LOG(LS_INFO) << "[Decryptor] Ivan, Memory error";
return Result(Status::kFailedToDecrypt, 0);
}

// Check for SPS and PPS NAL units
if (media_type == cricket::MEDIA_TYPE_VIDEO) {
if ((matchesPattern(std::vector<uint8_t>(encrypted_data,
encrypted_data + nal_header_size),
NAL_UNIT_TYPE_SPS) ||
matchesPattern(std::vector<uint8_t>(encrypted_data,
encrypted_data + nal_header_size),
NAL_UNIT_TYPE_PPS))) {
RTC_LOG(LS_INFO)
<< "[Decryptor] Ivan, Skip Decryption on SPS/PPS with length "
<< encrypted_size;

// Find encrypted IDR Frame
int payload_start = find_index_after_pattern(
std::vector<uint8_t>(encrypted_data, encrypted_data + encrypted_size),
NAL_UNIT_TYPE_IDR, nal_header_size);

// If found, decrypt IDR Frame
if (payload_start != -1) {
// Copy payload before IDR Frame to frame
std::copy(encrypted_data, encrypted_data + payload_start,
frame.begin());

// Copy unencrypted bytes to frame
std::copy(encrypted_data + payload_start,
encrypted_data + payload_start + unencrypted_bytes,
frame.begin() + payload_start);

size_t payload_length = encrypted_size - payload_start;
std::vector<uint8_t> payload(payload_length);
std::copy(encrypted_data + payload_start + unencrypted_bytes,
encrypted_data + encrypted_size, payload.begin());

std::vector<uint8_t> iv = GenerateIVDecryptor(
encrypted_data + payload_start + nal_header_size,
unencrypted_bytes - nal_header_size);
std::vector<unsigned char> decryptedtext(payload_length);
int decryptedtext_len =
decrypt(key_bytes.data(), payload.data(), payload_length, iv.data(),
decryptedtext.data());

if (decryptedtext_len > 0) {
std::copy(decryptedtext.begin(),
decryptedtext.begin() + decryptedtext_len,
frame.begin() + payload_start + unencrypted_bytes);

return Result(Status::kOk, frame_size);
} else {
RTC_LOG(LS_INFO) << "[Decryptor] Ivan, Decryption Failed";
return Result(Status::kFailedToDecrypt, 0);
}
} else {
return Result(Status::kOk, frame_size);
}
}
}

std::copy(encrypted_data, encrypted_data + unencrypted_bytes, frame.begin());

size_t payload_length = encrypted_size - unencrypted_bytes;
std::vector<uint8_t> payload(payload_length);
std::copy(encrypted_data + unencrypted_bytes, encrypted_data + encrypted_size,
payload.begin());

std::vector<uint8_t> iv = GenerateIVDecryptor(
encrypted_data + nal_header_size, unencrypted_bytes - nal_header_size);
std::vector<unsigned char> decryptedtext(payload_length);
int decryptedtext_len =
decrypt(key_bytes.data(), payload.data(), payload_length, iv.data(),
decryptedtext.data());

if (decryptedtext_len > 0) {
std::copy(decryptedtext.begin(), decryptedtext.begin() + decryptedtext_len,
frame.begin() + unencrypted_bytes);

return Result(Status::kOk, frame_size);
} else {
RTC_LOG(LS_INFO) << "[Decryptor] Ivan, Decryption Failed";
return Result(Status::kFailedToDecrypt, 0);
}
}

size_t OctonFrameDecryptor::GetMaxPlaintextByteSize(
cricket::MediaType media_type,
size_t encrypted_frame_size) {
return encrypted_frame_size;
}

} // namespace webrtc


On Sunday, July 21, 2024 at 11:56:52 AM UTC+8 Khun V wrote:

Philipp Hancke

unread,
Jul 24, 2024, 8:17:25 PMJul 24
to discuss...@googlegroups.com
Am Sa., 20. Juli 2024 um 20:56 Uhr schrieb Khun V <witaly...@gmail.com>:
The start of the frame check is covered by 1 or 2-byte offset akin to LiveKit's as first_mb_in_slice is the very first thing is the slice header.

SPS and PPS should not be touched for sure, as well as other non-VCL stuff like SEIs - the logic should be 1) parse nal_unit_type (see PS); 2) encrypt it if the type is in 1-5 range, leave it alone if not.

PS. Philipp, you have a bug in your h.264 e2ee implementation prototype, namely "const nalType = view.getUint8(i + 4) & 0b1111" - nal_unit_type is 5 bytes long, so it has to be 0x1f / 0b11111. There are some other subtle issues, mostly originating from the extremely bad decision on behalf of the H.264/AVC (and HEVC, and VVC) standardization committee to incorporate transport/muxing into the video compression standard. These 00 00 00 01 start codes are a massive pain in the back when you do encryption, they can also be 3-byte long (00 00 01) and some encoders even mix 3- and 4-byte long start codes within the same access unit (aka frame) making it even more tricky.

Good catch, thank you! Fixed (ofc only in one of the two places initially). Agree on start code emulation and yeah, H264... my condolences to anyone who has to use it.

Reply all
Reply to author
Forward
0 new messages