Sniffing for WebM

Philip Jägenstedt

unread,

Jan 27, 2011, 5:04:56 PM1/27/11

to webm-d...@webmproject.org

All browsers except Firefox and to some extent Opera are relying on
sniffing to determine the media type for <video>. However, exactly how to
sniff for WebM hasn't been defined, until
<http://tools.ietf.org/html/draft-ietf-websec-mime-sniff-01> was updated
recently:

So, as you can see, only the EBML header is checked, so the doctype needs
checking too. What is the best and safest way to sniff for WebM? Can the
doctype appear anywhere or will it always be at a fixed offset?

You can see what Opera (using GStreamer) does in ebml_check_header in
<http://cgit.freedesktop.org/gstreamer/gst-plugins-base/tree/gst/typefind/gsttypefindfunctions.c>,
but I'm kind of hoping there's a cleaner way.

--
Philip Jägenstedt
Core Developer
Opera Software

Matthew Gregan

unread,

Jan 27, 2011, 5:16:24 PM1/27/11

to webm-d...@webmproject.org

At 2011-01-27T23:04:56+0100, Philip J�genstedt wrote:
> So, as you can see, only the EBML header is checked, so the doctype
> needs checking too. What is the best and safest way to sniff for
> WebM? Can the doctype appear anywhere or will it always be at a
> fixed offset?

As far as I can tell, the doctype can appear anywhere within the first ~130
bytes after the EBML ID. If it's possible to require a set of EBML header
elements with a fixed order and encoding size, I'm happy to support that in
Firefox/libnestegg (and sniff for it, when it comes to that), but I think
it's too late for that kind of change now.

> You can see what Opera (using GStreamer) does in ebml_check_header
> in <http://cgit.freedesktop.org/gstreamer/gst-plugins-base/tree/gst/typefind/gsttypefindfunctions.c>,
> but I'm kind of hoping there's a cleaner way.

That code should at least check that the doctype string is preceded by a
doctype ID and element size.

Cheers,
-mjg
--
Matthew Gregan |/
/| kin...@flim.org

Steve Lhomme

unread,

Jan 28, 2011, 8:51:14 AM1/28/11

to webm-discuss

On Thu, Jan 27, 2011 at 11:16 PM, Matthew Gregan <kin...@flim.org> wrote:
>> You can see what Opera (using GStreamer) does in ebml_check_header
>> in <http://cgit.freedesktop.org/gstreamer/gst-plugins-base/tree/gst/typefind/gsttypefindfunctions.c>,
>> but I'm kind of hoping there's a cleaner way.
>
> That code should at least check that the doctype string is preceded by a
> doctype ID and element size.

Yes, and also keep in mind that the size in EBML can be coded in 1 to
8 bytes. So you could have:

[42][82][84]webm
[42][82][40][04]webm
[42][82][20][00][04]webm
[42][82][10][00][00][04]webm
[42][82][08][00][00][00][04]webm
[42][82][04][00][00][00][00][04]webm
[42][82][02][00][00][00][00][00][04]webm
[42][82][01][00][00][00][00][00][00][04]webm

--
Steve Lhomme
Matroska association Chairman

Philip Jägenstedt

unread,

Jan 28, 2011, 9:57:16 AM1/28/11

to webm-discuss, Steve Lhomme

On Fri, 28 Jan 2011 14:51:14 +0100, Steve Lhomme <slh...@matroska.org>
wrote:

Is there any limitation to the size of the EBML header, or does one in
theory have to sniff an arbitrary amount of data? The sniffing algorithm
uses at most 512 bytes of data by default.

In the context of <video> there is no problem with assuming that all EBML
files are WebM -- those that aren't will just fail decoding a little
later. However, when navigating directly to a Matroska file, it wouldn't
be great if one sniffs it as WebM, tries playing it using <video> and that
just fails.

Also, what about a doctype that is >4 bytes long and zero-padding, e.g.
"webm\0" ? I'm guessing lots of software will handle that as WebM due to
using strcmp, but is it something that exists in the wild?

Frank Galligan

unread,

Jan 28, 2011, 10:32:51 AM1/28/11

to webm-d...@webmproject.org, Steve Lhomme

On Fri, Jan 28, 2011 at 9:57 AM, Philip Jägenstedt <phi...@opera.com> wrote:

On Fri, 28 Jan 2011 14:51:14 +0100, Steve Lhomme <slh...@matroska.org> wrote:

On Thu, Jan 27, 2011 at 11:16 PM, Matthew Gregan <kin...@flim.org> wrote:

You can see what Opera (using GStreamer) does in ebml_check_header
in <http://cgit.freedesktop.org/gstreamer/gst-plugins-base/tree/gst/typefind/gsttypefindfunctions.c>,
but I'm kind of hoping there's a cleaner way.

That code should at least check that the doctype string is preceded by a
doctype ID and element size.

Yes, and also keep in mind that the size in EBML can be coded in 1 to
8 bytes. So you could have:

[42][82][84]webm
[42][82][40][04]webm
[42][82][20][00][04]webm
[42][82][10][00][00][04]webm
[42][82][08][00][00][00][04]webm
[42][82][04][00][00][00][00][04]webm
[42][82][02][00][00][00][00][00][04]webm
[42][82][01][00][00][00][00][00][00][04]webm

Is there any limitation to the size of the EBML header, or does one in theory have to sniff an arbitrary amount of data? The sniffing algorithm uses at most 512 bytes of data by default.

In theory it could be an arbitrary amount of data. In theory a muxer could put a void element after the EBML id.

In the context of <video> there is no problem with assuming that all EBML files are WebM -- those that aren't will just fail decoding a little later. However, when navigating directly to a Matroska file, it wouldn't be great if one sniffs it as WebM, tries playing it using <video> and that just fails.

Also, what about a doctype that is >4 bytes long and zero-padding, e.g. "webm\0" ? I'm guessing lots of software will handle that as WebM due to using strcmp, but is it something that exists in the wild?

I think we saw some files that had a doctype like this "webm\0\0\0\0" . Basically the muxer reserved enough space for "matroska" but put in "webm" with padding.

--
Philip Jägenstedt
Core Developer
Opera Software

--
You received this message because you are subscribed to the Google Groups "WebM Discussion" group.
To post to this group, send email to webm-d...@webmproject.org.
To unsubscribe from this group, send email to webm-discuss...@webmproject.org.
For more options, visit this group at http://groups.google.com/a/webmproject.org/group/webm-discuss/?hl=en.

Steve Lhomme

unread,

Jan 28, 2011, 10:33:31 AM1/28/11

to Philip Jägenstedt, webm-discuss

On Fri, Jan 28, 2011 at 3:57 PM, Philip Jägenstedt <phi...@opera.com> wrote:
> On Fri, 28 Jan 2011 14:51:14 +0100, Steve Lhomme <slh...@matroska.org>
> wrote:
>
>> On Thu, Jan 27, 2011 at 11:16 PM, Matthew Gregan <kin...@flim.org> wrote:
>>>>
>>>> You can see what Opera (using GStreamer) does in ebml_check_header
>>>> in
>>>> <http://cgit.freedesktop.org/gstreamer/gst-plugins-base/tree/gst/typefind/gsttypefindfunctions.c>,
>>>> but I'm kind of hoping there's a cleaner way.
>>>
>>> That code should at least check that the doctype string is preceded by a
>>> doctype ID and element size.
>>
>> Yes, and also keep in mind that the size in EBML can be coded in 1 to
>> 8 bytes. So you could have:
>>
>> [42][82][84]webm
>> [42][82][40][04]webm
>> [42][82][20][00][04]webm
>> [42][82][10][00][00][04]webm
>> [42][82][08][00][00][00][04]webm
>> [42][82][04][00][00][00][00][04]webm
>> [42][82][02][00][00][00][00][00][04]webm
>> [42][82][01][00][00][00][00][00][00][04]webm
>>
>
> Is there any limitation to the size of the EBML header, or does one in
> theory have to sniff an arbitrary amount of data? The sniffing algorithm
> uses at most 512 bytes of data by default.

By convention EBML IDs are never bigger than 4 bytes and the size is
never coded with more than 8 bytes. So the EBML header of a Matroska
file is maxed at 138 bytes and 134 bytes for WebM. That's provided
there is no additional junk/custom elements in there. But that's very
unlikely as of today.

There where plans to add a DTD like system in the EBML header but if
that ever happens that should go after the current DocType.

> In the context of <video> there is no problem with assuming that all EBML
> files are WebM -- those that aren't will just fail decoding a little later.
> However, when navigating directly to a Matroska file, it wouldn't be great
> if one sniffs it as WebM, tries playing it using <video> and that just
> fails.
>
> Also, what about a doctype that is >4 bytes long and zero-padding, e.g.
> "webm\0" ? I'm guessing lots of software will handle that as WebM due to
> using strcmp, but is it something that exists in the wild?

As specified in the EBML specs, 0 padding is legal:
String - Printable ASCII (0x20 to 0x7E), zero-padded when needed
UTF-8 - Unicode string, zero padded when needed (RFC 2279)

IIRC in the early days of WebM there were some tools outputing
"matroska" in the EBML header and it was later edited to replace it
with "webm" and four 0x00.

> --
> Philip Jägenstedt
> Core Developer
> Opera Software
>

--

Philip Jägenstedt

unread,

Feb 22, 2011, 3:56:12 AM2/22/11

to Steve Lhomme, webm-d...@webmproject.org

On Fri, 28 Jan 2011 16:33:31 +0100, Steve Lhomme <slh...@matroska.org>
wrote:

OK, sounds good.

> There where plans to add a DTD like system in the EBML header but if
> that ever happens that should go after the current DocType.
>
>> In the context of <video> there is no problem with assuming that all
>> EBML
>> files are WebM -- those that aren't will just fail decoding a little
>> later.
>> However, when navigating directly to a Matroska file, it wouldn't be
>> great
>> if one sniffs it as WebM, tries playing it using <video> and that just
>> fails.
>>
>> Also, what about a doctype that is >4 bytes long and zero-padding, e.g.
>> "webm\0" ? I'm guessing lots of software will handle that as WebM due to
>> using strcmp, but is it something that exists in the wild?
>
> As specified in the EBML specs, 0 padding is legal:
> String - Printable ASCII (0x20 to 0x7E), zero-padded when needed
> UTF-8 - Unicode string, zero padded when needed (RFC 2279)
>
> IIRC in the early days of WebM there were some tools outputing
> "matroska" in the EBML header and it was later edited to replace it
> with "webm" and four 0x00.

Indeed, I made some files like that myself, so the question is if the
sniffing needs to allow for it.

Steve Lhomme

unread,

Feb 22, 2011, 4:02:58 AM2/22/11

to Philip Jägenstedt, webm-d...@webmproject.org

Since the specs says it's legal, I'd say yes.

Reply all

Reply to author

Forward