+-------------------+-------------------+-----------------+------------+
| FF FF FF FF | 1A 45 DF A3 | vidow/webm | Safe |
| Comment: The WebM signature [TODO: Use more octets?] |
+-------------------+-------------------+-----------------+------------+
So, as you can see, only the EBML header is checked, so the doctype needs
checking too. What is the best and safest way to sniff for WebM? Can the
doctype appear anywhere or will it always be at a fixed offset?
You can see what Opera (using GStreamer) does in ebml_check_header in
<http://cgit.freedesktop.org/gstreamer/gst-plugins-base/tree/gst/typefind/gsttypefindfunctions.c>,
but I'm kind of hoping there's a cleaner way.
--
Philip Jägenstedt
Core Developer
Opera Software
As far as I can tell, the doctype can appear anywhere within the first ~130
bytes after the EBML ID. If it's possible to require a set of EBML header
elements with a fixed order and encoding size, I'm happy to support that in
Firefox/libnestegg (and sniff for it, when it comes to that), but I think
it's too late for that kind of change now.
> You can see what Opera (using GStreamer) does in ebml_check_header
> in <http://cgit.freedesktop.org/gstreamer/gst-plugins-base/tree/gst/typefind/gsttypefindfunctions.c>,
> but I'm kind of hoping there's a cleaner way.
That code should at least check that the doctype string is preceded by a
doctype ID and element size.
Cheers,
-mjg
--
Matthew Gregan |/
/| kin...@flim.org
Yes, and also keep in mind that the size in EBML can be coded in 1 to
8 bytes. So you could have:
[42][82][84]webm
[42][82][40][04]webm
[42][82][20][00][04]webm
[42][82][10][00][00][04]webm
[42][82][08][00][00][00][04]webm
[42][82][04][00][00][00][00][04]webm
[42][82][02][00][00][00][00][00][04]webm
[42][82][01][00][00][00][00][00][00][04]webm
--
Steve Lhomme
Matroska association Chairman
Is there any limitation to the size of the EBML header, or does one in
theory have to sniff an arbitrary amount of data? The sniffing algorithm
uses at most 512 bytes of data by default.
In the context of <video> there is no problem with assuming that all EBML
files are WebM -- those that aren't will just fail decoding a little
later. However, when navigating directly to a Matroska file, it wouldn't
be great if one sniffs it as WebM, tries playing it using <video> and that
just fails.
Also, what about a doctype that is >4 bytes long and zero-padding, e.g.
"webm\0" ? I'm guessing lots of software will handle that as WebM due to
using strcmp, but is it something that exists in the wild?
On Fri, 28 Jan 2011 14:51:14 +0100, Steve Lhomme <slh...@matroska.org> wrote:Is there any limitation to the size of the EBML header, or does one in theory have to sniff an arbitrary amount of data? The sniffing algorithm uses at most 512 bytes of data by default.
On Thu, Jan 27, 2011 at 11:16 PM, Matthew Gregan <kin...@flim.org> wrote:
You can see what Opera (using GStreamer) does in ebml_check_header
in <http://cgit.freedesktop.org/gstreamer/gst-plugins-base/tree/gst/typefind/gsttypefindfunctions.c>,
but I'm kind of hoping there's a cleaner way.
That code should at least check that the doctype string is preceded by a
doctype ID and element size.
Yes, and also keep in mind that the size in EBML can be coded in 1 to
8 bytes. So you could have:
[42][82][84]webm
[42][82][40][04]webm
[42][82][20][00][04]webm
[42][82][10][00][00][04]webm
[42][82][08][00][00][00][04]webm
[42][82][04][00][00][00][00][04]webm
[42][82][02][00][00][00][00][00][04]webm
[42][82][01][00][00][00][00][00][00][04]webm
In the context of <video> there is no problem with assuming that all EBML files are WebM -- those that aren't will just fail decoding a little later. However, when navigating directly to a Matroska file, it wouldn't be great if one sniffs it as WebM, tries playing it using <video> and that just fails.
Also, what about a doctype that is >4 bytes long and zero-padding, e.g. "webm\0" ? I'm guessing lots of software will handle that as WebM due to using strcmp, but is it something that exists in the wild?
--
Philip Jägenstedt
Core Developer
Opera Software
--You received this message because you are subscribed to the Google Groups "WebM Discussion" group.
To post to this group, send email to webm-d...@webmproject.org.
To unsubscribe from this group, send email to webm-discuss...@webmproject.org.
For more options, visit this group at http://groups.google.com/a/webmproject.org/group/webm-discuss/?hl=en.
By convention EBML IDs are never bigger than 4 bytes and the size is
never coded with more than 8 bytes. So the EBML header of a Matroska
file is maxed at 138 bytes and 134 bytes for WebM. That's provided
there is no additional junk/custom elements in there. But that's very
unlikely as of today.
There where plans to add a DTD like system in the EBML header but if
that ever happens that should go after the current DocType.
> In the context of <video> there is no problem with assuming that all EBML
> files are WebM -- those that aren't will just fail decoding a little later.
> However, when navigating directly to a Matroska file, it wouldn't be great
> if one sniffs it as WebM, tries playing it using <video> and that just
> fails.
>
> Also, what about a doctype that is >4 bytes long and zero-padding, e.g.
> "webm\0" ? I'm guessing lots of software will handle that as WebM due to
> using strcmp, but is it something that exists in the wild?
As specified in the EBML specs, 0 padding is legal:
String - Printable ASCII (0x20 to 0x7E), zero-padded when needed
UTF-8 - Unicode string, zero padded when needed (RFC 2279)
IIRC in the early days of WebM there were some tools outputing
"matroska" in the EBML header and it was later edited to replace it
with "webm" and four 0x00.
> --
> Philip Jägenstedt
> Core Developer
> Opera Software
>
--
OK, sounds good.
> There where plans to add a DTD like system in the EBML header but if
> that ever happens that should go after the current DocType.
>
>> In the context of <video> there is no problem with assuming that all
>> EBML
>> files are WebM -- those that aren't will just fail decoding a little
>> later.
>> However, when navigating directly to a Matroska file, it wouldn't be
>> great
>> if one sniffs it as WebM, tries playing it using <video> and that just
>> fails.
>>
>> Also, what about a doctype that is >4 bytes long and zero-padding, e.g.
>> "webm\0" ? I'm guessing lots of software will handle that as WebM due to
>> using strcmp, but is it something that exists in the wild?
>
> As specified in the EBML specs, 0 padding is legal:
> String - Printable ASCII (0x20 to 0x7E), zero-padded when needed
> UTF-8 - Unicode string, zero padded when needed (RFC 2279)
>
> IIRC in the early days of WebM there were some tools outputing
> "matroska" in the EBML header and it was later edited to replace it
> with "webm" and four 0x00.
Indeed, I made some files like that myself, so the question is if the
sniffing needs to allow for it.
Since the specs says it's legal, I'd say yes.