New API Proposal for Video Thumbnail Generation

Blake Wu

unread,

Dec 19, 2013, 4:34:10 AM12/19/13

to dev-w...@lists.mozilla.org

Hi All,

I would like to propose a new API for video thumbnail generation and need your opinions and review.

Currently thumbnail is generated by the following steps in Gaia.
1. Seek to 1/10th of the duration.
2. Use Canvas to draw image and save it as a file.

For advantages to have a dedicated API is
1. Remove unnecessary audio seeking in the step#1.
2. Better to improve, like applying software codec and skip many unnecessary things, like going through media stage machine and reader in Gecko.
3. Easy and clear to use for Gaia developers.

Proposal:
Introduce a new Web API, mozGetThumbnail (double thumbnailTime), and a event(ongetthumbnail) or other async mechanisms to notify that thumbnail is ready.
ThumbnailTime is optional. If it is negative, a default thumbnail generation mechanism in the gecko will apply. Otherwise, an I-frame close to that time will be chosen as thumbnail.

For more info, please go to bug 942078 (https://bugzilla.mozilla.org/show_bug.cgi?id=942078)
Thanks.

Best Wishes,
Blake Wu
System Team, Device Engineering
Mozilla Taiwan

Blake Wu

unread,

Dec 19, 2013, 6:29:09 AM12/19/13

to dev-w...@lists.mozilla.org

Hi All,

Add one more advantage as #4.

1. Remove unnecessary audio seeking in the step#1.
2. Better to improve, like applying software codec and skip many unnecessary things, like going through media stage machine and reader in Gecko.
3. Easy and clear to use for Gaia developers.

4. Reduce thumbnail generating time (current seek is to seek to previous I-frame and do decoding until reaching the desired seeking time)

Best Wishes,
Blake Wu
System Team, Device Engineering
Mozilla Taiwan

----- 原始郵件 -----
寄件者: "Blake Wu" <b...@mozilla.com>
收件者: dev-w...@lists.mozilla.org
寄件備份: 2013 12 月 19 星期四下午 5:34:10
主旨: New API Proposal for Video Thumbnail Generation

Ehsan Akhgari

unread,

Dec 19, 2013, 10:27:27 AM12/19/13

to Blake Wu, dev-w...@lists.mozilla.org

I think that even if we expose such an API we still need to do all of
the decoding work on the Gecko side, including audio seeking, going
through the decoder state machine, finding the previous key frame and
decoding to the requested time, etc. It will just change where this
work happens (gaia versus gecko), so I think that the only actual
advantage would be #3 below. Have you tried to implement this API in
JavaScript as a self-contained module? I don't see any reason why that
can't be done...

Cheers,
Ehsan

> _______________________________________________
> dev-webapi mailing list
> dev-w...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-webapi
>

Sotaro Ikeda

unread,

Dec 19, 2013, 2:52:45 PM12/19/13

to b...@mozilla.com, dev-w...@lists.mozilla.org, ehsan....@gmail.com

There is another bug for thumbnail generation that created in last spring.
- Bug 873959 - Create an API for generating thumbnails for HTML videos

Current b2g's thumbnail generation is very verbose. If we want to becomes competitive to android in video thumbnail generation area. Such kind of dedicated API is necessary.
To speed up the video thumbnail generation, the following things in gecko side becomes necessary.
- No Audio Track Handling
+ Do not load audio track at all.
+ MediaDecoderStateMachine allocate Audio out, it spends a time.
- No new thread creation for each thumbnail generation
- Thumbnail generation seek point is decided by gecko side
+ gecko side could know exact sync frame points
- Get Thumbnail only from sync frame
+ other type of frame decoding takes a time.
- Decode only thumbnail video frame
+ MediaDecoderReader decode first video frame after when metadata is loaded.
After the first video frame decodeing, seek and get video frame for thumbnail spend a lot of time.

sotaro

Ehsan Akhgari

unread,

Dec 19, 2013, 5:02:34 PM12/19/13

to Sotaro Ikeda, b...@mozilla.com, dev-w...@lists.mozilla.org

On 12/19/2013, 2:52 PM, Sotaro Ikeda wrote:
>
> There is another bug for thumbnail generation that created in last spring.
> - Bug 873959 - Create an API for generating thumbnails for HTML videos

Hmm interesting. I think what Blake proposed was a little different.
Bug 873959 seems to be about an API which generates a thumbnail leaving
all of the details to Gecko (e.g., Gecko would decide where in the video
to take the thumbnail from, etc.)

> Current b2g's thumbnail generation is very verbose. If we want to becomes competitive to android in video thumbnail generation area. Such kind of dedicated API is necessary.
> To speed up the video thumbnail generation, the following things in gecko side becomes necessary.
> - No Audio Track Handling
> + Do not load audio track at all.
> + MediaDecoderStateMachine allocate Audio out, it spends a time.

Can't we fix the HTMLMediaElement to not seek the audio track if the
video element is muted for example?

> - No new thread creation for each thumbnail generation

Which thread are you talking about?

> - Thumbnail generation seek point is decided by gecko side
> + gecko side could know exact sync frame points

Hmm, actually here's a question. Why do we seek into the video at all?
Why don't we stick to the first frame? Also, why is it that Gecko is
better able to decide where the thumbnail frame should be? Is it just
because you want the screenshot to be taken from a key frame so that the
decoding is cheaper? (And if that's the case, why not the first video
frame? :-)

> - Get Thumbnail only from sync frame
> + other type of frame decoding takes a time.

See above.

> - Decode only thumbnail video frame
> + MediaDecoderReader decode first video frame after when metadata is loaded.
> After the first video frame decodeing, seek and get video frame for thumbnail spend a lot of time.

I'm not sure if I understand this point, sorry.

Last but not least, have we talked to other browser vendors to see if
they're interested in this API? Presumably it can be used to render the
thumbnail for a video that is not playing yet, so if this is indeed a
useful API then it may be useful to other browser vendors as well.

Cheers,
Ehsan

Blake Wu

unread,

Dec 19, 2013, 10:23:27 PM12/19/13

to Ehsan Akhgari, dev-w...@lists.mozilla.org, Sotaro Ikeda

Hi Ehsan,
My comment is inline with "B>".

Hi Sotaro,
Please correct me if I misunderstood what you mean.

Best Wishes,
Blake Wu
System Team, Device Engineering
Mozilla Taiwan

----- 原始郵件 -----

寄件者: "Ehsan Akhgari" <ehsan....@gmail.com>
收件者: "Sotaro Ikeda" <sik...@mozilla.com>, b...@mozilla.com
副本: dev-w...@lists.mozilla.org
寄件備份: 2013 12 月 20 星期五上午 6:02:34
主旨: Re: New API Proposal for Video Thumbnail Generation

On 12/19/2013, 2:52 PM, Sotaro Ikeda wrote:
>
> There is another bug for thumbnail generation that created in last spring.
> - Bug 873959 - Create an API for generating thumbnails for HTML videos

Hmm interesting. I think what Blake proposed was a little different.
Bug 873959 seems to be about an API which generates a thumbnail leaving
all of the details to Gecko (e.g., Gecko would decide where in the video
to take the thumbnail from, etc.)
> Current b2g's thumbnail generation is very verbose. If we want to becomes competitive to android in video thumbnail generation area. Such kind of dedicated API is necessary.
> To speed up the video thumbnail generation, the following things in gecko side becomes necessary.
> - No Audio Track Handling
> + Do not load audio track at all.
> + MediaDecoderStateMachine allocate Audio out, it spends a time.

Can't we fix the HTMLMediaElement to not seek the audio track if the
video element is muted for example?

B> That may impact ordinary use case. There exits an observable latency between mute-> unmute if we uninit audio stuffs when mute. The user experience would be a concern.

> - No new thread creation for each thumbnail generation

Which thread are you talking about?

> - Thumbnail generation seek point is decided by gecko side
> + gecko side could know exact sync frame points

Hmm, actually here's a question. Why do we seek into the video at all?
Why don't we stick to the first frame? Also, why is it that Gecko is
better able to decide where the thumbnail frame should be? Is it just
because you want the screenshot to be taken from a key frame so that the
decoding is cheaper? (And if that's the case, why not the first video
frame? :-)

B> That is because the first video frame may be a black screen or dull frame. We plan to pick the biggest key frame as thumbnail which should contain rich information.
For this kind of frame-level process, it should be proper to be done in Gecko.

> - Get Thumbnail only from sync frame
> + other type of frame decoding takes a time.

See above.

> - Decode only thumbnail video frame
> + MediaDecoderReader decode first video frame after when metadata is loaded.
> After the first video frame decodeing, seek and get video frame for thumbnail spend a lot of time.

I'm not sure if I understand this point, sorry.

B> For APP part, seeking time is in the time level. In Gecko, that time will be interpreted in frame level. If the corresponding frame is not key frame, it requires to decode from the previous key frame to the corresponding frame. That is why it takes time and how accurate seeking works. And there is a Bug 778077 related (https://bugzilla.mozilla.org/show_bug.cgi?id=778077).

Sotaro Ikeda

unread,

Dec 20, 2013, 10:11:39 AM12/20/13

to Ehsan Akhgari, b...@mozilla.com, dev-w...@lists.mozilla.org

Hi Ehsan and Blake,

>> Current b2g's thumbnail generation is very verbose. If we want to becomes competitive to android in video thumbnail generation area. Such kind of dedicated API is necessary.
>> To speed up the video thumbnail generation, the following things in gecko side becomes necessary.
>> - No Audio Track Handling
>> + Do not load audio track at all.
>> + MediaDecoderStateMachine allocate Audio out, it spends a time.
>
>Can't we fix the HTMLMediaElement to not seek the audio track if the
>video element is muted for example?

Yeah, somehow we need a way to disable audio loading at all for the video tag before start loading the video.

>> - No new thread creation for each thumbnail generation
>
>Which thread are you talking about?

MediaDecoderStateMachine's decoding thread. It is created by each MediaDecoderStateMachine.

>> - Thumbnail generation seek point is decided by gecko side
>> + gecko side could know exact sync frame points
>
>Hmm, actually here's a question. Why do we seek into the video at all?
> Why don't we stick to the first frame? Also, why is it that Gecko is
>better able to decide where the thumbnail frame should be? Is it just
>because you want the screenshot to be taken from a key frame so that the
>decoding is cheaper? (And if that's the case, why not the first video
>frame? :-)

A lot of video's first frames are just black or movie companies logo.
As a thumbnail, user want to identify a content of the video.
In android, a biggest video sample within fist 20 sync video samples is chosen as thumbnail in SampleTable::findThumbnailSample().

http://androidxref.com/4.4_r1/xref/frameworks/av/media/libstagefright/SampleTable.cpp#727

A Reason of choosing sync frame is just because of decoding cost.

>> - Decode only thumbnail video frame
>> + MediaDecoderReader decode first video frame after when metadata is loaded.
>> After the first video frame decodeing, seek and get video frame for thumbnail spend a lot of time.
>
>I'm not sure if I understand this point, sorry.

For thumbnail generation, Gaia side(video/galleary/camera app) at first load video and then seek to thumbnail time.
This causes a lot of unnecessary video decoding.
When video is lodaded, MediaDecoderStateMachine decodes the video's metadata at first.
During the metadata loading, current implementation of MediaDecoderStateMachine decodes first video frame by calling MediaDecoderReader::FindStartTime()

http://mxr.mozilla.org/mozilla-central/source/content/media/MediaDecoderReader.cpp#454

After the first video frame decode, MediaDecoderStateMachine seek to thumbnail time and decode video frame for thumbnail.
It is not efficient form thumbnail generation point of view.

sotaro

----- Original Message -----
From: "Ehsan Akhgari" <ehsan....@gmail.com>
To: "Sotaro Ikeda" <sik...@mozilla.com>, b...@mozilla.com
Cc: dev-w...@lists.mozilla.org
Sent: Thursday, December 19, 2013 5:02:34 PM
Subject: Re: New API Proposal for Video Thumbnail Generation

On 12/19/2013, 2:52 PM, Sotaro Ikeda wrote:
>
> There is another bug for thumbnail generation that created in last spring.
> - Bug 873959 - Create an API for generating thumbnails for HTML videos

Hmm interesting. I think what Blake proposed was a little different.
Bug 873959 seems to be about an API which generates a thumbnail leaving
all of the details to Gecko (e.g., Gecko would decide where in the video
to take the thumbnail from, etc.)

> Current b2g's thumbnail generation is very verbose. If we want to becomes competitive to android in video thumbnail generation area. Such kind of dedicated API is necessary.
> To speed up the video thumbnail generation, the following things in gecko side becomes necessary.
> - No Audio Track Handling
> + Do not load audio track at all.
> + MediaDecoderStateMachine allocate Audio out, it spends a time.

Can't we fix the HTMLMediaElement to not seek the audio track if the
video element is muted for example?

> - No new thread creation for each thumbnail generation

Which thread are you talking about?

> - Thumbnail generation seek point is decided by gecko side
> + gecko side could know exact sync frame points

Hmm, actually here's a question. Why do we seek into the video at all?
Why don't we stick to the first frame? Also, why is it that Gecko is
better able to decide where the thumbnail frame should be? Is it just
because you want the screenshot to be taken from a key frame so that the
decoding is cheaper? (And if that's the case, why not the first video
frame? :-)

> - Get Thumbnail only from sync frame
> + other type of frame decoding takes a time.

See above.

> - Decode only thumbnail video frame
> + MediaDecoderReader decode first video frame after when metadata is loaded.
> After the first video frame decodeing, seek and get video frame for thumbnail spend a lot of time.

I'm not sure if I understand this point, sorry.

Sotaro Ikeda

unread,

Dec 20, 2013, 10:24:48 AM12/20/13

to Ehsan Akhgari, cdo...@mozilla.com, b...@mozilla.com, dev-w...@lists.mozilla.org

Hi Ehsan,

> Last but not least, have we talked to other browser vendors to see if
> they're interested in this API? Presumably it can be used to render the
>thumbnail for a video that is not playing yet, so if this is indeed a
> useful API then it may be useful to other browser vendors as well.

IIRC, there is no taling to other browser vendors :-( After Bug 873959 creation, there is no work for the bug.

double, is there such taking to other browser vendors in the past?

Sotaro Ikeda

unread,

Dec 20, 2013, 10:36:14 AM12/20/13

to Ehsan Akhgari, b...@mozilla.com, dev-w...@lists.mozilla.org

Hi Ehsan and Blake,

>>Hmm, actually here's a question. Why do we seek into the video at all?
>> Why don't we stick to the first frame? Also, why is it that Gecko is
>>better able to decide where the thumbnail frame should be? Is it just
>>because you want the screenshot to be taken from a key frame so that the
>>decoding is cheaper? (And if that's the case, why not the first video
>>frame? :-)
>

>A lot of video's first frames are just black or movie companies logo.
>As a thumbnail, user want to identify a content of the video.
>In android, a biggest video sample within fist 20 sync video samples is chosen as thumbnail in SampleTable::findThumbnailSample().

There is an assumption that bigger size video sample might include a more meaningful video frame.

sotaro

----- Original Message -----
From: "Sotaro Ikeda" <sik...@mozilla.com>
To: "Ehsan Akhgari" <ehsan....@gmail.com>, b...@mozilla.com
Cc: dev-w...@lists.mozilla.org
Sent: Friday, December 20, 2013 10:11:39 AM
Subject: Re: New API Proposal for Video Thumbnail Generation

Hi Ehsan and Blake,

>> Current b2g's thumbnail generation is very verbose. If we want to becomes competitive to android in video thumbnail generation area. Such kind of dedicated API is necessary.
>> To speed up the video thumbnail generation, the following things in gecko side becomes necessary.
>> - No Audio Track Handling
>> + Do not load audio track at all.
>> + MediaDecoderStateMachine allocate Audio out, it spends a time.
>
>Can't we fix the HTMLMediaElement to not seek the audio track if the
>video element is muted for example?

Yeah, somehow we need a way to disable audio loading at all for the video tag before start loading the video.

>> - No new thread creation for each thumbnail generation
>
>Which thread are you talking about?

MediaDecoderStateMachine's decoding thread. It is created by each MediaDecoderStateMachine.

>> - Thumbnail generation seek point is decided by gecko side
>> + gecko side could know exact sync frame points
>
>Hmm, actually here's a question. Why do we seek into the video at all?
> Why don't we stick to the first frame? Also, why is it that Gecko is
>better able to decide where the thumbnail frame should be? Is it just
>because you want the screenshot to be taken from a key frame so that the
>decoding is cheaper? (And if that's the case, why not the first video
>frame? :-)

A lot of video's first frames are just black or movie companies logo.
As a thumbnail, user want to identify a content of the video.
In android, a biggest video sample within fist 20 sync video samples is chosen as thumbnail in SampleTable::findThumbnailSample().

http://androidxref.com/4.4_r1/xref/frameworks/av/media/libstagefright/SampleTable.cpp#727

A Reason of choosing sync frame is just because of decoding cost.

>> - Decode only thumbnail video frame
>> + MediaDecoderReader decode first video frame after when metadata is loaded.
>> After the first video frame decodeing, seek and get video frame for thumbnail spend a lot of time.
>
>I'm not sure if I understand this point, sorry.

For thumbnail generation, Gaia side(video/galleary/camera app) at first load video and then seek to thumbnail time.
This causes a lot of unnecessary video decoding.
When video is lodaded, MediaDecoderStateMachine decodes the video's metadata at first.
During the metadata loading, current implementation of MediaDecoderStateMachine decodes first video frame by calling MediaDecoderReader::FindStartTime()

http://mxr.mozilla.org/mozilla-central/source/content/media/MediaDecoderReader.cpp#454

After the first video frame decode, MediaDecoderStateMachine seek to thumbnail time and decode video frame for thumbnail.
It is not efficient form thumbnail generation point of view.

Ehsan Akhgari

unread,

Dec 20, 2013, 11:42:10 AM12/20/13

to Sotaro Ikeda, b...@mozilla.com, dev-w...@lists.mozilla.org

On 12/20/2013, 10:11 AM, Sotaro Ikeda wrote:
> Hi Ehsan and Blake,
>
>>> Current b2g's thumbnail generation is very verbose. If we want to becomes competitive to android in video thumbnail generation area. Such kind of dedicated API is necessary.
>>> To speed up the video thumbnail generation, the following things in gecko side becomes necessary.
>>> - No Audio Track Handling
>>> + Do not load audio track at all.
>>> + MediaDecoderStateMachine allocate Audio out, it spends a time.
>>
>> Can't we fix the HTMLMediaElement to not seek the audio track if the
>> video element is muted for example?
>
> Yeah, somehow we need a way to disable audio loading at all for the video tag before start loading the video.

Do you mean audio loading or decoding here? I'm not very familiar with
different container formats so I don't know if this is actually
implementable.

Do you have a link to a profile which shows the audio decoding part ot
be very expensive here?

>>> - No new thread creation for each thumbnail generation
>>
>> Which thread are you talking about?
>
> MediaDecoderStateMachine's decoding thread. It is created by each MediaDecoderStateMachine.

Two questions:

1. How are we planning to avoid that?
2. That really seems like a problem with our current implementation.
This on itself is not a good argument for adding a new API. Can we just
address this in our implementation somehow?

>>> - Thumbnail generation seek point is decided by gecko side
>>> + gecko side could know exact sync frame points
>>
>> Hmm, actually here's a question. Why do we seek into the video at all?
>> Why don't we stick to the first frame? Also, why is it that Gecko is
>> better able to decide where the thumbnail frame should be? Is it just
>> because you want the screenshot to be taken from a key frame so that the
>> decoding is cheaper? (And if that's the case, why not the first video
>> frame? :-)
>
> A lot of video's first frames are just black or movie companies logo.
> As a thumbnail, user want to identify a content of the video.
> In android, a biggest video sample within fist 20 sync video samples is chosen as thumbnail in SampleTable::findThumbnailSample().
>
> http://androidxref.com/4.4_r1/xref/frameworks/av/media/libstagefright/SampleTable.cpp#727
>
> A Reason of choosing sync frame is just because of decoding cost.

So, what if somebody comes with a better idea on how to generate a
thumbnail later on? We wouldn't want to introduce a new API every time
that happens. I think this is actually a great reason to do this in
gaia instead of gecko.

>>> - Decode only thumbnail video frame
>>> + MediaDecoderReader decode first video frame after when metadata is loaded.
>>> After the first video frame decodeing, seek and get video frame for thumbnail spend a lot of time.
>>
>> I'm not sure if I understand this point, sorry.
>
> For thumbnail generation, Gaia side(video/galleary/camera app) at first load video and then seek to thumbnail time.
> This causes a lot of unnecessary video decoding.
> When video is lodaded, MediaDecoderStateMachine decodes the video's metadata at first.
> During the metadata loading, current implementation of MediaDecoderStateMachine decodes first video frame by calling MediaDecoderReader::FindStartTime()
>
> http://mxr.mozilla.org/mozilla-central/source/content/media/MediaDecoderReader.cpp#454
>
> After the first video frame decode, MediaDecoderStateMachine seek to thumbnail time and decode video frame for thumbnail.
> It is not efficient form thumbnail generation point of view.

This is also a limitation of our implementation it seems.

My point here is that if we fixed all of these limitations, and perhaps
provided an API to give you frame numbers and their corresponding times,
you would be able to implement the rest in gaia. Do you agree?

Cheers,
Ehsan

Sotaro Ikeda

unread,

Dec 20, 2013, 12:19:41 PM12/20/13

to Ehsan Akhgari, b...@mozilla.com, dev-w...@lists.mozilla.org

>> Yeah, somehow we need a way to disable audio loading at all for the video tag before start loading the video.
>
> Do you mean audio loading or decoding here? I'm not very familiar with
> different container formats so I don't know if this is actually
> implementable.
>
> Do you have a link to a profile which shows the audio decoding part ot
> be very expensive here?

I mean both audio loading and decoding. A cost of just parsing audio track is not expensive if the file is a local file.
Cost of audio codec allocation and decoding are expensive.
And if the file have a audio track, MediaDecoderStateMachine create a audio output.
In b2g, android::AudioTrack is allocated as audio output.
Cost of allocating AudioTrack is also expensive.

>> MediaDecoderStateMachine's decoding thread. It is created by each MediaDecoderStateMachine.
>Two questions:
>
>1. How are we planning to avoid that?

media playback related engineers are recognized about some problem of MediaDecoderStateMachine.
There seems no concrete plan about it.
Lat November, media playback work week was organized in Auckland office.
There was a discussion about current problems of gecko's media playback framework. The following is a note about it.
https://etherpad.mozilla.org/media-decoder-refactoring-ideas-2013

But media framework related engineers seems very busy for adding new capabilities next year.
So, fixing the existing problems are low priority now:-(
This discussion is very good to understand the necessity of fixing the problem:-)

>2. That really seems like a problem with our current implementation.
>This on itself is not a good argument for adding a new API. Can we just
>address this in our implementation somehow?

Yeah, I think so. But we might need to add some attributes to video tag.

> So, what if somebody comes with a better idea on how to generate a
> thumbnail later on? We wouldn't want to introduce a new API every time
> that happens. I think this is actually a great reason to do this in
> gaia instead of gecko.

Yeah, this is a problem of creating the whole new API.
Adding new things to API should be as minimum as possible.

> My point here is that if we fixed all of these limitations, and perhaps
> provided an API to give you frame numbers and their corresponding times,
> you would be able to implement the rest in gaia. Do you agree?

Yes, I agree.

sotaro
----- Original Message -----
From: "Ehsan Akhgari" <ehsan....@gmail.com>
To: "Sotaro Ikeda" <sik...@mozilla.com>, b...@mozilla.com
Cc: dev-w...@lists.mozilla.org
Sent: Friday, December 20, 2013 11:42:10 AM
Subject: Re: New API Proposal for Video Thumbnail Generation

Ehsan Akhgari

unread,

Dec 20, 2013, 4:05:01 PM12/20/13

to Sotaro Ikeda, b...@mozilla.com, dev-w...@lists.mozilla.org

On 12/20/2013, 12:19 PM, Sotaro Ikeda wrote:
>
>>> Yeah, somehow we need a way to disable audio loading at all for the video tag before start loading the video.
>>
>> Do you mean audio loading or decoding here? I'm not very familiar with
>> different container formats so I don't know if this is actually
>> implementable.
>>
>> Do you have a link to a profile which shows the audio decoding part ot
>> be very expensive here?
>
> I mean both audio loading and decoding. A cost of just parsing audio track is not expensive if the file is a local file.
> Cost of audio codec allocation and decoding are expensive.
> And if the file have a audio track, MediaDecoderStateMachine create a audio output.
> In b2g, android::AudioTrack is allocated as audio output.
> Cost of allocating AudioTrack is also expensive.

I see.

>>> MediaDecoderStateMachine's decoding thread. It is created by each MediaDecoderStateMachine.
>> Two questions:
>>
>> 1. How are we planning to avoid that?
>
> media playback related engineers are recognized about some problem of MediaDecoderStateMachine.
> There seems no concrete plan about it.
> Lat November, media playback work week was organized in Auckland office.
> There was a discussion about current problems of gecko's media playback framework. The following is a note about it.
> https://etherpad.mozilla.org/media-decoder-refactoring-ideas-2013
>
> But media framework related engineers seems very busy for adding new capabilities next year.
> So, fixing the existing problems are low priority now:-(
> This discussion is very good to understand the necessity of fixing the problem:-)

Yes, agreed.

>> 2. That really seems like a problem with our current implementation.
>> This on itself is not a good argument for adding a new API. Can we just
>> address this in our implementation somehow?
>
> Yeah, I think so. But we might need to add some attributes to video tag.

We can just use HTMLMediaElement.muted right?

>> So, what if somebody comes with a better idea on how to generate a
>> thumbnail later on? We wouldn't want to introduce a new API every time
>> that happens. I think this is actually a great reason to do this in
>> gaia instead of gecko.
>
> Yeah, this is a problem of creating the whole new API.
> Adding new things to API should be as minimum as possible.

Great, so let's see if we can just get away with exposing an API to give
you the times for each frame, and get something working well on top of
it. But we should coordinate that with other browser vendors any way.

Cheers,
Ehsan

Sotaro Ikeda

unread,

Dec 20, 2013, 4:21:32 PM12/20/13

to Ehsan Akhgari, b...@mozilla.com, dev-w...@lists.mozilla.org

>>> 2. That really seems like a problem with our current implementation.
>>> This on itself is not a good argument for adding a new API. Can we just
>>> address this in our implementation somehow?
>>
>> Yeah, I think so. But we might need to add some attributes to video tag.
>
>We can just use HTMLMediaElement.muted right?

About audio, it is yes. I am just not sure how to get sync video frame time, before decoding it by using video tag.

sotaro
----- Original Message -----
From: "Ehsan Akhgari" <ehsan....@gmail.com>
To: "Sotaro Ikeda" <sik...@mozilla.com>
Cc: b...@mozilla.com, dev-w...@lists.mozilla.org
Sent: Friday, December 20, 2013 4:05:01 PM
Subject: Re: New API Proposal for Video Thumbnail Generation

Ehsan Akhgari

unread,

Dec 20, 2013, 4:40:20 PM12/20/13

to Sotaro Ikeda, b...@mozilla.com, dev-w...@lists.mozilla.org

On 12/20/2013, 4:21 PM, Sotaro Ikeda wrote:
>
>>>> 2. That really seems like a problem with our current implementation.
>>>> This on itself is not a good argument for adding a new API. Can we just
>>>> address this in our implementation somehow?
>>>
>>> Yeah, I think so. But we might need to add some attributes to video tag.
>>
>> We can just use HTMLMediaElement.muted right?
>
> About audio, it is yes. I am just not sure how to get sync video frame time, before decoding it by using video tag.

That information is not exposed yet as far as I can tell, which is why I
think we need to expose that through an API.

Cheers,
Ehsan

Jonas Sicking

unread,

Dec 20, 2013, 4:48:45 PM12/20/13

to Ehsan Akhgari, Robert O'Callahan, b...@mozilla.com, dev-webapi, Sotaro Ikeda

We generally have a problem with that image scaling and thumbnail
generation requires the use of canvas which requires a lot of extra
resource usage. Not only does it require decoding fullsized images.
Then you have to create a copy when pasting into the canvas, and
another copy if you want to create a compressed image.

This applies both to still images and video.

I know Facebook has in the past asked for an API to handle this,
though I'm not clear on why they were rescaling images. A guess would
be that they want to show a thumbnail to the user when doing uploads.
Or to resize before uploading as to save bandwidth. I can ask to get
more details here.

Another usecase would be showing preview pictures when hovering the
timeline of a video. YouTube does this for some videos and I've always
found it really awesome for seeking.

David Flanagan the other day posted to the whatwg list about this as
well with a similar use case. Basically resizing an image withough
going through a fullsized decoded image.

Other browser vendors have generally showed little interest so far
here. Possibly because they aren't targetting as low-memory/cpu
devices as we are.

I think we should put together a proposal and send to other browser
vendors. If they aren't interested in this use case at all, I believe
our API policy allows us to implement as long as we publish a spec and
make an effort to create good API.

Something like this might be all that's needed:

loads URL, resizes and encodes as jpg. If the url returns a video the
first frame is used. Or maybe we should default to a more useful
frame?
navigator.createResizedImage({ src: "url", width: x, height: y,
encoding: "image/jpg" });

loads URL, resizes and encodes as jpg. If the url returns a video it
uses the timeslot 43 seconds into the video.
navigator.createResizedImage({ src: "url", width: x, height: y, time:
43, encoding: "image/jpg" });

Same as above, but allows an approximate time. I.e. the UA can choose
a keyframe close to 43 seconds.
navigator.createResizedImage({ src: "url", width: x, height: y, time:
43, timeApproximate: true, encoding: "image/jpg" });

loads blob contents, resizes and encodes as jpg
navigator.createResizedImage({ srcObj: myBlob, width: x, height: y,
encoding: "image/jpg" });

can also allow setting srcObj to an HTMLImageElement, HTMLVideoElement
or HTMLCanvasElement. Possibly also ImageData objects etc, I'm not
sure which classes are floating around for this these days.

All of the above functions would return a Promise which resolves to a
Blob object once encoding has been done. Possibly we should also allow
encoding directly "to screen" somehow. As to allow not having to go
through an encode-decode step and as to not have to have both the
encoded and decoded data around.

Is there a object which represents a pile of pixels and which can
displayed without having to create further copies?

This could be triggered by setting encoding to "none".

/ Jonas

On Dec 19, 2013 2:02 PM, "Ehsan Akhgari" <ehsan....@gmail.com> wrote:

Chris Double

unread,

Dec 22, 2013, 5:46:07 AM12/22/13

to Sotaro Ikeda, b...@mozilla.com, dev-w...@lists.mozilla.org, Ehsan Akhgari

On Sat, Dec 21, 2013 at 4:24 AM, Sotaro Ikeda <sik...@mozilla.com> wrote:
>
> double, is there such taking to other browser vendors in the past?
>

No discussion has been made that I'm aware of.

> From: "Ehsan Akhgari" <ehsan....@gmail.com>
> Hmm, actually here's a question. Why do we seek into the video at all?
> Why don't we stick to the first frame?

Because the first frame is usually the worst frame to choose. For
movie trailers it's often blank or the censor's age note. For many
videos it's a partial frame as they slowly zoom into a shot, etc. When
I first ran tinyvid.tv I originally used the first frame for
thumbnails and it was pretty bad. It was suggested to me by users to
pick the largest keyframe or similar heuristic. I'm not an expert on
what the best choice is though.

Ehsan Akhgari

unread,

Jan 3, 2014, 12:31:41 PM1/3/14

to Jonas Sicking, Robert O'Callahan, b...@mozilla.com, dev-webapi, Sotaro Ikeda

On 12/20/2013, 4:48 PM, Jonas Sicking wrote:
> We generally have a problem with that image scaling and thumbnail
> generation requires the use of canvas which requires a lot of extra
> resource usage. Not only does it require decoding fullsized images.
> Then you have to create a copy when pasting into the canvas, and
> another copy if you want to create a compressed image.
>
> This applies both to still images and video.
>
> I know Facebook has in the past asked for an API to handle this,
> though I'm not clear on why they were rescaling images. A guess would
> be that they want to show a thumbnail to the user when doing uploads.
> Or to resize before uploading as to save bandwidth. I can ask to get
> more details here.

IIRC the Facebook use case was resizing the images before uploading to
save bandwidth (or to make the uploads faster.)

> Another usecase would be showing preview pictures when hovering the
> timeline of a video. YouTube does this for some videos and I've always
> found it really awesome for seeking.
>
> David Flanagan the other day posted to the whatwg list about this as
> well with a similar use case. Basically resizing an image withough
> going through a fullsized decoded image.
>
> Other browser vendors have generally showed little interest so far
> here. Possibly because they aren't targetting as low-memory/cpu
> devices as we are.
>
> I think we should put together a proposal and send to other browser
> vendors. If they aren't interested in this use case at all, I believe
> our API policy allows us to implement as long as we publish a spec and
> make an effort to create good API.

Fair enough.

> Something like this might be all that's needed:
>
> loads URL, resizes and encodes as jpg. If the url returns a video the
> first frame is used. Or maybe we should default to a more useful
> frame?
> navigator.createResizedImage({ src: "url", width: x, height: y,
> encoding: "image/jpg" });

Hmm, what is the return type of this method? An HTMLImageElement or a
Blob? Isn't it better to provide another API to encode an ImageBitmap
to a Blob, and just take David's proposal to whatwg?

> loads URL, resizes and encodes as jpg. If the url returns a video it
> uses the timeslot 43 seconds into the video.
> navigator.createResizedImage({ src: "url", width: x, height: y, time:
> 43, encoding: "image/jpg" });
>
> Same as above, but allows an approximate time. I.e. the UA can choose
> a keyframe close to 43 seconds.
> navigator.createResizedImage({ src: "url", width: x, height: y, time:
> 43, timeApproximate: true, encoding: "image/jpg" });

Both of these can be implemented on top of David's suggestion by just
seeking the video before calling createImageBitmap.

> loads blob contents, resizes and encodes as jpg
> navigator.createResizedImage({ srcObj: myBlob, width: x, height: y,
> encoding: "image/jpg" });

Ditto.

> can also allow setting srcObj to an HTMLImageElement, HTMLVideoElement
> or HTMLCanvasElement. Possibly also ImageData objects etc, I'm not
> sure which classes are floating around for this these days.

Here is the argument type to createImageBitmap:

typedef (HTMLImageElement or
HTMLVideoElement or
HTMLCanvasElement or
Blob or
ImageData or
CanvasRenderingContext2D or
ImageBitmap) ImageBitmapSource;

That should capture everything that is interesting here.

> All of the above functions would return a Promise which resolves to a
> Blob object once encoding has been done. Possibly we should also allow
> encoding directly "to screen" somehow. As to allow not having to go
> through an encode-decode step and as to not have to have both the
> encoded and decoded data around.

For the to screen case, we can just paint the ImageBitmap onto a canvas,
right?

> Is there a object which represents a pile of pixels and which can
> displayed without having to create further copies?

I guess that object would be ImageBitmap. But painting that to a canvas
would need at least one copy. One idea that comes to mind is to extend
URL.createObjectURL to accept an ImageBitmap in addition to a blob and
then set the src of an img to such a URL object.

Also, note that all of this is orthogonal to the other issue raised in
this thread, that is, finding an appropriate keyframe into the video and
seeking to it. The "easy" solution for that is to add an API which maps
keyframes to times, but that won't give you any information about
heuristics such as picking the largest keyframe, etc.

Cheers,
Ehsan

Ehsan Akhgari

unread,

Jan 3, 2014, 12:32:37 PM1/3/14

to Blake Wu, Jonas Sicking, Robert O'Callahan, dev-webapi, Chiajung Hung, Sotaro Ikeda, Shih-Chiang Chien

Does David's proposal address these concerns?

On 12/24/2013, 10:06 AM, Blake Wu wrote:
> Yes. Using canvas requires some resource, color conversion and memory copy.
> In current design, the memory decoded frame occupys and decoder are not
> released until canvas finishes scaling and drawing in Gaia. To optimize
> and improve performance, a new API should be necessary to finish all
> things, like decode, rescale, and etc in one functioninstead of passing
> some temporary result to Gaia to procedd further. Since video decoding
> needs to be done in Gecko, so it would be better to implement all these
> processes in Gecko layer.
>
> Inside this API, fullsized frame memory and decoder can be released as
> soon as possible. Extra memory copy should be able to avoid. Without
> using canvas, GL could be used for rescale.
>
> Best Wishes,
> Blake Wu
>
> 於 2013/12/21 上午 05:48, Jonas Sicking 提到:

>
> We generally have a problem with that image scaling and thumbnail
> generation requires the use of canvas which requires a lot of extra
> resource usage. Not only does it require decoding fullsized images.
> Then you have to create a copy when pasting into the canvas, and
> another copy if you want to create a compressed image.
>
> This applies both to still images and video.
>
> I know Facebook has in the past asked for an API to handle this,
> though I'm not clear on why they were rescaling images. A guess would
> be that they want to show a thumbnail to the user when doing uploads.
> Or to resize before uploading as to save bandwidth. I can ask to get
> more details here.
>

> Another usecase would be showing preview pictures when hovering the
> timeline of a video. YouTube does this for some videos and I've always
> found it really awesome for seeking.
>
> David Flanagan the other day posted to the whatwg list about this as
> well with a similar use case. Basically resizing an image withough
> going through a fullsized decoded image.
>
> Other browser vendors have generally showed little interest so far
> here. Possibly because they aren't targetting as low-memory/cpu
> devices as we are.
>
> I think we should put together a proposal and send to other browser
> vendors. If they aren't interested in this use case at all, I believe
> our API policy allows us to implement as long as we publish a spec and
> make an effort to create good API.
>

> Something like this might be all that's needed:
>
> loads URL, resizes and encodes as jpg. If the url returns a video the
> first frame is used. Or maybe we should default to a more useful
> frame?
> navigator.createResizedImage({ src: "url", width: x, height: y,
> encoding: "image/jpg" });
>

> loads URL, resizes and encodes as jpg. If the url returns a video it
> uses the timeslot 43 seconds into the video.
> navigator.createResizedImage({ src: "url", width: x, height: y, time:
> 43, encoding: "image/jpg" });
>
> Same as above, but allows an approximate time. I.e. the UA can choose
> a keyframe close to 43 seconds.
> navigator.createResizedImage({ src: "url", width: x, height: y, time:
> 43, timeApproximate: true, encoding: "image/jpg" });
>

> loads blob contents, resizes and encodes as jpg
> navigator.createResizedImage({ srcObj: myBlob, width: x, height: y,
> encoding: "image/jpg" });
>

> can also allow setting srcObj to an HTMLImageElement, HTMLVideoElement
> or HTMLCanvasElement. Possibly also ImageData objects etc, I'm not
> sure which classes are floating around for this these days.
>

> All of the above functions would return a Promise which resolves to a
> Blob object once encoding has been done. Possibly we should also allow
> encoding directly "to screen" somehow. As to allow not having to go
> through an encode-decode step and as to not have to have both the
> encoded and decoded data around.
>

> Is there a object which represents a pile of pixels and which can
> displayed without having to create further copies?
>

Blake Wu

unread,

Jan 6, 2014, 6:33:44 AM1/6/14

to Ehsan Akhgari, Jonas Sicking, Robert O'Callahan, dev-webapi, Chiajung Hung, Sotaro Ikeda, Shih-Chiang Chien

From my understanding, David's proposal is intended to reduce the
memory usage for fullsized frame.
I am just thinking if it is possible to finish all the things in gecko,
besides memory usage memory copy could be avoided and resource (memory
and decoder) could be released as soon as possible.

IMO, if it is necessary to use canvas, at least it would be better to
first do scale in gecko to avoid creating canvas with fullsized-frame
memory.

Best Wishes,
Blake Wu

於 2014/1/4 上午 01:32, Ehsan Akhgari 提到:

Ehsan Akhgari

unread,

Jan 9, 2014, 8:34:30 PM1/9/14

to Blake Wu, Jonas Sicking, Robert O'Callahan, dev-webapi, Chiajung Hung, Sotaro Ikeda, Shih-Chiang Chien

Can you please list the individual problems that you would like to solve
and a proposal for each one of them? I'm getting a bit confused as to
which direction we're moving towards in this discussion.

The original mozGetThumbnail proposal in the first email in this thread
doesn't meet the bar for what you're trying to achieve, as it lacks
things such as converting between times and keyframes, resizing the
thumbnail, etc. Since that original email there have been a lot of good
ideas proposed on this thread and I'm trying to understand which ones
you think are helpful to you, and which ones are not, so that we can
move towards something which addresses your needs.

Thanks!
Ehsan

Blake Wu

unread,

Jan 10, 2014, 2:16:35 AM1/10/14

to Ehsan Akhgari, Jonas Sicking, Robert O'Callahan, sch...@mozilla.com, ch...@mozilla.com, dev-webapi, Sotaro Ikeda

Hi Ehsan,

Sorry to make you confused in another mail thread.
Let me align and merge it here with my comments as below.

於 2014/1/4 上午1:31, Ehsan Akhgari 提到:

> On 12/20/2013, 4:48 PM, Jonas Sicking wrote:

>> We generally have a problem with that image scaling and thumbnail
>> generation requires the use of canvas which requires a lot of extra
>> resource usage. Not only does it require decoding fullsized images.
>> Then you have to create a copy when pasting into the canvas, and
>> another copy if you want to create a compressed image.
>>
>> This applies both to still images and video.
>>
>> I know Facebook has in the past asked for an API to handle this,
>> though I'm not clear on why they were rescaling images. A guess would
>> be that they want to show a thumbnail to the user when doing uploads.
>> Or to resize before uploading as to save bandwidth. I can ask to get
>> more details here.
>

> IIRC the Facebook use case was resizing the images before uploading to
> save bandwidth (or to make the uploads faster.)
>

>> Another usecase would be showing preview pictures when hovering the
>> timeline of a video. YouTube does this for some videos and I've always
>> found it really awesome for seeking.
>>
>> David Flanagan the other day posted to the whatwg list about this as
>> well with a similar use case. Basically resizing an image withough
>> going through a fullsized decoded image.
>>
>> Other browser vendors have generally showed little interest so far
>> here. Possibly because they aren't targetting as low-memory/cpu
>> devices as we are.
>>
>> I think we should put together a proposal and send to other browser
>> vendors. If they aren't interested in this use case at all, I believe
>> our API policy allows us to implement as long as we publish a spec and
>> make an effort to create good API.
>

> Fair enough.
Agree to have a more general API.

>
>> Something like this might be all that's needed:
>>
>> loads URL, resizes and encodes as jpg. If the url returns a video the
>> first frame is used. Or maybe we should default to a more useful
>> frame?
>> navigator.createResizedImage({ src: "url", width: x, height: y,
>> encoding: "image/jpg" });
>

> Hmm, what is the return type of this method? An HTMLImageElement or a
> Blob? Isn't it better to provide another API to encode an ImageBitmap
> to a Blob, and just take David's proposal to whatwg?

or using the same API, we can set "encoding" as none, or something like
that, for not ecnoding case?

>> loads URL, resizes and encodes as jpg. If the url returns a video it
>> uses the timeslot 43 seconds into the video.
>> navigator.createResizedImage({ src: "url", width: x, height: y, time:
>> 43, encoding: "image/jpg" });
>>
>> Same as above, but allows an approximate time. I.e. the UA can choose
>> a keyframe close to 43 seconds.
>> navigator.createResizedImage({ src: "url", width: x, height: y, time:
>> 43, timeApproximate: true, encoding: "image/jpg" });
>

> Both of these can be implemented on top of David's suggestion by just
> seeking the video before calling createImageBitmap.
>

Using seeking may be not simple and straightforward enough. What if
gaia/web developer forget to call seek before calling createImageBitmap
:)? createResizedImage's "time" may be sufficient.
Sorry. I am not familiar with createImageBitmap and ImageBitmap part.
If my understanding is correct, you suggestion using createImageBitmap
to do the same things in createResizedImage, right? If yes, will
createImageBitmap still need to allocate fullsized memory?

>> loads blob contents, resizes and encodes as jpg
>> navigator.createResizedImage({ srcObj: myBlob, width: x, height: y,
>> encoding: "image/jpg" });
>

> Ditto.

>
>> can also allow setting srcObj to an HTMLImageElement, HTMLVideoElement
>> or HTMLCanvasElement. Possibly also ImageData objects etc, I'm not
>> sure which classes are floating around for this these days.
>

> Here is the argument type to createImageBitmap:
>
> typedef (HTMLImageElement or
> HTMLVideoElement or
> HTMLCanvasElement or
> Blob or
> ImageData or
> CanvasRenderingContext2D or
> ImageBitmap) ImageBitmapSource;
>
> That should capture everything that is interesting here.
>

>> All of the above functions would return a Promise which resolves to a
>> Blob object once encoding has been done. Possibly we should also allow
>> encoding directly "to screen" somehow. As to allow not having to go
>> through an encode-decode step and as to not have to have both the
>> encoded and decoded data around.

Agree to not have both the encoded and decoded data around.

>
> For the to screen case, we can just paint the ImageBitmap onto a
> canvas, right?
>

>> Is there a object which represents a pile of pixels and which can
>> displayed without having to create further copies?
>

> I guess that object would be ImageBitmap. But painting that to a
> canvas would need at least one copy. One idea that comes to mind is to
> extend URL.createObjectURL to accept an ImageBitmap in addition to a
> blob and then set the src of an img to such a URL object.
>
>
> Also, note that all of this is orthogonal to the other issue raised in
> this thread, that is, finding an appropriate keyframe into the video
> and seeking to it. The "easy" solution for that is to add an API which
> maps keyframes to times, but that won't give you any information about
> heuristics such as picking the largest keyframe, etc.
>
> Cheers,
> Ehsan

In Firefox OS, the most concerned should be when to release memory and
codec, since memory budget is low and there is only one video codec
instance on most devices which means once it is occupied, others cannot
use it. That's why I mentioned it would be better to finish decoding,
resizing, necessary memory copy (or avoid memory copy) in the gecko to
release those resources immediately. Current createResizedImage looks
good to me.

Best Wishes,
Blake Wu

Ehsan Akhgari

unread,

Jan 10, 2014, 11:58:40 AM1/10/14

to Blake Wu, Jonas Sicking, Robert O'Callahan, sch...@mozilla.com, ch...@mozilla.com, dev-webapi, Sotaro Ikeda

On 1/10/2014, 2:16 AM, Blake Wu wrote:
> Hi Ehsan,
>
> Sorry to make you confused in another mail thread.

No worries!

Hmm, what does the non-encoding case mean? It's still not clear to me
what createResizedImage in Jonas' proposal returns, and depending on
that, it may not make sense to talk about the non-encoding case. For
example, there is no canonical way to represent non-encoded bitmap data
as an HTMLImageElement.

>>> loads URL, resizes and encodes as jpg. If the url returns a video it
>>> uses the timeslot 43 seconds into the video.
>>> navigator.createResizedImage({ src: "url", width: x, height: y, time:
>>> 43, encoding: "image/jpg" });
>>>
>>> Same as above, but allows an approximate time. I.e. the UA can choose
>>> a keyframe close to 43 seconds.
>>> navigator.createResizedImage({ src: "url", width: x, height: y, time:
>>> 43, timeApproximate: true, encoding: "image/jpg" });
>>
>> Both of these can be implemented on top of David's suggestion by just
>> seeking the video before calling createImageBitmap.
>>
> Using seeking may be not simple and straightforward enough. What if
> gaia/web developer forget to call seek before calling createImageBitmap
> :)?

Then they will get an image representing the current frame of the video
(which could be the first frame)?

> createResizedImage's "time" may be sufficient.

Yeah, it could be. Note that this proposed API doesn't operate on a
video, it operates on a URL. This may or may not be a desirable
property of its own, but of course if this doesn't reference a video
then seeking the video doesn't mean anything. :-)

> Sorry. I am not familiar with createImageBitmap and ImageBitmap part.

Please see
<http://www.whatwg.org/specs/web-apps/current-work/multipage/timers.html#dom-createimagebitmap>.

> If my understanding is correct, you suggestion using createImageBitmap
> to do the same things in createResizedImage, right?

Yeah. Basically I'm interested to see if we can use createImageBitmap
and build something on top of it.

> If yes, will
> createImageBitmap still need to allocate fullsized memory?

Do you mean for the non-compressed frame? "Maybe". If there is a way
to resize an image without storing the entire non-compressed frame in
memory we could of course improve the implementation.

Well, note that in some cases createResizedImage as proposed may lead to
more resource usage. For example, if you want to render something like
a progress bar for a video which is ready to be played back (similar to
what youtube does when seeking) then with this proposal the downloading
and decoding of the video may need to happen twice, once inside the
video element and once as part of createResizedImage.

Cheers,
Ehsan

Blake Wu

unread,

Feb 5, 2014, 4:08:42 AM2/5/14

to Ehsan Akhgari, Jonas Sicking, Robert O'Callahan, sch...@mozilla.com, ch...@mozilla.com, dev-webapi, Sotaro Ikeda

Hi Ehsan,

Got your point about using createImageBitmap to achieve the same goal.
IIRC, The remaining problem would be how to handle u nnecesary audio
part (init and seek) for video thumbnail case. For what Youtube does
when seeking, audio part is necessary for next continuing to play. But
in the video thumbnail case audio part is unnecessary. Using
HTMLMediaElement.muted may not be sufficient to tell these two cases.

Besides adding new API, another idea is to do both finding the proper
key-frame and decoding (SW decoder) it via JavaScript to be a
self-contained module which should be able to save memory (current HW
decoder requires to allocate many memory buffers even there is only one
frame for decoding) and may be reused cross-platform. If it is doable,
then the performance (parsing and decoding time) of this module would be
needed to check. I have put this idea in the same bugzilla# 942078
<https://bugzilla.mozilla.org/show_bug.cgi?id=942078> to futher discuss
the feasability.
If SW decoding via JavaScript takes too long, we may need to have some
way to let Gecko to do that.

Best Wishes,
Blake

於 2014/1/11 上午12:58, Ehsan Akhgari 提到:

Sotaro Ikeda

unread,

Feb 7, 2014, 3:18:23 PM2/7/14

to Blake Wu, Ehsan Akhgari, dev-webapi, Robert O'Callahan, sch...@mozilla.com, ch...@mozilla.com, Jonas Sicking

Hi blake,

Android's default sw codec supports only H.264 profile. It can not decode high profile. Even if sw codec is supported, decoding time becomes a problem. It could take very long time. High profile and HEVC seems to become a problem.

Sotaro

----- Original Message -----
From: "Blake Wu" <b...@mozilla.com>
To: "Ehsan Akhgari" <ehsan....@gmail.com>, "Jonas Sicking" <jo...@sicking.cc>, "Robert O'Callahan" <rob...@ocallahan.org>, sch...@mozilla.com, ch...@mozilla.com
Cc: "Sotaro Ikeda" <sik...@mozilla.com>, "dev-webapi" <dev-w...@lists.mozilla.org>
Sent: Wednesday, February 5, 2014 4:08:42 AM
Subject: Re: New API Proposal for Video Thumbnail Generation

Sotaro Ikeda

unread,

Feb 7, 2014, 3:23:32 PM2/7/14

to Blake Wu, Ehsan Akhgari, dev-webapi, Jonas Sicking, sch...@mozilla.com, ch...@mozilla.com, Robert O'Callahan

> High profile and HEVC seems to become a problem.

Sorry, I intend to say CABAC (Context-based Adaptive Binary Arithmetic Coding).

sotaro

Blake Wu

unread,

Feb 13, 2014, 3:22:44 AM2/13/14

to Sotaro Ikeda, Ehsan Akhgari, dev-webapi, Jonas Sicking, sch...@mozilla.com, ch...@mozilla.com, Chris Pearce, Robert O'Callahan

*_Hi Sotaro,_*

Thanks for your information.

_*Hi Ehsan, Sotaro, and All,*_
I would like to recap what we have discussed so far.
There are three main items targeted to be improved for having a
dedicated API for video thumbnail generation. And some discussions are
done to see if it is necessary to have a new API or not.

*1. Find the biggest key-frame*
Some discussions in bug 971645 to check if it is doable at application
layer via JavaScript are ongoing.
*2. Fullsized memory usage reduce
*By using createImageBitmap api, we should be able to have a
non-fullsized memory copy.
*3. Avoid audio part *
Video thumbnail generation has nothing to do with audio. This remains an
unsolved problem. HTMLMediaElement.muted may not be helpful. I am
checking how long audio init in Firefox OS takes.

Do you have other ideas for item#3?
I am still in favor of having a new API for vidoe thumbnail generation
:) That would be more simpler and can fix all the problems in this new
api.

Best Wishes,
Blake
於 2014/2/8 上午4:23, Sotaro Ikeda 提到:

Sotaro Ikeda

unread,

Feb 14, 2014, 5:12:12 PM2/14/14

to Blake Wu, Ehsan Akhgari, dev-webapi, Jonas Sicking, sch...@mozilla.com, ch...@mozilla.com, Chris Pearce, Robert O'Callahan

Hi blake

> *3. Avoid audio part *
> Video thumbnail generation has nothing to do with audio. This remains an
> unsolved problem. HTMLMediaElement.muted may not be helpful. I am
> checking how long audio init in Firefox OS takes.

Effect of audio track is not only init. Audio data and video data could be in different place in the file. It could affect to the speed of video data read from file system.

Blake Wu

unread,

Feb 19, 2014, 5:45:55 AM2/19/14

to Sotaro Ikeda, Ehsan Akhgari, dev-webapi, Jonas Sicking, sch...@mozilla.com, ch...@mozilla.com, Chris Pearce, Robert O'Callahan

Hi Ehsan and All,

Since audio part still cannot be avoided in current method (.mute, seek,
and createImageBitmap) and as Sotaro also added audio could affect the
speed of video data read, do you have other concerns for the
introduction of a new API, mozCreateThumbnail(double thumbnailTime,
unsigned long width, unsigned long height) ?

This redefined API has two more parameters (width and height) to avoid
having a full-size memory allocation for canvas.
For thumbnailTime, it is open for application to decide.

Example:

var video = document.createElement('video');
video.mozCreateThumbnail(time, width, height); //new
video.src = url;
video.ongetthumbnail = function() { //new
:
captureFrame(video, metadata,callback);
:
}

I have created a wiki page to keep having a current summary
https://wiki.mozilla.org/TPESystem/Media/VideoThumbnail

Best Wishes,
Blake
於 2014/2/15 上午6:12, Sotaro Ikeda 提到:

Ehsan Akhgari

unread,

Feb 19, 2014, 11:05:42 AM2/19/14

to Blake Wu, Sotaro Ikeda, dev-webapi, Robert O'Callahan, sch...@mozilla.com, ch...@mozilla.com, Chris Pearce, Jonas Sicking

Hi Blake,

On 2/19/2014, 5:45 AM, Blake Wu wrote:
> Hi Ehsan and All,
>
> Since audio part still cannot be avoided in current method (.mute, seek,
> and createImageBitmap) and as Sotaro also added audio could affect the
> speed of video data read, do you have other concerns for the
> introduction of a new API, mozCreateThumbnail(double thumbnailTime,
> unsigned long width, unsigned long height) ?
>
> This redefined API has two more parameters (width and height) to avoid
> having a full-size memory allocation for canvas.
> For thumbnailTime, it is open for application to decide.
>
> Example:
>
> var video = document.createElement('video');
> video.mozCreateThumbnail(time, width, height); //new
> video.src = url;
> video.ongetthumbnail = function() { //new
> :
> captureFrame(video, metadata,callback);
> :
> }
>
> I have created a wiki page to keep having a current summary
> https://wiki.mozilla.org/TPESystem/Media/VideoThumbnail

So here are the questions that I have for you. Some of them have
already been discussed but I'm asking them again because I still don't
understand the reasoning behind them.

1. What is the issue with audio muting exactly? If audio is set to
muted, why do we _need_ to decode the audio track? I understand if this
is what Gecko _currently_ does, but I'm not very concerned with our
current behavior. I'm trying to understand why using the muted
attribute is the wrong thing here.

2. What happened with the requirement of selecting a key frame? Earlier
in the thread it seemed like that is a hard requirement for getting a
quality thumbnail generated, is that no longer the case? I'm asking
this because the API you're proposing above has no notion of key frames.

3. What is the issue with seeking the video through the usual
HTMLVideoElement API, and why is that unacceptable but a time argument
to mozCreateThumbnail is acceptable? To me, they both seem like two
ways to do the same thing, and I don't understand why we cannot rely on
normal seeking here.

Now, let's talk about your proposal specifically. We are working on an
API for resizing images without incurring a high memory usage cost (see
Jonas' post earlier this thread for a rough sketch.) It seems to me
like there is no reason why we cannot integrate HTMLVideoElement with
that API. That gives you the ability to rescale the captured thumbnail
to the requested size, which is part of your proposal above. That API
will be based on Promises which will give you a better syntax for
handling the asynchronous response. I'm actively working on the API for
now, focusing on the image resizing issue for now, but I would like to
start thinking about the video thumbnail generation use case as well, so
I would really appreciate if we can discuss the questions above.

Does that sound good?

Cheers,
Ehsan

Chris Pearce

unread,

Feb 19, 2014, 5:14:17 PM2/19/14

to Ehsan Akhgari, Blake Wu, Sotaro Ikeda, dev-webapi, Jonas Sicking, ch...@mozilla.com, Robert O'Callahan, sch...@mozilla.com

Hi all,

The issue with audio muting is that we want to still start up the audio
decoder even when audio is muted so that there's no delay when playback
is started if audio is unmuted.

due to our current behaviour there is memory overhead as well to buffer
the decoded data (currently we buffer 1s of decoded audio in advance of
the current playback position in order to avoid playback glitches, which
for 48KHz stero audio is 192,000 bytes required for this buffering. So
if we start up the audio decoder as well as the video decoder, we'll do
a bunch of work and allocate a bunch of memory that we don't need to.

There is an API for deselecting streams on video elements (which would
do what you're implying mute should do), the track API, but we don't
implement it yet, and AFAIK no other browser does either.

>
> 2. What happened with the requirement of selecting a key frame?

I think they dropped it because they thought they were more likely to
get something agreed upon if they dropped that requirement.

I think we should have the thumbnail API choose the keyframe for us, and
provide no way for the caller to specify what the keyframe should be.
The implementation should just pick the right frame. Because we control
the implementation, we can ensure that.

> Earlier in the thread it seemed like that is a hard requirement for
> getting a quality thumbnail generated, is that no longer the case?

Now they're discussing writing a demuxer in JavaScript that parses the
MP4 file's index and finds the keyframe they want.

I think this is a bad solution, because it means the code we're using
playback and for getting thumbnails is different, so it's ineveitable
that the behaviour will differ for these, and we'll fail to get a
thumbnail for some files. We also don't really want to have to write a
parser for all the other formats we want to support, like WebM, Ogg, etc.

Despite this being a bad solution, it's one that can be gotten to work
"well enough" quickly for the next B2G milestone, so I'm OK with it as a
short term solution until we have a useful thumbnail API implementation
in Gecko.

> I'm asking this because the API you're proposing above has no notion
> of key frames.

I think Jonas' earlier proposal is a good one, and we should allow video
URLs in it, and it for video the thumbnail API should select the
keyframe which has the largest encoded size out of the first 10
keyframes (what Android does). I think we should not require or even
allow the caller to specify which frame should be the thumbnail, because
it's much more reliable to get the video stack to do that.

>
> 3. What is the issue with seeking the video through the usual
> HTMLVideoElement API, and why is that unacceptable but a time argument
> to mozCreateThumbnail is acceptable?

We don't want to use normal seeking path because that requires us to
load the video to be thumbnailed in our regular pipeline, and if we're
not going to be playing back that means we've done a lot of extra work
that we don't need to. This increases latency of the API, and the amount
of work we have to do.

Additionally, on some B2G devices we can only have one hardware decoder
active at once (the software decoder is buggy and can only handle H.264
Baseline, so it's no use here). If we use the regular <video> path we
lock the hardware decoder for the entire time we've got a <video> active
in the active document. So while we're doing all that extra work that we
don't need to do (decoding audio and video frames for buffering under
the assumption that we'll be playing back the media) , we can't
thumbnail another video. Or play another video for that matter, though
you could argue the Player App should also tiptoe around this limitation.

Whereas if we have a dedicated thumbnail API we can initialize and lock
the HW decoder only while we decode the desired thumbnail keyframe, we
don't need to keep the HW decoder locked while we're demuxing/parsing
the file to find the desired thumbnail keyframe as in the normal <video>
path.

The primary use case here is the Video Player App "Gallery" screen. On
first run we'll have a number of unthumbnailed videos on disk, and we
need to iterate through them all and generate thumbnails. We want the
thumbnails to be generated ASAP.

> To me, they both seem like two ways to do the same thing, and I don't
> understand why we cannot rely on normal seeking here.
>
> Now, let's talk about your proposal specifically. We are working on
> an API for resizing images without incurring a high memory usage cost
> (see Jonas' post earlier this thread for a rough sketch.) It seems to
> me like there is no reason why we cannot integrate HTMLVideoElement
> with that API. That gives you the ability to rescale the captured
> thumbnail to the requested size, which is part of your proposal
> above. That API will be based on Promises which will give you a
> better syntax for handling the asynchronous response. I'm actively
> working on the API for now, focusing on the image resizing issue for
> now, but I would like to start thinking about the video thumbnail
> generation use case as well, so I would really appreciate if we can
> discuss the questions above.

I think it is completely reasonable to have the image thumbnail API also
work on videos. The sketch Jonas proposed earlier seems a reasonable API
to me.

As I said above, I think for video we should let the
navigator.createResizedImage() implementation chose which frame to use
for the thumbnail of a video based on the size of keyframe. We shouldn't
force the JS App to decide what frame should be the keyframe, or try to
expose the sizes of keyframes to the web somehow.

>
> Does that sound good?
>

Does to me! Did what I said make sense?

Cheers,
Chris P.

Chris Pearce

unread,

Feb 19, 2014, 5:14:17 PM2/19/14

to Ehsan Akhgari, Blake Wu, Sotaro Ikeda, dev-webapi, Jonas Sicking, ch...@mozilla.com, Robert O'Callahan, sch...@mozilla.com

Hi all,

On 2/20/2014 5:05 AM, Ehsan Akhgari wrote:

The issue with audio muting is that we want to still start up the audio
decoder even when audio is muted so that there's no delay when playback
is started if audio is unmuted.

due to our current behaviour there is memory overhead as well to buffer
the decoded data (currently we buffer 1s of decoded audio in advance of
the current playback position in order to avoid playback glitches, which
for 48KHz stero audio is 192,000 bytes required for this buffering. So
if we start up the audio decoder as well as the video decoder, we'll do
a bunch of work and allocate a bunch of memory that we don't need to.

There is an API for deselecting streams on video elements (which would
do what you're implying mute should do), the track API, but we don't
implement it yet, and AFAIK no other browser does either.

>

> 2. What happened with the requirement of selecting a key frame?

I think they dropped it because they thought they were more likely to
get something agreed upon if they dropped that requirement.

I think we should have the thumbnail API choose the keyframe for us, and
provide no way for the caller to specify what the keyframe should be.
The implementation should just pick the right frame. Because we control
the implementation, we can ensure that.

> Earlier in the thread it seemed like that is a hard requirement for
> getting a quality thumbnail generated, is that no longer the case?

Now they're discussing writing a demuxer in JavaScript that parses the
MP4 file's index and finds the keyframe they want.

I think this is a bad solution, because it means the code we're using
playback and for getting thumbnails is different, so it's ineveitable
that the behaviour will differ for these, and we'll fail to get a
thumbnail for some files. We also don't really want to have to write a
parser for all the other formats we want to support, like WebM, Ogg, etc.

Despite this being a bad solution, it's one that can be gotten to work
"well enough" quickly for the next B2G milestone, so I'm OK with it as a
short term solution until we have a useful thumbnail API implementation
in Gecko.

> I'm asking this because the API you're proposing above has no notion
> of key frames.

I think Jonas' earlier proposal is a good one, and we should allow video
URLs in it, and it for video the thumbnail API should select the
keyframe which has the largest encoded size out of the first 10
keyframes (what Android does). I think we should not require or even
allow the caller to specify which frame should be the thumbnail, because
it's much more reliable to get the video stack to do that.

>

> 3. What is the issue with seeking the video through the usual
> HTMLVideoElement API, and why is that unacceptable but a time argument
> to mozCreateThumbnail is acceptable?