Simplification and separation ideas

412 views
Skip to first unread message

Vittorio Giovara

unread,
May 12, 2016, 5:00:27 PM5/12/16
to spatial-media-discuss
Hello,
please find my comments about the spherical specification RFC.

- Why is stereoscopic information specified in the PRHD box? While I was working on the first version of the spec I was pointed to the fact that the spherical projection and the stereoscopic one are completely distinct and separate. Rendering takes place at two different stages for spherical and for stereoscopic, and you can have spherical videos that are not stereoscopic and stereoscopic videos that are not spherical. Since the stereoscopic information seems mostly unrelated, I would propose to separate it from this specification, moving it to a separate box out of the spherical video realm. Alternatively we could simply refer to already existing stereoscopic specifications (such as the frame packing arrangement from h264/h265) as this would further simplify parsing this box and avoid ambiguity in case both the container and the stream specify stereoscopic rendering.

- The roll, pitch, and yaw are expressed in degrees, but this is rather limiting. A better approach would be to use a display matrix for each element: this would allow us to specify any rotation in any direction without having to specify "clockwise" or "counter-clockwise". Moreover the display matrix extends the rendering capabilities so that if we have to rescale or resize the input view in any way, we would be able to do so without adding fields to the specification.

- I find the names for the Equirectangular Project Box extremely confusing: h264 and several other codecs use similar naming scheme for completely different purposes (bitstream cropping), while here they refer to a region of interest that should be presented first when loading the video. I would simply suggest to change the names into something different to avoid any doubt on what they are for.

Best regards
Vittorio

Robert Suderman

unread,
May 12, 2016, 6:35:46 PM5/12/16
to spatial-media-discuss


On Thursday, May 12, 2016 at 2:00:27 PM UTC-7, Vittorio Giovara wrote:
Hello,
please find my comments about the spherical specification RFC.

- Why is stereoscopic information specified in the PRHD box? While I was working on the first version of the spec I was pointed to the fact that the spherical projection and the stereoscopic one are completely distinct and separate. Rendering takes place at two different stages for spherical and for stereoscopic, and you can have spherical videos that are not stereoscopic and stereoscopic videos that are not spherical. Since the stereoscopic information seems mostly unrelated, I would propose to separate it from this specification, moving it to a separate box out of the spherical video realm. Alternatively we could simply refer to already existing stereoscopic specifications (such as the frame packing arrangement from h264/h265) as this would further simplify parsing this box and avoid ambiguity in case both the container and the stream specify stereoscopic rendering.


Just a few thoughts for why we chose to include it in the spherical header. 
  1. We needed to avoid codec specific stereo medata (as for h264 and h265). We have many creators working with content in other formats (ProRes for instance) that do not have their own stereo metadata. As we needed a solution for these formats we needed some form of new metadata.
  2. I agree that it would be preferable to have stereo metadata in a separate box adjacent to sv3d.
 Overall I am definitely in favor of having stereo metadata specified separately though there would be additional work in introducing a second spec.

- The roll, pitch, and yaw are expressed in degrees, but this is rather limiting. A better approach would be to use a display matrix for each element: this would allow us to specify any rotation in any direction without having to specify "clockwise" or "counter-clockwise". Moreover the display matrix extends the rendering capabilities so that if we have to rescale or resize the input view in any way, we would be able to do so without adding fields to the specification.

A few concerns with specifying the rotation as a transform matrix:
  1. While the current method does require an extreme verbose description of yaw, pitch, roll it does avoid describing each projection as XYZ
  2. Many operations would have little meaning for a spherical video. For instance the effect of a translation does not have meaning without knowing the expected distance of each point of the projection from the observer.
I am not heavily opposed to including the rotation as a transform, just concerned it could introduce unexpected behavior in unspecified cases.
 
- I find the names for the Equirectangular Project Box extremely confusing: h264 and several other codecs use similar naming scheme for completely different purposes (bitstream cropping), while here they refer to a region of interest that should be presented first when loading the video. I would simply suggest to change the names into something different to avoid any doubt on what they are for.

Would updating the naming to something like "frame_crop_left" or "projection_crop_left" help with the clarity? If you have a suggested naming, I would appreciate it.
 
Best regards
Vittorio

Chip Brown

unread,
May 13, 2016, 8:54:45 PM5/13/16
to spatial-media-discuss
While I understand that a rotation matrix is nice and general, it is of course far too general. We really want to restrict to simple rotations only, not even any flips IMHO. This is painful to enforce mathematically due to precision issues, so it would likely be best left as "if it is anything other than a simple rotation the results are undefined". 

I am not sure what a scale of the input could accomplish, unless you wanted it to crop or something. Perhaps it better to specify that? I may be missing something here though.

However, the bigger problems are usability:
  • There's the usual row/column major question, but also the confusion of -Z being the "forward" direction. Even a mathematician like me will have to slow down to get the matrix right... :)
  • Most people can't enter a rotation matrix by hand. You would typically use a tool that took Euler angles or yaw/pitch/roll to build them, so why not leave it there?
  • It's really hard to look at a list of 9 floats and figure out what kind of transform it describes. Even the 3x3 identity is a little strange IMHO: 1 0 0 0 1 0 0 0 1. Anything else gets rapidly harder to read.
In fact I might argue we go further and forbid roll != 0 and ||pitch|| > 90. I am not aware of any video technique which is capable of doing stereo in more "one orientation", see ODS for instance. With current videos a roll would essentially always make people queasy unless it was very small or the video was generated in some very strange way. The reason for the my pitch restriction is that it morphs into a 180 degree roll when it goes past 90. It comes back of course. :)

I am OK with leaving roll in for completeness and similarly for allowing large pitches, but with a note saying you likely don't want to use them.

Just to go even one step further in the direction of simplicity, I'd make the yaw/pitch/roll 2-byte integers rather than 16.16 fixed. Given that the whole point of 360 degree video is that people can look around, there is really no need for extreme precision here. People will only be pointed in that initial direction for a frame or two. I'll happily concede this last point though.

Vittorio Giovara

unread,
May 17, 2016, 3:24:57 PM5/17/16
to spatial-media-discuss


On Thursday, May 12, 2016 at 6:35:46 PM UTC-4, Robert Suderman wrote:


On Thursday, May 12, 2016 at 2:00:27 PM UTC-7, Vittorio Giovara
 
wrote:
Hello,
please find my comments about the spherical specification RFC.

- Why is stereoscopic information specified in the PRHD box? While I was working on the first version of the spec I was pointed to the fact that the spherical projection and the stereoscopic one are completely distinct and separate. Rendering takes place at two different stages for spherical and for stereoscopic, and you can have spherical videos that are not stereoscopic and stereoscopic videos that are not spherical. Since the stereoscopic information seems mostly unrelated, I would propose to separate it from this specification, moving it to a separate box out of the spherical video realm. Alternatively we could simply refer to already existing stereoscopic specifications (such as the frame packing arrangement from h264/h265) as this would further simplify parsing this box and avoid ambiguity in case both the container and the stream specify stereoscopic rendering.


Just a few thoughts for why we chose to include it in the spherical header. 
  1. We needed to avoid codec specific stereo medata (as for h264 and h265). We have many creators working with content in other formats (ProRes for instance) that do not have their own stereo metadata. As we needed a solution for these formats we needed some form of new metadata.
 I assumed this was the case, thanks for clarifying it.
  1. I agree that it would be preferable to have stereo metadata in a separate box adjacent to sv3d.
 Overall I am definitely in favor of having stereo metadata specified separately though there would be additional work in introducing a second spec.

Actually I keep everything in this specification but I would further simplify the RFC by skipping specifying the SV3D box: I would keep the SVHD, PROJ and the new 3D box at the same level, and have the 3D box contain only "stereo_mode" (monoscopic/left-right/top-bottom) and "aspect_ratio" (SAR of a single view so that the user can know whether it's a squashed video or two full size views). There could be other more generic attributes we could think of but right now I would still keep it tied to our use case.

What do you think? This should avoid introducing delays of a second spec and at the same time keep a high level separation of the rendering sequence.

- The roll, pitch, and yaw are expressed in degrees, but this is rather limiting. A better approach would be to use a display matrix for each element: this would allow us to specify any rotation in any direction without having to specify "clockwise" or "counter-clockwise". Moreover the display matrix extends the rendering capabilities so that if we have to rescale or resize the input view in any way, we would be able to do so without adding fields to the specification.

A few concerns with specifying the rotation as a transform matrix:
  1. While the current method does require an extreme verbose description of yaw, pitch, roll it does avoid describing each projection as XYZ
  2. Many operations would have little meaning for a spherical video. For instance the effect of a translation does not have meaning without knowing the expected distance of each point of the projection from the observer.
I am not heavily opposed to including the rotation as a transform, just concerned it could introduce unexpected behavior in unspecified cases.

I agree that specifying the three different rotations is preferable than XYZ. Since there could be two half-resolution videos or a two full-size videos, I had thought we could have expressed horizontal or vertical stretching to recreate the original size with the display matrix; on the other hand it would probably overcomplicate the specification, and if we actually split the stereoscopic part in its own box, it's more elegant to use an aspect ratio instead. So overall I do not have a strong argument for using a matrix rather than an angle, except that it would make the specification more future proof, in case a new projection transform is needed.
 
 
- I find the names for the Equirectangular Project Box extremely confusing: h264 and several other codecs use similar naming scheme for completely different purposes (bitstream cropping), while here they refer to a region of interest that should be presented first when loading the video. I would simply suggest to change the names into something different to avoid any doubt on what they are for.

Would updating the naming to something like "frame_crop_left" or "projection_crop_left" help with the clarity? If you have a suggested naming, I would appreciate it.

How about simply "initial_view_{left,top,right,bottom}"? We could still keep the verb crop in the description to explain what they are for.

Best regards,
Vittorio

Robert Suderman

unread,
May 19, 2016, 1:30:26 PM5/19/16
to spatial-media-discuss


On Tuesday, May 17, 2016 at 12:24:57 PM UTC-7, Vittorio Giovara wrote:


On Thursday, May 12, 2016 at 6:35:46 PM UTC-4, Robert Suderman wrote:


On Thursday, May 12, 2016 at 2:00:27 PM UTC-7, Vittorio Giovara
 
wrote:
Hello,
please find my comments about the spherical specification RFC.

- Why is stereoscopic information specified in the PRHD box? While I was working on the first version of the spec I was pointed to the fact that the spherical projection and the stereoscopic one are completely distinct and separate. Rendering takes place at two different stages for spherical and for stereoscopic, and you can have spherical videos that are not stereoscopic and stereoscopic videos that are not spherical. Since the stereoscopic information seems mostly unrelated, I would propose to separate it from this specification, moving it to a separate box out of the spherical video realm. Alternatively we could simply refer to already existing stereoscopic specifications (such as the frame packing arrangement from h264/h265) as this would further simplify parsing this box and avoid ambiguity in case both the container and the stream specify stereoscopic rendering.


Just a few thoughts for why we chose to include it in the spherical header. 
  1. We needed to avoid codec specific stereo medata (as for h264 and h265). We have many creators working with content in other formats (ProRes for instance) that do not have their own stereo metadata. As we needed a solution for these formats we needed some form of new metadata.
 I assumed this was the case, thanks for clarifying it.
  1. I agree that it would be preferable to have stereo metadata in a separate box adjacent to sv3d.
 Overall I am definitely in favor of having stereo metadata specified separately though there would be additional work in introducing a second spec.

Actually I keep everything in this specification but I would further simplify the RFC by skipping specifying the SV3D box: I would keep the SVHD, PROJ and the new 3D box at the same level, and have the 3D box contain only "stereo_mode" (monoscopic/left-right/top-bottom) and "aspect_ratio" (SAR of a single view so that the user can know whether it's a squashed video or two full size views). There could be other more generic attributes we could think of but right now I would still keep it tied to our use case.

What do you think? This should avoid introducing delays of a second spec and at the same time keep a high level separation of the rendering sequence.


A couple of thoughts
- Moving the stereo metadata up one level
- I can't see the benefit of the "aspect_ratio" of the sphere. Given those meshes need to be 3D rendered, it would appear that SAR loses much of its normal meaning. Could you clarify why it would be useful?
- We have tried experimenting with future projections that may have unique ways of handling stereo metadata (beyond left-right, top-bottom) and we would need a way to use the projection specific stereo mode
- We have also considering including multiple projections and support switching  between them depending on frame. Grouping the stereo mode and projection would simplify switching to a completely new format as only a single format change would need to occur.
- One other realization we found was detecting stereo_mode without knowing the projection not actually useful in most cases. Trying to view most projections on a stereo display actually breaks down as the left/right offsets in the projection are not guaranteed to line up with the left/right side of the frames (e.g. cubemaps have up / down faces). Even with equirectangular projections (with do line up left / right) the substantial warping on the top / bottom of the frame make them nearly impossible to fuse. 

Sorry for the long list. 

- The roll, pitch, and yaw are expressed in degrees, but this is rather limiting. A better approach would be to use a display matrix for each element: this would allow us to specify any rotation in any direction without having to specify "clockwise" or "counter-clockwise". Moreover the display matrix extends the rendering capabilities so that if we have to rescale or resize the input view in any way, we would be able to do so without adding fields to the specification.

A few concerns with specifying the rotation as a transform matrix:
  1. While the current method does require an extreme verbose description of yaw, pitch, roll it does avoid describing each projection as XYZ
  2. Many operations would have little meaning for a spherical video. For instance the effect of a translation does not have meaning without knowing the expected distance of each point of the projection from the observer.
I am not heavily opposed to including the rotation as a transform, just concerned it could introduce unexpected behavior in unspecified cases.

I agree that specifying the three different rotations is preferable than XYZ. Since there could be two half-resolution videos or a two full-size videos, I had thought we could have expressed horizontal or vertical stretching to recreate the original size with the display matrix; on the other hand it would probably overcomplicate the specification, and if we actually split the stereoscopic part in its own box, it's more elegant to use an aspect ratio instead. So overall I do not have a strong argument for using a matrix rather than an angle, except that it would make the specification more future proof, in case a new projection transform is needed.
 

There are actually some neat tricks you can perform with stretching / translating the projection (e.g. you can bias the projection to place more pixels at different regions) however for the first version we should probably avoid it. I'm going to sketch up a few more drafts of how the rotation matrix may be stored / defined as it does avoid strange arbitrary conventions.
 
 
- I find the names for the Equirectangular Project Box extremely confusing: h264 and several other codecs use similar naming scheme for completely different purposes (bitstream cropping), while here they refer to a region of interest that should be presented first when loading the video. I would simply suggest to change the names into something different to avoid any doubt on what they are for.

Would updating the naming to something like "frame_crop_left" or "projection_crop_left" help with the clarity? If you have a suggested naming, I would appreciate it.

How about simply "initial_view_{left,top,right,bottom}"? We could still keep the verb crop in the description to explain what they are for.


initial_view_ matches the naming of some GPano metadata which controlled the pano orientation and considering it was the basis for the V1 spec we would want to avoid conflicts.. Would something like "projection_bounds_" or "content_bounds_" conflict less with other mp4 cropping metadata?
 
Best regards,
Vittorio

Vittorio Giovara

unread,
May 19, 2016, 5:20:15 PM5/19/16
to spatial-media-discuss


On Thursday, May 19, 2016 at 1:30:26 PM UTC-4, Robert Suderman wrote:


On Tuesday, May 17, 2016 at 12:24:57 PM UTC-7, Vittorio Giovara wrote:
Actually I keep everything in this specification but I would further simplify the RFC by skipping specifying the SV3D box: I would keep the SVHD, PROJ and the new 3D box at the same level, and have the 3D box contain only "stereo_mode" (monoscopic/left-right/top-bottom) and "aspect_ratio" (SAR of a single view so that the user can know whether it's a squashed video or two full size views). There could be other more generic attributes we could think of but right now I would still keep it tied to our use case.

What do you think? This should avoid introducing delays of a second spec and at the same time keep a high level separation of the rendering sequence.


A couple of thoughts
- Moving the stereo metadata up one level
- I can't see the benefit of the "aspect_ratio" of the sphere. Given those meshes need to be 3D rendered, it would appear that SAR loses much of its normal meaning. Could you clarify why it would be useful?

Normal stereoscopic videos are created by stitching together two files, but these two files might (or might not) be further squashed to fit in a "standard resolution" frame: for example, a left-right file can be either combining two 1920x1080 videos generating a 3840x1080 file, or they could be squashed down at 960x1080 and then stitched together, generating a "frame packed" video of 1920x1080. Since the sample aspect ratio refers *only* to the container frame, a second aspect ratio would be helpful in identifying stitched vs squashed+stitched. In the previous example, the first case would have a standard aspect ratio of 2:1, and a stereoscopic aspect ratio of 1:1 (meaning that the two embedded frames represent two full-size videos), while the second case would have a standard aspect ratio of 1:1, and a stereoscopic one of 1:2 (meaning that each sample of the embedded frames have to be upsampled to recreate the original videos). If you would prefer to keep it simple we could just use a boolean value that when true notifies the user that the reported SAR actually refers to the upsampled video rather than the container one.
 
- We have tried experimenting with future projections that may have unique ways of handling stereo metadata (beyond left-right, top-bottom) and we would need a way to use the projection specific stereo mode
- We have also considering including multiple projections and support switching  between them depending on frame. Grouping the stereo mode and projection would simplify switching to a completely new format as only a single format change would need to occur.
 
I can't comment on unreleased experimental techniques without additional detail, but I would still assume that if there is a need to further expand the box containing stereoscopic information, it would be much simpler if the box is separated and independent of the spherical side. Also I am wondering how you can obtain frame-accurate switching while defining a container specification, and, even in that case, it does sound something that should be kept well apart from this specification.

- One other realization we found was detecting stereo_mode without knowing the projection not actually useful in most cases. Trying to view most projections on a stereo display actually breaks down as the left/right offsets in the projection are not guaranteed to line up with the left/right side of the frames (e.g. cubemaps have up / down faces). Even with equirectangular projections (with do line up left / right) the substantial warping on the top / bottom of the frame make them nearly impossible to fuse.

While this is something that might happen, the spherical and stereoscopic renderings are two different presentation steps that affect the output video in different ways, they bear no correlation to which device they are performed on; for the case at hand, this is usually fixed by watching spherical 3d video with headmounted display and distorting lens. However this looks like more a property of the spherical viewing or a limitation of the rendering device rather than a characteristic of the stereoscopic format. In other words, the stereoscopic box should just notify that two (or more) videos are present in the same stream, the spherical box should just notify that the views have to be mapped to a sphere.
 
I am not heavily opposed to including the rotation as a transform, just concerned it could introduce unexpected behavior in unspecified cases.

I agree that specifying the three different rotations is preferable than XYZ. Since there could be two half-resolution videos or a two full-size videos, I had thought we could have expressed horizontal or vertical stretching to recreate the original size with the display matrix; on the other hand it would probably overcomplicate the specification, and if we actually split the stereoscopic part in its own box, it's more elegant to use an aspect ratio instead. So overall I do not have a strong argument for using a matrix rather than an angle, except that it would make the specification more future proof, in case a new projection transform is needed.
 

There are actually some neat tricks you can perform with stretching / translating the projection (e.g. you can bias the projection to place more pixels at different regions) however for the first version we should probably avoid it. I'm going to sketch up a few more drafts of how the rotation matrix may be stored / defined as it does avoid strange arbitrary conventions.

Agreed, let's drop this suggestion.
 
 
- I find the names for the Equirectangular Project Box extremely confusing: h264 and several other codecs use similar naming scheme for completely different purposes (bitstream cropping), while here they refer to a region of interest that should be presented first when loading the video. I would simply suggest to change the names into something different to avoid any doubt on what they are for.

Would updating the naming to something like "frame_crop_left" or "projection_crop_left" help with the clarity? If you have a suggested naming, I would appreciate it.

How about simply "initial_view_{left,top,right,bottom}"? We could still keep the verb crop in the description to explain what they are for.


initial_view_ matches the naming of some GPano metadata which controlled the pano orientation and considering it was the basis for the V1 spec we would want to avoid conflicts.. Would something like "projection_bounds_" or "content_bounds_" conflict less with other mp4 cropping metadata?

The two specs are inherently incompatible, but if you want to avoid any possible naming conflict, projection_bounds_* is fine with me.

Best regards.
Vittorio

Robert Suderman

unread,
May 26, 2016, 8:09:19 PM5/26/16
to spatial-media-discuss
Going to stop inlining to make a little more succinct.

Not making the container format have an expectation of per-frame metadata makes sense so we can consider per-frame metadata a moot point.

projection_bounds_* sounds good.

Our perspective has been that stereo layout is part of the projection as we interpret it as a transformation from texture space to the spheres for each eye. I see an ideological improvement in that stereo and projection can be interpreted independently, however I don't see a technical improvement as projection agnostic stereo metadata will likely be limited at left-right, top-bottom, or perhaps per-frame interlacing. I do see the possibility of projection specific though I imagine it's details would be stored in the projection box. Other than the ideological reason, is there something that separating the stereo and projection metadata may provide?

As for the SAR metadata. I still can't see how the metadata would be used during the rendering process. As the frame is split to either eye independently and is 3D rendered I don't see how knowing the SAR would change how a 3D 360 video is rendered.

Vittorio Giovara

unread,
Jun 2, 2016, 5:30:08 PM6/2/16
to spatial-media-discuss
Hello,


On Thursday, May 26, 2016 at 8:09:19 PM UTC-4, Robert Suderman wrote:
Our perspective has been that stereo layout is part of the projection as we interpret it as a transformation from texture space to the spheres for each eye. I see an ideological improvement in that stereo and projection can be interpreted independently, however I don't see a technical improvement as projection agnostic stereo metadata will likely be limited at left-right, top-bottom, or perhaps per-frame interlacing. I do see the possibility of projection specific though I imagine it's details would be stored in the projection box. Other than the ideological reason, is there something that separating the stereo and projection metadata may provide?

I don't agree in defining these "ideological reasons": the stereoscopic rendering and spherical rendering are two very distinct practical processes that the player has to interpret very differently. However I can see that keeping the 3d information and the spherical information in the same specification, but only for a very limited very narrow use case. The semantics of the "stereo_mode" field only apply to current video files and only to frame-packed stereoscopic videos; this is the fundamental problem to my eyes, a specification has to describe an algorithm or some sort of procedure that is not bound or that does not artificially limits to what is technically possible at the time of the draft.

What if in the future there is a way to embed two video streams that are not frame-packed but that describe a stereoscopic scene? With the embedded stereo_mode flag there is *no* way to signal this format, and any value we could store there would be either a lie (since it would not be 0.monoscopic, nor 1.left-right, nor 2.top-bottom) or extremely confusing to describe (3. two-streams in the same container 4. single stream with multiple layers, or 5. implementation dependent). I can see the opposing argument that these kind of files could be described at a later phase, but why should we ratify a draft that already envisions updates? Also why should the *spherical* specification be modified for new *stereoscopic* formats?

If you need an actual example, how would an MVC (or HSVC or VP10-Multiview) file work with the current draft? Right now stereo_mode would be incorrect for any value used. On the other hand If the stereoscopic metadata were in a different specification or at least in a different box, the container would just need be marked as spherical and the decoder would take care of the stereoscopic rendering, independently. This in my opinion respects the semantics of keeping the specification separate and simple and would allow to keep the stereoscopic and spherical sections well distinct.
 
As for the SAR metadata. I still can't see how the metadata would be used during the rendering process. As the frame is split to either eye independently and is 3D rendered I don't see how knowing the SAR would change how a 3D 360 video is rendered.

This value would mostly apply to traditional frame packed video, here is an example where it would matter the most: you receive a 1920x1080 file and you know it's a side-by-side stereoscopic file. How can you tell whether the video surface contains two 960x1080 squashed videos (https://www.youtube.com/watch?v=vjxd3TxQ-s0) or two full size anamorphic videos (https://www.youtube.com/watch?v=5IcOrnpqFmU)? You need to look at the SAR of course, but you can never be sure whether it applies to the container frame or to the two independent views. In the previous example, in the first case you'd have a SAR of 2:1 which clearly applies to the container frame, but in the second case you might get a SAR value of either 1:1 (meaning you don't have to resize the views) or of 16:9 (meaning the single view is anamorphic). In the second case you'd want a secondary SAR or a boolean value that helps you in selecting the correct SAR.

Robert Suderman

unread,
Jun 10, 2016, 3:50:48 PM6/10/16
to spatial-media-discuss


On Thursday, June 2, 2016 at 2:30:08 PM UTC-7, Vittorio Giovara wrote:
Hello,

On Thursday, May 26, 2016 at 8:09:19 PM UTC-4, Robert Suderman wrote:
Our perspective has been that stereo layout is part of the projection as we interpret it as a transformation from texture space to the spheres for each eye. I see an ideological improvement in that stereo and projection can be interpreted independently, however I don't see a technical improvement as projection agnostic stereo metadata will likely be limited at left-right, top-bottom, or perhaps per-frame interlacing. I do see the possibility of projection specific though I imagine it's details would be stored in the projection box. Other than the ideological reason, is there something that separating the stereo and projection metadata may provide?

I don't agree in defining these "ideological reasons": the stereoscopic rendering and spherical rendering are two very distinct practical processes that the player has to interpret very differently. However I can see that keeping the 3d information and the spherical information in the same specification, but only for a very limited very narrow use case. The semantics of the "stereo_mode" field only apply to current video files and only to frame-packed stereoscopic videos; this is the fundamental problem to my eyes, a specification has to describe an algorithm or some sort of procedure that is not bound or that does not artificially limits to what is technically possible at the time of the draft.

What if in the future there is a way to embed two video streams that are not frame-packed but that describe a stereoscopic scene? With the embedded stereo_mode flag there is *no* way to signal this format, and any value we could store there would be either a lie (since it would not be 0.monoscopic, nor 1.left-right, nor 2.top-bottom) or extremely confusing to describe (3. two-streams in the same container 4. single stream with multiple layers, or 5. implementation dependent). I can see the opposing argument that these kind of files could be described at a later phase, but why should we ratify a draft that already envisions updates? Also why should the *spherical* specification be modified for new *stereoscopic* formats?

If you need an actual example, how would an MVC (or HSVC or VP10-Multiview) file work with the current draft? Right now stereo_mode would be incorrect for any value used. On the other hand If the stereoscopic metadata were in a different specification or at least in a different box, the container would just need be marked as spherical and the decoder would take care of the stereoscopic rendering, independently. This in my opinion respects the semantics of keeping the specification separate and simple and would allow to keep the stereoscopic and spherical sections well distinct.
 

Could you write up an example of the container layout for the Spherical Metadata with the stereo changes? I think it may help me better understand the changes you are proposing.
 
Also, could you give me an example of how this new spherical video metadata would work with MVC? 

As for the SAR metadata. I still can't see how the metadata would be used during the rendering process. As the frame is split to either eye independently and is 3D rendered I don't see how knowing the SAR would change how a 3D 360 video is rendered.

This value would mostly apply to traditional frame packed video, here is an example where it would matter the most: you receive a 1920x1080 file and you know it's a side-by-side stereoscopic file. How can you tell whether the video surface contains two 960x1080 squashed videos (https://www.youtube.com/watch?v=vjxd3TxQ-s0) or two full size anamorphic videos (https://www.youtube.com/watch?v=5IcOrnpqFmU)? You need to look at the SAR of course, but you can never be sure whether it applies to the container frame or to the two independent views. In the previous example, in the first case you'd have a SAR of 2:1 which clearly applies to the container frame, but in the second case you might get a SAR value of either 1:1 (meaning you don't have to resize the views) or of 16:9 (meaning the single view is anamorphic). In the second case you'd want a secondary SAR or a boolean value that helps you in selecting the correct SAR.

I see how your example above affects regular 2D video files however I do not see how it affects 360 videos. For now can we can defer on SAR until we we resolve stereo metadata? 

Vittorio Giovara

unread,
Jun 20, 2016, 3:35:52 PM6/20/16
to spatial-media-discuss


On Friday, June 10, 2016 at 3:50:48 PM UTC-4, Robert Suderman wrote:


On Thursday, June 2, 2016 at 2:30:08 PM UTC-7, Vittorio Giovara wrote:
Hello,

On Thursday, May 26, 2016 at 8:09:19 PM UTC-4, Robert Suderman wrote:
Our perspective has been that stereo layout is part of the projection as we interpret it as a transformation from texture space to the spheres for each eye. I see an ideological improvement in that stereo and projection can be interpreted independently, however I don't see a technical improvement as projection agnostic stereo metadata will likely be limited at left-right, top-bottom, or perhaps per-frame interlacing. I do see the possibility of projection specific though I imagine it's details would be stored in the projection box. Other than the ideological reason, is there something that separating the stereo and projection metadata may provide?

I don't agree in defining these "ideological reasons": the stereoscopic rendering and spherical rendering are two very distinct practical processes that the player has to interpret very differently. However I can see that keeping the 3d information and the spherical information in the same specification, but only for a very limited very narrow use case. The semantics of the "stereo_mode" field only apply to current video files and only to frame-packed stereoscopic videos; this is the fundamental problem to my eyes, a specification has to describe an algorithm or some sort of procedure that is not bound or that does not artificially limits to what is technically possible at the time of the draft.

What if in the future there is a way to embed two video streams that are not frame-packed but that describe a stereoscopic scene? With the embedded stereo_mode flag there is *no* way to signal this format, and any value we could store there would be either a lie (since it would not be 0.monoscopic, nor 1.left-right, nor 2.top-bottom) or extremely confusing to describe (3. two-streams in the same container 4. single stream with multiple layers, or 5. implementation dependent). I can see the opposing argument that these kind of files could be described at a later phase, but why should we ratify a draft that already envisions updates? Also why should the *spherical* specification be modified for new *stereoscopic* formats?

If you need an actual example, how would an MVC (or HSVC or VP10-Multiview) file work with the current draft? Right now stereo_mode would be incorrect for any value used. On the other hand If the stereoscopic metadata were in a different specification or at least in a different box, the container would just need be marked as spherical and the decoder would take care of the stereoscopic rendering, independently. This in my opinion respects the semantics of keeping the specification separate and simple and would allow to keep the stereoscopic and spherical sections well distinct.
 

Could you write up an example of the container layout for the Spherical Metadata with the stereo changes? I think it may help me better understand the changes you are proposing.

I created a PR at github, hopefully that should convey what I'd like to see changed.
https://github.com/google/spatial-media/pull/104
I would be ok even if this was specified one level down (eg. inside sv3d).
 
Also, could you give me an example of how this new spherical video metadata would work with MVC? 

Sure: the decoder would determine whether to enable stereoscopic rendering based on the SEI messages present in the bitstream, then it would read the container metadata needed for spherical rendering. If the decoder does not support MVC, then it would only read the container spherical metadata and obtain a monoscopic version of the video.
A similar reasoning may be applied to a "classic" frame-packed 3d video: the decoder would first need to determine if there is any stereoscopic information in the bitstream (for example H264 frame packing SEI message); if absent then it would check the presence of the new stereoscopic box. Finally it would read the rest of the spherical metadata as needed, and enable the required rendering described.

In all cases described the new specification is not violated, and there is no ambiguity.
 
As for the SAR metadata. I still can't see how the metadata would be used during the rendering process. As the frame is split to either eye independently and is 3D rendered I don't see how knowing the SAR would change how a 3D 360 video is rendered.

This value would mostly apply to traditional frame packed video, here is an example where it would matter the most: you receive a 1920x1080 file and you know it's a side-by-side stereoscopic file. How can you tell whether the video surface contains two 960x1080 squashed videos (https://www.youtube.com/watch?v=vjxd3TxQ-s0) or two full size anamorphic videos (https://www.youtube.com/watch?v=5IcOrnpqFmU)? You need to look at the SAR of course, but you can never be sure whether it applies to the container frame or to the two independent views. In the previous example, in the first case you'd have a SAR of 2:1 which clearly applies to the container frame, but in the second case you might get a SAR value of either 1:1 (meaning you don't have to resize the views) or of 16:9 (meaning the single view is anamorphic). In the second case you'd want a secondary SAR or a boolean value that helps you in selecting the correct SAR.

I see how your example above affects regular 2D video files however I do not see how it affects 360 videos. For now can we can defer on SAR until we we resolve stereo metadata? 

Yes, it's something that should be addressed only if/when there is consensus on separating the metadata.

Thank you,
Vittorio

Robert Suderman

unread,
Jun 28, 2016, 12:48:24 PM6/28/16
to spatial-media-discuss
Hey Vittorio,

The change looks good to us so I'm going to perform a quick read through and merge it in. Thanks!

Subhash Das

unread,
Mar 24, 2017, 7:38:18 PM3/24/17
to spatial-media-discuss
Reply all
Reply to author
Forward
0 new messages