Hello,
please find my comments about the spherical specification RFC.
- Why is stereoscopic information specified in the PRHD box? While I was working on the first version of the spec I was pointed to the fact that the spherical projection and the stereoscopic one are completely distinct and separate. Rendering takes place at two different stages for spherical and for stereoscopic, and you can have spherical videos that are not stereoscopic and stereoscopic videos that are not spherical. Since the stereoscopic information seems mostly unrelated, I would propose to separate it from this specification, moving it to a separate box out of the spherical video realm. Alternatively we could simply refer to already existing stereoscopic specifications (such as the frame packing arrangement from h264/h265) as this would further simplify parsing this box and avoid ambiguity in case both the container and the stream specify stereoscopic rendering.
- The roll, pitch, and yaw are expressed in degrees, but this is rather limiting. A better approach would be to use a display matrix for each element: this would allow us to specify any rotation in any direction without having to specify "clockwise" or "counter-clockwise". Moreover the display matrix extends the rendering capabilities so that if we have to rescale or resize the input view in any way, we would be able to do so without adding fields to the specification.
- I find the names for the Equirectangular Project Box extremely confusing: h264 and several other codecs use similar naming scheme for completely different purposes (bitstream cropping), while here they refer to a region of interest that should be presented first when loading the video. I would simply suggest to change the names into something different to avoid any doubt on what they are for.
Best regards
Vittorio
On Thursday, May 12, 2016 at 2:00:27 PM UTC-7, Vittorio Giovara
wrote:Hello,
please find my comments about the spherical specification RFC.
- Why is stereoscopic information specified in the PRHD box? While I was working on the first version of the spec I was pointed to the fact that the spherical projection and the stereoscopic one are completely distinct and separate. Rendering takes place at two different stages for spherical and for stereoscopic, and you can have spherical videos that are not stereoscopic and stereoscopic videos that are not spherical. Since the stereoscopic information seems mostly unrelated, I would propose to separate it from this specification, moving it to a separate box out of the spherical video realm. Alternatively we could simply refer to already existing stereoscopic specifications (such as the frame packing arrangement from h264/h265) as this would further simplify parsing this box and avoid ambiguity in case both the container and the stream specify stereoscopic rendering.Just a few thoughts for why we chose to include it in the spherical header.
- We needed to avoid codec specific stereo medata (as for h264 and h265). We have many creators working with content in other formats (ProRes for instance) that do not have their own stereo metadata. As we needed a solution for these formats we needed some form of new metadata.
- I agree that it would be preferable to have stereo metadata in a separate box adjacent to sv3d.
Overall I am definitely in favor of having stereo metadata specified separately though there would be additional work in introducing a second spec.
- The roll, pitch, and yaw are expressed in degrees, but this is rather limiting. A better approach would be to use a display matrix for each element: this would allow us to specify any rotation in any direction without having to specify "clockwise" or "counter-clockwise". Moreover the display matrix extends the rendering capabilities so that if we have to rescale or resize the input view in any way, we would be able to do so without adding fields to the specification.A few concerns with specifying the rotation as a transform matrix:
- While the current method does require an extreme verbose description of yaw, pitch, roll it does avoid describing each projection as XYZ
- Many operations would have little meaning for a spherical video. For instance the effect of a translation does not have meaning without knowing the expected distance of each point of the projection from the observer.
I am not heavily opposed to including the rotation as a transform, just concerned it could introduce unexpected behavior in unspecified cases.
- I find the names for the Equirectangular Project Box extremely confusing: h264 and several other codecs use similar naming scheme for completely different purposes (bitstream cropping), while here they refer to a region of interest that should be presented first when loading the video. I would simply suggest to change the names into something different to avoid any doubt on what they are for.Would updating the naming to something like "frame_crop_left" or "projection_crop_left" help with the clarity? If you have a suggested naming, I would appreciate it.
On Thursday, May 12, 2016 at 6:35:46 PM UTC-4, Robert Suderman wrote:
On Thursday, May 12, 2016 at 2:00:27 PM UTC-7, Vittorio Giovarawrote:Hello,
please find my comments about the spherical specification RFC.
- Why is stereoscopic information specified in the PRHD box? While I was working on the first version of the spec I was pointed to the fact that the spherical projection and the stereoscopic one are completely distinct and separate. Rendering takes place at two different stages for spherical and for stereoscopic, and you can have spherical videos that are not stereoscopic and stereoscopic videos that are not spherical. Since the stereoscopic information seems mostly unrelated, I would propose to separate it from this specification, moving it to a separate box out of the spherical video realm. Alternatively we could simply refer to already existing stereoscopic specifications (such as the frame packing arrangement from h264/h265) as this would further simplify parsing this box and avoid ambiguity in case both the container and the stream specify stereoscopic rendering.Just a few thoughts for why we chose to include it in the spherical header.
- We needed to avoid codec specific stereo medata (as for h264 and h265). We have many creators working with content in other formats (ProRes for instance) that do not have their own stereo metadata. As we needed a solution for these formats we needed some form of new metadata.
I assumed this was the case, thanks for clarifying it.
- I agree that it would be preferable to have stereo metadata in a separate box adjacent to sv3d.
Overall I am definitely in favor of having stereo metadata specified separately though there would be additional work in introducing a second spec.
Actually I keep everything in this specification but I would further simplify the RFC by skipping specifying the SV3D box: I would keep the SVHD, PROJ and the new 3D box at the same level, and have the 3D box contain only "stereo_mode" (monoscopic/left-right/top-bottom) and "aspect_ratio" (SAR of a single view so that the user can know whether it's a squashed video or two full size views). There could be other more generic attributes we could think of but right now I would still keep it tied to our use case.
What do you think? This should avoid introducing delays of a second spec and at the same time keep a high level separation of the rendering sequence.
- The roll, pitch, and yaw are expressed in degrees, but this is rather limiting. A better approach would be to use a display matrix for each element: this would allow us to specify any rotation in any direction without having to specify "clockwise" or "counter-clockwise". Moreover the display matrix extends the rendering capabilities so that if we have to rescale or resize the input view in any way, we would be able to do so without adding fields to the specification.A few concerns with specifying the rotation as a transform matrix:
- While the current method does require an extreme verbose description of yaw, pitch, roll it does avoid describing each projection as XYZ
- Many operations would have little meaning for a spherical video. For instance the effect of a translation does not have meaning without knowing the expected distance of each point of the projection from the observer.
I am not heavily opposed to including the rotation as a transform, just concerned it could introduce unexpected behavior in unspecified cases.
I agree that specifying the three different rotations is preferable than XYZ. Since there could be two half-resolution videos or a two full-size videos, I had thought we could have expressed horizontal or vertical stretching to recreate the original size with the display matrix; on the other hand it would probably overcomplicate the specification, and if we actually split the stereoscopic part in its own box, it's more elegant to use an aspect ratio instead. So overall I do not have a strong argument for using a matrix rather than an angle, except that it would make the specification more future proof, in case a new projection transform is needed.
- I find the names for the Equirectangular Project Box extremely confusing: h264 and several other codecs use similar naming scheme for completely different purposes (bitstream cropping), while here they refer to a region of interest that should be presented first when loading the video. I would simply suggest to change the names into something different to avoid any doubt on what they are for.Would updating the naming to something like "frame_crop_left" or "projection_crop_left" help with the clarity? If you have a suggested naming, I would appreciate it.
How about simply "initial_view_{left,top,right,bottom}"? We could still keep the verb crop in the description to explain what they are for.
Best regards,
Vittorio
On Tuesday, May 17, 2016 at 12:24:57 PM UTC-7, Vittorio Giovara wrote:Actually I keep everything in this specification but I would further simplify the RFC by skipping specifying the SV3D box: I would keep the SVHD, PROJ and the new 3D box at the same level, and have the 3D box contain only "stereo_mode" (monoscopic/left-right/top-bottom) and "aspect_ratio" (SAR of a single view so that the user can know whether it's a squashed video or two full size views). There could be other more generic attributes we could think of but right now I would still keep it tied to our use case.
What do you think? This should avoid introducing delays of a second spec and at the same time keep a high level separation of the rendering sequence.A couple of thoughts- Moving the stereo metadata up one level- I can't see the benefit of the "aspect_ratio" of the sphere. Given those meshes need to be 3D rendered, it would appear that SAR loses much of its normal meaning. Could you clarify why it would be useful?
- We have tried experimenting with future projections that may have unique ways of handling stereo metadata (beyond left-right, top-bottom) and we would need a way to use the projection specific stereo mode
- We have also considering including multiple projections and support switching between them depending on frame. Grouping the stereo mode and projection would simplify switching to a completely new format as only a single format change would need to occur.
- One other realization we found was detecting stereo_mode without knowing the projection not actually useful in most cases. Trying to view most projections on a stereo display actually breaks down as the left/right offsets in the projection are not guaranteed to line up with the left/right side of the frames (e.g. cubemaps have up / down faces). Even with equirectangular projections (with do line up left / right) the substantial warping on the top / bottom of the frame make them nearly impossible to fuse.
I am not heavily opposed to including the rotation as a transform, just concerned it could introduce unexpected behavior in unspecified cases.
I agree that specifying the three different rotations is preferable than XYZ. Since there could be two half-resolution videos or a two full-size videos, I had thought we could have expressed horizontal or vertical stretching to recreate the original size with the display matrix; on the other hand it would probably overcomplicate the specification, and if we actually split the stereoscopic part in its own box, it's more elegant to use an aspect ratio instead. So overall I do not have a strong argument for using a matrix rather than an angle, except that it would make the specification more future proof, in case a new projection transform is needed.
There are actually some neat tricks you can perform with stretching / translating the projection (e.g. you can bias the projection to place more pixels at different regions) however for the first version we should probably avoid it. I'm going to sketch up a few more drafts of how the rotation matrix may be stored / defined as it does avoid strange arbitrary conventions.
- I find the names for the Equirectangular Project Box extremely confusing: h264 and several other codecs use similar naming scheme for completely different purposes (bitstream cropping), while here they refer to a region of interest that should be presented first when loading the video. I would simply suggest to change the names into something different to avoid any doubt on what they are for.Would updating the naming to something like "frame_crop_left" or "projection_crop_left" help with the clarity? If you have a suggested naming, I would appreciate it.
How about simply "initial_view_{left,top,right,bottom}"? We could still keep the verb crop in the description to explain what they are for.initial_view_ matches the naming of some GPano metadata which controlled the pano orientation and considering it was the basis for the V1 spec we would want to avoid conflicts.. Would something like "projection_bounds_" or "content_bounds_" conflict less with other mp4 cropping metadata?
Our perspective has been that stereo layout is part of the projection as we interpret it as a transformation from texture space to the spheres for each eye. I see an ideological improvement in that stereo and projection can be interpreted independently, however I don't see a technical improvement as projection agnostic stereo metadata will likely be limited at left-right, top-bottom, or perhaps per-frame interlacing. I do see the possibility of projection specific though I imagine it's details would be stored in the projection box. Other than the ideological reason, is there something that separating the stereo and projection metadata may provide?
As for the SAR metadata. I still can't see how the metadata would be used during the rendering process. As the frame is split to either eye independently and is 3D rendered I don't see how knowing the SAR would change how a 3D 360 video is rendered.
Hello,
On Thursday, May 26, 2016 at 8:09:19 PM UTC-4, Robert Suderman wrote:Our perspective has been that stereo layout is part of the projection as we interpret it as a transformation from texture space to the spheres for each eye. I see an ideological improvement in that stereo and projection can be interpreted independently, however I don't see a technical improvement as projection agnostic stereo metadata will likely be limited at left-right, top-bottom, or perhaps per-frame interlacing. I do see the possibility of projection specific though I imagine it's details would be stored in the projection box. Other than the ideological reason, is there something that separating the stereo and projection metadata may provide?
I don't agree in defining these "ideological reasons": the stereoscopic rendering and spherical rendering are two very distinct practical processes that the player has to interpret very differently. However I can see that keeping the 3d information and the spherical information in the same specification, but only for a very limited very narrow use case. The semantics of the "stereo_mode" field only apply to current video files and only to frame-packed stereoscopic videos; this is the fundamental problem to my eyes, a specification has to describe an algorithm or some sort of procedure that is not bound or that does not artificially limits to what is technically possible at the time of the draft.
What if in the future there is a way to embed two video streams that are not frame-packed but that describe a stereoscopic scene? With the embedded stereo_mode flag there is *no* way to signal this format, and any value we could store there would be either a lie (since it would not be 0.monoscopic, nor 1.left-right, nor 2.top-bottom) or extremely confusing to describe (3. two-streams in the same container 4. single stream with multiple layers, or 5. implementation dependent). I can see the opposing argument that these kind of files could be described at a later phase, but why should we ratify a draft that already envisions updates? Also why should the *spherical* specification be modified for new *stereoscopic* formats?
If you need an actual example, how would an MVC (or HSVC or VP10-Multiview) file work with the current draft? Right now stereo_mode would be incorrect for any value used. On the other hand If the stereoscopic metadata were in a different specification or at least in a different box, the container would just need be marked as spherical and the decoder would take care of the stereoscopic rendering, independently. This in my opinion respects the semantics of keeping the specification separate and simple and would allow to keep the stereoscopic and spherical sections well distinct.
As for the SAR metadata. I still can't see how the metadata would be used during the rendering process. As the frame is split to either eye independently and is 3D rendered I don't see how knowing the SAR would change how a 3D 360 video is rendered.
This value would mostly apply to traditional frame packed video, here is an example where it would matter the most: you receive a 1920x1080 file and you know it's a side-by-side stereoscopic file. How can you tell whether the video surface contains two 960x1080 squashed videos (https://www.youtube.com/watch?v=vjxd3TxQ-s0) or two full size anamorphic videos (https://www.youtube.com/watch?v=5IcOrnpqFmU)? You need to look at the SAR of course, but you can never be sure whether it applies to the container frame or to the two independent views. In the previous example, in the first case you'd have a SAR of 2:1 which clearly applies to the container frame, but in the second case you might get a SAR value of either 1:1 (meaning you don't have to resize the views) or of 16:9 (meaning the single view is anamorphic). In the second case you'd want a secondary SAR or a boolean value that helps you in selecting the correct SAR.
On Thursday, June 2, 2016 at 2:30:08 PM UTC-7, Vittorio Giovara wrote:Hello,
On Thursday, May 26, 2016 at 8:09:19 PM UTC-4, Robert Suderman wrote:Our perspective has been that stereo layout is part of the projection as we interpret it as a transformation from texture space to the spheres for each eye. I see an ideological improvement in that stereo and projection can be interpreted independently, however I don't see a technical improvement as projection agnostic stereo metadata will likely be limited at left-right, top-bottom, or perhaps per-frame interlacing. I do see the possibility of projection specific though I imagine it's details would be stored in the projection box. Other than the ideological reason, is there something that separating the stereo and projection metadata may provide?
I don't agree in defining these "ideological reasons": the stereoscopic rendering and spherical rendering are two very distinct practical processes that the player has to interpret very differently. However I can see that keeping the 3d information and the spherical information in the same specification, but only for a very limited very narrow use case. The semantics of the "stereo_mode" field only apply to current video files and only to frame-packed stereoscopic videos; this is the fundamental problem to my eyes, a specification has to describe an algorithm or some sort of procedure that is not bound or that does not artificially limits to what is technically possible at the time of the draft.
What if in the future there is a way to embed two video streams that are not frame-packed but that describe a stereoscopic scene? With the embedded stereo_mode flag there is *no* way to signal this format, and any value we could store there would be either a lie (since it would not be 0.monoscopic, nor 1.left-right, nor 2.top-bottom) or extremely confusing to describe (3. two-streams in the same container 4. single stream with multiple layers, or 5. implementation dependent). I can see the opposing argument that these kind of files could be described at a later phase, but why should we ratify a draft that already envisions updates? Also why should the *spherical* specification be modified for new *stereoscopic* formats?
If you need an actual example, how would an MVC (or HSVC or VP10-Multiview) file work with the current draft? Right now stereo_mode would be incorrect for any value used. On the other hand If the stereoscopic metadata were in a different specification or at least in a different box, the container would just need be marked as spherical and the decoder would take care of the stereoscopic rendering, independently. This in my opinion respects the semantics of keeping the specification separate and simple and would allow to keep the stereoscopic and spherical sections well distinct.
Could you write up an example of the container layout for the Spherical Metadata with the stereo changes? I think it may help me better understand the changes you are proposing.
Also, could you give me an example of how this new spherical video metadata would work with MVC?
As for the SAR metadata. I still can't see how the metadata would be used during the rendering process. As the frame is split to either eye independently and is 3D rendered I don't see how knowing the SAR would change how a 3D 360 video is rendered.
This value would mostly apply to traditional frame packed video, here is an example where it would matter the most: you receive a 1920x1080 file and you know it's a side-by-side stereoscopic file. How can you tell whether the video surface contains two 960x1080 squashed videos (https://www.youtube.com/watch?v=vjxd3TxQ-s0) or two full size anamorphic videos (https://www.youtube.com/watch?v=5IcOrnpqFmU)? You need to look at the SAR of course, but you can never be sure whether it applies to the container frame or to the two independent views. In the previous example, in the first case you'd have a SAR of 2:1 which clearly applies to the container frame, but in the second case you might get a SAR value of either 1:1 (meaning you don't have to resize the views) or of 16:9 (meaning the single view is anamorphic). In the second case you'd want a secondary SAR or a boolean value that helps you in selecting the correct SAR.I see how your example above affects regular 2D video files however I do not see how it affects 360 videos. For now can we can defer on SAR until we we resolve stereo metadata?