Vimeo90k Dataset

0 views

Skip to first unread message

Edilma Howard

unread,

Aug 3, 2024, 11:20:15 AM8/3/24

to toppgafchiecchar

The Vimeo-90K is a large-scale high-quality video dataset for lower-level video processing. It proposes three different video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution.

Many video processing algorithms rely on optical flow to register different frames within a sequence. However, a precise estimation of optical flow is often neither tractable nor optimal for a particular task. In this paper, we propose task-oriented flow (TOFlow), a flow representation tailored for specific video processing tasks. We design a neural network with a motion estimation component and a video processing component. These two parts can be jointly trained in a self-supervised manner to facilitate learning of the proposed TOFlow. We demonstrate that TOFlow outperforms the traditional optical flow on three different video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution. We also introduce Vimeo-90K, a large-scale, high-quality video dataset for video processing to better evaluate the proposed algorithm.

Out of curiosity, I looked a bit deeper into the Vimeo90K dataset. From what I could gather it was created by MIT ( ) and consists of essentially 98,900 videos they semi-randomly downloaded from Vimeo. The link has the list. I checked 3 of them out - one was a random video production company, another out-takes of a wedding video, the third one actually an animation.

This allows for fast training where the images patches have already beenextracted and shuffled. The numpy array in expected to have the followingsize: NxHxWx3, with N the number of samples, H and W the imagesdimensions.

This dataset of 3D CAD models of objects was introduced by[Wu2015], consisting of 10 or 40 classes, with 4899 and 12311aligned items, respectively.Each 3D model is represented in the OFF file format by a trianglemesh (i.e. faces) and has a single label (e.g. airplane).To convert the triangle meshes to point clouds, one may use a meshsampling method (e.g. SamplePoints).

The Stanford 3D Indoor Scene Dataset (S3DIS) dataset, introduced by[Armeni2012], contains 3D point clouds of 6 large-scale indoor areas.There are multiple rooms (e.g. office, lounge, hallway, etc) per area.See the [ProjectPage_S3DIS] for a visualization.

The KITTI dataset, introduced by [Geiger2012], contains 3D pointclouds sequences (i.e. video) of LiDAR sensor data from theperspective of a driving vehicle.The SemanticKITTI dataset, introduced by [Behley2019] and[Behley2021], provides semantic annotation of all 22 sequences fromthe odometry task [Odometry_KITTI] of KITTI.See the [ProjectPage_SemanticKITTI] for a visualization.Note that the test set is unlabelled, and must be evaluated on theserver, as mentioned at [ProjectPageTasks_SemanticKITTI].

Quantitative (PSNR/SSIM) comparison. We compare our boosted models to representative state-of-the-art methods on Vimeo90K, DAVIS and SNU-FILM benchmarks. Both of the optimisation approaches exhibit a substantial improvement in performance. Note that FLAVR and VFIT take multiple frames as input, but our boosted models can still outperform them. RED: best performance, BLUE: second best performance.

Qualitative comparison against the state-of-the-art VFI algorithms. We show visualization on Vimeo90K, SNU-FILM and DAVIS benchmarks for comparison. The patches for careful comparison are marked with red in the original images. Our boosted models can generate higher-quality results with clearer structures and fewer distortions.

Quantitative (PSNR/SSIM) comparison of adaptation strategies. The experiments on Vimeo90K dataset have shown that cycle-consistency adaptation steadily boosts VFI models by fully leveraging the inter-frame consistency to learn motion characteristics within the test sequence.

Ablation Study on end-to-end and plug-in adapter adaptation. Models boosted by our proposed plug-in adapter require minimal finetuning parameters for adaptation, resulting in a 2 times improvement in efficiency while maintaining comparable inference efficiency and performance.

Modelscope without the watermark, trained in 320x320 from the original weights, with no skipped frames for less flicker.This updated version fixes stretching issues present in v1, but produces different results overallModel was trained on a subset of the vimeo90k dataset + a selection of music videos

Videos are sequences of frames that are displayed continuously within a time frame, which creates the illusion. FPS is defined as the number of frames per second, and is crucial to determine the smoothness of motion or scene changes in the video. To improve the appearance of the videos, we can a technique called Frame Rate Enhancement. This is an approach to augment generated frames between pairs of frames using Generative Adversarial Networks. There are a few traditional techniques using Convolution Neural Networks and Optical Flow based methods, but they create unwanted artifacts such as blurring or ghosting and might reduce the video quality. Hence, we used the ability of GANs to generate high quality intermediate frames that can generate high quality images to increase frame rate. We trained the model on 2 datasets - UCF-101 and Vimeo90K dataset. This enabled the model to learn temporal dependencies and generate realistic frames. Our research proposes a solution to enhance the frame rate in videos with applicability in various scenarios from capturing everyday moments to video surveillance, film production and sports streaming. We finally evaluated the model using 2 metrics - Peak Signal to Noise Ratio and Structural Similarity Index.

This dataset consists of 89,800 video clips downloaded from vimeo.com, which covers large variaty of scenes and actions. It is designed for the following four video processing tasks: temporal frame interpolation, video denoising, video deblocking, and video super-resolution.

Video super-resolution originates from image super-resolution, which aims to recover high-resolution (HR) images from one or more low resolution (LR) images. The difference between them is that the video is composed of multiple frames, so the video super-resolution usually uses the information between frames to repair. Here we provide the video super-resolution model EDVR, BasicVSR,IconVSR,BasicVSR++, and PP-MSVSR.

PP-MSVSR is a multi-stage VSR deep architecture, with local fusion module, auxiliary loss and refined align module to refine the enhanced result progressively. Specifically, in order to strengthen the fusion of features across frames in feature propagation, a local fusion module is designed in stage-1 to perform local feature fusion before feature propagation. Moreover, an auxiliary loss in stage-2 is introduced to make the features obtained by the propagation module reserve more correlated information connected to the HR space, and introduced a refined align module in stage-3 to make full use of the feature information of the previous stage. Extensive experiments substantiate that PP-MSVSR achieves a promising performance of Vid4 datasets, which PSNR metric can achieve 28.13 with only 1.45M parameters.

EDVR wins the champions and outperforms the second place by a large margin in all four tracks in the NTIRE19 video restoration and enhancement challenges. The main difficulties of video super-resolution from two aspects: (1) how to align multiple frames given large motions, and (2) how to effectively fuse different frames with diverse motion and blur. First, to handle large motions, EDVR devise a Pyramid, Cascading and Deformable (PCD) alignment module, in which frame alignment is done at the feature level using deformable convolutions in a coarse-to-fine manner. Second, EDVR propose a Temporal and Spatial Attention (TSA) fusion module, in which attention is applied both temporally and spatially, so as to emphasize important features for subsequent restoration.

BasicVSR reconsiders some most essential components for VSR guided by four basic functionalities, i.e., Propagation, Alignment, Aggregation, and Upsampling. By reusing some existing components added with minimal redesigns, a succinct pipeline, BasicVSR, achieves appealing improvements in terms of speed and restoration quality in comparison to many state-of-the-art algorithms. By presenting an informationrefill mechanism and a coupled propagation scheme to facilitate information aggregation, the BasicVSR can be expanded to IconVSR, which can serve as strong baselines for future VSR approaches.

BasicVSR++ redesign BasicVSR by proposing second-order grid propagation and flowguided deformable alignment. By empowering the recurrent framework with the enhanced propagation and alignment, BasicVSR++ can exploit spatiotemporal information across misaligned video frames more effectively. The new components lead to an improved performance under a similar computational constraint. In particular, BasicVSR++ surpasses BasicVSR by 0.82 dB in PSNR with similar number of parameters. In NTIRE 2021, BasicVSR++ obtains three champions and one runner-up in the Video Super-Resolution and Compressed Video Enhancement Challenges.

Here are 4 commonly used video super-resolution dataset, REDS, Vimeo90K, Vid4, UDM10. The REDS and Vimeo90K dataset include train dataset and test dataset, Vid4 and UDM10 are test dataset. Download and decompress the required dataset and place it under the PaddleGAN/data.

REDSdownloadis a newly proposed high-quality (720p) video dataset in the NTIRE19 Competition. REDS consists of 240 training clips, 30 validation clips and 30 testing clips (each with 100 consecutive frames). Since the test ground truth is not available, we select four representative clips (they are '000', '011', '015', '020', with diverse scenes and motions) as our test set, denoted by REDS4. The remaining training and validation clips are re-grouped as our training dataset (a total of 266 clips).

Vimeo90K (download) is designed by Tianfan Xue etc. for the following four video processing tasks: temporal frame interpolation, video denoising, video deblocking, and video super-resolution. Vimeo90K is a large-scale, high-quality video dataset. This dataset consists of 89,800 video clips downloaded from vimeo.com, which covers large variaty of scenes and actions.