VFHQ: A High-Quality Dataset and Benchmark for Video Face Super Resolution

Liangbin Xie ^1,2,3, Xintao Wang³, Honglun Zhang³, Chao Dong^1*, Ying Shan³

¹Shenzhen Key Lab of Computer Vision and Pattern Recognition,
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences.
²University of Chinese Academy of Sciences. ³ARC Lab, Tencent PCG.

Paper arXiv

(This is a video, you may need to wait some time before totally loading it.)

Abstract

Most of the existing video face super-resolution (VFSR) methods are trained and evaluated on VoxCeleb1, which is designed specifically for speaker identification and the frames in this dataset are of low quality. As a consequence, the VFSR models trained on this dataset can not output visual-pleasing results. In this paper, we develop an automatic and scalable pipeline to collect a high-quality video face dataset (VFHQ), which contains over 16,000 high-fidelity clips of diverse interview scenarios. To verify the necessity of VFHQ, we further conduct experiments and demonstrate that VFSR models trained on our VFHQ dataset can generate results with sharper edges and finer textures than those trained on VoxCeleb1. In addition, we show that the temporal information plays a pivotal role in eliminating video consistency issues as well as further improving visual performance. Based on VFHQ, by analyzing the benchmarking study of several state-of-the-art algorithms under bicubic and blind settings.

Characteristics — High Quality and Diverse

The clips in VFHQ are high-quality.
(This is a video, you may need to wait some time before totally loading it.)

The scenarios in VFHQ are diverse.
(This is a video, you may need to wait some time before totally loading it.)

Statistics

As shown in (a), VFHQ includes persons that come from more than 20 distinct countries. In (b), we notice that the proportion of men and women is roughly the same.

The figure (c) demonstrates that the distribution of clip resolution of our VFHQ is different from VoxCeleb1 and the resolution of VFHQ is much higher than VoxCeleb1. Above the bar is the number of clips. Note that we use the length of the shortest side as the clip resolution. The figure (d) shows that the quality of VFHQ is higher than VoxCeleb1 quantitatively.

Dataset

We provide a processing script that extracts high-resolution faces from meta info. We also provide the processed VFHQ dataset and the resized 512x512 version. Note that the usage of VFHQ must comply with the agreement that mentioned in the next section.

Dataset Structure

Note: Due to the transfer instability of large files, there may exists few empty folders. All these four download links are valid.

Considering the large size of the data, it is recommended to download the compressed VFHQ_zips file directly, which contains videos in their original dimensions. You can resize these clips to a 512x512 resolution using the resize script provided by us. We don't have additional storage to support uploading 512x512 compressed zip files.

Name	Size	Clips	Links	Description
vfhq-dataset	4.2 TB			Main folder
├ meta_info	170 MB	15,381	百度网盘	Metadata including video id, face landmarks, etc.
├ VFHQ_zips	2.8 TB	15,204	百度网盘	Zips of VFHQ (w/o resize operation).
├ VFHQ-512	1.2 TB	15,381	百度网盘	Resized 512x512 version of VFHQ.
├ VFHQ-Test	2.37 GB	100	百度网盘	Test dataset adopted in the paper.
└ resize.py			百度网盘	Resize script.

Model Zoo

We release the trained models used in our benchmarking experiments.
For the bicubic setting (for scaling factor X4 and X8), we release the pre-trained weights of RRDB, ESRGAN, EDVR(frame=5), EDVR+GAN(frame=5), BasicVSR(frame=7), BasicVSR-GAN(frame=7).
For the blind setting (for scaling factor X4), we release the pre-trained weights of EDVR(frame=5), EDVR+GAN(frame=5), BasicVSR(frame=7), BasicVSR-GAN(frame=7).

All of the models are trained based on the BasicSR framework, and the specific training settings are also referenced from BasicSR.

Model Zoo Structure

Name	Links	Description
pretrained_models		Main folder
├ Cubic-Setting-X4	百度网盘	Pre-trained weights of models for bicubic setting with scale X4.
├ Cubic-Setting-X8	百度网盘	Pre-trained weights of models for bicubic setting with scale X8.
└ Blind-Setting-X4	百度网盘	Pre-trained weights of models for blind setting with scale X4.

Training setting

For blind setting, we provide the dataset pipeline in training phase. Besides, we also provide a yaml file which contains the detailed range of different degradation types.

Name	Links	Description
code		Main folder
├ data	百度网盘	The pipeline of synthesizing paired data in training phase.
└ yaml	百度网盘	An example of training configuration.

Agreement

The VFHQ dataset is only available to download for non-commercial research purposes. The copyright remains with the original owners of the video. A complete version of the license can be found here and we refer to the license of VoxCeleb.
All videos of the VFHQ dataset are obtained from the Internet which are not property of our institutions. Our institution are not responsible for the content nor the meaning of these videos.
You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion of the videos and any portion of derived data. You agree not to further copy, publish or distribute any portion of the VFHQ dataset.
The distribution of identities in the VFHQ datasets may not be representative of the global human population. Please be careful of unintended societal, gender, racial and other biases when training or deploying models trained on this data.

BibTeX

If you find this helpful, please cite our work:

@InProceedings{xie2022vfhq,
      author = {Liangbin Xie and Xintao Wang and Honglun Zhang and Chao Dong and Ying Shan},
      title = {VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution},
      booktitle={The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
      year = {2022}
  }