VFHQ: A High-Quality Dataset and Benchmark for Video Face Super Resolution

1Shenzhen Key Lab of Computer Vision and Pattern Recognition,
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences.
2University of Chinese Academy of Sciences.    3ARC Lab, Tencent PCG.

(This is a video, you may need to wait some time before totally loading it.)


Most of the existing video face super-resolution (VFSR) methods are trained and evaluated on VoxCeleb1, which is designed specifically for speaker identification and the frames in this dataset are of low quality. As a consequence, the VFSR models trained on this dataset can not output visual-pleasing results. In this paper, we develop an automatic and scalable pipeline to collect a high-quality video face dataset (VFHQ), which contains over 16,000 high-fidelity clips of diverse interview scenarios. To verify the necessity of VFHQ, we further conduct experiments and demonstrate that VFSR models trained on our VFHQ dataset can generate results with sharper edges and finer textures than those trained on VoxCeleb1. In addition, we show that the temporal information plays a pivotal role in eliminating video consistency issues as well as further improving visual performance. Based on VFHQ, by analyzing the benchmarking study of several state-of-the-art algorithms under bicubic and blind settings.

Characteristics — High Quality and Diverse

The clips in VFHQ are high-quality.
(This is a video, you may need to wait some time before totally loading it.)

The scenarios in VFHQ are diverse.
(This is a video, you may need to wait some time before totally loading it.)


Interpolation end reference image.

As shown in (a), VFHQ includes persons that come from more than 20 distinct countries. In (b), we notice that the proportion of men and women is roughly the same.

Interpolation end reference image.

The figure (c) demonstrates that the distribution of clip resolution of our VFHQ is different from VoxCeleb1 and the resolution of VFHQ is much higher than VoxCeleb1. Above the bar is the number of clips. Note that we use the length of the shortest side as the clip resolution. The figure (d) shows that the quality of VFHQ is higher than VoxCeleb1 quantitatively.


We provide a processing script that extracts high-resolution faces from meta info. We also provide the processed VFHQ dataset and the resized 512x512 version. Note that the usage of VFHQ must comply with the agreement that mentioned in the next section.

Dataset Structure

Note: Due to the transfer instability of large files, there may exists few empty folders. All these four download links are valid.

Name Size Clips Links Description
vfhq-dataset 4.2 TB Main folder
meta_info 170 MB 15,381 百度网盘 Metadata including video id, face landmarks, etc.
VFHQ1 1.4 TB 7,543 百度网盘 Part1 of VFHQ cropped from the YouTube videos.
VFHQ2 1.6 TB 8,228 百度网盘 Part2 of VFHQ cropped from the YouTube videos.
VFHQ-512 1.2 TB 15,381 百度网盘 Resized 512x512 version of VFHQ.


  • The VFHQ dataset is only available to download for non-commercial research purposes. The copyright remains with the original owners of the video. A complete version of the license can be found here and we refer to the license of VoxCeleb.
  • All videos of the VFHQ dataset are obtained from the Internet which are not property of our institutions. Our institution are not responsible for the content nor the meaning of these videos.
  • You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion of the videos and any portion of derived data. You agree not to further copy, publish or distribute any portion of the VFHQ dataset.
  • The distribution of identities in the VFHQ datasets may not be representative of the global human population. Please be careful of unintended societal, gender, racial and other biases when training or deploying models trained on this data.


If you find this helpful, please cite our work:

      author = {Liangbin Xie and Xintao Wang and Honglun Zhang and Chao Dong and Ying Shan},
      title = {VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution},
      booktitle={The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
      year = {2022}