VFHQ: A High-Quality Dataset and Benchmark

for Video Face Super Resolution

Liangbin Xie 1,2,3      Xintao Wang 3      Honglun Zhang 3       Chao Dong 1*       Ying Shan 3
1 Shenzhen Key Lab of Computer Vision and Pattern Recognition,
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
3 ARC Lab, Tencent PCG

Visual comparison between BasicVSR-GAN models trained with Voxceleb1 and
VFHQ dataset, respectively. The high-quality VFHQ dataset helps to recover
more visual-pleasing results with finer details.

Visual comparisons between the two datasets:
VoxCeleb1 (top) and VFHQ (bottom). Images
are randomly selected from the dataset.
VFHQ images have much higher quality.


Most of the existing video face super-resolution (VFSR) methods are trained and evaluated on VoxCeleb1, which is designed specifically for speaker identification and the frames in this dataset are of low quality. As a consequence, the VFSR models trained on this dataset can not output visual-pleasing results. In this paper, we develop an automatic and scalable pipeline to collect a high-quality video face dataset (VFHQ), which contains over $16,000$ high-fidelity clips of diverse interview scenarios. To verify the necessity of VFHQ, we further conduct experiments and demonstrate that VFSR models trained on our VFHQ dataset can generate results with sharper edges and finer textures than those trained on VoxCeleb1. In addition, we show that the temporal information plays a pivotal role in eliminating video consistency issues as well as further improving visual performance. Based on VFHQ, by analyzing the benchmarking study of several state-of-the-art algorithms under bicubic and blind settings.


The VFHQ dataset is only available to download for research purpose under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here and we refer to the license of VoxCeleb.

Caution: We note that the distribution of identities in the VFHQ datasets may not be representative of the global human population. Please be careful of unintended societal, gender, racial and other biases when training or deploying models trained on this data.

Dataset Description

Distribution of the properties of the celebrities in our VFHQ in different aspects. As shown in (a), VFHQ includes persons that come from more than 20 distinct countries. In (b), we notice that the proportion of men and women is roughly the same.The figure (c) demonstrates that the distribution of clip resolution of our VFHQ is different from VoxCeleb1 and the resolution of VFHQ is much higher than VoxCeleb1. Above the bar is the number of clips. Note that we use the length of the shortest side as the clip resolution. The figure (d) shows that the quality of VFHQ is higher than VoxCeleb1 quantitatively.


          author = {Liangbin Xie and Xintao Wang and Honglun Zhang and Chao Dong and Ying Shan},
          title = {VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution},
          booktitle={The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
          year = {2022}


If you have any question, please contact Liangbin Xie at lb.xie@siat.ac.cn.