Responsive Listening Head Generation
A Benchmark Dataset and Baseline

1Harbin Institute of Technology 2JD Explore Academy, Beijing, China

Abstract

We present a new listening head generation benchmark, for synthesizing responsive feedbacks of a listener (e.g., nod, smile) during a face-to-face conversation. As the indispensable complement to talking heads generation, listening head generation has seldomly been studied in literature. Automatically synthesizing listening behavior that actively responds to a talking head, is critical to applications such as digital human, virtual agents and social robots. In this work, we propose a novel dataset "ViCo", highlighting the listening head generation during a face-to-face conversation. A total number of 92 identities (67 speakers and 76 listeners) are involved in ViCo, featuring 483 clips in a paired "speaking-listening" pattern, where listeners show three listening styles based on their attitudes: positive, neutral, negative. Different from traditional speech-to-gesture or talking-head generation, listening head generation takes as input both the audio and visual signals from the speaker, and gives non-verbal feedbacks (e.g., head motions, facial expressions) in a real-time manner. Our dataset supports a wide range of applications such as human-to-human interaction, video-to-video translation, cross-modal understanding and generation. To encourage further research, we also release a listening head generation baseline, conditioning on different listening attitudes. Code & ViCo dataset: https://project.mhzhou.com/vico.


Dataset Video Samples

positive attitude

negative attitude

neutral attitude


Dataset Details

1st dataset for listener modeling over 95 minutes duration 92 identites 483 clips 3 attitudes

Our dataset can be accessed through OneDrive. In Conversational Head Generation Challenge, we use the subset of ViCo with newly collected talking head videos.


Guidelines

The dataset consists of three parts:
  • videos/*.mp4: all videos without audio track
  • audios/*.wav: all audios
  • *.csv: return meta data about all videos/audios
    NameTypeDescription
    attitudestrAttitude of video, possible values: [positive, negative, neutral]
    audiostrAudio filename, can be located by /audios/{audio}.wav
    listenerstrListener video filename, can be located by /videos/{listener}.mp4
    speakerstrSpeaker video filename, can be located by /videos/{speaker}.mp4
    listener_idintID of listener, value ranges in [0, 91]
    speaker_idintID of speaker, value ranges in [0, 91]
    data_splitstrThe data split of current record, possible values: [train, test, ood]
We also release a github repo for the train and evaluation of responsive listening heads.

Generations

Compare Generations with Ground-Truth

Generate with different attitudes


Method Details

model architecture
The overall pipeline of our responsive listening head generation baseline. The speaker encoder aims to encode the head motion, facial expression and audio features. Starting from the fused feature from reference listener image, the listener decoder receives signals from speaker encoder in temporal order, and predicts the head motion and facial expression features. These features are adapted to reconstruct the 3DMM coefficients with the reference listener's identity-dependent features, and then fed to a neural renderer to generate realistic listening video.
or read our paper in arXiv.

Citation

If our dataset / code helps your research, please cite our paper.
@InProceedings{zhou2022responsive,
    title={Responsive Listening Head Generation: A Benchmark Dataset and Baseline},
    author={Zhou, Mohan and Bai, Yalong and Zhang, Wei and Yao, Ting and Zhao, Tiejun and Mei, Tao},
    booktitle={Proceedings of the European conference on computer vision (ECCV)},
    year={2022}
}

Ethical Use

The RLD dataset would be released only for research purposes under restricted licenses. The responsive listening patterns are identity-independent, which reduces the abuse of facial data. The only potential social harm is "fake content", while different from talking head synthesis, responsive listening can hardly harm the information fidelity.


Contact

Mohan Zhou, mhzhou99[at]outlook[dot]com