Impersonator++
Liquid Warping GAN with Attention: A Uniﬁed Framework for Human Image Synthesis
Wen Liu, Zhixin Piao, Zhi Tu, Wenhan Luo, Lin Ma, Shenghua Gao

## Abstract

We tackle human image synthesis, including human motion imitation, appearance transfer, and novel view synthesis, within a unified framework. It means that the model, once being trained, can be used to handle all these tasks. The existing task-specific methods mainly use 2D keypoints (pose) to estimate the human body structure. However, they only express the position information with no abilities to characterize the personalized shape of the person and model the limb rotations. In this paper, we propose to use a 3D body mesh recovery module to disentangle the pose and shape. It can not only model the joint location and rotation but also characterize the personalized body shape. To preserve the source information, such as texture, style, color, and face identity, we propose an Attentional Liquid Warping GAN with Attentional Liquid Warping Block (AttLWB) that propagates the source information in both image and feature spaces to the synthesized reference. Specifically, the source features are extracted by a denoising convolutional auto-encoder for characterizing the source identity well. Furthermore, our proposed method can support a more flexible warping from multiple sources. To further improve the generalization ability of the unseen source images, a one/few-shot adversarial learning is applied. In detail, it firstly trains a model in an extensive training set. Then, it finetunes the model by one/few-shot unseen image(s) in a self-supervised way to generate high-resolution ($$512 \times 512$$ and $$1024 \times 1024$$) results. Also, we build a new dataset, namely Impersonator (iPER) dataset, for the evaluation of human motion imitation, appearance transfer, and novel view synthesis. Extensive experiments demonstrate the effectiveness of our methods in terms of preserving face identity, shape consistency, and clothes details.

## FrameWork Overview

The training pipeline of our method. We randomly sample a pair of images from a video, denoting the source and the reference image as $$I_{s_i}$$ and $$I_r$$. (a) A body mesh recovery module will estimate the 3D mesh of each image and render their correspondence map, $$C_s$$ and $$C_t$$; (b) The flow composition module will first calculate the transformation flow $$T$$ based on two correspondence maps and their projected vertices in the image space. Then it will separate the source image $$I_{s_i}$$ into a foreground image $$I^{ft}_{s_i}$$ and a masked background $$I_{bg}$$. Finally it warps the source image based on the transformation flow $$T$$ and produces a warped image $$I_{syn}$$; (c) In the last GAN module, the generator consists of three streams, which separately generates the background image $$\hat{I}_{bg}$$ by $$G_{BG}$$, reconstructs the source image $$\hat{I}_s$$ by $$G_{SID}$$ and synthesizes the target image $$\hat{I}_t$$ under the reference condition by $$G_{TSF}$$. To preserve the details of the source image, we propose a novel LWB and AttLWB which propagates the source features of $$G_{SID}$$ into $$G_{TSF}$$ at several layers and preserve the source information, in terms of texture, style and color.

## LWB && AttLWB

Illustration of our LWB and AttLWB. They have the same structure illustrated in (b) but with separate AddWB (illustrated in (a)) or AttWB (illustrated in (b)). (a) is the structure of AddWB. Through AddWB, $$\widehat{X}_t^{l}$$ is obtained by aggregation of warped source features and features from $$G_{TSF}$$. (b) is the shared structure of (Attentional) Liquid Warping Block. $$\{X^{l}_{s_1}, X^{l}_{s_2}, ..., X^{l}_{s_n}\}$$ are the feature maps of different sources extracted by $$G_{SID}$$ at the $$l^{th}$$ layer. $$\{T_{s_1\to t}, T_{s_2\to t},...,T_{s_n\to t}\}$$ are the transformation flows from different sources to the target. $$X^{l}_t$$ is the feature map of $$G_{TSF}$$ at the $$l^{th}$$ layer. (c) is the architecture of AttWB. Through AttWB, final output features $$\widehat{X}_t^{l}$$ is obtained with SPADE by denormalizing feature map from $$G_{TSF}$$ with weighted combination of warped source features by a bilinear sampler (BS) with respect to corresponding flow $$T_{s_i\to t}$$.

## Network Architectures

The details of network architectures of our Attentional Liquid Warping GAN, including the generator and the discriminator. Here $$s$$ represents the stride size in convolution and transposed convolution.

## Citation

If you find this useful, please cite our work as follows:
@misc{liu2020liquid,
title={Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis},
author={Wen Liu and Zhixin Piao, Zhi Tu, Wenhan Luo, Lin Ma and Shenghua Gao},
year={2020},
eprint={2011.09055},
archivePrefix={arXiv},
primaryClass={cs.CV}
}

@InProceedings{lwb2019,
title={Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis},
author={Wen Liu and Zhixin Piao, Min Jie, Wenhan Luo, Lin Ma and Shenghua Gao},
booktitle={The IEEE International Conference on Computer Vision (ICCV)},
year={2019}
}