We propose GAN-Supervised Learning, a framework for learning discriminative models and their GAN-generated training data jointly end-to-end. We apply our framework to the dense visual alignment problem. Inspired by the classic Congealing method, our GANgealing algorithm trains a Spatial Transformer to warp random samples from a GAN trained on unaligned data to a common, jointly-learned target mode. The target mode is updated to make the Spatial Transformer's job "as easy as possible."
GANgealing significantly outperforms past self-supervised correspondence algorithms and performs on-par with (and sometimes exceeds) state-of-the-art supervised correspondence algorithms on several datasets---without making use of any correspondence supervision or data augmentation and despite being trained exclusively on GAN-generated data. For precise correspondence, we improve upon state-of-the-art supervised methods by as much as 3x.
Our Spatial Transformer is trained exclusively on GAN images and generalizes to real images at test time automatically. Here we show our Spatial Transformer's learned transformation of real LSUN Cat images that bring them into joint alignment. We can then find dense correspondences between all of the images via the congealed average image.
Anything can be propagated from our average congealed images. By dragging a batman mask onto our average congealed cat image, a user can effortlessly edit a huge number of images automatically.
GANgealing produces surprisingly smooth results when applied per-frame to videos without leveraging any temporal information. This makes our method well-suited for mixed reality applications in comparison to supervised optical flow algorithms like RAFT which suffer from compounding errors. And, GANgealing doesn't require any per-video annotations to propagate objects: just annotate the average congealed image once and you're good to go for any image or video.
@inproceedings{peebles2022gansupervised,
title={GAN-Supervised Dense Visual Alignment},
author={William Peebles and Jun-Yan
Zhu and Richard Zhang and Antonio Torralba and Alexei Efros and Eli Shechtman},
booktitle={CVPR},
year={2022}
}
We thank Tim Brooks for his antialiased sampling code; Tete Xiao, Ilija Radosavovic, Taesung Park, Assaf Shocher, Phillip Isola, Angjoo Kanazawa, Shubham Goel, Allan Jabri, Shubham Tulsiani and Dave Epstein for helpful discussions; Rockey Hester for his assistance with testing our model on video. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 2146752. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Additional funding provided by Berkeley DeepDrive, SAP and Adobe. Thanks to Matt Tancik for letting us use his project page template. Also check out concurrent work from NVIDIA that takes a different angle on using GANs to find correspondences.