IDLab-MEDIA
Deepfake Video Detection Using Generative Convolutional Vision Transformer

We are excited to share our latest paper, “Deepfake Video Detection Using Generative Convolutional Vision Transformer,” which introduces a new model designed to combat the growing challenge of manipulated media.

As deepfake technology continues to advance, the potential for misuse—from spreading misinformation to compromising personal identity—poses a significant threat to the integrity of digital media. This has created an urgent need for powerful and reliable detection tools that can keep pace with new generation techniques.

Many current deepfake detection models perform well on videos they were trained on, but they often struggle to generalize. When faced with a new or more advanced deepfake method not seen during training, their accuracy can drop significantly, leaving a critical gap in our defenses.

To address this challenge, our paper introduces the Generative Convolutional Vision Transformer (GenConViT), a novel hybrid architecture designed for more robust deepfake detection. GenConViT employs a unique dual strategy: it combines the power of ConvNeXt and Swin Transformer models to meticulously analyze subtle visual artifacts, while simultaneously using Autoencoders to learn the fundamental latent data distribution of real media. By integrating artifact detection with a deep understanding of what constitutes “normal” data, our model can better identify the inconsistencies that mark a video as fake.

Images generated by the VAE

Through extensive training and evaluation on five diverse datasets, including large-scale benchmarks like DFDC and FaceForensics++, GenConViT demonstrates strong performance and high accuracy in identifying a wide variety of manipulated videos.

While our approach achieves strong performance, our study also underscores the persistent difficulty of generalizing to entirely new deepfake techniques. Our experiments show that performance can decrease when the model is tested on manipulation methods it has never encountered, highlighting a crucial area for future research in the deepfake detection community.

The development of more generalizable detectors like GenConViT is an important step toward safeguarding digital information, empowering fact-checkers, and preserving trust in online media. To support this goal, we have made the code for GenConViT open-source to encourage further research and collaboration in this vital field.

Paper and code available on GitHub.com