#333: Deepfake Video Detection Based on Spatial, Spectral, and Temporal Inconsistencies Using Multimodal Deep Learning


Authentication of digital media has become an ever-pressing necessity for modern society. Since the introduction of Generative Adversarial Networks (GANs), synthetic media has become increasingly difficult to identify. Synthetic videos that contain altered faces and/or voices of a person are known as deepfakes and threaten trust and privacy in digital media. Deep-fakes can be weaponized for political advantage, slander, and to undermine the reputation of public figures. Despite imperfections of deepfakes, people struggle to distinguish between authentic and manipulated images and videos. Consequently, it is important to have automated systems that accurately and efficiently classify the validity of digital content. Many recent deepfake detection methods use single frames of video and focus on the spatial information in the image to infer the authenticity of the video. Some promising approaches exploit the temporal inconsistencies of manipulated videos; however, research primarily focuses on spatial features. We propose a hybrid deep learning approach that uses spatial, spectral, and temporal content that is coupled in a consistent way to differentiate real and fake videos. We show that the Discrete Cosine transform can improve deepfake detection by capturing spectral features of individual frames. In this work, we build a multimodal network that explores new features to detect deepfake videos, achieving 61.95% accuracy on the Facebook Deepfake Detection Challenge (DFDC) dataset.