EditorPricingBlog

Can You Spot AI Video? New Study Shows Even Best AI Models Fail

December 17, 2025
Can You Spot AI Video? New Study Shows Even Best AI Models Fail

Video Reality Test Study | Wang et al. / arXiv (CC BY-SA 4.0)

Share this post:

Can You Spot AI Video? New Study Shows Even Best AI Models Fail

Video Reality Test: Oxford study reveals AI detection challenges with synthetic video

Researchers from Oxford, Chinese University of Hong Kong, National University of Singapore, and Video Rebirth published findings today that reveal a troubling gap: even the most advanced AI vision models struggle to distinguish real videos from AI generated ones. The best performing model, Google's Gemini 2.5 Pro, achieved only 56% accuracy barely better than random guessing while human experts reached 81.25%.

The study, titled "Video Reality Test: Can AI Generated ASMR Videos fool VLMs and Humans?", tested 13 different video generation models against 10 vision language models (VLMs) using ASMR videos as the benchmark. The choice of ASMR content proves strategic: these videos require tight audiovisual coupling with fine grained object interactions, creating a demanding test of perceptual realism.

Test Yourself: Can You Identify the AI Videos?

Before diving into the findings, test your own detection abilities. Among these 12 videos, only 4 are real. Can you identify which ones? Click each video to play with sound.

Video 1: Can you tell if this is real or AI generated?
Video 2: Real or fake?
Video 3: Can you detect if this is synthetic?
Video 4: Real footage or AI generation?
Video 5: What's your verdict?
Video 6: Can you identify this one?
Video 7: Real or AI generated?
Video 8: Make your guess.
Video 9: Synthetic or authentic?
Video 10: Can you tell the difference?
Video 11: Real or generated?
Video 12: Final test: what's your call?

Answer: Real videos are 2, 6, 7, and 11. AI generated videos are 1, 3, 4, 5, 8, 9, 10, and 12.

How did you perform? If you struggled, you're in good company. Even Google's most advanced vision model, Gemini 2.5 Pro, correctly identifies fake videos only 56% of the time when tested against Veo 3.1 Fast generations—barely better than flipping a coin.

Why ASMR Videos Make the Perfect Test

ASMR (Autonomous Sensory Meridian Response) videos require precise synchronization between audio and visual elements. The genre focuses on subtle sounds—tapping, scratching, crinkling—paired with closeup footage of objects and hands. This tight audiovisual coupling creates a demanding benchmark for testing whether AI generated content can match perceptual realism.

The researchers built their dataset from 149 carefully curated real ASMR videos, targeting fine grained action object interactions with diversity across objects, actions, and backgrounds. They then used 13 different video generation models to create synthetic versions, testing whether these AI creations could fool both automated detection systems and human observers.

The methodology follows a "peer review" framework: video generation models act as "creators" attempting to fool reviewers, while vision language models serve as "reviewers" seeking to identify fakeness. This adversarial setup mirrors real world scenarios where synthetic content creators continuously improve their techniques while detection systems race to keep pace.

The Models Tested: Generators and Detectors

Video Generation Models (Creators):

  • Google Veo 3.1-Fast
  • OpenAI Sora 2
  • Tencent HunyuanVideo
  • Wan AI 2.2 (14B and 5B parameter versions)
  • OpenSora V2
  • StepFun StepVideo
  • Plus 7 additional model configurations

Vision Language Models (Reviewers):

  • Google Gemini 2.5-Pro
  • Google Gemini 2.5-Flash
  • OpenAI GPT-4o
  • OpenAI GPT-4o-mini
  • OpenAI GPT-5 (Preview)
  • Qwen 2.5-VL (multiple parameter sizes)
  • GLM-4.5V
  • Plus 3 additional opensource models

Each VLM evaluated videos generated by each creation model, rating whether videos were real or AI generated. The setup tested both video only and video with audio conditions to determine whether audio information improves detection accuracy.

Key Findings: The Detection Gap

Finding 1: Even Advanced VLMs Perform Poorly

The top performing detector, Gemini 2.5-Pro, achieved only 67.34% average accuracy across all generator models. Against the best creator (Veo 3.1-Fast), this drops to just 51.56%—essentially random guessing. Human experts, by contrast, reached 81.25% accuracy across the same test set and 89.11% average performance.

This gap reveals that current vision language models lack the perceptual sophistication to reliably detect high-quality synthetic video, even when specifically designed for multimodal understanding. The failure occurs across proprietary and opensource models, with most open-source VLMs performing significantly worse.

GPT-4o achieved 60.22% average accuracy, while GPT-4o-mini managed only 57.28%. Open-source models fared even worse: Qwen 2.5-VL-72B reached 58.87%, and GLM-4.5V achieved 63.61%. The performance ceiling appears fundamentally limited by current architectural approaches to video understanding.

Finding 2: Veo 3.1-Fast Dominates Video Generation Quality

Among creator models, Google's Veo 3.1-Fast produced the most convincing fakes. Only 12.54% of its image-to-video generations were correctly identified as fake by VLMs on average. Even the strongest detector, Gemini 2.5-Pro, caught only 17% of Veo 3.1-Fast fakes.

Sora 2 performed surprisingly poorly despite being from OpenAI. VLMs detected Sora 2 generations as fake 46.58% of the time when using image-to-video mode. However, this detection rate dropped dramatically to 16.55% when researchers removed the Sora 2 watermark, suggesting models relied heavily on superficial cues rather than understanding actual video quality or realism.

Opensource models showed competitive performance. Tencent's HunyuanVideo text-to-video achieved 14.77% detection rate, while Wan AI's 2.2-A14B model reached 16.10% for image-to-video generation. This narrow gap between proprietary and opensource models indicates the democratization of high quality synthetic video generation.

Finding 3: Audio Helps Detection But Not Dramatically

Adding audio to video samples improved detection accuracy by approximately 5 percentage points for Gemini models. Gemini 2.5-Pro's accuracy increased from 67.34% to 68.44% when audio was included. Gemini 2.5-Flash showed similar gains, moving from 61.59% to 63.15%.

The improvement occurs because current generation models struggle to produce coherent audio that matches video content. Sora 2, for instance, generates human speech instead of the continuous ambient sounds characteristic of ASMR content. These audiovisual mismatches provide additional detection cues that VLMs can exploit.

However, the modest improvement suggests audio analysis remains underdeveloped in current vision language models. Even with audio providing additional information, detection accuracy remains far below human performance, indicating fundamental limitations in how these models process multimodal content.

Finding 4: Superficial Cues Mislead Models

The watermark experiment proved particularly revealing. Sora 2 videos with watermarks were detected as fake 46.58% of the time. When researchers removed the watermark, detection dropped to 16.55%—a 30 percentage point decrease. This dramatic shift demonstrates that VLMs rely heavily on obvious artifacts rather than subtle perceptual cues that indicate synthetic generation.

Human experts don't experience the same confusion. They focus on physical plausibility, temporal consistency, and finegrained details that reveal generation artifacts. VLMs, by contrast, appear to use simple heuristics: if a video has a watermark from a known AI company, mark it fake; otherwise, assume it's real.

This limitation has serious implications for real world deployment. As generation models improve and creators learn to remove or avoid obvious tells, detection systems that rely on superficial cues will fail catastrophically.

Finding 5: Model Scale Matters for Generation Quality

Among Wan AI's models, the 14-billion parameter version significantly outperformed the 5-billion parameter variant. Wan 2.2-A14B achieved 16.10% detection rate, while Wan 2.2-5B reached 20.97%. Larger models capture more nuanced patterns in real video, enabling more convincing synthesis.

However, scale alone doesn't guarantee success. StepVideo, despite presumably having substantial parameters given its commercial deployment, produced easily detectable fakes with 83.62% detection rate. Architecture, training data quality, and optimization techniques matter as much as raw parameter count.

The relationship between scale and quality suggests that as computational resources increase and training techniques improve, synthetic video will become increasingly difficult to distinguish from real footage. The detection challenge will worsen, not improve, over time.

Implications for AI Filmmaking

For filmmakers working with AI video generation, these findings provide practical guidance on tool selection and workflow integration.

Veo 3.1 Fast Produces Most Realistic Output

Filmmakers prioritizing realism should consider Veo 3.1-Fast for projects where synthetic video must pass as authentic. The model's 12.54% detection rate indicates it generates video that matches human perception of reality better than competing options. This quality makes it suitable for background replacement, establishing shots, or B-roll where scrutiny is less intense.

However, Veo 3.1-Fast requires Google Cloud access and operates as a closed source commercial service. Cost and availability may limit adoption for independent creators or small studios.

Open-Source Models Close the Gap

HunyuanVideo and Wan AI 2.2-A14B demonstrate that opensource alternatives can approach proprietary quality. Both models achieved detection rates under 17%, making them viable for professional production where slight quality differences are acceptable trade-offs for full control and zero usage fees.

These opensource options enable customization, local deployment, and integration into proprietary pipelines—advantages that may outweigh marginal quality differences for many production scenarios.

Audio Remains a Weak Point

All tested models struggle to generate convincing audio that synchronizes properly with video content. Filmmakers should plan to replace AI generated audio with recorded sound, foley work, or separately synthesized audio designed specifically for the visual content.

The 5 point accuracy improvement when VLMs analyze audio indicates that audiovisual mismatches create obvious detection cues. Professional production workflows should treat AI video generation as a visual only tool, handling audio through traditional methods.

Watermarks and Artifacts Must Be Removed

The 30 point detection improvement when watermarks were removed demonstrates that obvious artifacts severely compromise perceived realism. Post-generation cleanup—removing watermarks, fixing temporal inconsistencies, color grading for consistency—should be standard practice when using synthetic video in professional contexts.

This requirement adds production time but proves necessary for maintaining the illusion of authentic footage. Audiences may not consciously notice watermarks, but detection systems certainly do, and as AI literacy increases, human viewers will develop similar sensitivities.

Detection Systems Lag Behind Generation

The performance gap between human experts (81.25% accuracy) and the best VLM (56% accuracy) reveals that automated detection systems cannot reliably identify high quality synthetic video. This has significant implications for content moderation, misinformation prevention, and authentication systems.

Current detection approaches fail because they rely on pattern matching and superficial cues rather than understanding physical plausibility and perceptual consistency. When generation models learn to avoid obvious tells—temporal jitter, physical impossibilities, lighting inconsistencies—detection systems trained on these artifacts become useless.

The study suggests detection research needs new directions:

Multimodal Consistency Analysis

Rather than analyzing video or audio independently, detection systems should evaluate whether audio and visual information cohere according to physical principles. Does the sound of fabric moving match the visual representation of fabric deformation? Do footsteps correspond to surface materials shown in video?

This approach requires physics informed models that understand material properties, acoustics, and causal relationships between events. Current VLMs lack this grounding in physical reality.

Temporal Coherence Evaluation

Synthetic video often exhibits subtle temporal inconsistencies—object positions that don't maintain proper trajectories, lighting that shifts unnaturally, or motion that violates momentum conservation. Specialized architectures focused on temporal consistency rather than frame level classification might detect these artifacts more reliably.

Adversarial Training With Latest Models

Detection systems trained on older generation models fail against newer ones. Continuous adversarial training—where detection systems explicitly learn to identify artifacts from SOTA generators—could maintain effectiveness as generation quality improves.

However, this approach requires access to proprietary models like Veo 3.1-Fast and Sora 2, creating an asymmetry that favors attackers over defenders in the synthetic media arms race.

The Broader Context: Synthetic Media Detection

This study arrives as synthetic video generation reaches a critical threshold. Multiple models can now produce video that fools most people most of the time. The implications extend beyond filmmaking into misinformation, evidence authentication, and trust in digital media.

Legal and Evidentiary Concerns

If synthetic video becomes indistinguishable from real footage, video evidence loses reliability in legal proceedings. Courts have traditionally treated video as highly reliable evidence, but this study demonstrates that even experts armed with advanced AI tools struggle to verify authenticity.

Legal systems will need new authentication frameworks—perhaps based on cryptographic verification at capture time rather than posthoc analysis. Cameras might need to digitally sign footage with hardware secured keys, creating verifiable provenance chains.

Misinformation and Trust

When people cannot distinguish real from synthetic, trust in all video content erodes. This affects journalism, documentary filmmaking, citizen reporting, and public discourse. The "liar's dividend"—where real evidence can be dismissed as fake becomes more powerful as detection ambiguity increases.

Media organizations may need to implement verification systems showing full provenance: where footage originated, who captured it, what modifications occurred. This transparency layer could maintain trust even when visual verification becomes impossible.

Content Moderation Challenges

Platforms hosting user generated content cannot rely on automated systems to detect synthetic video reliably. The 56% accuracy ceiling means half of all fake content passes undetected. This necessitates different moderation approaches—perhaps focusing on behavioral patterns, source reputation, and content metadata rather than trying to verify each video's authenticity.

Research Methodology and Limitations

The study's focus on ASMR videos provides both strengths and limitations. ASMR content requires precise audiovisual synchronization with fine grained details, creating a demanding test. However, ASMR represents a specific domain with particular characteristics. Performance on ASMR videos may not generalize to other content types.

The researchers collected 149 real ASMR videos, extracted frames, generated text descriptions, and clustered content to ensure diversity. They then used 13 different generation model configurations to create synthetic versions, producing a dataset of 149 real videos plus 1,937 synthetic variants (149 scenes × 13 models).

Testing focused on short clips optimized for online platforms rather than long form content. Longer videos might reveal different detection patterns as subtle artifacts accumulate over extended duration.

The human evaluation involved expert reviewers rather than general population samples. Expert performance (81.25%) likely exceeds average human capability, so the human/AI gap might be smaller when comparing to non expert viewers.

What Comes Next

The research team released the complete dataset including real videos, extracted images, text prompts, and generated outputs from all 13 model configurations. This enables other researchers to develop improved detection methods or test new generation models against the benchmark.

The leaderboard structure encourages ongoing evaluation as new models emerge. Researchers can submit results for updated models via email to the team, creating a living benchmark that tracks the state of the art in both generation and detection.

For filmmakers, the clear message: AI video generation has reached production quality for many use cases, but audio generation lags behind, and post-processing remains essential for removing artifacts. Detection systems cannot reliably flag synthetic content, placing responsibility on creators to disclose AI usage when appropriate.

For researchers, the findings delineate clear directions: multimodal consistency analysis, physics informed detection, and continuous adversarial training represent promising approaches for improving detection as generation quality advances.

For society, the implications are stark: video can no longer serve as unquestionable truth. Authentication systems, provenance tracking, and trust frameworks need fundamental rethinking for an era when seeing is no longer believing.

Research Team and Acknowledgments

This research was conducted by Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu, James Cheng, Wei Liu, Philip Torr, and Kevin Qinghong Lin from:

  • The Chinese University of Hong Kong
  • National University of Singapore
  • Video Rebirth
  • University of Oxford

The complete research paper, dataset, code, and model weights are available through the following resources:

For updates to the leaderboard or to submit new model results, contact the research team at wjqkoko@foxmail.com.

Sources