# Image Authenticity Detector: Detailed Project Summary This document explains the architecture, design decisions, and internal logic of the Image Authenticity Detector project in detail. ## Overview The goal of this project is to accurately determine whether a given image is a **Real Photograph** or **FAKE** (AI-generated, deepfake, or digitally manipulated). Because no single detection model is perfect—and because generative AI techniques change rapidly—this project uses a **5-Model Confidence-Weighted Ensemble**. It runs the input image through five completely different analysis paradigms (ranging from massive neural networks to pure mathematics), evaluates how confident each model is, and dynamically combines their votes to reach a highly robust final verdict. --- ## 1. The Five Detection Engines The ensemble consists of four AI-based engines and one math-based mathematical framework. ### Engine 1: Primary ViT Classifier (`dima806/ai_vs_real_image_detection`) - **Type**: Vision Transformer (ViT) fine-tuned on binary classification. - **Base Weight**: 35% - **Role**: This is the primary heavy-lifter. It has been extensively trained on millions of real photos and AI generation artifacts. It excels at spotting standard diffusion model artifacts (like Midjourney and DALL-E). ### Engine 2: Secondary Deepfake ViT (`prithivMLmods/Deep-Fake-Detector-v2-Model`) - **Type**: Specialized Vision Transformer. - **Base Weight**: 25% - **Role**: While the primary model is great at "AI Art", this secondary model is a specialist trained primarily to detect **manipulated human faces** (Deepfakes), GAN artifacts, and structural blending errors. ### Engine 3: Zero-Shot Semantics (OpenAI `CLIP ViT-L/14`) - **Type**: Foundational Contrastive Language-Image Pretraining Model. - **Base Weight**: 20% - **Role**: CLIP was not trained to catch deepfakes. Instead, we use a technique called **Zero-Shot Prompting**. We give it sets of descriptive sentences: - *Real prompts*: "a raw candid photo with natural lighting and depth of field", "a genuine photo..." - *Fake prompts*: "an AI-generated image with flawlessly smooth skin", "a synthetic image..." - **Logic**: We embed the image and the text prompts into the same mathematical space and see which set of prompts the image sits closer to (measured via Cosine Similarity). This acts as a semantic "vibe check". (Temperature is set to `40.0` to amplify small mathematical differences). ### Engine 4: Frequency & Forensic Analysis (Non-AI) - **Type**: Classical Computer Vision and Mathematics. - **Base Weight**: 15% - **Role**: Generative AI models generate pixels, but they struggle to generate the *invisible physical properties* of those pixels correctly. This engine looks at four mathematical features: 1. **Spectral Alpha `1/f` Rule (FFT)**: Real photographs have a very specific energy drop-off from low to high frequencies (1/f noise) due to physical lens and sensor limitations. We map the radial frequency spectrum and calculate the slope (alpha). If alpha deviates from the natural range (~1.5–2.5), it points to an AI generator. 2. **Benford's Law (DCT)**: Real camera standard JPEG compression forces the first digits of the Discrete Cosine Transform (DCT) coefficients to obey Benford's Law. AI models tend to produce coefficient distributions that heavily violate this law. 3. **Error Level Analysis (ELA)**: Re-saves the image a second time and maps the error residual. Real images compress non-uniformly (sky vs trees). Diffusion models compress with highly unnatural uniformity. 4. **Texture CV**: AI images tend to be globally smooth but locally ultra-sharp. Real images have much more varied local textures (measured via Coefficient of Variation of Laplacian tiles). ### Engine 5: Conventional CNN (`EfficientNet-B4`) - **Type**: Convolutional Neural Network. - **Base Weight**: 5% - **Role**: Currently serves as a structural feature backbone. This model's primary contribution is generating the **GradCAM Heatmap** shown in the Web UI, which highlights *where* in the image the network is looking. Its voting weight is kept low unless the user provides fine-tuned custom weights (`weights/cnn_detector.pth`). --- ## 2. The Voting Process (Confidence-Aware Weighting) If we blindly averaged the 5 scores, a model that is completely confused (50/50) would drag the final score toward the middle, ruin the accuracy, and cause too many "Uncertain" verdicts. To fix this, the project uses **Confidence-Aware Weighting**: 1. Every model outputs a probability `[0.0 to 1.0]`. 2. The distance from `0.50` determines the model's "Confidence". Let `C = max(|p - 0.5| * 2, 0.10)`. 3. The model's allocated base weight is multiplied by `C`. 4. Therefore, if a model outputs exactly `0.51`, its voting power is slashed drastically. If a model outputs `0.98`, its vote counts at almost full strength. This allows the ensemble to "listen" only to the experts who are highly confident about a particular image. --- ## 3. The Final Verdict The dynamically weighted votes are summed to create a final `Fake Probability` percentage. - **0.0 - 0.45 : REAL (Authentic Photograph)** - **0.45 - 0.55 : UNCERTAIN** - This narrow band indicates the models fiercely disagreed, or the image is so highly compressed/small that there is simply not enough data for either humans or AI to make a definitive call. - **0.55 - 1.0 : FAKE (AI-Generated / Manipulated)** The internal default threshold is heavily balanced toward `0.46`, meaning the system favors flagging images as real unless there is compelling evidence to the contrary. --- ## 4. Standalone POC vs. Web UI The project is built around two primary entry points: 1. **[app.py](file:///c:/Users/PRAG/Desktop/newproject/app.py)**: The fully interactive Gradio Web Interface. It handles image uploading, builds beautiful dark/light mode CSS layouts, visualizes the confidence of all 5 models using HTML progress bars, and generates and renders the CNN GradCAM heatmap over the image. 2. **[single_file_poc.py](file:///c:/Users/PRAG/Desktop/newproject/single_file_poc.py)**: A ~450-line monolithic, highly portable execution script. It strips away all the UI, bundles the internal logic of all 6 python files together, and runs the entire ensemble directly inside the terminal (or a Jupyter Notebook) with zero local imports.