VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction
Abstract
A new dataset and language-based framework for logical anomaly detection in industrial inspection, addressing challenges from visual distractions through text-description-based contrastive learning.
Logical anomaly detection in industrial inspection remains challenging due to variations in visual appearance (e.g., background clutter, illumination shift, and blur), which often distract vision-centric detectors from identifying rule-level violations. However, existing benchmarks rarely provide controlled settings where logical states are fixed while such nuisance factors vary. To address this gap, we introduce VID-AD, a dataset for logical anomaly detection under vision-induced distraction. It comprises 10 manufacturing scenarios and five capture conditions, totaling 50 one-class tasks and 10,395 images. Each scenario is defined by two logical constraints selected from quantity, length, type, placement, and relation, with anomalies including both single-constraint and combined violations. We further propose a language-based anomaly detection framework that relies solely on text descriptions generated from normal images. Using contrastive learning with positive texts and contradiction-based negative texts synthesized from these descriptions, our method learns embeddings that capture logical attributes rather than low-level features. Extensive experiments demonstrate consistent improvements over baselines across the evaluated settings. The dataset is available at: https://github.com/nkthiroto/VID-AD.
Community
This paper introduces VID-AD (Vision-Induced Distraction Anomaly Dataset), a new benchmark for image-level logical anomaly detection. Unlike existing datasets, VID-AD specifically evaluates robustness against visual distractions (background clutter, illumination shifts, and blur) while focusing on five logical constraints: Quantity, Length, Type, Placement, and Relation. The dataset comprises 10 manufacturing scenarios and 5 capture conditions, totaling 10,395 images.
We also propose a novel vision-to-text anomaly detection framework. It leverages a frozen VLM to convert images into structured descriptions and fine-tunes a BERT encoder through contrastive learning. By synthesizing contradictory negative texts from normal descriptions, our method learns to prioritize global logical consistency over irrelevant pixel-level fluctuations.
Submitted at the suggestion of Niels Rogge (Hugging Face) to share this benchmark and method with the community.
Resources:
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer (2026)
- HLGFA: High-Low Resolution Guided Feature Alignment for Unsupervised Anomaly Detection (2026)
- TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection (2026)
- VTFusion: A Vision-Text Multimodal Fusion Network for Few-Shot Anomaly Detection (2026)
- SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling (2026)
- Multi-Cue Anomaly Detection and Localization under Data Contamination (2026)
- DevPrompt: Deviation-Based Prompt Learning for One-Normal ShotImage Anomaly Detection (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper