arxiv:2603.13964

VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction

Published on Mar 14

· Submitted by

nakatahiroto on Mar 20

Upvote

Authors:

Hiroto Nakata ,

Abstract

A new dataset and language-based framework for logical anomaly detection in industrial inspection, addressing challenges from visual distractions through text-description-based contrastive learning.

AI-generated summary

Logical anomaly detection in industrial inspection remains challenging due to variations in visual appearance (e.g., background clutter, illumination shift, and blur), which often distract vision-centric detectors from identifying rule-level violations. However, existing benchmarks rarely provide controlled settings where logical states are fixed while such nuisance factors vary. To address this gap, we introduce VID-AD, a dataset for logical anomaly detection under vision-induced distraction. It comprises 10 manufacturing scenarios and five capture conditions, totaling 50 one-class tasks and 10,395 images. Each scenario is defined by two logical constraints selected from quantity, length, type, placement, and relation, with anomalies including both single-constraint and combined violations. We further propose a language-based anomaly detection framework that relies solely on text descriptions generated from normal images. Using contrastive learning with positive texts and contradiction-based negative texts synthesized from these descriptions, our method learns embeddings that capture logical attributes rather than low-level features. Extensive experiments demonstrate consistent improvements over baselines across the evaluated settings. The dataset is available at: https://github.com/nkthiroto/VID-AD.

View arXiv page View PDF GitHub 3 Add to collection

Community

nkthiroto

Paper author Paper submitter 1 day ago

This paper introduces VID-AD (Vision-Induced Distraction Anomaly Dataset), a new benchmark for image-level logical anomaly detection. Unlike existing datasets, VID-AD specifically evaluates robustness against visual distractions (background clutter, illumination shifts, and blur) while focusing on five logical constraints: Quantity, Length, Type, Placement, and Relation. The dataset comprises 10 manufacturing scenarios and 5 capture conditions, totaling 10,395 images.

We also propose a novel vision-to-text anomaly detection framework. It leverages a frozen VLM to convert images into structured descriptions and fine-tunes a BERT encoder through contrastive learning. By synthesizing contradictory negative texts from normal descriptions, our method learns to prioritize global logical consistency over irrelevant pixel-level fluctuations.

Submitted at the suggestion of Niels Rogge (Hugging Face) to share this benchmark and method with the community.

Resources: