Detection Heads

Detection head architectures on frozen EUPE-ViT-B features. The backbone is frozen throughout; only the head trains. The premise behind the repository is that a strong modern ViT encoder already carries enough semantic and spatial structure to support detection with a lightweight decoder on top, and that the detection head's architecture — not the backbone — is where the interesting design trade-offs live. Every result here is obtained from features extracted once and cached; training iterates over those cached tensors, which makes it possible to run architecture studies in minutes rather than the days required for end-to-end backbone training.

Contents

The reference detection head is a standard FCOS detector wrapped around a ViTDet-style simple feature pyramid, 16.14M parameters, 41.0 mAP on COCO val2017. It is the detection head used in phanerozoic/argus, a multi-task perception model that attaches classification, segmentation, depth, correspondence, and detection heads to a single frozen EUPE-ViT-B backbone. Anyone building a similar multi-task system who wants a clean detection baseline, or anyone who wants to see how far a standard FCOS head gets on frozen ViT features, can take it off the shelf.

Alongside the reference head sits a library of seventeen experimental architectural scaffolds under heads/. Each is a single head.py file implementing a forward pass with no trained checkpoint — hypothesis-stage designs that formulate detection in unconventional ways: wavelet decomposition in place of FPN, optimal-transport label assignment, compositional patch assembly, tropical-semiring operations, corner-pair relational reasoning, and others. A driver script (arena.py) instantiates and trains any of them against the cached backbone features, so trying a new formulation is a one-command operation. The scaffolds are recorded so alternative formulations are not lost, and so readers looking for detection ideas have somewhere to browse.

Finally, one checkpoint is included from a parallel research program described in the next section: a parameter-efficient detection head that achieves 24.6 mAP on COCO val2017 with 4.07 million learnable parameters. This corresponds to approximately 60 percent of the FCOS baseline's mAP (41.0) at roughly one quarter of its parameter budget (16.14 million). The head replaces FCOS's learned feature pyramid network with a cofiber decomposition of the backbone patch features — a zero-parameter multi-scale operation that iteratively subtracts the downsampled-then-upsampled component of a feature map to produce frequency-separated scale bands. Four such bands are computed (at spatial strides of 16, 32, 64, and 128 pixels) and augmented with a finer-resolution level (stride 8) synthesized by a single transposed convolution applied to the stride-16 band; these five levels match the scale coverage of the FCOS simple feature pyramid. Each level is processed by separate classification and regression convolutional towers — five standard 3×3 convolutions followed by four depthwise residual blocks at 192 hidden channels, with shared weights across levels — and top-down lateral connections pass information from coarser bands into finer ones before the towers run. The head is included in this repository as a direct architectural comparison against the FCOS baseline on the same frozen backbone. The full parameter-versus-mAP scaling curve for the cofiber line of work, from 105-parameter minimal circuits up through variants at four million parameters, is hosted in the separate repository linked below.

Reference Results (COCO val2017)

Head Parameters mAP mAP@0.50 mAP@0.75
Baseline FCOS (simple feature pyramid) 16,138,074 41.0 64.8 43.2
Split-tower with multi-scale decomposition (5 levels) 4,068,954 24.6 37.1 27.0

The FCOS baseline uses a simple feature pyramid that synthesizes five scale levels — P3 at stride 8 through P7 at stride 128, each with 256 channels and GroupNorm — from the backbone's stride-16 spatial features, followed by two shared four-layer convolutional towers (one for classification, one for box regression plus centerness) and three 1×1 prediction heads. It trains in eight epochs at 640-pixel input. The configuration is standard and the implementation is direct; the point of having it in this repository is to provide a clean, comparable reference.

The split-tower head reaches 24.6 mAP with 4.07M parameters. It differs from the FCOS baseline in two main ways: the multi-scale structure comes from a zero-parameter decomposition of the backbone features rather than from a learned FPN, and the classification and regression towers use separate weights throughout rather than sharing a tower. It is the current best parameter-efficient head against the same backbone. The full parameter-versus-mAP curve for this line of work — from 105-parameter minimal circuits up to 4M-parameter split-tower variants — is in the separate repository linked below.

Cofiber decomposition (separate repository)

A large parallel line of work on this backbone replaces the FCOS feature pyramid with cofiber decomposition: a zero-parameter multi-scale operation that iteratively subtracts the downsampled-then-upsampled component of a feature map to produce frequency-separated scale bands. Given backbone features f, the decomposition produces f − upsample(avgpool(f)) as a high-frequency band, then recurses on the low-frequency remainder. The three bands that result are pairwise orthogonal in frequency content, which lets a lightweight head attend to them separately without needing a learned FPN to mediate between scales.

The decomposition has been machine-checked in Rocq/HoTT. The proof frames average pooling and bilinear upsampling as an adjoint pair whose counit gives a short exact sequence in a semi-additive category; the cofiber bands are the kernels of the resulting projections, and any input is uniquely expressible as a sum of its bands. The practical consequence is that the multi-scale construction which typically costs 11M parameters in an FPN is free — no parameters, no training, no FLOPs beyond pooling and interpolation.

This has accumulated enough work around it to warrant its own home. The sibling repository phanerozoic/cofiber-detection contains:

  • Closed-form analytical heads derived from least-squares regression on the decomposed features (zero training, 70K parameters, 1.6 mAP)
  • Trained neural heads at multiple scales, from 70K parameters (5.2 mAP) up to the 3.85M-parameter split-tower included here (20.3 mAP) and 4.27M-parameter variants with top-down lateral connections (19.9 mAP)
  • INT8 threshold-logic heads using Heaviside step functions (92K parameters, 5.9 mAP) and their pruned variants down to 46K nonzero parameters
  • Synthesizable Verilog circuit implementations, including a 93-parameter person image classifier that achieves 99.8% recall on COCO val images
  • Evolutionary dimension search (GPU-batched, 200 generations per second) that identifies informative feature subsets; the evolved 10-dimension person-detection circuit outperforms a greedy 100-dimension circuit at 10× fewer gates
  • Sheaf-cohomology regression features (directional feature differences at spatial boundaries, interpretable as ÄŒech 1-cocycles)
  • The Rocq/HoTT proof of decomposition exactness
  • Full training and evaluation scripts

Readers interested in any of the above — parameter-efficient detection, formalized multi-scale operations, hardware circuits, evolutionary feature selection, topological regression features — should go there. This repository keeps the top-performing head as a comparison point but does not attempt to mirror the full research lineage.

Experimental Architectural Scaffolds

The seventeen scaffolds under heads/ each explore a distinct hypothesis about how detection could be formulated on frozen ViT features. None has been trained to convergence; the point is to record architectural options rather than to benchmark them:

Scaffold Core idea
adaptive_query Learned query vectors that adaptively attend to backbone patches, producing detections as attention outputs rather than per-location predictions
cascade_pool A cascade of pooling operations to synthesize multi-scale features without an FPN
centernet CenterNet-style keypoint detection: predict object centers as a heatmap, then regress size and offset at center locations
compression Head operating on compressed backbone features, testing how aggressively features can be reduced before detection quality collapses
curvature Object boundaries as high curvature of the feature manifold; detection as curvature estimation
depth_fusion Joint detection and depth estimation sharing features, exploiting the alignment between depth gradients and object boundaries
feature_graph Detection as a graph problem over patch tokens, with edges indicating co-occurrence or spatial adjacency
mutual_attention Cross-attention between patch features to model inter-object relationships before detection
optimal_transport Label assignment via optimal transport between ground-truth boxes and prediction locations
patch_assembly Compositional object construction from overlapping patch detections, inspired by self-assembly tile systems
prototype Replacing learned classification weights with fixed class prototypes (text-embedded or image-mean)
relational_corners Corner-pair detection: predict corner heatmaps, then pair them via learned embeddings
scale_classify Explicit scale classification as an auxiliary head to produce scale-consistent boxes
sparse_query Sparse query mechanism to limit prediction locations and reduce background noise
threshold_prototype Prototype classification with a threshold gate
tropical Tropical-semiring (min-plus) operations in the head
wavelet Wavelet-based multi-scale decomposition as an alternative to FPN and to cofiber decomposition

Each is a one-file architectural commitment. Most will not beat the FCOS baseline; some may suggest directions that do. The arena script makes training any of them a one-command operation against the cached features.

FCOS Variants

Three FCOS-family heads are present as controlled variants on the baseline, each differing from the reference in one isolated way:

  • heads/baseline_fcos/ — the full simple-feature-pyramid FCOS head at 16.14M parameters, the 41.0-mAP reference.
  • heads/slim_fcos/ — a reduced-parameter variant with narrower towers and fewer FPN channels. It exists to help locate how much of the baseline's capacity is load-bearing.
  • heads/hook_fcos/ — an FCOS head that reads intermediate backbone features via forward hooks rather than only the final spatial output. This follows the pattern used by the DPT depth decoder in the multi-task Argus model, where intermediate ViT blocks (2, 5, 8, 11) carry information that the final block does not expose; this variant tests whether the same access pattern improves detection.

Arena

arena.py and multi_domain_arena.py are lightweight driver scripts that instantiate any head from the heads/ directory, train it for a configurable number of epochs on cached COCO features, and evaluate with pycocotools. The multi-domain version extends evaluation across the twenty RF100-VL cross-domain datasets in addition to COCO, to test whether a head's performance generalizes beyond the benchmark it was tuned on.

Shared Infrastructure

  • losses/ contains the FCOS-style focal loss for classification, GIoU loss for boxes, BCE for centerness, and a CenterNet-style focal-loss variant for keypoint heads.
  • utils/ contains the feature caching code (shard format, manifest, shard loader), the decoder that converts per-location predictions to COCO-format boxes, and the evaluation wrapper around pycocotools.
  • eval_coco_map.py is the standalone evaluation entry point; it can be pointed at any saved checkpoint.
  • train_split_tower.py and eval_split_tower.py reproduce the split-tower result above.

Backbone

All results use EUPE-ViT-B at 640-pixel input with letterbox padding. The backbone produces 768-channel spatial features at stride 16 (40×40 patches). It has 86M parameters and is frozen; features are cached once before any head training begins. The framework is backbone-agnostic: any ViT-scale encoder producing a consistent spatial grid can be substituted by replacing the feature extraction step, and the head library will run without modification. EUPE-ViT-B is the default because its multi-teacher distillation produces features that work well across detection, segmentation, depth, and correspondence, which makes it a sensible proving ground for heads that might eventually share a backbone with other tasks.

References

  • Tian, Shen, Chen, He. FCOS: Fully Convolutional One-Stage Object Detection. ICCV 2019.
  • Li, Mao, Girshick, He. Exploring Plain Vision Transformer Backbones for Object Detection. ECCV 2022.
  • Zhou, Wang, Krähenbühl. Objects as Points. arXiv:1904.07850, 2019.

License

Apache 2.0. Users are responsible for complying with the license terms of whatever backbone they substitute for EUPE-ViT-B.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phanerozoic/detection-heads

Finetuned
(5)
this model

Dataset used to train phanerozoic/detection-heads

Paper for phanerozoic/detection-heads