Improve model card: add pipeline tag, library, GitHub link, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +97 -3
README.md CHANGED
@@ -2,13 +2,107 @@
2
  license: mit
3
  tags:
4
  - arxiv:2507.18405
 
 
5
  ---
6
 
7
- Pre-trained Iwin Transformer models on ImageNet-1k and ImageNet-22k
8
 
 
9
 
10
- ## Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  @misc{huo2025iwin,
13
  title={Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows},
14
  author={Simin Huo and Ning Li},
@@ -18,4 +112,4 @@ Pre-trained Iwin Transformer models on ImageNet-1k and ImageNet-22k
18
  primaryClass={cs.CV},
19
  url={https://arxiv.org/abs/2507.18405},
20
  }
21
- ```
 
2
  license: mit
3
  tags:
4
  - arxiv:2507.18405
5
+ pipeline_tag: image-feature-extraction
6
+ library_name: transformers
7
  ---
8
 
9
+ # Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows
10
 
11
+ This repository contains the pre-trained Iwin Transformer models on ImageNet-1k and ImageNet-22k, presented in the paper [Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows](https://huggingface.co/papers/2507.18405).
12
 
13
+ **Official Code:** [https://github.com/Cominder/Iwin-Transformer](https://github.com/Cominder/Iwin-Transformer)
14
+
15
+ ## Introduction
16
+ Iwin Transformer (the name `Iwin` stands for **I**nterleaved **win**dow) is a novel position-embedding-free hierarchical vision transformer. It can be fine-tuned directly from low to high resolution through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer's limitation of requiring two consecutive blocks to approximate global attention.
17
+
18
+ ## Key Highlights
19
+ - **Hierarchical Design:** A position-embedding-free hierarchical architecture.
20
+ - **Interleaved Window Attention & Depthwise Separable Convolution:** Combines attention for distant tokens and convolution for neighboring tokens, enabling global information exchange within a single module.
21
+ - **Flexible Fine-tuning:** Can be fine-tuned directly from low to high resolution.
22
+ - **Strong Performance:** Exhibits strong competitiveness in tasks such as image classification, semantic segmentation, and video action recognition.
23
+ - **Modular Component:** The core component can be used as a standalone module to replace self-attention in class-conditional image generation.
24
+
25
+ ## Usage
26
+ You can use the `Iwin Transformer` with the Hugging Face `transformers` library for image feature extraction.
27
+
28
+ ```python
29
+ from transformers import AutoModel, AutoImageProcessor
30
+ from PIL import Image
31
+ import requests
32
+
33
+ # Load the model and image processor
34
+ model_name = "Cominder/iwin_base_patch4_window7_224" # Example model, choose from available checkpoints
35
+ model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
36
+ processor = AutoImageProcessor.from_pretrained(model_name)
37
+
38
+ # Example image input
39
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
40
+ image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
41
+
42
+ # Preprocess the image and get model outputs
43
+ inputs = processor(images=image, return_tensors="pt")
44
+ outputs = model(**inputs)
45
+
46
+ # The last hidden state can be used as image features
47
+ last_hidden_state = outputs.last_hidden_state
48
+ print("Last hidden state shape:", last_hidden_state.shape)
49
+ # For classification or other tasks, you might use pooled outputs or apply further layers.
50
  ```
51
+
52
+ ## Results on ImageNet with Pretrained Models
53
+
54
+ **ImageNet-1K and ImageNet-22K Pretrained Iwin Models**
55
+
56
+ | name | pretrain | resolution | acc@1 | #params | FLOPs | 22K model | 1K model |
57
+ | :---: | :---: | :---: | :---: | :---: | :---: |:---: |:---: |
58
+ | Iwin-T | ImageNet-1K | 224x224 | 82.0 | 30.2M | 4.7G | - | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_tiny_patch4_window7_224.pth)/[config](configs/iwin/iwin_tiny_patch4_window7_224.yaml) |
59
+ | Iwin-S | ImageNet-1K | 224x224 | 83.4 | 51.6M | 9.0G | - | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_small_patch4_window7_224.pth)/[config](configs/iwin/iwin_small_patch4_window7_224.yaml) |
60
+ | Iwin-S | ImageNet-1K | 384x384 | 84.3 | 51.6M | 27.7G | - | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_small_patch4_window12_384.pth)/[config](configs/iwin/iwin_small_patch4_window12_384_finetune.yaml) |
61
+ | Iwin-S | ImageNet-1K | 512x512 | 84.4 | 51.6M | 52.0G | - | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_small_patch4_window16_512.pth)/[config](configs/iwin/iwin_small_patch4_window16_512_finetune.yaml) |
62
+ | Iwin-S | ImageNet-1K | 1024x1024 | 83.8 | 51.6M | 207.9G | - | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_small_patch4_window16_1024.pth)/[config](configs/iwin/iwin_small_patch4_window16_1024_finetune.yaml) |
63
+ | Iwin-B | ImageNet-1K | 224x224 | 83.5 | 91.2M | 15.9G | - | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_base_patch4_window7_224.pth)/[config](configs/iwin/iwin_base_patch4_window7_224.yaml) |
64
+ | Iwin-B | ImageNet-1K | 384x384 | 84.9 | 91.2M | 48.3G | - | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_base_patch4_window12_384.pth)/[config](configs/iwin/iwin_base_patch4_window12_384_finetune.yaml) |
65
+ | Iwin-B | ImageNet-1K | 512x512 | 85.1 | 91.3M | 89.5G | - | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_base_patch4_window16_512.pth)/[config](configs/iwin/iwin_base_patch4_window16_512_finetune.yaml) |
66
+ | Iwin-B | ImageNet-1K | 1024x1024 | 85.0 | 91.3M | 358.2G | - | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_base_patch4_window16_1024.pth)/[config](configs/iwin/iwin_base_patch4_window16_1024_finetune.yaml) |
67
+ | Iwin-B | ImageNet-22K | 224x224 | 85.5 | 91.2M | 15.9G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_base_patch4_window7_224_22k.pth)/[config](configs/iwin/iwin_base_patch4_window7_224_22k.yaml) | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_base_patch4_window7_224_22kto1k.pth)/[config](configs/iwin/iwin_base_patch4_window7_224_22kto1k_finetune.yaml) |
68
+ | Iwin-B | ImageNet-22K | 384x384 | 86.6 | 91.2M | 48.3G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_base_patch4_window12_384_22k.pth)/[config](configs/iwin/iwin_base_patch4_window12_384_22k.yaml) | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_base_patch4_window12_384_22kto1k.pth)/[config](configs/iwin/iwin_base_patch4_window12_384_22kto1k_finetune.yaml) |
69
+ | Iwin-B | ImageNet-22K | 512x512 | 86.1 | 91.2M | 89.5G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_base_patch4_window16_512_22k.pth)/[config](configs/iwin/iwin_base_patch4_window16_512_22k.yaml) | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_base_patch4_window16_512_22kto1k.pth)/[config](configs/iwin/iwin_base_patch4_window16_512_22kto1k_finetune.yaml) |
70
+ | Iwin-B | ImageNet-22K | 1024x1024 | 85.6 | 91.2M | 358.2G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_base_patch4_window16_1024_22k.pth)/[config](configs/iwin/iwin_base_patch4_window16_1024_22k.yaml) | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_base_patch4_window16_1024_22kto1k.pth)/[config](configs/iwin/iwin_base_patch4_window16_1024_22kto1k_finetune.yaml) |
71
+ | Iwin-L | ImageNet-22K | 224x224 | 86.4 | 204.3M | 35.4G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_large_patch4_window7_224_22k.pth)/[config](configs/iwin/iwin_large_patch4_window7_224_22k.yaml) | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_large_patch4_window7_224_22kto1k.pth)/[config](configs/iwin/iwin_large_patch4_window7_224_22kto1k_finetune.yaml) |
72
+ | Iwin-L | ImageNet-22K | 384x384 | 87.4 | 204.3M | 106.6G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_large_patch4_window12_384_22k.pth)/[config](configs/iwin/iwin_large_patch4_window12_384_22k.yaml) | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_large_patch4_window12_384_22kto1k.pth)/[config](configs/iwin/iwin_large_patch4_window12_384_22kto1k_finetune.yaml) |
73
+
74
+ ## Results on Downstream Tasks
75
+
76
+ **COCO Object Detection (2017 val)**
77
+
78
+ | Backbone | Method | pretrain | Lr Schd | box mAP | mask mAP | #params | FLOPs | model |
79
+ | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
80
+ | Iwin-T | Mask R-CNN | ImageNet-1K | 1x | 42.2 | 38.9 | 48M | 268G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_tiny_window7_mask_rcnn_1x_coco.pth) |
81
+ | Iwin-S | Mask R-CNN | ImageNet-1K | 1x | 43.7 | 40.0 | 69M | 358G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_small_window7_mask_rcnn_1x_coco.pth) |
82
+ | Iwin-T | Mask R-CNN | ImageNet-1K | 3x | 44.7 | 40.9 | 48M | 268G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_tiny_window7_mask_rcnn_3x_coco.pth) |
83
+ | Iwin-S | Mask R-CNN | ImageNet-1K | 3x | 45.5 | 41.0 | 69M | 358G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_small_window7_mask_rcnn_3x_coco.pth) |
84
+ | Iwin-T | Cascade Mask R-CNN | ImageNet-1K | 1x | 47.2 | 40.9 | 86M | 747G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_tiny_window7_cascade_mask_rcnn_1x_coco.pth) |
85
+ | Iwin-T | Cascade Mask R-CNN | ImageNet-1K | 3x | 49.4 | 42.9 | 86M | 747G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_tiny_window7_cascade_mask_rcnn_3x_coco.pth) |
86
+ | Iwin-S | Cascade Mask R-CNN | ImageNet-1K | 3x | 49.4 | 43.0 | 107M | 837G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_small_window7_cascade_mask_rcnn_3x_coco.pth) |
87
+
88
+ **ADE20K Semantic Segmentation (val)**
89
+
90
+ | Backbone | Method | pretrain | Crop Size | Lr Schd | mIoU | #params | FLOPs | model |
91
+ | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
92
+ | Iwin-T | UPerNet | ImageNet-1K | 512x512 | 160K | 44.70 | 61.9M | 946G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_tiny_patch4_window7_512_ade20k_1k.pth) |
93
+ | Iwin-S | UperNet | ImageNet-1K | 512x512 | 160K | 47.50 | 83.2M | 1038G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_small_patch4_window7_512_ade20k_1k.pth) |
94
+ | Iwin-B | UperNet | ImageNet-1K | 512x512 | 160K | 48.90 | 124.8M | 1189G | [github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_base_patch4_window7_512_ade20k_1k.pth) |
95
+
96
+ **Kinetics 400 Recognition**
97
+
98
+ | Backbone | Pretrain | Lr Schd | spatial crop | acc@1 | acc@5 | #params | FLOPs | config | model |
99
+ | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
100
+ | Iwin-T | ImageNet-1K | 30ep | 224 | 79.1 | 93.8 | 29.8M | 74G | [config](video_recognition/configs/recognition/iwin/iwin_tiny_patch244_window77_kinetics400_1k.py) |[github](https://github.com/Cominder/Iwin-Transformer/releases/download/v1.0/iwin_tiny_patch244_window77_kinetics400_1k.pth) |
101
+ | Iwin-S | ImageNet-1K | 30ep | 224 | 80.0 | 94.1 | 51.1M | 140G | [config](video_recognition/configs/recognition/iwin/iwin_small_patch244_window77_kinetics400_1k.py) | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0/iwin_small_patch244_window77_kinetics400_1k.pth) |
102
+
103
+ ## Citation
104
+ If you find our work useful or helpful for your research, please consider citing our paper:
105
+ ```bibtex
106
  @misc{huo2025iwin,
107
  title={Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows},
108
  author={Simin Huo and Ning Li},
 
112
  primaryClass={cs.CV},
113
  url={https://arxiv.org/abs/2507.18405},
114
  }
115
+ ```