Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

---
license: gpl-3.0
tags:
- human-pose-estimation
- pose-estimation
- instance-segmentation
- detection
- person-detection
- computer-vision
datasets:
- COCO
- AIC
- MPII
- OCHuman
metrics:
- mAP
pipeline_tag: keypoint-detection
library_name: bboxmaskpose
---
</h1><div id="toc">
  <ul align="center" style="list-style: none; padding: 0; margin: 0;">
    <summary>
      <h1 style="margin-bottom: 0.0em;">
        Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
      </h1>
    </summary>
  </ul>
</div>
</h1><div id="toc">
  <ul align="center" style="list-style: none; padding: 0; margin: 0;">
    <summary>
      <h2 style="margin-bottom: 0.2em;">
        ICCV 2025 + CVPR 2025
      </h2>
    </summary>
  </ul>
</div>

![image](https://cdn-uploads.huggingface.co/production/uploads/64bfa064b7375f6b84ad58e9/7wuB6cVsvjWsPf7B56TE4.png)

<div style="text-align: justify;">
The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other.
This approach enhances all three tasks simultaneously.
Using segmentation masks instead of bounding boxes improves performance in crowded scenarios, making top-down methods competitive with bottom-up approaches.

Key contributions:
1. **MaskPose**: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
    - Download pre-trained weights below
2. **PMPose**: a pose estimation model conditioned by segmentation masks AND predicting full description of each keypoint. Combination of MaskPose and ProbPose (CVPR'25).
3. **BBox-MaskPose (BMP)**: method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation
    - Try the demo!
4. Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes')
    - Download pre-trained weights below
5. Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.
</div>

<div align="left">

[![arXiv](https://img.shields.io/badge/arXiv-2412.01562-b31b1b?style=flat)](https://arxiv.org/abs/2412.01562) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
[![GitHub repository](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/MiraPurkrabek/BBoxMaskPose) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
[![Project Website](https://img.shields.io/badge/Project%20Website-blue?style=flat&logo=google-chrome&logoColor=white)](https://mirapurkrabek.github.io/BBox-Mask-Pose/)
</div>

For more details, see the [GitHub repository](https://github.com/MiraPurkrabek/BBoxMaskPose).


## 📝 Models List

1. **ViTPose-b multi-dataset**
2. **MaskPose**
3. **PMPose**
4. fine-tuned **RTMDet-l**

See details of each model below.

-----------------------------------------
## 1. ViTPose-B [multi-dataset]

- **Model type**: ViT-b backbone with multi-layer decoder
- **Input**: RGB images (192x256)
- **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
- **Language(s)**: Not language-dependent (vision model)
- **License**: GPL-3.0
- **Framework**: MMPose

#### Training Details

- **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475)
- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
- **Epochs**: 210
- **Batch size**: 64
- **Learning rate**: 5e-5
- **Hardware**: 4x NVIDIA A-100

**What's new?**
ViTPose trained on multiple datasets perform much better in multi-body (and crowded) scenarios than COCO-trained ViTPose.
The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0.

-----------------------------------------
## 2. MaskPose-1.1.0

- **Model type**: ViT-b backbone with multi-layer decoder
- **Input**: RGB images (192x256) + estimated instance segmentation
- **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint, 23 keypoints)
- **Language(s)**: Not language-dependent (vision model)
- **License**: GPL-3.0
- **Framework**: MMPose
- **Size(s)**: -S, -B, -L, -H

#### Training Details

- **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475) + SAM-estimated instance masks
- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
- **Epochs**: 210
- **Batch size**: 64
- **Learning rate**: 5e-5
- **Hardware**: 4x NVIDIA A-100

**What's new?**
Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes.
No computational overhead compared to ViTPose.

**V1.0.0 vs. V1.1.0**
The previous version (v1.0.0) predicted 21 keypoints and was trained using a different training recipe. V1.1.0 predicts 23 keypoints and improved training recipe with dataset balancing, which improves numbers.

-----------------------------------------
## 3. PMPose-1.0.0

- **Model type**: ViT-b backbone with multi-layer decoder
- **Input**: RGB images (192x256) + estimated instance segmentation
- **Output**: Keypoints Coordinates (48x64 probmap for each keypoint, 23 keypoints), Presence Probabilities, Visibilities, Expected OKS for each keypoint
- **Language(s)**: Not language-dependent (vision model)
- **License**: GPL-3.0
- **Framework**: MMPose
- **Size(s)**: -S, -B, -L, -H

#### Training Details

- **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475) + SAM-estimated instance masks
- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
- **Epochs**: 20
- **Batch size**: 64
- **Learning rate**: 5e-5
- **Frozen backbone**
- **Hardware**: 4x NVIDIA A-100


**What's new?**
PMPose combines MaskPose-1.1.0 and [ProbPose (CVPR'25)](https://mirapurkrabek.github.io/ProbPose/). It is conditioned by masks and has superior in-crowd performance as MaskPose and also precicts proabilities and visibilities as ProbPose.

-----------------------------------------
## 4. fine-tuned RTMDet-L

- **Model type**: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head
- **Input**: RGB images
- **Output**: Detected instances -- bbox, instance mask and class for each
- **Language(s)**: Not language-dependent (vision model)
- **License**: GPL-3.0
- **Framework**: MMDetection

#### Training Details

- **Training data**: [COCO Dataset](https://cocodataset.org/#home) with randomly masked-out instances
- **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose)
- **Epochs**: 10
- **Batch size**: 16
- **Learning rate**: 2e-2
- **Hardware**: 4x NVIDIA A-100

**What's new?**
RTMDet fine-tuned to ignore masked-out instances is designed for itterative detection.
Especially effective in multi-body scenes where background would not be detected otherwise.


## 📄 Citation

If you use our work, please cite:

```bibtex
@InProceedings{BMPv2,
    author    = {Purkrabek, Miroslav and Kolomiiets, Constantin and Matas, Jiri},
    title     = {BBoxMaskPose v2: Expanding Mutual Conditioning to 3D},
    booktitle = {arXiv preprint arXiv:to be added},
    year      = {2026}
}
```
```bibtex
@InProceedings{Purkrabek2025ICCV,
    author    = {Purkrabek, Miroslav and Matas, Jiri},
    title     = {Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025}
}
```
```bibtex
@InProceedings{Kolomiiets2026CVWW,
    author    = {Kolomiiets, Constantin and Purkrabek, Miroslav and Matas, Jiri},
    title     = {SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds},
    booktitle = {Computer Vision Winter Workshop (CVWW)},
    year      = {2026}
}
```

## 🧑‍💻 Authors

- Miroslav Purkrabek ([personal website](https://github.com/MiraPurkrabek))
- Constantin Kolomiiets
- Jiri Matas ([personal website](https://cmp.felk.cvut.cz/~matas/))