--- license: gpl-3.0 tags: - human-pose-estimation - pose-estimation - instance-segmentation - detection - person-detection - computer-vision datasets: - COCO - AIC - MPII - OCHuman metrics: - mAP pipeline_tag: keypoint-detection library_name: bboxmaskpose ---
![image](https://cdn-uploads.huggingface.co/production/uploads/64bfa064b7375f6b84ad58e9/7wuB6cVsvjWsPf7B56TE4.png)
The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other. This approach enhances all three tasks simultaneously. Using segmentation masks instead of bounding boxes improves performance in crowded scenarios, making top-down methods competitive with bottom-up approaches. Key contributions: 1. **MaskPose**: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters - Download pre-trained weights below 2. **PMPose**: a pose estimation model conditioned by segmentation masks AND predicting full description of each keypoint. Combination of MaskPose and ProbPose (CVPR'25). 3. **BBox-MaskPose (BMP)**: method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation - Try the demo! 4. Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes') - Download pre-trained weights below 5. Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.
[![arXiv](https://img.shields.io/badge/arXiv-2412.01562-b31b1b?style=flat)](https://arxiv.org/abs/2412.01562)           [![GitHub repository](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/MiraPurkrabek/BBoxMaskPose)           [![Project Website](https://img.shields.io/badge/Project%20Website-blue?style=flat&logo=google-chrome&logoColor=white)](https://mirapurkrabek.github.io/BBox-Mask-Pose/)
For more details, see the [GitHub repository](https://github.com/MiraPurkrabek/BBoxMaskPose). ## 📝 Models List 1. **ViTPose-b multi-dataset** 2. **MaskPose** 3. **PMPose** 4. fine-tuned **RTMDet-l** See details of each model below. ----------------------------------------- ## 1. ViTPose-B [multi-dataset] - **Model type**: ViT-b backbone with multi-layer decoder - **Input**: RGB images (192x256) - **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints) - **Language(s)**: Not language-dependent (vision model) - **License**: GPL-3.0 - **Framework**: MMPose #### Training Details - **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475) - **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose) - **Epochs**: 210 - **Batch size**: 64 - **Learning rate**: 5e-5 - **Hardware**: 4x NVIDIA A-100 **What's new?** ViTPose trained on multiple datasets perform much better in multi-body (and crowded) scenarios than COCO-trained ViTPose. The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0. ----------------------------------------- ## 2. MaskPose-1.1.0 - **Model type**: ViT-b backbone with multi-layer decoder - **Input**: RGB images (192x256) + estimated instance segmentation - **Output**: Keypoints Coordinates (48x64 heatmap for each keypoint, 23 keypoints) - **Language(s)**: Not language-dependent (vision model) - **License**: GPL-3.0 - **Framework**: MMPose - **Size(s)**: -S, -B, -L, -H #### Training Details - **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475) + SAM-estimated instance masks - **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose) - **Epochs**: 210 - **Batch size**: 64 - **Learning rate**: 5e-5 - **Hardware**: 4x NVIDIA A-100 **What's new?** Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes. No computational overhead compared to ViTPose. **V1.0.0 vs. V1.1.0** The previous version (v1.0.0) predicted 21 keypoints and was trained using a different training recipe. V1.1.0 predicts 23 keypoints and improved training recipe with dataset balancing, which improves numbers. ----------------------------------------- ## 3. PMPose-1.0.0 - **Model type**: ViT-b backbone with multi-layer decoder - **Input**: RGB images (192x256) + estimated instance segmentation - **Output**: Keypoints Coordinates (48x64 probmap for each keypoint, 23 keypoints), Presence Probabilities, Visibilities, Expected OKS for each keypoint - **Language(s)**: Not language-dependent (vision model) - **License**: GPL-3.0 - **Framework**: MMPose - **Size(s)**: -S, -B, -L, -H #### Training Details - **Training data**: [COCO Dataset](https://cocodataset.org/#home), [MPII Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/software-and-datasets/mpii-human-pose-dataset), [AIC Datasel](https://arxiv.org/abs/1711.06475) + SAM-estimated instance masks - **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose) - **Epochs**: 20 - **Batch size**: 64 - **Learning rate**: 5e-5 - **Frozen backbone** - **Hardware**: 4x NVIDIA A-100 **What's new?** PMPose combines MaskPose-1.1.0 and [ProbPose (CVPR'25)](https://mirapurkrabek.github.io/ProbPose/). It is conditioned by masks and has superior in-crowd performance as MaskPose and also precicts proabilities and visibilities as ProbPose. ----------------------------------------- ## 4. fine-tuned RTMDet-L - **Model type**: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head - **Input**: RGB images - **Output**: Detected instances -- bbox, instance mask and class for each - **Language(s)**: Not language-dependent (vision model) - **License**: GPL-3.0 - **Framework**: MMDetection #### Training Details - **Training data**: [COCO Dataset](https://cocodataset.org/#home) with randomly masked-out instances - **Training script**: [GitHub - BBoxMaskPose_code](https://github.com/MiraPurkrabek/BBoxMaskPose) - **Epochs**: 10 - **Batch size**: 16 - **Learning rate**: 2e-2 - **Hardware**: 4x NVIDIA A-100 **What's new?** RTMDet fine-tuned to ignore masked-out instances is designed for itterative detection. Especially effective in multi-body scenes where background would not be detected otherwise. ## 📄 Citation If you use our work, please cite: ```bibtex @InProceedings{BMPv2, author = {Purkrabek, Miroslav and Kolomiiets, Constantin and Matas, Jiri}, title = {BBoxMaskPose v2: Expanding Mutual Conditioning to 3D}, booktitle = {arXiv preprint arXiv:to be added}, year = {2026} } ``` ```bibtex @InProceedings{Purkrabek2025ICCV, author = {Purkrabek, Miroslav and Matas, Jiri}, title = {Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025} } ``` ```bibtex @InProceedings{Kolomiiets2026CVWW, author = {Kolomiiets, Constantin and Purkrabek, Miroslav and Matas, Jiri}, title = {SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds}, booktitle = {Computer Vision Winter Workshop (CVWW)}, year = {2026} } ``` ## 🧑‍💻 Authors - Miroslav Purkrabek ([personal website](https://github.com/MiraPurkrabek)) - Constantin Kolomiiets - Jiri Matas ([personal website](https://cmp.felk.cvut.cz/~matas/))