merve PRO
AI & ML interests
Recent Activity
Organizations
-
Running25
YOLO26
π25Process images with advanced object detection and segmentation
-
RunningFeatured61
YOLO26 WebGPU
π61Real-time object detection & pose estimation in your browser
-
onnx-community/yolo26x-ONNX
Updated β’ 680 β’ 5 -
openvision/yoloe26-n-seg
Zero-Shot Object Detection β’ Updated β’ 177 β’ 2
-
Wuli-art/Qwen-Image-2512-Turbo-LoRA
Text-to-Image β’ Updated β’ 9.35k β’ 200 -
miromind-ai/MiroThinker-v1.5-235B
Text Generation β’ 235B β’ Updated β’ 837 β’ 248 -
prithivMLmods/Qwen-Image-Edit-2511-Object-Remover
Image-to-Image β’ Updated β’ 11.5k β’ β’ 49 -
tencent/Youtu-LLM-2B-Base
Text Generation β’ Updated β’ 1.38k β’ 41
-
facebook/sam3
Mask Generation β’ 0.9B β’ Updated β’ 1.8M β’ 1.55k -
SleepingFeatured106
SAM3 Video Segmentation
π106Track and label objects in videos using text prompts or clicks
-
onnx-community/sam3-tracker-ONNX
Mask Generation β’ Updated β’ 2.95k β’ 28 -
Running23
SAM3 Tracker WebGPU
π―23Segment and extract parts from images by clicking
-
Kwaipilot/KAT-Dev-72B-Exp
Text Generation β’ 73B β’ Updated β’ 34 β’ 159 -
LiquidAI/LFM2-8B-A1B
Text Generation β’ 8B β’ Updated β’ 14.6k β’ 299 -
yanolja/YanoljaNEXT-Rosetta-12B-2510
Translation β’ 12B β’ Updated β’ 363 β’ 30 -
NeuML/colbert-muvera-femto
Sentence Similarity β’ 243k β’ Updated β’ 1 β’ 20
-
bytedance-research/HuMo
Image-to-Video β’ Updated β’ 149 β’ 212 -
facebook/MobileLLM-R1-950M
Text Generation β’ 0.9B β’ Updated β’ 317 β’ 280 -
tencent/POINTS-Reader
Image-Text-to-Text β’ 4B β’ Updated β’ 477k β’ 100 -
baidu/ERNIE-4.5-21B-A3B-Thinking
Text Generation β’ 22B β’ Updated β’ 636 β’ β’ 772
-
microsoft/VibeVoice-1.5B
Text-to-Speech β’ 3B β’ Updated β’ 228k β’ 2.21k -
OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview
Image-Text-to-Text β’ 0.4B β’ Updated β’ 36.8k β’ 82 -
apple/FastVLM-1.5B
Text Generation β’ 2B β’ Updated β’ 1.37k β’ 75 -
stepfun-ai/Step-Audio-2-mini
Any-to-Any β’ 8B β’ Updated β’ 1.96k β’ 247
-
openai/gpt-oss-120b
Text Generation β’ 120B β’ Updated β’ 3.33M β’ β’ 4.47k -
openai/gpt-oss-20b
Text Generation β’ 22B β’ Updated β’ 5.68M β’ β’ 4.35k -
openai/BrowseCompLongContext
Viewer β’ Updated β’ 295 β’ 596 β’ 46 -
baichuan-inc/Baichuan-M2-32B
Text Generation β’ 33B β’ Updated β’ 119k β’ β’ 118
-
Wan-AI/Wan2.2-I2V-A14B
Image-to-Video β’ Updated β’ 12.1k β’ β’ 604 -
allenai/olmOCR-7B-0725
Image-Text-to-Text β’ 8B β’ Updated β’ 432 β’ 63 -
Wan-AI/Wan2.2-T2V-A14B
Text-to-Video β’ Updated β’ 3.58k β’ β’ 419 -
Qwen/Qwen3-235B-A22B-Thinking-2507
Text Generation β’ 235B β’ Updated β’ 38.6k β’ β’ 397
-
HuggingFaceTB/SmolLM3-3B
Text Generation β’ 3B β’ Updated β’ 71k β’ β’ 896 -
moonshotai/Kimi-K2-Instruct
Text Generation β’ 1T β’ Updated β’ 277k β’ β’ 2.32k -
fal/Realism-Detailer-Kontext-Dev-LoRA
Image-to-Image β’ Updated β’ 124 β’ β’ 53 -
Alibaba-NLP/WebSailor-3B
3B β’ Updated β’ 46 β’ 74
-
nari-labs/Dia-1.6B-0626
Text-to-Speech β’ 2B β’ Updated β’ 25.7k β’ 124 -
google/gemma-3n-E4B-it
Image-Text-to-Text β’ 8B β’ Updated β’ 140k β’ β’ 865 -
ByteDance/XVerse
Text-to-Image β’ Updated β’ 56 β’ 89 -
nvidia/llama-nemoretriever-colembed-3b-v1
Visual Document Retrieval β’ 4B β’ Updated β’ 681 β’ 74
-
opendatalab/OmniDocBench
Viewer β’ Updated β’ 1.36k β’ 10.9k β’ 68 -
nanonets/Nanonets-OCR-s
Image-Text-to-Text β’ 4B β’ Updated β’ 29.5k β’ 1.58k -
echo840/MonkeyOCR
Image-Text-to-Text β’ Updated β’ 244 β’ 514 -
Running on ZeroMCPFeatured140
Multimodal OCR2
π»140nanonets ocr / smoldocling / monkey ocr / typhoon ocr
-
ByteDance-Seed/BAGEL-7B-MoT
Any-to-Any β’ 15B β’ Updated β’ 594 β’ 1.18k -
mistralai/Devstral-Small-2505
24B β’ Updated β’ 56.2k β’ 861 -
ByteDance/Dolphin
Image-Text-to-Text β’ 0.4B β’ Updated β’ 3.29k β’ 513 -
moondream/moondream-2b-2025-04-14-4bit
Image-Text-to-Text β’ 1B β’ Updated β’ 4.46k β’ 62
-
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text β’ 16B β’ Updated β’ 80.9k β’ 445 -
agentica-org/DeepCoder-14B-Preview
Text Generation β’ 15B β’ Updated β’ 323 β’ β’ 680 -
HiDream-ai/HiDream-I1-Full
Text-to-Image β’ Updated β’ 25.1k β’ β’ 986 -
OpenGVLab/InternVL3-78B
Image-Text-to-Text β’ 78B β’ Updated β’ 837k β’ 231
-
docling-project/SmolDocling-256M-preview
Image-Text-to-Text β’ 0.3B β’ Updated β’ 40.6k β’ 1.61k -
sesame/csm-1b
Text-to-Speech β’ Updated β’ 103k β’ 2.33k -
mistralai/Mistral-Small-3.1-24B-Instruct-2503
24B β’ Updated β’ 97.7k β’ 1.35k -
tencent/Hunyuan3D-2mini
Image-to-3D β’ Updated β’ 4.66k β’ 114
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text β’ 8B β’ Updated β’ 2.3k β’ 88 -
AIDC-AI/Ovis2-34B
Image-Text-to-Text β’ 35B β’ Updated β’ 1.07k β’ 152 -
open-r1/OpenR1-Qwen-7B
Text Generation β’ 8B β’ Updated β’ 20 β’ β’ 54 -
nomic-ai/nomic-embed-text-v2-moe
Sentence Similarity β’ 0.5B β’ Updated β’ 1.22M β’ 452
-
allenai/Llama-3.1-Tulu-3-405B
Text Generation β’ 406B β’ Updated β’ 43 β’ 110 -
Qwen/Qwen2.5-VL-72B-Instruct
Image-Text-to-Text β’ 73B β’ Updated β’ 144k β’ β’ 592 -
mistralai/Mistral-Small-24B-Instruct-2501
24B β’ Updated β’ 749k β’ 953 -
deepseek-ai/Janus-Pro-7B
Any-to-Any β’ Updated β’ 19.3k β’ 3.56k
-
ostris/Flex.1-alpha
Text-to-Image β’ Updated β’ 515 β’ 481 -
Qwen/Qwen2.5-Math-PRM-72B
Text Classification β’ 73B β’ Updated β’ 252 β’ 72 -
HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text β’ 0.5B β’ Updated β’ 14.7k β’ 186 -
deepseek-ai/DeepSeek-R1
Text Generation β’ 685B β’ Updated β’ 511k β’ β’ 13k
-
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text β’ 2B β’ Updated β’ 34.7k β’ 577 -
Qwen/QwQ-32B-Preview
Text Generation β’ 33B β’ Updated β’ 6.64k β’ β’ 1.74k -
nvidia/Hymba-1.5B-Base
Text Generation β’ 2B β’ Updated β’ 446 β’ 157 -
vidore/colsmolvlm-v0.1
Visual Document Retrieval β’ Updated β’ 4 β’ 55
-
microsoft/LLM2CLIP-EVA02-L-14-336
Zero-Shot Image Classification β’ Updated β’ 47 β’ 60 -
microsoft/LLM2CLIP-EVA02-B-16
Updated β’ 52 β’ 10 -
PleIAs/common_corpus
Viewer β’ Updated β’ 517M β’ 33.3k β’ 341 -
Qwen/Qwen2.5-Coder-32B-Instruct
Text Generation β’ 33B β’ Updated β’ 682k β’ β’ 1.99k
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper β’ 2409.11402 β’ Published β’ 74 -
BRAVE: Broadening the visual encoding of vision-language models
Paper β’ 2404.07204 β’ Published β’ 19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper β’ 2403.18814 β’ Published β’ 48 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper β’ 2409.17146 β’ Published β’ 121
-
Runtime errorFeatured100
LOTUS Normal
π100Generate high-quality predictions from images
-
Runtime error78
LOTUS Depth
π78Generate depth maps from images and videos
-
jingheya/lotus-depth-g-v1-0
Depth Estimation β’ Updated β’ 8.31k β’ 27 -
jingheya/lotus-depth-d-v1-0
Depth Estimation β’ Updated β’ 259 β’ 5
-
facebook/dinov2-large
Image Feature Extraction β’ 0.3B β’ Updated β’ 560k β’ 102 -
google/flan-t5-xl
3B β’ Updated β’ 171k β’ 527 -
google/siglip-large-patch16-384
Zero-Shot Image Classification β’ 0.7B β’ Updated β’ 14.6k β’ 11 -
google/vit-huge-patch14-224-in21k
Image Feature Extraction β’ 0.6B β’ Updated β’ 75.2k β’ 22
-
facebook/deit-base-distilled-patch16-384
Image Classification β’ 87.6M β’ Updated β’ 57.6k β’ 7 -
facebook/convnextv2-base-1k-224
Image Classification β’ 88.7M β’ Updated β’ 1.47k β’ 4 -
facebook/deit-base-distilled-patch16-224
Image Classification β’ Updated β’ 5.91k β’ 32 -
google/vit-base-patch32-384
Image Classification β’ 88.3M β’ Updated β’ 6.85k β’ 23
-
facebook/maskformer-swin-large-coco
Image Segmentation β’ 0.2B β’ Updated β’ 667 β’ β’ 27 -
nvidia/segformer-b0-finetuned-ade-512-512
Image Segmentation β’ 3.75M β’ Updated β’ 552k β’ β’ 179 -
facebook/detr-resnet-50-dc5-panoptic
Image Segmentation β’ 43M β’ Updated β’ 21 β’ 3 -
nvidia/segformer-b5-finetuned-cityscapes-1024-1024
Image Segmentation β’ Updated β’ 99k β’ β’ 38
-
timbrooks/instruct-pix2pix
Image-to-Image β’ Updated β’ 105k β’ 1.17k -
TencentARC/t2i-adapter-canny-sdxl-1.0
Image-to-Image β’ Updated β’ 3.56k β’ 52 -
TencentARC/t2i-adapter-sketch-sdxl-1.0
Image-to-Image β’ Updated β’ 4.77k β’ 75 -
CrucibleAI/ControlNetMediaPipeFace
Image-to-Image β’ Updated β’ 994 β’ 575
-
Salesforce/blip-image-captioning-large
Image-to-Text β’ 0.5B β’ Updated β’ 683k β’ 1.45k -
Salesforce/blip-image-captioning-base
Image-to-Text β’ Updated β’ 2.3M β’ 842 -
microsoft/trocr-base-handwritten
Image-to-Text β’ 0.3B β’ Updated β’ 229k β’ 476 -
microsoft/git-large-coco
Image-to-Text β’ 0.4B β’ Updated β’ 3.16k β’ 104
-
Running114
Grounding DINO Demo
π»114Cutting edge open-vocabulary object detection app
-
RunningFeatured96
Owlv2
π96State-of-the-art Zero-shot Object Detection
-
Runtime errorFeatured41
BLIP2 with transformers
π41BLIP2 (cutting edge image captioning) in π€transformers
-
Build errorFeatured378
IDEFICS Playground
π¨378
-
RunningFeatured96
Owlv2
π96State-of-the-art Zero-shot Object Detection
-
Runtime errorFeatured64
Owl Tracking
β‘64Powerful foundation model for zero-shot object tracking
-
Running26
Search and Detect (CLIP/OWL-ViT)
π¦26Search and detect objects in images using text queries
-
Running on ZeroFeatured109
OWLSAM
π»109State-of-the-art open-vocabulary image segmentation β‘οΈ
-
Runtime errorFeatured84
UDOP
π84Generate text from document images
-
Runtime error40
Pix2struct
π40Play with all the pix2struct variants in this d
-
Sleeping26
Compare Docvqa Models
π¦26Compare different visual question answering
-
Runtime errorFeatured289
DocQuery βΒ Document Query Engine
π¦289
-
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper β’ 2403.05525 β’ Published β’ 49 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper β’ 2308.12966 β’ Published β’ 11 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper β’ 2404.01331 β’ Published β’ 27
-
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 102k β’ 145 -
google/owlvit-base-patch16
Zero-Shot Object Detection β’ Updated β’ 11.1k β’ 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection β’ Updated β’ 14.6k β’ 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 132k β’ 29
-
depth-anything/Depth-Anything-V2-Small
Depth Estimation β’ Updated β’ 9.96k β’ 76 -
depth-anything/Depth-Anything-V2-Large
Depth Estimation β’ Updated β’ 89.3k β’ 150 -
Running on Zero627
Depth Anything V2
π627Generate depth map from any photo
-
depth-anything/DA-2K
Viewer β’ Updated β’ 1.04k β’ 322 β’ 16
-
Running196
Vidore Leaderboard
π₯196Compare and rank visual document retrieval models across different benchmarks
-
Running on CPU Upgrade988
Open VLM Leaderboard
π988VLMEvalKit Evaluation Results Collection
-
RunningFeatured560
Vision Arena (Testing VLMs side-by-side)
πΌ560Explore AI vision models by uploading an image
-
RunningFeatured85
SEED-Bench Leaderboard
π85Submit model evaluation results to leaderboard
-
vidore/colpali-v1.2
Visual Document Retrieval β’ Updated β’ 24.3k β’ 113 -
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text β’ 8B β’ Updated β’ 1.7M β’ β’ 1.26k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text β’ 2B β’ Updated β’ 2.02M β’ 487 -
Qwen/Qwen2-72B-Instruct
Text Generation β’ 73B β’ Updated β’ 30.3k β’ β’ 719
-
robbyant/lingbot-world-base-cam
Image-to-Video β’ Updated β’ 307 -
nvidia/C-RADIOv4-H
Feature Extraction β’ 0.7B β’ Updated β’ 7.42k β’ 47 -
deepseek-ai/DeepSeek-OCR-2
Image-Text-to-Text β’ 3B β’ Updated β’ 888k β’ 742 -
arcee-ai/Trinity-Large-Base
Text Generation β’ 399B β’ Updated β’ 503 β’ 50
-
google/translategemma-27b-it
Image-Text-to-Text β’ 29B β’ Updated β’ 40.1k β’ 302 -
kakaocorp/kanana-2-30b-a3b-mid-2601
Text Generation β’ 31B β’ Updated β’ 130 β’ 30 -
black-forest-labs/FLUX.2-klein-base-4B
Image-to-Image β’ Updated β’ 56.3k β’ β’ 83 -
google/translategemma-12b-it
Image-Text-to-Text β’ 13B β’ Updated β’ 252k β’ 248
-
LiquidAI/LFM2.5-VL-1.6B
Image-Text-to-Text β’ 2B β’ Updated β’ 94.7k β’ 217 -
openbmb/AgentCPM-Explore
Text Generation β’ 4B β’ Updated β’ 3.65k β’ 408 -
Phr00t/LTX2-Rapid-Merges
Image-Text-to-Video β’ Updated β’ 298 -
LiquidAI/LFM2.5-1.2B-Base
Text Generation β’ 1B β’ Updated β’ 12.4k β’ 111
-
PekingU/rtdetr_v2_r50vd
Object Detection β’ 43M β’ Updated β’ 16.6k β’ 26 -
ustc-community/dfine-xlarge-obj365
Object Detection β’ 63.4M β’ Updated β’ 1.07k β’ 4 -
PekingU/rtdetr_v2_r101vd
Object Detection β’ 76.8M β’ Updated β’ 5.44k β’ 13 -
Running on T4119
RF-DETR
π₯119SOTA real-time object detection model
-
facebook/metaclip-2-worldwide-s16
Zero-Shot Image Classification β’ 0.4B β’ Updated β’ 52 β’ 8 -
facebook/metaclip-2-worldwide-m16
Zero-Shot Image Classification β’ 0.5B β’ Updated β’ 58 β’ 3 -
facebook/metaclip-2-worldwide-l14
Zero-Shot Image Classification β’ 1B β’ Updated β’ 149k β’ 12 -
facebook/metaclip-2-worldwide-b32
Zero-Shot Image Classification β’ 0.6B β’ Updated β’ 138 β’ 6
-
ibm-granite/granite-docling-258M
Image-Text-to-Text β’ 0.3B β’ Updated β’ 205k β’ 1.12k -
XiaomiMiMo/MiMo-Audio-7B-Base
Any-to-Any β’ 8B β’ Updated β’ 41 β’ 46 -
decart-ai/Lucy-Edit-Dev
Video-to-Video β’ Updated β’ 505 β’ 326 -
OpenGVLab/ScaleCUA-3B
Image-Text-to-Text β’ 4B β’ Updated β’ 258 β’ 11
-
openbmb/MiniCPM4.1-8B
Text Generation β’ 8B β’ Updated β’ 18.4k β’ 385 -
tencent/Hunyuan-MT-7B
Translation β’ 8B β’ Updated β’ 12.4k β’ 549 -
google/embeddinggemma-300m
Sentence Similarity β’ 0.3B β’ Updated β’ 1.28M β’ β’ 1.46k -
moonshotai/Kimi-K2-Instruct-0905
Text Generation β’ 1T β’ Updated β’ 13.5k β’ β’ 670
-
stepfun-ai/step3
Image-Text-to-Text β’ 321B β’ Updated β’ 34.4k β’ 166 -
nunchaku-ai/nunchaku-flux.1-krea-dev
Text-to-Image β’ Updated β’ 6.62k β’ 119 -
fdtn-ai/Foundation-Sec-8B-Instruct
Text Generation β’ 8B β’ Updated β’ 5.69k β’ β’ 67 -
Wan-AI/Wan2.2-TI2V-5B-Diffusers
Text-to-Video β’ Updated β’ 33.5k β’ 108
-
nvidia/OpenReasoning-Nemotron-32B
Text Generation β’ 33B β’ Updated β’ 729 β’ β’ 122 -
ByteDance-Seed/Seed-X-RM-7B
Translation β’ Updated β’ 118 β’ 30 -
LGAI-EXAONE/EXAONE-4.0-32B
Text Generation β’ 32B β’ Updated β’ 13.1k β’ 277 -
vidore/colqwen-omni-v0.1
Visual Document Retrieval β’ Updated β’ 4.15k β’ 93
-
Qwen/WorldPM-72B
Text Classification β’ 73B β’ Updated β’ 72 β’ 81 -
Running on ZeroMCPFeatured1.48k
LTX Video Fast
π₯1.48kultra-fast video model, LTX 0.9.8 13B distilled
-
BLIP3o/BLIP3o-Pretrain-Long-Caption
Viewer β’ Updated β’ 27.2M β’ 32.4k β’ 57 -
BLIP3o/BLIP3o-Model-8B
14B β’ Updated β’ 480 β’ 101
-
OpenGVLab/InternVL3-1B-hf
Image-Text-to-Text β’ 0.9B β’ Updated β’ 111k β’ 10 -
OpenGVLab/InternVL3-2B-hf
Image-Text-to-Text β’ 2B β’ Updated β’ 8.24k β’ 3 -
OpenGVLab/InternVL3-8B-hf
Image-Text-to-Text β’ 8B β’ Updated β’ 13.4k β’ 9 -
OpenGVLab/InternVL3-14B-hf
Image-Text-to-Text β’ 15B β’ Updated β’ 3.9k
-
deepseek-ai/DeepSeek-V3-0324
Text Generation β’ 685B β’ Updated β’ 234k β’ β’ 3.09k -
Qwen/Qwen2.5-Omni-7B
Any-to-Any β’ 11B β’ Updated β’ 285k β’ 1.86k -
google/txgemma-27b-chat
Text Generation β’ 27B β’ Updated β’ 1.01k β’ 58 -
RunningFeatured366
Qwen2.5 Omni 7B Demo
π366Chat with an AI using text, audio, image, or video and hear responses
-
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text β’ 8B β’ Updated β’ 1.7M β’ β’ 1.26k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text β’ 2B β’ Updated β’ 2.02M β’ 487 -
CohereLabs/aya-vision-8b
Image-Text-to-Text β’ 9B β’ Updated β’ 46.2k β’ 316 -
CohereLabs/aya-vision-32b
Image-Text-to-Text β’ 33B β’ Updated β’ 168 β’ β’ 220
-
Running on Zero266
Qwen2-VL-7B
π₯266Answer questions about any uploaded image
-
Running67
UI-TARS
π67Find click coordinates on images based on instructions
-
Running98
Qwen2.5-1M Demo
π»98Ask questions about uploaded documents with 1Mβtoken AI
-
Qwen/Qwen2.5-14B-Instruct-1M
Text Generation β’ 15B β’ Updated β’ 10.6k β’ β’ 333
-
meta-llama/Llama-3.3-70B-Instruct
Text Generation β’ 71B β’ Updated β’ 920k β’ β’ 2.65k -
Qwen/Qwen2-VL-72B
Image-Text-to-Text β’ 73B β’ Updated β’ 64 β’ 80 -
google/paligemma2-3b-pt-224
Image-Text-to-Text β’ 3B β’ Updated β’ 30.3k β’ 162 -
tencent/HunyuanVideo
Text-to-Video β’ Updated β’ 1.09k β’ β’ 2.12k
-
ibm-granite/granite-3.0-8b-instruct
Text Generation β’ 8B β’ Updated β’ 17.1k β’ 205 -
ibm-granite/granite-3.0-2b-instruct
Text Generation β’ 3B β’ Updated β’ 4.28k β’ 47 -
CohereLabs/aya-expanse-8b
Text Generation β’ 8B β’ Updated β’ 89.8k β’ 421 -
CohereLabs/aya-expanse-32b
Text Generation β’ 32B β’ Updated β’ 5.04k β’ β’ 288
-
Running on ZeroFeatured198
DepthCrafter
π¦198a super consistent video depth model
-
PausedFeatured223
Depth Pro
π223Generate an inverse depth map from an image
-
Runtime error78
LOTUS Depth
π78Generate depth maps from images and videos
-
apple/DepthPro
Depth Estimation β’ Updated β’ 4.01k β’ 498
-
microsoft/resnet-50
Image Classification β’ 25.6M β’ Updated β’ 173k β’ β’ 477 -
google/vit-base-patch16-224-in21k
Image Feature Extraction β’ 86.4M β’ Updated β’ 1.07M β’ 393 -
google/vit-base-patch32-224-in21k
Image Feature Extraction β’ 88M β’ Updated β’ 6.42k β’ 19 -
facebook/dinov2-large
Image Feature Extraction β’ 0.3B β’ Updated β’ 560k β’ 102
-
facebook/detr-resnet-50
Object Detection β’ 41.6M β’ Updated β’ 511k β’ β’ 933 -
facebook/detr-resnet-101-dc5
Object Detection β’ 60.7M β’ Updated β’ 1.67k β’ 19 -
facebook/detr-resnet-50-dc5
Object Detection β’ 41.6M β’ Updated β’ 1.74k β’ 6 -
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 102k β’ 145
-
openai/clip-vit-large-patch14
Zero-Shot Image Classification β’ 0.4B β’ Updated β’ 7.88M β’ 1.96k -
openai/clip-vit-base-patch32
Zero-Shot Image Classification β’ Updated β’ 17.2M β’ 856 -
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Zero-Shot Image Classification β’ Updated β’ 55.3k β’ 306 -
kakaobrain/align-base
Zero-Shot Image Classification β’ Updated β’ 11.6k β’ 30
-
microsoft/xclip-base-patch32
Video Classification β’ 0.2B β’ Updated β’ 164k β’ 108 -
facebook/timesformer-base-finetuned-k400
Video Classification β’ Updated β’ 31.5k β’ 42 -
facebook/timesformer-base-finetuned-k600
Video Classification β’ Updated β’ 576 β’ 12 -
google/vivit-b-16x2
Video Classification β’ Updated β’ 1.61k β’ 11
-
stabilityai/stable-diffusion-xl-base-1.0
Text-to-Image β’ Updated β’ 2.05M β’ β’ 7.43k -
warp-ai/wuerstchen
Text-to-Image β’ Updated β’ 146 β’ 176 -
Deci/DeciDiffusion-v1-0
Text-to-Image β’ Updated β’ 29 β’ 140 -
stabilityai/stable-diffusion-xl-refiner-1.0
Image-to-Image β’ Updated β’ 336k β’ 2.02k
-
Running on ZeroFeatured72
Draw To Search Art
π72Draw/upload image and search among WikiART using SigLIP
-
Running on CPU Upgrade23
Compare Clip Siglip
π23Compare strong zero-shot image classification models
-
Running on Zero13
Multilingual Zero Shot Image Clf
π’13Comparing powerful multilingual zero-shot image clf models
-
BAAI/bunny-phi-2-siglip-lora
Text Generation β’ Updated β’ 259 β’ 48
-
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 102k β’ 145 -
google/owlvit-base-patch16
Zero-Shot Object Detection β’ Updated β’ 11.1k β’ 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection β’ Updated β’ 14.6k β’ 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 132k β’ 29
-
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 102k β’ 145 -
google/owlvit-base-patch16
Zero-Shot Object Detection β’ Updated β’ 11.1k β’ 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection β’ Updated β’ 14.6k β’ 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 132k β’ 29
-
Paused21
Video Llava
π¨21Generate descriptions by uploading images or videos
-
llava-hf/LLaVA-NeXT-Video-7B-hf
Video-Text-to-Text β’ 7B β’ Updated β’ 84.5k β’ 122 -
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf
Video-Text-to-Text β’ 7B β’ Updated β’ 1.46k β’ 11 -
llava-hf/LLaVA-NeXT-Video-7B-32K-hf
Image-Text-to-Text β’ 8B β’ Updated β’ 449 β’ 8
-
NVEagle/Eagle-X5-13B
Image-Text-to-Text β’ 15B β’ Updated β’ 5 β’ 15 -
NVEagle/Eagle-X5-13B-Chat
Image-Text-to-Text β’ 15B β’ Updated β’ 4 β’ 28 -
NVEagle/Eagle-X5-7B
Image-Text-to-Text β’ 9B β’ Updated β’ 26 β’ 26 -
Runtime error64
Eagle X5 13B Chat
π64Combine text and images to generate responses
-
robbyant/lingbot-world-base-cam
Image-to-Video β’ Updated β’ 307 -
nvidia/C-RADIOv4-H
Feature Extraction β’ 0.7B β’ Updated β’ 7.42k β’ 47 -
deepseek-ai/DeepSeek-OCR-2
Image-Text-to-Text β’ 3B β’ Updated β’ 888k β’ 742 -
arcee-ai/Trinity-Large-Base
Text Generation β’ 399B β’ Updated β’ 503 β’ 50
-
google/translategemma-27b-it
Image-Text-to-Text β’ 29B β’ Updated β’ 40.1k β’ 302 -
kakaocorp/kanana-2-30b-a3b-mid-2601
Text Generation β’ 31B β’ Updated β’ 130 β’ 30 -
black-forest-labs/FLUX.2-klein-base-4B
Image-to-Image β’ Updated β’ 56.3k β’ β’ 83 -
google/translategemma-12b-it
Image-Text-to-Text β’ 13B β’ Updated β’ 252k β’ 248
-
Running25
YOLO26
π25Process images with advanced object detection and segmentation
-
RunningFeatured61
YOLO26 WebGPU
π61Real-time object detection & pose estimation in your browser
-
onnx-community/yolo26x-ONNX
Updated β’ 680 β’ 5 -
openvision/yoloe26-n-seg
Zero-Shot Object Detection β’ Updated β’ 177 β’ 2
-
LiquidAI/LFM2.5-VL-1.6B
Image-Text-to-Text β’ 2B β’ Updated β’ 94.7k β’ 217 -
openbmb/AgentCPM-Explore
Text Generation β’ 4B β’ Updated β’ 3.65k β’ 408 -
Phr00t/LTX2-Rapid-Merges
Image-Text-to-Video β’ Updated β’ 298 -
LiquidAI/LFM2.5-1.2B-Base
Text Generation β’ 1B β’ Updated β’ 12.4k β’ 111
-
Wuli-art/Qwen-Image-2512-Turbo-LoRA
Text-to-Image β’ Updated β’ 9.35k β’ 200 -
miromind-ai/MiroThinker-v1.5-235B
Text Generation β’ 235B β’ Updated β’ 837 β’ 248 -
prithivMLmods/Qwen-Image-Edit-2511-Object-Remover
Image-to-Image β’ Updated β’ 11.5k β’ β’ 49 -
tencent/Youtu-LLM-2B-Base
Text Generation β’ Updated β’ 1.38k β’ 41
-
PekingU/rtdetr_v2_r50vd
Object Detection β’ 43M β’ Updated β’ 16.6k β’ 26 -
ustc-community/dfine-xlarge-obj365
Object Detection β’ 63.4M β’ Updated β’ 1.07k β’ 4 -
PekingU/rtdetr_v2_r101vd
Object Detection β’ 76.8M β’ Updated β’ 5.44k β’ 13 -
Running on T4119
RF-DETR
π₯119SOTA real-time object detection model
-
facebook/sam3
Mask Generation β’ 0.9B β’ Updated β’ 1.8M β’ 1.55k -
SleepingFeatured106
SAM3 Video Segmentation
π106Track and label objects in videos using text prompts or clicks
-
onnx-community/sam3-tracker-ONNX
Mask Generation β’ Updated β’ 2.95k β’ 28 -
Running23
SAM3 Tracker WebGPU
π―23Segment and extract parts from images by clicking
-
facebook/metaclip-2-worldwide-s16
Zero-Shot Image Classification β’ 0.4B β’ Updated β’ 52 β’ 8 -
facebook/metaclip-2-worldwide-m16
Zero-Shot Image Classification β’ 0.5B β’ Updated β’ 58 β’ 3 -
facebook/metaclip-2-worldwide-l14
Zero-Shot Image Classification β’ 1B β’ Updated β’ 149k β’ 12 -
facebook/metaclip-2-worldwide-b32
Zero-Shot Image Classification β’ 0.6B β’ Updated β’ 138 β’ 6
-
Kwaipilot/KAT-Dev-72B-Exp
Text Generation β’ 73B β’ Updated β’ 34 β’ 159 -
LiquidAI/LFM2-8B-A1B
Text Generation β’ 8B β’ Updated β’ 14.6k β’ 299 -
yanolja/YanoljaNEXT-Rosetta-12B-2510
Translation β’ 12B β’ Updated β’ 363 β’ 30 -
NeuML/colbert-muvera-femto
Sentence Similarity β’ 243k β’ Updated β’ 1 β’ 20
-
ibm-granite/granite-docling-258M
Image-Text-to-Text β’ 0.3B β’ Updated β’ 205k β’ 1.12k -
XiaomiMiMo/MiMo-Audio-7B-Base
Any-to-Any β’ 8B β’ Updated β’ 41 β’ 46 -
decart-ai/Lucy-Edit-Dev
Video-to-Video β’ Updated β’ 505 β’ 326 -
OpenGVLab/ScaleCUA-3B
Image-Text-to-Text β’ 4B β’ Updated β’ 258 β’ 11
-
bytedance-research/HuMo
Image-to-Video β’ Updated β’ 149 β’ 212 -
facebook/MobileLLM-R1-950M
Text Generation β’ 0.9B β’ Updated β’ 317 β’ 280 -
tencent/POINTS-Reader
Image-Text-to-Text β’ 4B β’ Updated β’ 477k β’ 100 -
baidu/ERNIE-4.5-21B-A3B-Thinking
Text Generation β’ 22B β’ Updated β’ 636 β’ β’ 772
-
openbmb/MiniCPM4.1-8B
Text Generation β’ 8B β’ Updated β’ 18.4k β’ 385 -
tencent/Hunyuan-MT-7B
Translation β’ 8B β’ Updated β’ 12.4k β’ 549 -
google/embeddinggemma-300m
Sentence Similarity β’ 0.3B β’ Updated β’ 1.28M β’ β’ 1.46k -
moonshotai/Kimi-K2-Instruct-0905
Text Generation β’ 1T β’ Updated β’ 13.5k β’ β’ 670
-
microsoft/VibeVoice-1.5B
Text-to-Speech β’ 3B β’ Updated β’ 228k β’ 2.21k -
OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview
Image-Text-to-Text β’ 0.4B β’ Updated β’ 36.8k β’ 82 -
apple/FastVLM-1.5B
Text Generation β’ 2B β’ Updated β’ 1.37k β’ 75 -
stepfun-ai/Step-Audio-2-mini
Any-to-Any β’ 8B β’ Updated β’ 1.96k β’ 247
-
openai/gpt-oss-120b
Text Generation β’ 120B β’ Updated β’ 3.33M β’ β’ 4.47k -
openai/gpt-oss-20b
Text Generation β’ 22B β’ Updated β’ 5.68M β’ β’ 4.35k -
openai/BrowseCompLongContext
Viewer β’ Updated β’ 295 β’ 596 β’ 46 -
baichuan-inc/Baichuan-M2-32B
Text Generation β’ 33B β’ Updated β’ 119k β’ β’ 118
-
stepfun-ai/step3
Image-Text-to-Text β’ 321B β’ Updated β’ 34.4k β’ 166 -
nunchaku-ai/nunchaku-flux.1-krea-dev
Text-to-Image β’ Updated β’ 6.62k β’ 119 -
fdtn-ai/Foundation-Sec-8B-Instruct
Text Generation β’ 8B β’ Updated β’ 5.69k β’ β’ 67 -
Wan-AI/Wan2.2-TI2V-5B-Diffusers
Text-to-Video β’ Updated β’ 33.5k β’ 108
-
Wan-AI/Wan2.2-I2V-A14B
Image-to-Video β’ Updated β’ 12.1k β’ β’ 604 -
allenai/olmOCR-7B-0725
Image-Text-to-Text β’ 8B β’ Updated β’ 432 β’ 63 -
Wan-AI/Wan2.2-T2V-A14B
Text-to-Video β’ Updated β’ 3.58k β’ β’ 419 -
Qwen/Qwen3-235B-A22B-Thinking-2507
Text Generation β’ 235B β’ Updated β’ 38.6k β’ β’ 397
-
nvidia/OpenReasoning-Nemotron-32B
Text Generation β’ 33B β’ Updated β’ 729 β’ β’ 122 -
ByteDance-Seed/Seed-X-RM-7B
Translation β’ Updated β’ 118 β’ 30 -
LGAI-EXAONE/EXAONE-4.0-32B
Text Generation β’ 32B β’ Updated β’ 13.1k β’ 277 -
vidore/colqwen-omni-v0.1
Visual Document Retrieval β’ Updated β’ 4.15k β’ 93
-
HuggingFaceTB/SmolLM3-3B
Text Generation β’ 3B β’ Updated β’ 71k β’ β’ 896 -
moonshotai/Kimi-K2-Instruct
Text Generation β’ 1T β’ Updated β’ 277k β’ β’ 2.32k -
fal/Realism-Detailer-Kontext-Dev-LoRA
Image-to-Image β’ Updated β’ 124 β’ β’ 53 -
Alibaba-NLP/WebSailor-3B
3B β’ Updated β’ 46 β’ 74
-
nari-labs/Dia-1.6B-0626
Text-to-Speech β’ 2B β’ Updated β’ 25.7k β’ 124 -
google/gemma-3n-E4B-it
Image-Text-to-Text β’ 8B β’ Updated β’ 140k β’ β’ 865 -
ByteDance/XVerse
Text-to-Image β’ Updated β’ 56 β’ 89 -
nvidia/llama-nemoretriever-colembed-3b-v1
Visual Document Retrieval β’ 4B β’ Updated β’ 681 β’ 74
-
opendatalab/OmniDocBench
Viewer β’ Updated β’ 1.36k β’ 10.9k β’ 68 -
nanonets/Nanonets-OCR-s
Image-Text-to-Text β’ 4B β’ Updated β’ 29.5k β’ 1.58k -
echo840/MonkeyOCR
Image-Text-to-Text β’ Updated β’ 244 β’ 514 -
Running on ZeroMCPFeatured140
Multimodal OCR2
π»140nanonets ocr / smoldocling / monkey ocr / typhoon ocr
-
ByteDance-Seed/BAGEL-7B-MoT
Any-to-Any β’ 15B β’ Updated β’ 594 β’ 1.18k -
mistralai/Devstral-Small-2505
24B β’ Updated β’ 56.2k β’ 861 -
ByteDance/Dolphin
Image-Text-to-Text β’ 0.4B β’ Updated β’ 3.29k β’ 513 -
moondream/moondream-2b-2025-04-14-4bit
Image-Text-to-Text β’ 1B β’ Updated β’ 4.46k β’ 62
-
Qwen/WorldPM-72B
Text Classification β’ 73B β’ Updated β’ 72 β’ 81 -
Running on ZeroMCPFeatured1.48k
LTX Video Fast
π₯1.48kultra-fast video model, LTX 0.9.8 13B distilled
-
BLIP3o/BLIP3o-Pretrain-Long-Caption
Viewer β’ Updated β’ 27.2M β’ 32.4k β’ 57 -
BLIP3o/BLIP3o-Model-8B
14B β’ Updated β’ 480 β’ 101
-
OpenGVLab/InternVL3-1B-hf
Image-Text-to-Text β’ 0.9B β’ Updated β’ 111k β’ 10 -
OpenGVLab/InternVL3-2B-hf
Image-Text-to-Text β’ 2B β’ Updated β’ 8.24k β’ 3 -
OpenGVLab/InternVL3-8B-hf
Image-Text-to-Text β’ 8B β’ Updated β’ 13.4k β’ 9 -
OpenGVLab/InternVL3-14B-hf
Image-Text-to-Text β’ 15B β’ Updated β’ 3.9k
-
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text β’ 16B β’ Updated β’ 80.9k β’ 445 -
agentica-org/DeepCoder-14B-Preview
Text Generation β’ 15B β’ Updated β’ 323 β’ β’ 680 -
HiDream-ai/HiDream-I1-Full
Text-to-Image β’ Updated β’ 25.1k β’ β’ 986 -
OpenGVLab/InternVL3-78B
Image-Text-to-Text β’ 78B β’ Updated β’ 837k β’ 231
-
deepseek-ai/DeepSeek-V3-0324
Text Generation β’ 685B β’ Updated β’ 234k β’ β’ 3.09k -
Qwen/Qwen2.5-Omni-7B
Any-to-Any β’ 11B β’ Updated β’ 285k β’ 1.86k -
google/txgemma-27b-chat
Text Generation β’ 27B β’ Updated β’ 1.01k β’ 58 -
RunningFeatured366
Qwen2.5 Omni 7B Demo
π366Chat with an AI using text, audio, image, or video and hear responses
-
docling-project/SmolDocling-256M-preview
Image-Text-to-Text β’ 0.3B β’ Updated β’ 40.6k β’ 1.61k -
sesame/csm-1b
Text-to-Speech β’ Updated β’ 103k β’ 2.33k -
mistralai/Mistral-Small-3.1-24B-Instruct-2503
24B β’ Updated β’ 97.7k β’ 1.35k -
tencent/Hunyuan3D-2mini
Image-to-3D β’ Updated β’ 4.66k β’ 114
-
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text β’ 8B β’ Updated β’ 1.7M β’ β’ 1.26k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text β’ 2B β’ Updated β’ 2.02M β’ 487 -
CohereLabs/aya-vision-8b
Image-Text-to-Text β’ 9B β’ Updated β’ 46.2k β’ 316 -
CohereLabs/aya-vision-32b
Image-Text-to-Text β’ 33B β’ Updated β’ 168 β’ β’ 220
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text β’ 8B β’ Updated β’ 2.3k β’ 88 -
AIDC-AI/Ovis2-34B
Image-Text-to-Text β’ 35B β’ Updated β’ 1.07k β’ 152 -
open-r1/OpenR1-Qwen-7B
Text Generation β’ 8B β’ Updated β’ 20 β’ β’ 54 -
nomic-ai/nomic-embed-text-v2-moe
Sentence Similarity β’ 0.5B β’ Updated β’ 1.22M β’ 452
-
allenai/Llama-3.1-Tulu-3-405B
Text Generation β’ 406B β’ Updated β’ 43 β’ 110 -
Qwen/Qwen2.5-VL-72B-Instruct
Image-Text-to-Text β’ 73B β’ Updated β’ 144k β’ β’ 592 -
mistralai/Mistral-Small-24B-Instruct-2501
24B β’ Updated β’ 749k β’ 953 -
deepseek-ai/Janus-Pro-7B
Any-to-Any β’ Updated β’ 19.3k β’ 3.56k
-
Running on Zero266
Qwen2-VL-7B
π₯266Answer questions about any uploaded image
-
Running67
UI-TARS
π67Find click coordinates on images based on instructions
-
Running98
Qwen2.5-1M Demo
π»98Ask questions about uploaded documents with 1Mβtoken AI
-
Qwen/Qwen2.5-14B-Instruct-1M
Text Generation β’ 15B β’ Updated β’ 10.6k β’ β’ 333
-
ostris/Flex.1-alpha
Text-to-Image β’ Updated β’ 515 β’ 481 -
Qwen/Qwen2.5-Math-PRM-72B
Text Classification β’ 73B β’ Updated β’ 252 β’ 72 -
HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text β’ 0.5B β’ Updated β’ 14.7k β’ 186 -
deepseek-ai/DeepSeek-R1
Text Generation β’ 685B β’ Updated β’ 511k β’ β’ 13k
-
meta-llama/Llama-3.3-70B-Instruct
Text Generation β’ 71B β’ Updated β’ 920k β’ β’ 2.65k -
Qwen/Qwen2-VL-72B
Image-Text-to-Text β’ 73B β’ Updated β’ 64 β’ 80 -
google/paligemma2-3b-pt-224
Image-Text-to-Text β’ 3B β’ Updated β’ 30.3k β’ 162 -
tencent/HunyuanVideo
Text-to-Video β’ Updated β’ 1.09k β’ β’ 2.12k
-
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text β’ 2B β’ Updated β’ 34.7k β’ 577 -
Qwen/QwQ-32B-Preview
Text Generation β’ 33B β’ Updated β’ 6.64k β’ β’ 1.74k -
nvidia/Hymba-1.5B-Base
Text Generation β’ 2B β’ Updated β’ 446 β’ 157 -
vidore/colsmolvlm-v0.1
Visual Document Retrieval β’ Updated β’ 4 β’ 55
-
microsoft/LLM2CLIP-EVA02-L-14-336
Zero-Shot Image Classification β’ Updated β’ 47 β’ 60 -
microsoft/LLM2CLIP-EVA02-B-16
Updated β’ 52 β’ 10 -
PleIAs/common_corpus
Viewer β’ Updated β’ 517M β’ 33.3k β’ 341 -
Qwen/Qwen2.5-Coder-32B-Instruct
Text Generation β’ 33B β’ Updated β’ 682k β’ β’ 1.99k
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper β’ 2409.11402 β’ Published β’ 74 -
BRAVE: Broadening the visual encoding of vision-language models
Paper β’ 2404.07204 β’ Published β’ 19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper β’ 2403.18814 β’ Published β’ 48 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper β’ 2409.17146 β’ Published β’ 121
-
ibm-granite/granite-3.0-8b-instruct
Text Generation β’ 8B β’ Updated β’ 17.1k β’ 205 -
ibm-granite/granite-3.0-2b-instruct
Text Generation β’ 3B β’ Updated β’ 4.28k β’ 47 -
CohereLabs/aya-expanse-8b
Text Generation β’ 8B β’ Updated β’ 89.8k β’ 421 -
CohereLabs/aya-expanse-32b
Text Generation β’ 32B β’ Updated β’ 5.04k β’ β’ 288
-
Runtime errorFeatured100
LOTUS Normal
π100Generate high-quality predictions from images
-
Runtime error78
LOTUS Depth
π78Generate depth maps from images and videos
-
jingheya/lotus-depth-g-v1-0
Depth Estimation β’ Updated β’ 8.31k β’ 27 -
jingheya/lotus-depth-d-v1-0
Depth Estimation β’ Updated β’ 259 β’ 5
-
Running on ZeroFeatured198
DepthCrafter
π¦198a super consistent video depth model
-
PausedFeatured223
Depth Pro
π223Generate an inverse depth map from an image
-
Runtime error78
LOTUS Depth
π78Generate depth maps from images and videos
-
apple/DepthPro
Depth Estimation β’ Updated β’ 4.01k β’ 498
-
facebook/dinov2-large
Image Feature Extraction β’ 0.3B β’ Updated β’ 560k β’ 102 -
google/flan-t5-xl
3B β’ Updated β’ 171k β’ 527 -
google/siglip-large-patch16-384
Zero-Shot Image Classification β’ 0.7B β’ Updated β’ 14.6k β’ 11 -
google/vit-huge-patch14-224-in21k
Image Feature Extraction β’ 0.6B β’ Updated β’ 75.2k β’ 22
-
microsoft/resnet-50
Image Classification β’ 25.6M β’ Updated β’ 173k β’ β’ 477 -
google/vit-base-patch16-224-in21k
Image Feature Extraction β’ 86.4M β’ Updated β’ 1.07M β’ 393 -
google/vit-base-patch32-224-in21k
Image Feature Extraction β’ 88M β’ Updated β’ 6.42k β’ 19 -
facebook/dinov2-large
Image Feature Extraction β’ 0.3B β’ Updated β’ 560k β’ 102
-
facebook/deit-base-distilled-patch16-384
Image Classification β’ 87.6M β’ Updated β’ 57.6k β’ 7 -
facebook/convnextv2-base-1k-224
Image Classification β’ 88.7M β’ Updated β’ 1.47k β’ 4 -
facebook/deit-base-distilled-patch16-224
Image Classification β’ Updated β’ 5.91k β’ 32 -
google/vit-base-patch32-384
Image Classification β’ 88.3M β’ Updated β’ 6.85k β’ 23
-
facebook/detr-resnet-50
Object Detection β’ 41.6M β’ Updated β’ 511k β’ β’ 933 -
facebook/detr-resnet-101-dc5
Object Detection β’ 60.7M β’ Updated β’ 1.67k β’ 19 -
facebook/detr-resnet-50-dc5
Object Detection β’ 41.6M β’ Updated β’ 1.74k β’ 6 -
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 102k β’ 145
-
facebook/maskformer-swin-large-coco
Image Segmentation β’ 0.2B β’ Updated β’ 667 β’ β’ 27 -
nvidia/segformer-b0-finetuned-ade-512-512
Image Segmentation β’ 3.75M β’ Updated β’ 552k β’ β’ 179 -
facebook/detr-resnet-50-dc5-panoptic
Image Segmentation β’ 43M β’ Updated β’ 21 β’ 3 -
nvidia/segformer-b5-finetuned-cityscapes-1024-1024
Image Segmentation β’ Updated β’ 99k β’ β’ 38
-
openai/clip-vit-large-patch14
Zero-Shot Image Classification β’ 0.4B β’ Updated β’ 7.88M β’ 1.96k -
openai/clip-vit-base-patch32
Zero-Shot Image Classification β’ Updated β’ 17.2M β’ 856 -
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Zero-Shot Image Classification β’ Updated β’ 55.3k β’ 306 -
kakaobrain/align-base
Zero-Shot Image Classification β’ Updated β’ 11.6k β’ 30
-
timbrooks/instruct-pix2pix
Image-to-Image β’ Updated β’ 105k β’ 1.17k -
TencentARC/t2i-adapter-canny-sdxl-1.0
Image-to-Image β’ Updated β’ 3.56k β’ 52 -
TencentARC/t2i-adapter-sketch-sdxl-1.0
Image-to-Image β’ Updated β’ 4.77k β’ 75 -
CrucibleAI/ControlNetMediaPipeFace
Image-to-Image β’ Updated β’ 994 β’ 575
-
microsoft/xclip-base-patch32
Video Classification β’ 0.2B β’ Updated β’ 164k β’ 108 -
facebook/timesformer-base-finetuned-k400
Video Classification β’ Updated β’ 31.5k β’ 42 -
facebook/timesformer-base-finetuned-k600
Video Classification β’ Updated β’ 576 β’ 12 -
google/vivit-b-16x2
Video Classification β’ Updated β’ 1.61k β’ 11
-
Salesforce/blip-image-captioning-large
Image-to-Text β’ 0.5B β’ Updated β’ 683k β’ 1.45k -
Salesforce/blip-image-captioning-base
Image-to-Text β’ Updated β’ 2.3M β’ 842 -
microsoft/trocr-base-handwritten
Image-to-Text β’ 0.3B β’ Updated β’ 229k β’ 476 -
microsoft/git-large-coco
Image-to-Text β’ 0.4B β’ Updated β’ 3.16k β’ 104
-
stabilityai/stable-diffusion-xl-base-1.0
Text-to-Image β’ Updated β’ 2.05M β’ β’ 7.43k -
warp-ai/wuerstchen
Text-to-Image β’ Updated β’ 146 β’ 176 -
Deci/DeciDiffusion-v1-0
Text-to-Image β’ Updated β’ 29 β’ 140 -
stabilityai/stable-diffusion-xl-refiner-1.0
Image-to-Image β’ Updated β’ 336k β’ 2.02k
-
Running114
Grounding DINO Demo
π»114Cutting edge open-vocabulary object detection app
-
RunningFeatured96
Owlv2
π96State-of-the-art Zero-shot Object Detection
-
Runtime errorFeatured41
BLIP2 with transformers
π41BLIP2 (cutting edge image captioning) in π€transformers
-
Build errorFeatured378
IDEFICS Playground
π¨378
-
RunningFeatured96
Owlv2
π96State-of-the-art Zero-shot Object Detection
-
Runtime errorFeatured64
Owl Tracking
β‘64Powerful foundation model for zero-shot object tracking
-
Running26
Search and Detect (CLIP/OWL-ViT)
π¦26Search and detect objects in images using text queries
-
Running on ZeroFeatured109
OWLSAM
π»109State-of-the-art open-vocabulary image segmentation β‘οΈ
-
Running on ZeroFeatured72
Draw To Search Art
π72Draw/upload image and search among WikiART using SigLIP
-
Running on CPU Upgrade23
Compare Clip Siglip
π23Compare strong zero-shot image classification models
-
Running on Zero13
Multilingual Zero Shot Image Clf
π’13Comparing powerful multilingual zero-shot image clf models
-
BAAI/bunny-phi-2-siglip-lora
Text Generation β’ Updated β’ 259 β’ 48
-
Runtime errorFeatured84
UDOP
π84Generate text from document images
-
Runtime error40
Pix2struct
π40Play with all the pix2struct variants in this d
-
Sleeping26
Compare Docvqa Models
π¦26Compare different visual question answering
-
Runtime errorFeatured289
DocQuery βΒ Document Query Engine
π¦289
-
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper β’ 2403.05525 β’ Published β’ 49 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper β’ 2308.12966 β’ Published β’ 11 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper β’ 2404.01331 β’ Published β’ 27
-
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 102k β’ 145 -
google/owlvit-base-patch16
Zero-Shot Object Detection β’ Updated β’ 11.1k β’ 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection β’ Updated β’ 14.6k β’ 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 132k β’ 29
-
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 102k β’ 145 -
google/owlvit-base-patch16
Zero-Shot Object Detection β’ Updated β’ 11.1k β’ 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection β’ Updated β’ 14.6k β’ 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 132k β’ 29
-
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 102k β’ 145 -
google/owlvit-base-patch16
Zero-Shot Object Detection β’ Updated β’ 11.1k β’ 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection β’ Updated β’ 14.6k β’ 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 132k β’ 29
-
depth-anything/Depth-Anything-V2-Small
Depth Estimation β’ Updated β’ 9.96k β’ 76 -
depth-anything/Depth-Anything-V2-Large
Depth Estimation β’ Updated β’ 89.3k β’ 150 -
Running on Zero627
Depth Anything V2
π627Generate depth map from any photo
-
depth-anything/DA-2K
Viewer β’ Updated β’ 1.04k β’ 322 β’ 16
-
Running196
Vidore Leaderboard
π₯196Compare and rank visual document retrieval models across different benchmarks
-
Running on CPU Upgrade988
Open VLM Leaderboard
π988VLMEvalKit Evaluation Results Collection
-
RunningFeatured560
Vision Arena (Testing VLMs side-by-side)
πΌ560Explore AI vision models by uploading an image
-
RunningFeatured85
SEED-Bench Leaderboard
π85Submit model evaluation results to leaderboard
-
Paused21
Video Llava
π¨21Generate descriptions by uploading images or videos
-
llava-hf/LLaVA-NeXT-Video-7B-hf
Video-Text-to-Text β’ 7B β’ Updated β’ 84.5k β’ 122 -
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf
Video-Text-to-Text β’ 7B β’ Updated β’ 1.46k β’ 11 -
llava-hf/LLaVA-NeXT-Video-7B-32K-hf
Image-Text-to-Text β’ 8B β’ Updated β’ 449 β’ 8
-
NVEagle/Eagle-X5-13B
Image-Text-to-Text β’ 15B β’ Updated β’ 5 β’ 15 -
NVEagle/Eagle-X5-13B-Chat
Image-Text-to-Text β’ 15B β’ Updated β’ 4 β’ 28 -
NVEagle/Eagle-X5-7B
Image-Text-to-Text β’ 9B β’ Updated β’ 26 β’ 26 -
Runtime error64
Eagle X5 13B Chat
π64Combine text and images to generate responses
-
vidore/colpali-v1.2
Visual Document Retrieval β’ Updated β’ 24.3k β’ 113 -
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text β’ 8B β’ Updated β’ 1.7M β’ β’ 1.26k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text β’ 2B β’ Updated β’ 2.02M β’ 487 -
Qwen/Qwen2-72B-Instruct
Text Generation β’ 73B β’ Updated β’ 30.3k β’ β’ 719