ToriiGate-0.5

ToriiGate-0.5 is an advanced model for generating captions and descriptions for digital art, screenshots, and various images.

Carefully prepared dataset (based on Anime-Art-Multicaptions-v5.0 filtered and expanded) covers a wide range of images, compositions, sfw/nsfw art, comics, and more. The model possesses high awareness of numerous concepts and actions without restrictions or censorship, can recognize characters and use their names in descriptions, and distinguishes individual attributes. By volume of knowledge and description accuracy in its domain, the model delivers state-of-the-art performance, outperforming all nearest competitors.

The model can be used for creating accurate datasets for generative models, preparing prompts for individual images, and descriptions/labeling for further sorting and classification.

Key Advantages

High awareness and knowledge important for digital illustrations and anime art across a wide range
Multiple caption format variants (including intermediate reasoning) tailored for different use cases with detailed structuring options
Optimized for character name usage and recognition
Flexible grounding options for high-precision automated dataset processing
Despite its tiny size, outperforms larger and closed alternatives
No censorship

Capabilities

Zero-shot Character Recognition

The model provides state of the art level of knowledge for popular characters represented in mass media and booru galleries up to 12.2025.

Chart methodology

Chart obtained by processing 4000 recent Danbooru images with single character in each category without any grounding. Characters are considered as popular when has over 1200 pictures, more them 200 but less then 800 for medium, less then 100 for rare. Character recognition was evaluated by Kimi k2.5 judge model; character similarity for mismatches was assessed using cosine distance between their averaged embeddings.

For rare characters, confusion with similar popular ones may occur. This is resolved through various grounding. For example, the model excellently extracts character names from the full list of booru tags or descriptions and correctly applies them thanks to its existing knowledge.

Chart methodology

Chart obtained by processing 4000 recent Danbooru images with single character in each category with list of booru tags provided. Character recognition was evaluated by Kimi k2.5 judge model.

For cases where you need to achieve best recognition of multiple characters simultaneously without mixing their traits and names, a mode is provided that allows describing the individual appearance and attributes of each one. This allows the model effectively distinguishes unknown characters or objects considering given intel.

Example for characters description

Original Picture

1. Thoughts about characters

The two characters are a cute anthropomorphic panda and a lion, both depicted in a chibi, plush-like style. The panda is the black-and-white furry character with typical panda ears and markings, while the lion has a fluffy orange mane and a small tufted tail. They are both named in the tags as "Bamboo" (panda) and "Sunny" (lion).

2. Key details

The scene is a warm, peaceful outdoor picnic at sunset with a soft golden glow.

The panda is sitting on a picnic blanket eating a sandwich, smiling with closed eyes and blushing cheeks.

The lion is holding a steaming teacup with both paws, also smiling contentedly with closed eyes.

There is a traditional teapot and a plate of cookies on the blanket between them.

A bento box with assorted food is placed near the panda.

A cozy campfire burns beside a tent and a lantern in the background, adding to the camping atmosphere.

The surrounding environment is a lush grassy field dotted with wildflowers, with a winding path and distant mountains under a colorful sky.

3. Long description

This charming and heartwarming illustration shows two anthropomorphic animal characters enjoying a peaceful picnic outdoors during the golden hour before sunset. The panda, seated on a floral-patterned picnic blanket, is mid-bite into a sandwich, eyes closed in delight, cheeks flushed with a gentle blush. Next to the panda is a small brown backpack and a bento box filled with a variety of foods, including rice and vegetables. The lion, sitting on a blue cushion nearby, holds a floral teacup with both paws, steam rising from the hot beverage inside. The lions eyes are also closed, and its expression is one of contentment and happiness, with a small smile and rosy cheeks. Between them on the blanket is a matching floral teapot and a plate of two cookies.

The background reveals a lush meadow filled with wildflowers and tall grass, extending toward a winding path leading to rolling green hills and distant mountains. The sky is painted with soft pastel colors, featuring fluffy clouds tinged with the warm hues of the setting sun. To the right, a beige canvas tent with a window is pitched near a small campfire surrounded by logs and a lantern, further emphasizing the camping trip vibe. The overall style is soft and painterly, with delicate linework and warm, inviting colors that evoke a cozy and joyful atmosphere.

4. Detailed description for each character

Character 1: Panda (Bamboo)

The panda is a small, chubby, and adorable anthropomorphic character with classic black-and-white fur patterning. It has round black ears with white insides, large black eyes closed in a happy expression, and rosy blush on its cheeks. The panda wears a green neck scarf tied around its neck and is sitting upright on the picnic blanket with its legs spread comfortably. It holds a sandwich with both paws, taking a big bite with a joyful smile. Its pink paw pads are visible on its feet, and a brown backpack rests behind it. The pandas overall posture and expression convey pure delight and contentment.

Character 2: Lion (Sunny)

The lion is a similarly small, plush-like anthropomorphic character with fluffy orange fur and a thick, mane-like tuft of hair framing its face. It has round ears with light inner fur and a small tufted tail visible curling behind it. The lions eyes are closed in a happy smile, cheeks also blushed. It holds a floral patterned teacup with both front paws, steam gently rising from the hot tea inside. The lion wears a yellow scarf around its neck and sits on a blue pillow on the picnic blanket. Its round paw pads are visible as it clasps the cup, radiating warmth and relaxation.This image beautifully captures a serene, joyful moment of friendship and nature appreciation during a peaceful camping trip at sunset.

No Restrictions and Deep NSFW Knowledge

Chart methodology

Chart obtained by processing 2000 recent Danbooru images from the several explicit categories, zero shot. Accuracy was evaluated by Kimi k2.5 judge model through tag comparison and semantic analysis.

The model understands many types of activities and does not require jailbreaks or numerous additional instructions for working with such images. Descriptions are sufficiently detailed and rich in details, avoiding abstract purple prose and vague terms like intimate activity.

Flexible Grounding Options

The model can work fully zero-shot or use available ground-truth for maximum accuracy. Flexible grounding options are available and can be used in any combination:

List of character names to use in the description
Individual attributes for each character for recognition on images with multiple characters (both popular tags and regular descriptions are supported)
Separate booru tags or brief descriptions of what's happening (from one or two words to a long list)

For convenience, you can use these datasets (will be uploaded soon), which contain descriptions for most characters represented in booru galleries.

Multiple Caption Formats

Formats are oriented toward different tasks. In general, two directions can be distinguished: detailed structured descriptions for further processing, and ready-to-use caption variants for use as prompts. And some general lagecy stuff.

More details with examples HERE.

What the Model is Not Suitable For

Multi-turn conversations and multiple image input
OCR
Text-only input and regular chat
Any general LLM task except image description

Do not even try to use is with SillyTavern, OpenWebui, or regular nodes you may used to, the model requires special handling. All other functionality was lobotomized in favor of the primary use case. Possibly, a general-purpose model will be created in the future if there is interest.

Usage

Running the Model / Backend

For efficient operation, use:

VLLM (full weights in this repository, fp8 quant/soon)
Llamacpp (quants/soon)
Exllama (quants/soon)

Running via Transformers is possible, but speed will be extremely low (script example).

Casual Use

Ready-to-use Gradio-based interface
HF Space (may be slow due to Transformes implementation!) /soon
ComfyUI nodes (coming soon)

Batch Processing

Sample script. Inside you will find settings for modes and grounding configurations.

Prompts

The model is designed to work with specific prompts and structure to ensure optimal results. They are presented in the scripts/prompts.py file; deviating from them is highly discouraged.

Caption Examples

SFW examples

NSFW examples on Rentry / soon

Using Complex Formats

Detailed formats with large structures are designed to obtain the most comprehensive information. They can subsequently be restructured into the desired format via stream processing with any LLM.

Individual formats (list) are ready to use as prompts for generation or training in their original form. Simple processors can be implemented for specific models, for example, individual prompts for NovelAI Image v4+.

Known Issues

Some formats with reasoning blocks are designed only for working with named characters; disabling names in them will not yield results. Use other formats for descriptions without names
The model was trained with average resolution of about 1 Mpx, avoid using too high resolution images for raw input (provided scripts will handle automatic resizing)
...

Warning

The model may produce inaccurate and provocative results

Discussion and support

Discord

BTC: bc1qwv83ggq8rvv07uk6dv4njs0j3yygj3aax4wg6c

ETH/USDT(e): 0x04C8a749F49aE8a56CB84cF0C99CD9E92eDB17db

XMR: 47F7JAyKP8tMBtzwxpoZsUVB8wzg2VrbtDKBice9FAS1FikbHEXXPof4PAb42CQ5ch8p8Hs4RvJuzPHDtaVSdQzD6ZbA5TZ

Acknowledgments

Thanks to all fellow brothers who supported me and my project. Special thanks: Anonymous person, NeuroSenko, OpenRoot-Compute, Sv1.

Downloads last month: 27

Safetensors

Model size

5B params

Tensor type

BF16

Model tree for Minthy/ToriiGate-0.5

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

(45)

this model

Quantizations

3 models