Ivan Peshkov
AI & ML interests
Recent Activity
Organizations
Using multiple quality tags (NovelAI/Pony/etc) is a good idea
Full tutorial link > https://www.youtube.com/watch?v=DPX3eBTuO_Y
Info
This is a full comprehensive step-by-step tutorial for how to train Qwen Image models. This tutorial covers how to do LoRA training and full Fine-Tuning / DreamBooth training on Qwen Image models. It covers both the Qwen Image base model and the Qwen Image Edit Plus 2509 model. This tutorial is the product of 21 days of full R&D, costing over $800 in cloud services to find the best configurations for training. Furthermore, we have developed an amazing, ultra-easy-to-use Gradio app to use the legendary Kohya Musubi Tuner trainer with ease. You will be able to train locally on your Windows computer with GPUs with as little as 6 GB of VRAM for both LoRA and Fine-Tuning. Furthermore, I have shown how to train a character (person), a product (perfume) and a style (GTA5 artworks).
Resources
The post used in tutorial to download zip file : https://www.patreon.com/posts/qwen-trainer-app-137551634
Requirements tutorial : https://youtu.be/DrhUHnYfwC0
SwarmUI tutorial : https://youtu.be/c3gEoAyL2IE
Used Prompts
https://gist.github.com/FurkanGozukara/069523015d18a3e63d74c59257447f5b
Comparison Images Full Size
Chat Template?
Really Fun Model
🚩 Report: Ethical issue(s)
What dose A14 means? Could we get the detail of Qwen MOE architechture?
What is the context size on this model? And it does not appear to deal with JSON, function calling well.
4k versions load and work in Koboldcpp, but the 128k versions don't.
<|eot_id|> in aphrodite-engine
The issue is all those experts have to be very diverse and trained more or less simultaneously.
Because if you are going to use sparse MoE, your router model has to be able to predict the fittest expert for the upcoming token. Which means router has to be trained with the experts. That wouldn't be an issue for classic MoE, but both kinds of models also rely on the experts' uniform "understanding" of the cached context. I don't think a 100x2B model would work without that well enough. That's the reason why Mixtral fine-tuning is such a complicated task.
Not only that, we don't really have a good base 2B model. Sure, Phi exists... With 2K ctx length, no GQA, coherency issues and very limited knowledge. I don't think the point of "expert" is providing domain-specific capabilities into the composite model, I think the trick is overcoming the diminishing returns in training, as well as some bandwidth optimizations for inference. So among your 100 experts, one might have both an analog for Grandma cell and some weights associated with division. Another expert could be good at both kinds of ERP - being Enterprise Recourse Planning and the main excuse for Frankenmerges creation, lol. The model distillation becomes better over time, but I don't think any modern 2B can help to compete with GPT-4. Perhaps 16x34B could, but good luck training that from scratch as a relatively small business, let alone nonprofit or private individual.