Keep on Pruning
Can we get this bad boy down to 40B, please π
lol i hear you. 40B params? or 40GB? some details for anyone curious ..
I don't think REAP goes much further than 40% cut .. that's already very aggressive and what REAP author did (lkevincc0 -- not me)
40GB is probably feasible with ik llama quants .. (2 or 3bit?) but 40% REAP takes 5-15% quality .. Q4 normally takes another 5-10% prob worse with REAP. going past that .. π€· . I actually haven't tried this q4 gguf yet but should work ok.
I don't think 40B params is even close to feasible. would be fun to try with some RL healing tho but I don't have a good DPO setup.
40B parameters, lol
I have seen a team drop the Qwen 3 Coder Next 80B down to a 20B model (60% pruned), it is pretty smart or a pruned and then quantized model.
Not saying it wouldn't be usable -- imo at that big of a drop would use some RL to be useful (IMO)
but 40B + baby DPO would probably get a pretty great flash model. point me at a fresh DPO dataset.π