FP4 quantization for inference optimization
#19
by O96a - opened
The A12B active parameter configuration with FP4 quantization is an interesting tradeoff for 120B-scale models. We've been exploring quantization strategies for running large models on Oracle Free Tier instances, and FP4 is compelling for memory efficiency. The key question is whether FP4 degrades reasoning capabilities more than 4-bit quantization at this scale. Has anyone compared FP4 vs INT4 on reasoning benchmarks like GSM8K or HumanEval? The 220 likes suggest early adopters are validating the approach, but real-world inference benchmarks would help teams decide if the memory savings justify the precision tradeoff.