Model Card for RWKV-Qwen3-30B-A3B-2507-Hybrid-GGUF

This model requires a custom fork of llama.cpp with RWKV079 implementation

This is preview release.

This model has lower output quality compared to the Dense Hybrid. We will continue to improve it!

Model Overview

Model Name: RWKV-Qwen3-30B-A3B-2507-Hybrid-GGUF
Repository: OpenMOSE/RWKV-Qwen3-30B-A3B-2507-Instruct-hxa079
Format: GGUF (for llama.cpp) with imatrix quantization
Year: 2025
Release phase: alpha

Description

RWKV-Qwen3-30B-A3B-2507-Hybrid-GGUF is an experimental large language model that combines the strengths of traditional transformer architecture with the efficiency of RWKV (Receptance Weighted Key Value) mechanisms. This model is specifically optimized for inference in memory-constrained environments while maintaining excellent context length capabilities.

Technical Specifications

Model Parameters

Parameter Count: 30 Billion parameters
Architecture: RWKV079 + GQA (Grouped-Query Attention) Hybrid Linear Attention + Mixture of Experts
Base Model: Alibaba Qwen3-30B-A3B-2507
Suitable Ctx Length: 32768 (passkey up to 110k)
Layers: 39 RWKV, 9 NoPE GQA

Key Innovation

The model achieves remarkable efficiency by:

Converting 81.25% of attention layers from the base Qwen3-30B-A3B model to RWKV architecture
Reducing KV (Key-Value) cache size to 1/5.33 of the original
Enabling superior long-context inference in VRAM-limited environments

Performance Benefits

Compared to the base model, RWKV-Qwen3-30B-A3B-2507-Hybrid offers:

2x longer context length capability (theoretical)
2x larger batch size for simultaneous inference
Significantly reduced memory footprint while maintaining model quality

Installation and Usage

Prerequisites

This model requires a custom fork of llama.cpp with RWKV079 implementation, based on mollysophia's RWKV7 implementation.

Setup Instructions

Clone the repository:

git clone https://github.com/OpenMOSE/llama.cpp
cd llama.cpp
git checkout hxa079

Building the Project(Linux)

For CUDA (NVIDIA GPUs):

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

For ROCm (AMD GPUs):

First, identify your GPU architecture:

AMD Radeon RX 79xx series → gfx1100
AMD Instinct MI300 series → gfx942
AMD Instinct MI100 → gfx908

Then build with the appropriate target:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build --config Release -- -j 16

Note: Replace gfx1100 with your GPU's architecture code

Running the Model

Standard Inference:

./build/bin/llama-cli -m YOUR_MODEL_PATH --jinja -fa 1

With KV Cache Quantization:

./build/bin/llama-cli -m YOUR_MODEL_PATH --jinja -fa 1 -ctv q8_0 -ctk q8_0

Extreme Low VRAM Mode(fit to 16GB GPU):

./build/bin/llama-cli -m YOUR_MODEL_PATH --jinja -fa 1 -ctv q8_0 -ctk q8_0 --override-tensor "time_mix_g1=CPU,time_mix_g2=CPU,time_mix_w1=CPU,time_mix_w2=CPU"

Extreme Low VRAM Mode(fit to 4GB GPU):

./build/bin/llama-cli -m YOUR_MODE_PATH --jinja -fa 1 -c 4096 --n-cpu-moe 48

./build/bin/llama-server -m YOUR_MODE_PATH --jinja -fa 1 --port 4096 -np 1 -c 65536 --top-k 20 --top-p 0.3 --temp 0.6 --repeat-penalty 1.1 --n-cpu-moe 48

Important: To get better output quality, please test --top-k 20 --top-p 0.3 --temp 0.6 --repeat-penalty 1.1

Important Limitations and Notes

Current Limitations:

Model Compatibility: This branch exclusively supports RWKV079 models - other model types will not function

Supported Hardware:

✅ NVIDIA GPUs (via CUDA)
✅ AMD GPUs (via ROCm)
✅ CPU inference
❌ Apple Silicon (Metal)

Acknowledgments

This project was made possible through:

Substantial computational support from Recursal.AI
Special thanks to SmerkyG for invaluable guidance and mentorship
Inspired by RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
Reference Code https://github.com/recursal/RADLADS-paper

We extend our heartfelt gratitude to all contributors and supporters who made this experimental model possible.

Disclaimer

EXPERIMENTAL MODEL: This model is created purely for experimental and research purposes.

No Warranty: The creators make no guarantees regarding:

Model performance
Output quality
Suitability for any particular use case
Results accuracy

Users should thoroughly evaluate the model for their specific needs before deployment in any application.

License

Apache-2.0 Please refer to the repository for specific license information. As this is based on Qwen3-30B-A3B-2507, users should also comply with the original Qwen model's licensing terms.

Contact and Support

For issues, questions, or contributions, please visit the GitHub repository or open an issue in the project's issue tracker.

2025 OpenMOSE

Downloads last month: 15

GGUF

Model size

31B params

Architecture

rwkv079qwen3_moe

Hardware compatibility

3-bit

4-bit

5-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSE/RWKV-Qwen3-30B-A3B-2507-Instruct-Hybrid-GGUF

Base model

OpenMOSE/RWKV-Qwen3-30B-A3B-2507-Instruct-hxa079

Quantized

(1)

this model

Collection including OpenMOSE/RWKV-Qwen3-30B-A3B-2507-Instruct-Hybrid-GGUF

hxa079 RWKV-Transformer Hybrid series

Collection

HXA079 family of hybrid models, combining RWKV recurrent architectures with Transformer-based attention. Designed for efficient long-context. • 8 items • Updated Oct 27, 2025

Paper for OpenMOSE/RWKV-Qwen3-30B-A3B-2507-Instruct-Hybrid-GGUF

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Paper • 2505.03005 • Published May 5, 2025 • 36