lapa-instruct-bidi-grpo

Bidirectional English-Ukrainian translation LoRA adapter for lapa-v0.1.2-instruct (Gemma-3 12B).

Created by linearly combining the best direction-specific adapters:

  • en->uk: Condition E (LLM judge + calibrated guardrails, step 300)
  • uk->en: Condition A (chrF + BLEU, step 300)

Combined via peft.add_weighted_adapter(combination_type='linear', weights=[1.0, 1.0]).

Results (FLoRes+ devtest)

Direction BLEU chrF Mistral Judge Aya Judge
en->uk 33.88 61.54 83.87 91.94
uk->en 43.28 67.97 88.59 94.32

Results (WMT24)

Direction BLEU chrF Mistral Judge Aya Judge
en->uk 31.37 57.15 80.31 88.88
uk->en 35.69 60.71 84.62 91.75

Comparison to Baseline and Specialists

Benchmark Baseline en->uk specialist (E) uk->en specialist (A) Bidi
FLoRes en->uk BLEU 33.44 34.02 33.80 33.88
FLoRes uk->en BLEU 42.02 42.30 43.15 43.28
WMT24 en->uk BLEU 31.12 31.24 30.93 31.37
WMT24 uk->en BLEU 34.60 34.78 35.75 35.69

Improves over baseline on all 4 benchmarks. Retains most of each specialist's gains in its respective direction.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "lapa-llm/lapa-v0.1.2-instruct",
    device_map="auto",
    dtype="bfloat16",
)
model = PeftModel.from_pretrained(
    base_model,
    "iamthewalrus67/lapa-instruct-bidi-grpo",
    adapter_name="bidi",
    subfolder="bidi",
)
tokenizer = AutoTokenizer.from_pretrained("lapa-llm/lapa-v0.1.2-instruct")

messages = [{"role": "user", "content": "You are a professional translator. You give only the translated text and nothing else. Translate the following text into Ukrainian:\nThe weather is nice today."}]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=256, temperature=0.1, top_k=1, do_sample=True)
print(tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True))

Training Details

  • Base model: lapa-v0.1.2-instruct (Gemma-3 12B, Ukrainian-specialized)
  • Method: GRPO with direction-specific reward configurations
  • LoRA: r=128, alpha=256, targets=q/k/v/o projections
  • Data: WikiMatrix en-uk (132K pairs)
  • Training: 300 steps, DeepSpeed ZeRO-2, 4x RTX 6000 Ada

See reward-driven-translation for full reproduction code.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for iamthewalrus67/lapa-instruct-bidi-grpo

Adapter
(4)
this model