lapa-instruct-bidi-grpo

Bidirectional English-Ukrainian translation LoRA adapter for lapa-v0.1.2-instruct (Gemma-3 12B).

Created by linearly combining the best direction-specific adapters:

en->uk: Condition E (LLM judge + calibrated guardrails, step 300)
uk->en: Condition A (chrF + BLEU, step 300)

Combined via peft.add_weighted_adapter(combination_type='linear', weights=[1.0, 1.0]).

Results (FLoRes+ devtest)

Direction	BLEU	chrF	Mistral Judge	Aya Judge
en->uk	33.88	61.54	83.87	91.94
uk->en	43.28	67.97	88.59	94.32

Results (WMT24)

Direction	BLEU	chrF	Mistral Judge	Aya Judge
en->uk	31.37	57.15	80.31	88.88
uk->en	35.69	60.71	84.62	91.75

Comparison to Baseline and Specialists

Benchmark	Baseline	en->uk specialist (E)	uk->en specialist (A)	Bidi
FLoRes en->uk BLEU	33.44	34.02	33.80	33.88
FLoRes uk->en BLEU	42.02	42.30	43.15	43.28
WMT24 en->uk BLEU	31.12	31.24	30.93	31.37
WMT24 uk->en BLEU	34.60	34.78	35.75	35.69

Improves over baseline on all 4 benchmarks. Retains most of each specialist's gains in its respective direction.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "lapa-llm/lapa-v0.1.2-instruct",
    device_map="auto",
    dtype="bfloat16",
)
model = PeftModel.from_pretrained(
    base_model,
    "iamthewalrus67/lapa-instruct-bidi-grpo",
    adapter_name="bidi",
    subfolder="bidi",
)
tokenizer = AutoTokenizer.from_pretrained("lapa-llm/lapa-v0.1.2-instruct")

messages = [{"role": "user", "content": "You are a professional translator. You give only the translated text and nothing else. Translate the following text into Ukrainian:\nThe weather is nice today."}]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=256, temperature=0.1, top_k=1, do_sample=True)
print(tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True))

Training Details

Base model: lapa-v0.1.2-instruct (Gemma-3 12B, Ukrainian-specialized)
Method: GRPO with direction-specific reward configurations
LoRA: r=128, alpha=256, targets=q/k/v/o projections
Data: WikiMatrix en-uk (132K pairs)
Training: 300 steps, DeepSpeed ZeRO-2, 4x RTX 6000 Ada

See reward-driven-translation for full reproduction code.

Downloads last month: -

Model tree for iamthewalrus67/lapa-instruct-bidi-grpo

Base model

google/gemma-3-12b-pt

Finetuned

lapa-llm/lapa-12b-pt

Finetuned

lapa-llm/lapa-v0.1.2-instruct

Adapter

(4)

this model