File size: 8,039 Bytes

37f456e

---
license: mit
base_model:
  - FacebookAI/xlm-roberta-large
language:
  - ru
tags:
  - reasoning
  - logical-analysis
  - text-classification
  - ai-safety
  - evaluation
  - judge-model
  - argumentation
pipeline_tag: text-classification
---

# RQA — Reasoning Quality Analyzer (R2)

**RQA-R2** is a **judge model** for reasoning-quality evaluation.  
It does **not** generate, rewrite, or explain text. Instead, it determines whether a text contains a reasoning problem, whether that problem is **hidden** or **explicit**, and which explicit error types are present.

> RQA is a judge, not a teacher and not a generator.

---

## What Is New in R2 Compared to R1

R2 is not just a retrain of R1. It is a full methodological upgrade.

### Core differences

- **R1** used a more limited 2-signal setup.
- **R2** uses a strict **3-head ontology**:
  - `has_issue`
  - `is_hidden`
  - `error_types`

### Key improvements in R2

- explicit hidden-problem modeling instead of weaker implicit logic
- strict `logical / hidden / explicit` inference contract
- honest `train / val / calib / test` split
- separate calibration split for temperatures and thresholds
- per-class thresholds for error types
- uncertainty-aware inference with `status=uncertain` and `review_required`
- duplicate and conflict-duplicate filtering in the loader
- truncation audit and richer evaluation reports
- better optimizer setup for transformer fine-tuning
- staged encoder fine-tuning with freeze/unfreeze
- stronger schema/version safety for inference artifacts

In short:

> **R1** was a strong prototype.  
> **R2** is the first version that behaves like a full training + calibration + inference pipeline.

---

## What Problem RQA-R2 Solves

Texts written by humans or LLMs can:

- sound coherent
- use correct vocabulary
- appear persuasive

...while still containing **reasoning problems** that are:

- subtle
- structural
- hidden in argumentation

RQA-R2 focuses specifically on **reasoning quality**, not on style, grammar, sentiment, or factual correctness.

---

## Model Overview

| Property | Value |
|---|---|
| Model Type | Judge / Evaluator |
| Base Encoder | [XLM-RoBERTa Large](https://huggingface.co/FacebookAI/xlm-roberta-large) |
| Pooling | Mean pooling |
| Heads | 3 (`has_issue`, `is_hidden`, `error_types`) |
| Language | Russian |
| License | MIT |

---

## What the Model Predicts

RQA-R2 predicts three connected outputs.

### 1. Logical Issue Detection

- `has_logical_issue ∈ {false, true}`
- calibrated probability available

### 2. Hidden Problem Detection

- `is_hidden_problem ∈ {false, true}`
- evaluated only when a reasoning issue exists

### 3. Explicit Error Type Classification

If the text is classified as `explicit`, the model may assign one or more of the following error types:

- `false_causality`
- `unsupported_claim`
- `overgeneralization`
- `missing_premise`
- `contradiction`
- `circular_reasoning`

This is a **multi-label** prediction head.

---

## Ontology

R2 uses a strict three-class reasoning ontology.

### `logical`

- no reasoning issue
- no hidden problem
- no explicit errors

### `hidden`

- reasoning problem exists
- no explicit labeled fallacy
- the issue is structural, implicit, or argumentative

### `explicit`

- reasoning problem exists
- at least one explicit error type is present

This ontology is enforced in both training and inference.

---

## Inference Contract

RQA-R2 uses gated inference:

- if `has_issue = false` -> class is `logical`, no errors are returned
- if `has_issue = true` and `is_hidden = true` -> class is `hidden`, no explicit errors are returned
- if `has_issue = true` and `is_hidden = false` -> class is `explicit`, explicit errors may be returned

R2 also supports:

- calibrated thresholds
- `uncertain` mode
- `review_required` for borderline cases

---

## Architecture

RQA-R2 is built on top of **XLM-RoBERTa Large** with:

- mean pooling
- separate projections per task
- separate dropout per head
- 3 task-specific heads
- uncertainty-weighted multi-task training

Training is hierarchical:

- `has_issue` is trained on all samples
- `is_hidden` is trained only on problem samples
- `error_types` are trained only on explicit samples

---

## Training and Calibration

R2 uses an honest experimental structure:

- `train` for fitting
- `val` for model selection
- `calib` for temperature scaling and threshold tuning
- `test` for final held-out evaluation

Calibration includes:

- issue temperature
- hidden temperature
- per-class error temperatures
- threshold selection for `has_issue`
- threshold selection for `is_hidden`
- per-class thresholds for error types

---

## Held-Out Synthetic Benchmark

The following metrics were obtained on the current held-out synthetic test split used for R2:

- `Issue`: `F1 = 0.988`, `FPR = 0.029`, `PR-AUC = 0.999`
- `Hidden`: `F1 = 0.960`, `PR-AUC = 0.994`
- `Errors`: `macro-F1 = 0.822`, `micro-F1 = 0.813`, `samples-F1 = 0.838`
- `Top-level class macro-F1 = 0.964`
- `Coverage = 95.6%`
- `Uncertain rate = 4.4%`

These are strong results for the current data regime.

Important:

> These metrics are measured on a held-out split from the current synthetic dataset.  
> They demonstrate that the R2 design works very well in-distribution, but they should not be interpreted as proof of universal real-world reasoning performance.

---

## Training Data

RQA-R2 was trained on a custom reasoning-quality dataset with:

- `7292` total samples
- `3150` logical texts
- `4142` problematic texts
- `1242` hidden problems
- `2900` explicit cases

Error-label counts:

- `false_causality`: `518`
- `unsupported_claim`: `524`
- `overgeneralization`: `599`
- `missing_premise`: `537`
- `contradiction`: `475`
- `circular_reasoning`: `540`

Multi-label explicit cases:

- `293`

The current dataset is useful and already strong enough for training and benchmarking R2, but it is still primarily **synthetic** and should be expanded with real-world data in future versions.

---

## Intended Use

### Recommended for

- reasoning-quality evaluation
- LLM output auditing
- AI safety pipelines
- judge/reranker pipelines
- pre-filtering for downstream review
- analytical tooling around argument structure

### Not intended for

- text generation
- explanation generation
- automatic rewriting or correction
- factual verification
- legal or scientific truth adjudication

---

## Output Example

```json
{
  "class": "explicit",
  "status": "ok",
  "review_required": false,
  "has_logical_issue": true,
  "has_issue_probability": 0.9993,
  "is_hidden_problem": false,
  "hidden_probability": 0.021,
  "errors": [
    {
      "type": "missing_premise",
      "probability": 0.923,
      "threshold": 0.54
    }
  ]
}
```

---

## Limitations

RQA-R2 still has important limits:

- it evaluates reasoning structure, not factual truth
- hidden problems remain partly subjective by nature
- the current benchmark is still synthetic and in-distribution
- real human-written texts and outputs from other LLMs may be harder
- the model should still be validated externally before being treated as a fully general reasoning judge

Also note:

- R2 was optimized toward low false positives, but on the current held-out synthetic test set the observed `Issue FPR` is `2.9%`, not `1.0%`
- if strict false-positive control is critical, threshold tuning may need to be tightened further for the target deployment environment

---

## Recommended Next Step

The best next step after R2 is external validation on:

- human-written argumentative texts
- outputs from other LLM families
- paraphrased and adversarially reworded samples
- harder hidden-problem cases

That is the correct way to turn a strong in-distribution result into a robust real-world system.

---

## Summary

RQA-R2 is a major upgrade over R1:

- better ontology
- better training logic
- better calibration
- better inference safety
- stronger held-out synthetic performance

R1 proved the idea.  
**R2 is the first version that fully validates it.**