File size: 8,039 Bytes
37f456e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
---
license: mit
base_model:
  - FacebookAI/xlm-roberta-large
language:
  - ru
tags:
  - reasoning
  - logical-analysis
  - text-classification
  - ai-safety
  - evaluation
  - judge-model
  - argumentation
pipeline_tag: text-classification
---

# RQA — Reasoning Quality Analyzer (R2)

**RQA-R2** is a **judge model** for reasoning-quality evaluation.  
It does **not** generate, rewrite, or explain text. Instead, it determines whether a text contains a reasoning problem, whether that problem is **hidden** or **explicit**, and which explicit error types are present.

> RQA is a judge, not a teacher and not a generator.

---

## What Is New in R2 Compared to R1

R2 is not just a retrain of R1. It is a full methodological upgrade.

### Core differences

- **R1** used a more limited 2-signal setup.
- **R2** uses a strict **3-head ontology**:
  - `has_issue`
  - `is_hidden`
  - `error_types`

### Key improvements in R2

- explicit hidden-problem modeling instead of weaker implicit logic
- strict `logical / hidden / explicit` inference contract
- honest `train / val / calib / test` split
- separate calibration split for temperatures and thresholds
- per-class thresholds for error types
- uncertainty-aware inference with `status=uncertain` and `review_required`
- duplicate and conflict-duplicate filtering in the loader
- truncation audit and richer evaluation reports
- better optimizer setup for transformer fine-tuning
- staged encoder fine-tuning with freeze/unfreeze
- stronger schema/version safety for inference artifacts

In short:

> **R1** was a strong prototype.  
> **R2** is the first version that behaves like a full training + calibration + inference pipeline.

---

## What Problem RQA-R2 Solves

Texts written by humans or LLMs can:

- sound coherent
- use correct vocabulary
- appear persuasive

...while still containing **reasoning problems** that are:

- subtle
- structural
- hidden in argumentation

RQA-R2 focuses specifically on **reasoning quality**, not on style, grammar, sentiment, or factual correctness.

---

## Model Overview

| Property | Value |
|---|---|
| Model Type | Judge / Evaluator |
| Base Encoder | [XLM-RoBERTa Large](https://huggingface.co/FacebookAI/xlm-roberta-large) |
| Pooling | Mean pooling |
| Heads | 3 (`has_issue`, `is_hidden`, `error_types`) |
| Language | Russian |
| License | MIT |

---

## What the Model Predicts

RQA-R2 predicts three connected outputs.

### 1. Logical Issue Detection

- `has_logical_issue ∈ {false, true}`
- calibrated probability available

### 2. Hidden Problem Detection

- `is_hidden_problem ∈ {false, true}`
- evaluated only when a reasoning issue exists

### 3. Explicit Error Type Classification

If the text is classified as `explicit`, the model may assign one or more of the following error types:

- `false_causality`
- `unsupported_claim`
- `overgeneralization`
- `missing_premise`
- `contradiction`
- `circular_reasoning`

This is a **multi-label** prediction head.

---

## Ontology

R2 uses a strict three-class reasoning ontology.

### `logical`

- no reasoning issue
- no hidden problem
- no explicit errors

### `hidden`

- reasoning problem exists
- no explicit labeled fallacy
- the issue is structural, implicit, or argumentative

### `explicit`

- reasoning problem exists
- at least one explicit error type is present

This ontology is enforced in both training and inference.

---

## Inference Contract

RQA-R2 uses gated inference:

- if `has_issue = false` -> class is `logical`, no errors are returned
- if `has_issue = true` and `is_hidden = true` -> class is `hidden`, no explicit errors are returned
- if `has_issue = true` and `is_hidden = false` -> class is `explicit`, explicit errors may be returned

R2 also supports:

- calibrated thresholds
- `uncertain` mode
- `review_required` for borderline cases

---

## Architecture

RQA-R2 is built on top of **XLM-RoBERTa Large** with:

- mean pooling
- separate projections per task
- separate dropout per head
- 3 task-specific heads
- uncertainty-weighted multi-task training

Training is hierarchical:

- `has_issue` is trained on all samples
- `is_hidden` is trained only on problem samples
- `error_types` are trained only on explicit samples

---

## Training and Calibration

R2 uses an honest experimental structure:

- `train` for fitting
- `val` for model selection
- `calib` for temperature scaling and threshold tuning
- `test` for final held-out evaluation

Calibration includes:

- issue temperature
- hidden temperature
- per-class error temperatures
- threshold selection for `has_issue`
- threshold selection for `is_hidden`
- per-class thresholds for error types

---

## Held-Out Synthetic Benchmark

The following metrics were obtained on the current held-out synthetic test split used for R2:

- `Issue`: `F1 = 0.988`, `FPR = 0.029`, `PR-AUC = 0.999`
- `Hidden`: `F1 = 0.960`, `PR-AUC = 0.994`
- `Errors`: `macro-F1 = 0.822`, `micro-F1 = 0.813`, `samples-F1 = 0.838`
- `Top-level class macro-F1 = 0.964`
- `Coverage = 95.6%`
- `Uncertain rate = 4.4%`

These are strong results for the current data regime.

Important:

> These metrics are measured on a held-out split from the current synthetic dataset.  
> They demonstrate that the R2 design works very well in-distribution, but they should not be interpreted as proof of universal real-world reasoning performance.

---

## Training Data

RQA-R2 was trained on a custom reasoning-quality dataset with:

- `7292` total samples
- `3150` logical texts
- `4142` problematic texts
- `1242` hidden problems
- `2900` explicit cases

Error-label counts:

- `false_causality`: `518`
- `unsupported_claim`: `524`
- `overgeneralization`: `599`
- `missing_premise`: `537`
- `contradiction`: `475`
- `circular_reasoning`: `540`

Multi-label explicit cases:

- `293`

The current dataset is useful and already strong enough for training and benchmarking R2, but it is still primarily **synthetic** and should be expanded with real-world data in future versions.

---

## Intended Use

### Recommended for

- reasoning-quality evaluation
- LLM output auditing
- AI safety pipelines
- judge/reranker pipelines
- pre-filtering for downstream review
- analytical tooling around argument structure

### Not intended for

- text generation
- explanation generation
- automatic rewriting or correction
- factual verification
- legal or scientific truth adjudication

---

## Output Example

```json
{
  "class": "explicit",
  "status": "ok",
  "review_required": false,
  "has_logical_issue": true,
  "has_issue_probability": 0.9993,
  "is_hidden_problem": false,
  "hidden_probability": 0.021,
  "errors": [
    {
      "type": "missing_premise",
      "probability": 0.923,
      "threshold": 0.54
    }
  ]
}
```

---

## Limitations

RQA-R2 still has important limits:

- it evaluates reasoning structure, not factual truth
- hidden problems remain partly subjective by nature
- the current benchmark is still synthetic and in-distribution
- real human-written texts and outputs from other LLMs may be harder
- the model should still be validated externally before being treated as a fully general reasoning judge

Also note:

- R2 was optimized toward low false positives, but on the current held-out synthetic test set the observed `Issue FPR` is `2.9%`, not `1.0%`
- if strict false-positive control is critical, threshold tuning may need to be tightened further for the target deployment environment

---

## Recommended Next Step

The best next step after R2 is external validation on:

- human-written argumentative texts
- outputs from other LLM families
- paraphrased and adversarially reworded samples
- harder hidden-problem cases

That is the correct way to turn a strong in-distribution result into a robust real-world system.

---

## Summary

RQA-R2 is a major upgrade over R1:

- better ontology
- better training logic
- better calibration
- better inference safety
- stronger held-out synthetic performance

R1 proved the idea.  
**R2 is the first version that fully validates it.**