File size: 8,039 Bytes
37f456e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 | ---
license: mit
base_model:
- FacebookAI/xlm-roberta-large
language:
- ru
tags:
- reasoning
- logical-analysis
- text-classification
- ai-safety
- evaluation
- judge-model
- argumentation
pipeline_tag: text-classification
---
# RQA — Reasoning Quality Analyzer (R2)
**RQA-R2** is a **judge model** for reasoning-quality evaluation.
It does **not** generate, rewrite, or explain text. Instead, it determines whether a text contains a reasoning problem, whether that problem is **hidden** or **explicit**, and which explicit error types are present.
> RQA is a judge, not a teacher and not a generator.
---
## What Is New in R2 Compared to R1
R2 is not just a retrain of R1. It is a full methodological upgrade.
### Core differences
- **R1** used a more limited 2-signal setup.
- **R2** uses a strict **3-head ontology**:
- `has_issue`
- `is_hidden`
- `error_types`
### Key improvements in R2
- explicit hidden-problem modeling instead of weaker implicit logic
- strict `logical / hidden / explicit` inference contract
- honest `train / val / calib / test` split
- separate calibration split for temperatures and thresholds
- per-class thresholds for error types
- uncertainty-aware inference with `status=uncertain` and `review_required`
- duplicate and conflict-duplicate filtering in the loader
- truncation audit and richer evaluation reports
- better optimizer setup for transformer fine-tuning
- staged encoder fine-tuning with freeze/unfreeze
- stronger schema/version safety for inference artifacts
In short:
> **R1** was a strong prototype.
> **R2** is the first version that behaves like a full training + calibration + inference pipeline.
---
## What Problem RQA-R2 Solves
Texts written by humans or LLMs can:
- sound coherent
- use correct vocabulary
- appear persuasive
...while still containing **reasoning problems** that are:
- subtle
- structural
- hidden in argumentation
RQA-R2 focuses specifically on **reasoning quality**, not on style, grammar, sentiment, or factual correctness.
---
## Model Overview
| Property | Value |
|---|---|
| Model Type | Judge / Evaluator |
| Base Encoder | [XLM-RoBERTa Large](https://huggingface.co/FacebookAI/xlm-roberta-large) |
| Pooling | Mean pooling |
| Heads | 3 (`has_issue`, `is_hidden`, `error_types`) |
| Language | Russian |
| License | MIT |
---
## What the Model Predicts
RQA-R2 predicts three connected outputs.
### 1. Logical Issue Detection
- `has_logical_issue ∈ {false, true}`
- calibrated probability available
### 2. Hidden Problem Detection
- `is_hidden_problem ∈ {false, true}`
- evaluated only when a reasoning issue exists
### 3. Explicit Error Type Classification
If the text is classified as `explicit`, the model may assign one or more of the following error types:
- `false_causality`
- `unsupported_claim`
- `overgeneralization`
- `missing_premise`
- `contradiction`
- `circular_reasoning`
This is a **multi-label** prediction head.
---
## Ontology
R2 uses a strict three-class reasoning ontology.
### `logical`
- no reasoning issue
- no hidden problem
- no explicit errors
### `hidden`
- reasoning problem exists
- no explicit labeled fallacy
- the issue is structural, implicit, or argumentative
### `explicit`
- reasoning problem exists
- at least one explicit error type is present
This ontology is enforced in both training and inference.
---
## Inference Contract
RQA-R2 uses gated inference:
- if `has_issue = false` -> class is `logical`, no errors are returned
- if `has_issue = true` and `is_hidden = true` -> class is `hidden`, no explicit errors are returned
- if `has_issue = true` and `is_hidden = false` -> class is `explicit`, explicit errors may be returned
R2 also supports:
- calibrated thresholds
- `uncertain` mode
- `review_required` for borderline cases
---
## Architecture
RQA-R2 is built on top of **XLM-RoBERTa Large** with:
- mean pooling
- separate projections per task
- separate dropout per head
- 3 task-specific heads
- uncertainty-weighted multi-task training
Training is hierarchical:
- `has_issue` is trained on all samples
- `is_hidden` is trained only on problem samples
- `error_types` are trained only on explicit samples
---
## Training and Calibration
R2 uses an honest experimental structure:
- `train` for fitting
- `val` for model selection
- `calib` for temperature scaling and threshold tuning
- `test` for final held-out evaluation
Calibration includes:
- issue temperature
- hidden temperature
- per-class error temperatures
- threshold selection for `has_issue`
- threshold selection for `is_hidden`
- per-class thresholds for error types
---
## Held-Out Synthetic Benchmark
The following metrics were obtained on the current held-out synthetic test split used for R2:
- `Issue`: `F1 = 0.988`, `FPR = 0.029`, `PR-AUC = 0.999`
- `Hidden`: `F1 = 0.960`, `PR-AUC = 0.994`
- `Errors`: `macro-F1 = 0.822`, `micro-F1 = 0.813`, `samples-F1 = 0.838`
- `Top-level class macro-F1 = 0.964`
- `Coverage = 95.6%`
- `Uncertain rate = 4.4%`
These are strong results for the current data regime.
Important:
> These metrics are measured on a held-out split from the current synthetic dataset.
> They demonstrate that the R2 design works very well in-distribution, but they should not be interpreted as proof of universal real-world reasoning performance.
---
## Training Data
RQA-R2 was trained on a custom reasoning-quality dataset with:
- `7292` total samples
- `3150` logical texts
- `4142` problematic texts
- `1242` hidden problems
- `2900` explicit cases
Error-label counts:
- `false_causality`: `518`
- `unsupported_claim`: `524`
- `overgeneralization`: `599`
- `missing_premise`: `537`
- `contradiction`: `475`
- `circular_reasoning`: `540`
Multi-label explicit cases:
- `293`
The current dataset is useful and already strong enough for training and benchmarking R2, but it is still primarily **synthetic** and should be expanded with real-world data in future versions.
---
## Intended Use
### Recommended for
- reasoning-quality evaluation
- LLM output auditing
- AI safety pipelines
- judge/reranker pipelines
- pre-filtering for downstream review
- analytical tooling around argument structure
### Not intended for
- text generation
- explanation generation
- automatic rewriting or correction
- factual verification
- legal or scientific truth adjudication
---
## Output Example
```json
{
"class": "explicit",
"status": "ok",
"review_required": false,
"has_logical_issue": true,
"has_issue_probability": 0.9993,
"is_hidden_problem": false,
"hidden_probability": 0.021,
"errors": [
{
"type": "missing_premise",
"probability": 0.923,
"threshold": 0.54
}
]
}
```
---
## Limitations
RQA-R2 still has important limits:
- it evaluates reasoning structure, not factual truth
- hidden problems remain partly subjective by nature
- the current benchmark is still synthetic and in-distribution
- real human-written texts and outputs from other LLMs may be harder
- the model should still be validated externally before being treated as a fully general reasoning judge
Also note:
- R2 was optimized toward low false positives, but on the current held-out synthetic test set the observed `Issue FPR` is `2.9%`, not `1.0%`
- if strict false-positive control is critical, threshold tuning may need to be tightened further for the target deployment environment
---
## Recommended Next Step
The best next step after R2 is external validation on:
- human-written argumentative texts
- outputs from other LLM families
- paraphrased and adversarially reworded samples
- harder hidden-problem cases
That is the correct way to turn a strong in-distribution result into a robust real-world system.
---
## Summary
RQA-R2 is a major upgrade over R1:
- better ontology
- better training logic
- better calibration
- better inference safety
- stronger held-out synthetic performance
R1 proved the idea.
**R2 is the first version that fully validates it.**
|