--- license: mit base_model: - FacebookAI/xlm-roberta-large language: - ru tags: - reasoning - logical-analysis - text-classification - ai-safety - evaluation - judge-model - argumentation pipeline_tag: text-classification --- # RQA — Reasoning Quality Analyzer (R2) **RQA-R2** is a **judge model** for reasoning-quality evaluation. It does **not** generate, rewrite, or explain text. Instead, it determines whether a text contains a reasoning problem, whether that problem is **hidden** or **explicit**, and which explicit error types are present. > RQA is a judge, not a teacher and not a generator. --- ## What Is New in R2 Compared to R1 R2 is not just a retrain of R1. It is a full methodological upgrade. ### Core differences - **R1** used a more limited 2-signal setup. - **R2** uses a strict **3-head ontology**: - `has_issue` - `is_hidden` - `error_types` ### Key improvements in R2 - explicit hidden-problem modeling instead of weaker implicit logic - strict `logical / hidden / explicit` inference contract - honest `train / val / calib / test` split - separate calibration split for temperatures and thresholds - per-class thresholds for error types - uncertainty-aware inference with `status=uncertain` and `review_required` - duplicate and conflict-duplicate filtering in the loader - truncation audit and richer evaluation reports - better optimizer setup for transformer fine-tuning - staged encoder fine-tuning with freeze/unfreeze - stronger schema/version safety for inference artifacts In short: > **R1** was a strong prototype. > **R2** is the first version that behaves like a full training + calibration + inference pipeline. --- ## What Problem RQA-R2 Solves Texts written by humans or LLMs can: - sound coherent - use correct vocabulary - appear persuasive ...while still containing **reasoning problems** that are: - subtle - structural - hidden in argumentation RQA-R2 focuses specifically on **reasoning quality**, not on style, grammar, sentiment, or factual correctness. --- ## Model Overview | Property | Value | |---|---| | Model Type | Judge / Evaluator | | Base Encoder | [XLM-RoBERTa Large](https://huggingface.co/FacebookAI/xlm-roberta-large) | | Pooling | Mean pooling | | Heads | 3 (`has_issue`, `is_hidden`, `error_types`) | | Language | Russian | | License | MIT | --- ## What the Model Predicts RQA-R2 predicts three connected outputs. ### 1. Logical Issue Detection - `has_logical_issue ∈ {false, true}` - calibrated probability available ### 2. Hidden Problem Detection - `is_hidden_problem ∈ {false, true}` - evaluated only when a reasoning issue exists ### 3. Explicit Error Type Classification If the text is classified as `explicit`, the model may assign one or more of the following error types: - `false_causality` - `unsupported_claim` - `overgeneralization` - `missing_premise` - `contradiction` - `circular_reasoning` This is a **multi-label** prediction head. --- ## Ontology R2 uses a strict three-class reasoning ontology. ### `logical` - no reasoning issue - no hidden problem - no explicit errors ### `hidden` - reasoning problem exists - no explicit labeled fallacy - the issue is structural, implicit, or argumentative ### `explicit` - reasoning problem exists - at least one explicit error type is present This ontology is enforced in both training and inference. --- ## Inference Contract RQA-R2 uses gated inference: - if `has_issue = false` -> class is `logical`, no errors are returned - if `has_issue = true` and `is_hidden = true` -> class is `hidden`, no explicit errors are returned - if `has_issue = true` and `is_hidden = false` -> class is `explicit`, explicit errors may be returned R2 also supports: - calibrated thresholds - `uncertain` mode - `review_required` for borderline cases --- ## Architecture RQA-R2 is built on top of **XLM-RoBERTa Large** with: - mean pooling - separate projections per task - separate dropout per head - 3 task-specific heads - uncertainty-weighted multi-task training Training is hierarchical: - `has_issue` is trained on all samples - `is_hidden` is trained only on problem samples - `error_types` are trained only on explicit samples --- ## Training and Calibration R2 uses an honest experimental structure: - `train` for fitting - `val` for model selection - `calib` for temperature scaling and threshold tuning - `test` for final held-out evaluation Calibration includes: - issue temperature - hidden temperature - per-class error temperatures - threshold selection for `has_issue` - threshold selection for `is_hidden` - per-class thresholds for error types --- ## Held-Out Synthetic Benchmark The following metrics were obtained on the current held-out synthetic test split used for R2: - `Issue`: `F1 = 0.988`, `FPR = 0.029`, `PR-AUC = 0.999` - `Hidden`: `F1 = 0.960`, `PR-AUC = 0.994` - `Errors`: `macro-F1 = 0.822`, `micro-F1 = 0.813`, `samples-F1 = 0.838` - `Top-level class macro-F1 = 0.964` - `Coverage = 95.6%` - `Uncertain rate = 4.4%` These are strong results for the current data regime. Important: > These metrics are measured on a held-out split from the current synthetic dataset. > They demonstrate that the R2 design works very well in-distribution, but they should not be interpreted as proof of universal real-world reasoning performance. --- ## Training Data RQA-R2 was trained on a custom reasoning-quality dataset with: - `7292` total samples - `3150` logical texts - `4142` problematic texts - `1242` hidden problems - `2900` explicit cases Error-label counts: - `false_causality`: `518` - `unsupported_claim`: `524` - `overgeneralization`: `599` - `missing_premise`: `537` - `contradiction`: `475` - `circular_reasoning`: `540` Multi-label explicit cases: - `293` The current dataset is useful and already strong enough for training and benchmarking R2, but it is still primarily **synthetic** and should be expanded with real-world data in future versions. --- ## Intended Use ### Recommended for - reasoning-quality evaluation - LLM output auditing - AI safety pipelines - judge/reranker pipelines - pre-filtering for downstream review - analytical tooling around argument structure ### Not intended for - text generation - explanation generation - automatic rewriting or correction - factual verification - legal or scientific truth adjudication --- ## Output Example ```json { "class": "explicit", "status": "ok", "review_required": false, "has_logical_issue": true, "has_issue_probability": 0.9993, "is_hidden_problem": false, "hidden_probability": 0.021, "errors": [ { "type": "missing_premise", "probability": 0.923, "threshold": 0.54 } ] } ``` --- ## Limitations RQA-R2 still has important limits: - it evaluates reasoning structure, not factual truth - hidden problems remain partly subjective by nature - the current benchmark is still synthetic and in-distribution - real human-written texts and outputs from other LLMs may be harder - the model should still be validated externally before being treated as a fully general reasoning judge Also note: - R2 was optimized toward low false positives, but on the current held-out synthetic test set the observed `Issue FPR` is `2.9%`, not `1.0%` - if strict false-positive control is critical, threshold tuning may need to be tightened further for the target deployment environment --- ## Recommended Next Step The best next step after R2 is external validation on: - human-written argumentative texts - outputs from other LLM families - paraphrased and adversarially reworded samples - harder hidden-problem cases That is the correct way to turn a strong in-distribution result into a robust real-world system. --- ## Summary RQA-R2 is a major upgrade over R1: - better ontology - better training logic - better calibration - better inference safety - stronger held-out synthetic performance R1 proved the idea. **R2 is the first version that fully validates it.**