File size: 4,536 Bytes
fd0e5cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
license: mit
base_model:
- microsoft/deberta-v3-small
datasets:
- tgupj/tiny-router-data
---

# tiny-router

`tiny-router` is a compact experimental multi-head routing classifier for short, domain-neutral messages with optional interaction context. It predicts four separate signals that downstream systems or agents can use for update handling, action routing, memory policy, and prioritization.

## What it predicts

```
relation_to_previous: new | follow_up | correction | confirmation | cancellation | closure
actionability: none | review | act
retention: ephemeral | useful | remember
urgency: low | medium | high
```

The model emits these heads independently at inference time, plus calibrated confidences and an `overall_confidence`.

## Intended use

- Route short user messages into lightweight automation tiers.
- Detect whether a message updates prior context or starts something new.
- Decide whether action is required, review is safer, or no action is needed.
- Separate disposable details from short-term useful context and longer-term memory candidates.
- Prioritize items by urgency.

Good use cases:

- routing message-like requests in assistants or productivity tools
- triaging follow-ups, corrections, confirmations, and closures
- conservative automation with review fallback

Not good use cases:

- fully autonomous high-stakes action without guardrails
- domains that need expert reasoning or regulated decisions

## Training data

This checkpoint was trained on the synthetic dataset split in:

- `data/synthetic/train.jsonl`
- `data/synthetic/validation.jsonl`
- `data/synthetic/test.jsonl`

The data follows a structured JSONL schema with:

- `current_text`
- optional `interaction.previous_text`
- optional `interaction.previous_action`
- optional `interaction.previous_outcome`
- optional `interaction.recency_seconds`
- four label heads under `labels`

## Model details

- Base encoder: `microsoft/deberta-v3-small`
- Architecture: encoder-only multitask classifier
- Pooling: learned attention pooling
- Structured features:
  - canonicalized `previous_action` embedding
  - `previous_outcome` embedding
  - learned projection of `log1p(recency_seconds)`
- Head structure:
  - dependency-aware multitask heads
  - later heads condition on learned summaries of earlier head predictions
- Calibration:
  - post-hoc per-head temperature scaling fit on validation logits

This checkpoint was trained with:

- `batch_size = 32`
- `epochs = 20`
- `max_length = 128`
- `encoder_lr = 2e-5`
- `head_lr = 1e-4`
- `dropout = 0.1`
- `pooling_type = attention`
- `use_head_dependencies = true`

## Current results

Held-out test results from `artifacts/tiny-router/eval.json`:

- `macro_average_f1 = 0.7848`
- `exact_match = 0.4570`
- `automation_safe_accuracy = 0.6230`
- `automation_safe_coverage = 0.5430`
- `ECE = 0.3440`

Per-head macro F1:

- `relation_to_previous = 0.8415`
- `actionability = 0.7982`
- `retention = 0.7809`
- `urgency = 0.7187`

Ablations:

- `current_text_only = 0.7058`
- `current_plus_previous_text = 0.7478`
- `full_interaction = 0.7848`

Interpretation:

- interaction context helps
- actionability and urgency are usable but still imperfect
- high-confidence automation is possible only with conservative thresholds

## Limitations

- The benchmark is task-specific and internal to this repo.
- The dataset is synthetic, so distribution shift to real product traffic is likely.
- Label quality on subtle boundaries still matters a lot.
- Confidence calibration is improved but not strong enough to justify broad unattended automation.

## Example inference

```json
{
  "relation_to_previous": { "label": "correction", "confidence": 0.94 },
  "actionability": { "label": "act", "confidence": 0.97 },
  "retention": { "label": "useful", "confidence": 0.76 },
  "urgency": { "label": "medium", "confidence": 0.81 },
  "overall_confidence": 0.87
}
```

## How to load

This repo uses a custom checkpoint format. Load it with this project:

```python
from tiny_router.io import load_checkpoint
from tiny_router.runtime import get_device

device = get_device(requested_device="cpu")
model, tokenizer, config = load_checkpoint("artifacts/tiny-router", device=device)
```

Or run inference with:

```bash
uv run python predict.py \
  --model-dir artifacts/tiny-router \
  --input-json '{"current_text":"Actually next Monday","interaction":{"previous_text":"Set a reminder for Friday","previous_action":"created_reminder","previous_outcome":"success","recency_seconds":45}}' \
  --pretty
```