# Chrome taxonomy-tiny: Observed Facts

**Date:** 2026-04-19
**Analysts:** Local research (reverse engineering) + public Chromium sources

---

## 1. The Chrome Component

### Identity
| Field | Value | Source |
|-------|-------|--------|
| CRX ID | `eidcjfoningnkhpoelgpjemmhmopkeoi` | `verified_contents.json` |
| Name | `Optimization Guide On Device Taxonomy Model` | `manifest.json` |
| Version | `2026.2.12.1554` | `manifest.json` |
| BaseModelSpec.name | `taxonomy-tiny` | `manifest.json` |
| BaseModelSpec.version | `0.0.0.0` | `manifest.json` |
| Performance hints | `[3]` | `manifest.json` |

### Signed Files (CRX verified_contents)
The component contains exactly 3 files signed by Google:

| File | Size | Content |
|------|------|---------|
| `manifest.json` | 247 bytes | Component metadata |
| `on_device_model_execution_config.pb` | **0 bytes** | Empty |
| `weights.bin` | 126,025,728 bytes (120.2 MB) | LITERTLM container |

No `adaptation_weights.bin`, `adapter.bin`, `lora.bin`, or `model-info.pb` file exists in the component.

### LITERTLM Container (weights.bin)
| Entry | Size | Description |
|-------|------|-------------|
| `embedder.tflite` | 8,601,600 bytes | TFLite v3 |
| `prefill_decode.tflite` | 117,391,360 bytes | TFLite v3 |
| `tokenizer.spm` | 312,918 bytes (LITERTLM header; extracted file: 312,917 bytes) | SentencePiece |
| `model_version` | 5 bytes | String `"1.0.1"` |

---

## 2. Model Architecture

### Transformer Specifications
| Parameter | Value | Measurement Method |
|-----------|-------|--------------------|
| Family | Gemma (named "TinyGemma" in Chromium code) | Chromium C++ source |
| Layers | 18 transformer decoder | TFLite inspection |
| Parameters | ~319M estimated (INT4 quantized) | Estimated: embedding 16384x1024 + 18 layers x (gate 2048x1024 + up 2048x1024 + down 1024x2048 + o_proj 1024x1024 + norms) + tied LM head |
| Embedding dim | 1024 | Embedder output tensor |
| KV head dim | 256, 1 head per layer | KV cache tensors |
| Quantization | INT4 stored as INT8 (15 unique values) | Weight analysis |
| Embedding table vs LM head | Byte-for-byte identical matrices | Full matrix comparison |
| Execution | CPU only | Chromium C++ source |

### TFLite Signatures

**embedder.tflite:**
| Input/Output | Shape | Type |
|--------------|-------|------|
| Input: `token_ids` | `[1, 1]` | int32 |
| Output: `embeddings` | `[1, 1, 1024]` | float32 |

**prefill_decode.tflite -- `prefill` signature:**
| Input/Output | Shape | Type |
|--------------|-------|------|
| Input: `embeddings` | `[1, N, 1024]` | float32 |
| Input: `input_pos` | `[N]` | int32 |
| Input: `mask` | `[1, 1, N, N]` | float32 |
| Input: 36 KV caches | `[1, N, 1, 256]` | float32 |
| Output: 36 KV caches (updated) | `[1, N, 1, 256]` | float32 |
| Output: **no logits** | -- | -- |

**prefill_decode.tflite -- `decode` signature:**
| Input/Output | Shape | Type |
|--------------|-------|------|
| Input: `embeddings` | `[1, 1, 1024]` | float32 |
| Input: `input_pos` | `[1]` | int32 |
| Input: `mask` | `[1, 1, 1, 1]` | float32 |
| Input: 36 KV caches | `[1, ?, 1, 256]` | float32 |
| Output: 36 KV caches (updated) | `[1, ?, 1, 256]` | float32 |
| Output: `logits` | `[1, 1, 16384]` | float32 |

### KV Cache: Observed Behavior
The model writes new KV values **at position `input_pos` within the existing buffer**. Output has the same size as input. The cache must be **replaced** with the output, not concatenated.

---

## 3. Tokenizer

### Vocabulary Structure (16,384 tokens)
| Range | Tokens | Role |
|-------|--------|------|
| 0 | `<pad>` | Padding |
| 1 | `</s>` | End of sequence |
| 2 | `<s>` | Beginning of sequence (BOS) |
| 3 | `<unk>` | Unknown |
| 4--259 | `<0x00>`--`<0xFF>` | Byte fallback |
| 260--15999 | Text tokens | Standard SentencePiece vocabulary |
| 16000--16001 | `▁<start_of_audio>`, `▁<end_of_audio>` | Audio markers |
| 16002--16003 | `▁<start_of_image>`, `▁<end_of_image>` | Image markers |
| 16004--16013 | `▁<ctrl1>` -- `▁<ctrl10>` | Control tokens |
| 16014--16383 | `▁<custom1>` -- `▁<custom370>` | Custom tokens |

### Tokenization of `<ctrl1>`
- `" <ctrl1>"` (with preceding space) -> single token 16004
- `"<ctrl1>"` (without space) -> decomposed into individual characters (byte fallback)
- `sp.PieceToId('<ctrl1>')` returns 3 (UNK) because the actual piece name is `▁<ctrl1>`

---

## 4. Chromium Source Code

### File: `optimization_guide_on_device_model_installer.cc`

Confirmed via URL provided by the user:
`https://chromium.googlesource.com/experimental/chromium/src/+/refs/heads/main/chrome/browser/component_updater/optimization_guide_on_device_model_installer.cc`

```cpp
// Extension id is eidcjfoningnkhpoelgpjemmhmopkeoi.
constexpr char kClassifierModelManifestName[] =
    "Optimization Guide On Device Taxonomy Model";
constexpr base::FilePath::CharType kClassifierModelInstallationRelativePath[] =
    FILE_PATH_LITERAL("OptGuideOnDeviceClassifierModel");
```

The component is registered through an `OptimizationGuideOnDeviceClassifierModelInstallerPolicy` class that inherits from `OnDeviceModelInstallerPolicy`.

The `OnDeviceModelType` enum distinguishes two model types:
```cpp
enum OnDeviceModelType { kBaseModel, kClassifierModel };
```

### File: `on_device_model_classifier_controller.cc`

Hardcoded classification config (Chromium source, `experimental/chromium/src` branch, Copyright 2026 in file header):
```cpp
// Prompt template
substitution->set_string_template("%s <ctrl1>");

// Proto types
// Request: optimization_guide.proto.ClassifyApiRequest { string text = 1; }
// Response: optimization_guide.proto.ClassifyApiResponse { string output = 1; }

// Feature ID
config.set_feature(proto::ModelExecutionFeature::MODEL_EXECUTION_FEATURE_CLASSIFIER);
// = 33

// Execution
params->backend_type = ml::ModelBackendType::kCpuBackend;
params->max_tokens = kOnDeviceModelMaxTokens;  // 10240

// Log on disconnect
LOG(ERROR) << "TinyGemma model disconnected unexpectedly.";
```

### Proto: `substitution.proto`
```protobuf
enum ControlToken {
  CONTROL_TOKEN_UNSPECIFIED = 0;
  SYSTEM = 1;   // <ctrl1>
  MODEL = 2;    // <ctrl2>
  USER = 3;     // <ctrl3>
  END = 4;      // <ctrl4>
}
```

### Execution Pipeline in Chrome
```
Input text -> ClassifyApiRequest { text: "..." }
  -> FormatString: "%s <ctrl1>" -> "text <ctrl1>"
  -> BOS + tokenize -> embedder -> prefill/decode -> logits -> sample
  -> ParseResponse -> ClassifyApiResponse { output: "..." }
```

---

## 5. Observed Inference Behavior

### Text Generation
When provided with text followed by `<ctrl1>`, the model **echoes and reformats** the input:

| Input | Generated Output |
|-------|-----------------|
| `best laptop deals gaming electronics` | `Best laptop deals: gaming electronics` |
| `Italian cooking recipes pasta carbonara` | `Italian cooking recipes Pasta carbonara` |
| `NBA basketball scores live updates` | `NBA Basketball Scores Live Update` |
| `stock market investment trading` | `Stock market investment trading` |

Across the 4 tests performed, the output contains capitalization and punctuation changes relative to the input. No classification labels or category identifiers were observed in any tested output.

### Custom Tokens in Generation
| Metric | Value |
|--------|-------|
| Custom token max logit | -19 to -21 |
| Regular token max logit | +19 to +27 |
| Gap | ~37 to 46 logit units |
| Custom token entropy ratio | 1.0000 (uniform) |

Across all tests performed, custom tokens (16014-16383) had logits between -19 and -21, producing negligible probability after softmax.

### Embedding Quality (Layer 17 KV cache as representation)
| Metric | Last Position | Mean-Pooled |
|--------|--------------|-------------|
| Within-category similarity | 0.849 | 0.783 |
| Across-category similarity | 0.805 | 0.756 |
| Separation gap | 0.044 | 0.027 |

Within-category similarity exceeds across-category similarity by 0.044 (last position) and 0.027 (mean-pooled).

### Performance
| Operation | Duration (~16 tokens) |
|-----------|-----------------------|
| Decode only (token by token) | ~2.0--2.4s |
| Prefill batch | ~1.3s |
| Prefill + full decode | ~3.0s |

---

## 6. Weight Analysis

### Embedding Table: Custom vs Regular Tokens

| Metric | Custom (370) | Regular (370) | Multimodal (14) |
|--------|-------------|---------------|-----------------|
| Intra-group cosine | **0.999937** | 0.051 | 0.886 |
| L2 norm mean | 32.045 | 29.616 | 32.029 |
| L2 norm std | **0.007** | 3.019 | 0.077 |
| Per-dimension variance (mean) | **0.0004** | 6.600 | -- |
| Regular/custom variance ratio | **14,817x** | -- | -- |
| K-means k=50: distinct clusters | 33 (collapse) | 50 | -- |
| DBSCAN: clusters (eps x1.0) | 2 | 1 | 1 |

K-means at k=50 collapses to 33 clusters (ConvergenceWarning). DBSCAN at eps x1.0 finds 2 clusters. No significant intra/inter-group separation detected for the grid patterns tested (10x37, 37x10, 74x5, 185x2).

Multimodal tokens (16000-16013) show 2 sub-clusters with cosine ~0.20 between groups. Custom tokens show no sub-clusters.

### Transformer Layers: Weight Distribution

| Metric | Value (54 FFN tensors, 18 layers) |
|--------|-----------------------------------|
| Mean std | 2.6252 |
| Std of stds | 0.0093 |
| Coefficient of variation | **0.36%** |
| % outliers | 0.00% (all layers) |
| % zeros | 15.3--16.4% |

0.36% coefficient of variation across the 54 FFN tensors. Differences between early, mid, and late layers:

| Weight type | Early (0-5) | Mid (6-11) | Late (12-17) | Variation |
|-------------|-------------|------------|--------------|-----------|
| gate_proj std | 2.6319 | 2.6243 | 2.6307 | 0.8% |
| up_proj std | 2.6314 | 2.6246 | 2.6318 | 0.8% |
| down_proj std | 2.6217 | 2.6090 | 2.6214 | 1.3% |

`o_proj` has std of 18.49 (7x the FFN layers), constant across all 18 layers (18.492--18.496).

### Normalization Layers
- Layer 0: all RMSNorm values are zero
- Layers 1-17: RMSNorm values range from 3.89 to 14.02

---

## 7. Comparison with Other Chrome Targets

### Target 72 vs Targets 43 and 74

| Attribute | Target 43 (Embedder) | Target 72 (Classifier) | Target 74 (Shopping) |
|-----------|---------------------|----------------------|---------------------|
| Optimization target | `PASSAGE_EMBEDDER` | `CLASSIFIER` | `SHOPPING_CLASSIFIER` |
| Timestamp in model-info.pb / version | 2024-06-25 (timestamp) | 2026-02-12 (inferred from version `2026.2.12.1554`) | 2026-03-30 (timestamp) |
| Config populated | Yes (147 bytes) | **No (0 bytes)** | Yes (121 bytes) |
| model-info.pb | Present | **Absent** | Present |
| Metadata type | `PassageEmbeddingsModelMetadata` | None | `CategoryClassifierMetadata` |
| Architecture | Sentence embedder TFLite | TinyGemma LLM 18 layers | Linear classifier head TFLite |
| Model size | 112 MB | 120 MB | 789 KB |
| Input | int32[1,64] token IDs | Text + `<ctrl1>` | float32[1,1536] embeddings |
| Output | float32[1,768] embedding | Generated text | float32[1,1] score |
| Tokenizer | sentencepiece.model (794 KB) | tokenizer.spm (313 KB) | Uses target 43's |
| Input window | 64 tokens | ~10240 tokens | N/A |
| Depends on another target | No | No | Yes (target 43, v >= 2026-02-02) |

Target 74's metadata declares a dependency on target 43 via `required_embedder_version`. Target 74's input shape (1536-dim) matches the concatenation of two 768-dim embeddings from target 43. Target 72 has no declared dependency.

### Target 72 vs Target 62 (Proofreader)

| Attribute | Target 62 | Target 72 |
|-----------|-----------|-----------|
| Own model | No (uses shared Gemini Nano ~1.7 GB) | Yes (standalone TinyGemma 120 MB) |
| LoRA adapter | Yes (`adaptation_weights.bin`, 16.6 MB) | No |
| Config | Non-empty | Empty (0 bytes) |

### Legacy Topics System: PAGE_TOPICS_V2

| Attribute | PAGE_TOPICS_V2 | Target 72 |
|-----------|---------------|-----------|
| Target ID | 15 | 72 |
| Architecture | BERT (BertNLClassifier) | TinyGemma (Gemma) |
| Size | ~2.7 MB | ~120 MB |
| Input | Cleaned hostname | Full page content |
| Output | 469 topic probabilities | Generated text |
| Taxonomy | v2, 469 topics (IDs 1-629) | None in the component |
| Override list | ~47,128 domain-to-topic mappings | None |

---

## 8. Verified External Context

### Topics API: Deprecation Status
- **October 17, 2025**: Google announces retirement of the Topics API, Protected Audience, and Attribution Reporting
- **Deprecation** scheduled for Chrome 144
- **Full removal** scheduled for Chrome 150
- **Reason**: insufficient ecosystem adoption
- Source: [Privacy Sandbox Update](https://privacysandbox.google.com/blog/update-on-plans-for-privacy-sandbox-technologies)

### Classifier API: Public Proposal (GitHub)
An explainer published by the Chrome Built-in AI Team exists on GitHub:
- **Repo**: [`explainers-by-googlers/classifier-api`](https://github.com/explainers-by-googlers/classifier-api)
- **Description**: "A proposal for a high-performance, on-device browser API to classify text."
- **Status**: "a tentative and early design sketch [...] to solicit feedback"

Proposal contents:

| Element | Detail |
|---------|--------|
| JavaScript API | `Classifier.availability()`, `Classifier.create()`, `classifier.classify(text)` |
| Taxonomy | IAB Content Taxonomy V3.1 (experimental default) |
| IAB v3.1 size | 706 total categories, 588 leaf nodes ([source TSV](https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/Content%20Taxonomies/Content%20Taxonomy%203.1.tsv)) |
| Output | Array of `{id: string, confidence: number}` — ID-based, not human-readable |
| Model | Described as "dedicated, on-device expert model optimized for high-speed classification" |
| Target latency | "Millisecond-level inference" |
| vs generic LLM | "A generic LLM is 'overkill' for classification" |
| Privacy | "Raw text never leaves the browser", "stateless by design" |
| Model updates | "managed by the browser component updater" |
| Non-goals | Summarization, translation, sentiment analysis; human-readable string outputs |
| Out of MVP scope | Streaming, multilingual, quota API, download progress, AbortSignal |
| Language | English only (MVP) |
| Author | Chrome Built-in AI Team |

### Shopping Classifier: Working Pipeline (Target 43+74)
Documented by Dejan Petrovic ([blog](https://dejan.ai/blog/google-shopping-classifier/), [HuggingFace](https://huggingface.co/dejanseo/chrome_models)):

| Step | Detail |
|------|--------|
| Chunking | 100 words max per passage, 10 passages max, truncated to 64 tokens |
| Embedding | Target 43: text -> 768-dim embedding |
| Title+URL | Concatenated into a single embedding |
| Classifier input | `[title_emb(768) || mean_pool(passage_embs)(768)]` = 1536 dim |
| Output | Score 0--1 (shopping probability) |
| Storage | `VisitContentAnnotations` in Chrome's history database |
| Usage | Aggregated via `OPTIMIZATION_TARGET_SEGMENTATION_SHOPPING_USER` to enable commerce features |

### LiteRT-LM
Open-source C++ framework by Google for on-device language model inference. Used in Chrome, Chromebook Plus, and Pixel Watch according to Google. Handles KV cache, session cloning, and CPU/GPU/NPU acceleration. The `LITERTLM` container format in which taxonomy-tiny is packaged corresponds to this framework.
Source: [Google Developers Blog](https://developers.googleblog.com/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/)

---

## 9. What the Facts Do Not Tell Us

The following points are not resolved by available data:

1. **The link between target 72 and the Classifier API** -- The GitHub explainer does not name the underlying model. Target 72 has no reference to the Classifier API in its files. The correspondence is plausible but unconfirmed.

2. **The intended use of the 370 custom tokens** -- They exist in the vocabulary and are untrained. IAB Content Taxonomy V3.1 has 706 categories (588 leaf nodes); 370 does not match either count. Their correspondence to any taxonomy is not documented anywhere.

3. **Why a generative LLM rather than an embedding-based classifier** -- The explainer says "expert model", not "LLM". It says "millisecond-level inference", which does not match the ~3s observed on target 72. The explainer's output format is structured (`{id, confidence}`), while the Chromium controller reads a plain string (`ClassifyApiResponse { string output = 1; }`).

4. **The relationship with the Topics API** -- Target 72 appeared (2026-02-12) after the Topics API retirement announcement (2025-10-17). Both involve taxonomies and page classification, but no source code explicitly links them.

5. **When or whether fine-tuning will be deployed** -- The model is a pre-trained base model. No public information exists on a fine-tuning timeline.

6. **The semantics of `BaseModelSpec.version = "0.0.0.0"`** -- This format is not documented anywhere. Other Chrome components use normal numeric versions.

---

## 10. References

| Source | Type | URL / Path |
|--------|------|------------|
| Chrome component (CRX) | Local file | `classifier_analysis/classifier.crx3` |
| Manifest.json | Local file | `classifier_analysis/extracted/manifest.json` |
| Installer source | Chromium | [optimization_guide_on_device_model_installer.cc](https://chromium.googlesource.com/experimental/chromium/src/+/refs/heads/main/chrome/browser/component_updater/optimization_guide_on_device_model_installer.cc) |
| Controller source | Chromium | `on_device_model_classifier_controller.cc` (read via Chromium Code Search, `experimental/chromium/src` branch, no stable direct URL) |
| Classifier API explainer | GitHub | [explainers-by-googlers/classifier-api](https://github.com/explainers-by-googlers/classifier-api) |
| Shopping classifier blog | Web | [dejan.ai/blog/google-shopping-classifier](https://dejan.ai/blog/google-shopping-classifier/) |
| Privacy Sandbox update | Web | [privacysandbox.google.com/blog/update-on-plans](https://privacysandbox.google.com/blog/update-on-plans-for-privacy-sandbox-technologies) |
| Topics deprecation | Chromium Groups | [topics-api-announce](https://groups.google.com/a/chromium.org/g/topics-api-announce/c/iQX7PC3S0Ds) |
| LiteRT-LM blog | Web | [developers.googleblog.com](https://developers.googleblog.com/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/) |
| Mission 1 report | Local file | `results/MISSION_01/target_72_inventory.md` |
| Mission 2 report | Local file | `results/MISSION_02/target_configs_comparison.md` |
| Mission 4 report | Local file | `results/MISSION_04/weight_variance_analysis.md` |
| Mission 5 report | Local file | `results/MISSION_05/custom_tokens_clustering.md` |