# Chrome taxonomy-tiny: Observed Facts **Date:** 2026-04-19 **Analysts:** Local research (reverse engineering) + public Chromium sources --- ## 1. The Chrome Component ### Identity | Field | Value | Source | |-------|-------|--------| | CRX ID | `eidcjfoningnkhpoelgpjemmhmopkeoi` | `verified_contents.json` | | Name | `Optimization Guide On Device Taxonomy Model` | `manifest.json` | | Version | `2026.2.12.1554` | `manifest.json` | | BaseModelSpec.name | `taxonomy-tiny` | `manifest.json` | | BaseModelSpec.version | `0.0.0.0` | `manifest.json` | | Performance hints | `[3]` | `manifest.json` | ### Signed Files (CRX verified_contents) The component contains exactly 3 files signed by Google: | File | Size | Content | |------|------|---------| | `manifest.json` | 247 bytes | Component metadata | | `on_device_model_execution_config.pb` | **0 bytes** | Empty | | `weights.bin` | 126,025,728 bytes (120.2 MB) | LITERTLM container | No `adaptation_weights.bin`, `adapter.bin`, `lora.bin`, or `model-info.pb` file exists in the component. ### LITERTLM Container (weights.bin) | Entry | Size | Description | |-------|------|-------------| | `embedder.tflite` | 8,601,600 bytes | TFLite v3 | | `prefill_decode.tflite` | 117,391,360 bytes | TFLite v3 | | `tokenizer.spm` | 312,918 bytes (LITERTLM header; extracted file: 312,917 bytes) | SentencePiece | | `model_version` | 5 bytes | String `"1.0.1"` | --- ## 2. Model Architecture ### Transformer Specifications | Parameter | Value | Measurement Method | |-----------|-------|--------------------| | Family | Gemma (named "TinyGemma" in Chromium code) | Chromium C++ source | | Layers | 18 transformer decoder | TFLite inspection | | Parameters | ~319M estimated (INT4 quantized) | Estimated: embedding 16384x1024 + 18 layers x (gate 2048x1024 + up 2048x1024 + down 1024x2048 + o_proj 1024x1024 + norms) + tied LM head | | Embedding dim | 1024 | Embedder output tensor | | KV head dim | 256, 1 head per layer | KV cache tensors | | Quantization | INT4 stored as INT8 (15 unique values) | Weight analysis | | Embedding table vs LM head | Byte-for-byte identical matrices | Full matrix comparison | | Execution | CPU only | Chromium C++ source | ### TFLite Signatures **embedder.tflite:** | Input/Output | Shape | Type | |--------------|-------|------| | Input: `token_ids` | `[1, 1]` | int32 | | Output: `embeddings` | `[1, 1, 1024]` | float32 | **prefill_decode.tflite -- `prefill` signature:** | Input/Output | Shape | Type | |--------------|-------|------| | Input: `embeddings` | `[1, N, 1024]` | float32 | | Input: `input_pos` | `[N]` | int32 | | Input: `mask` | `[1, 1, N, N]` | float32 | | Input: 36 KV caches | `[1, N, 1, 256]` | float32 | | Output: 36 KV caches (updated) | `[1, N, 1, 256]` | float32 | | Output: **no logits** | -- | -- | **prefill_decode.tflite -- `decode` signature:** | Input/Output | Shape | Type | |--------------|-------|------| | Input: `embeddings` | `[1, 1, 1024]` | float32 | | Input: `input_pos` | `[1]` | int32 | | Input: `mask` | `[1, 1, 1, 1]` | float32 | | Input: 36 KV caches | `[1, ?, 1, 256]` | float32 | | Output: 36 KV caches (updated) | `[1, ?, 1, 256]` | float32 | | Output: `logits` | `[1, 1, 16384]` | float32 | ### KV Cache: Observed Behavior The model writes new KV values **at position `input_pos` within the existing buffer**. Output has the same size as input. The cache must be **replaced** with the output, not concatenated. --- ## 3. Tokenizer ### Vocabulary Structure (16,384 tokens) | Range | Tokens | Role | |-------|--------|------| | 0 | `` | Padding | | 1 | `` | End of sequence | | 2 | `` | Beginning of sequence (BOS) | | 3 | `` | Unknown | | 4--259 | `<0x00>`--`<0xFF>` | Byte fallback | | 260--15999 | Text tokens | Standard SentencePiece vocabulary | | 16000--16001 | `▁`, `▁` | Audio markers | | 16002--16003 | `▁`, `▁` | Image markers | | 16004--16013 | `▁` -- `▁` | Control tokens | | 16014--16383 | `▁` -- `▁` | Custom tokens | ### Tokenization of `` - `" "` (with preceding space) -> single token 16004 - `""` (without space) -> decomposed into individual characters (byte fallback) - `sp.PieceToId('')` returns 3 (UNK) because the actual piece name is `▁` --- ## 4. Chromium Source Code ### File: `optimization_guide_on_device_model_installer.cc` Confirmed via URL provided by the user: `https://chromium.googlesource.com/experimental/chromium/src/+/refs/heads/main/chrome/browser/component_updater/optimization_guide_on_device_model_installer.cc` ```cpp // Extension id is eidcjfoningnkhpoelgpjemmhmopkeoi. constexpr char kClassifierModelManifestName[] = "Optimization Guide On Device Taxonomy Model"; constexpr base::FilePath::CharType kClassifierModelInstallationRelativePath[] = FILE_PATH_LITERAL("OptGuideOnDeviceClassifierModel"); ``` The component is registered through an `OptimizationGuideOnDeviceClassifierModelInstallerPolicy` class that inherits from `OnDeviceModelInstallerPolicy`. The `OnDeviceModelType` enum distinguishes two model types: ```cpp enum OnDeviceModelType { kBaseModel, kClassifierModel }; ``` ### File: `on_device_model_classifier_controller.cc` Hardcoded classification config (Chromium source, `experimental/chromium/src` branch, Copyright 2026 in file header): ```cpp // Prompt template substitution->set_string_template("%s "); // Proto types // Request: optimization_guide.proto.ClassifyApiRequest { string text = 1; } // Response: optimization_guide.proto.ClassifyApiResponse { string output = 1; } // Feature ID config.set_feature(proto::ModelExecutionFeature::MODEL_EXECUTION_FEATURE_CLASSIFIER); // = 33 // Execution params->backend_type = ml::ModelBackendType::kCpuBackend; params->max_tokens = kOnDeviceModelMaxTokens; // 10240 // Log on disconnect LOG(ERROR) << "TinyGemma model disconnected unexpectedly."; ``` ### Proto: `substitution.proto` ```protobuf enum ControlToken { CONTROL_TOKEN_UNSPECIFIED = 0; SYSTEM = 1; // MODEL = 2; // USER = 3; // END = 4; // } ``` ### Execution Pipeline in Chrome ``` Input text -> ClassifyApiRequest { text: "..." } -> FormatString: "%s " -> "text " -> BOS + tokenize -> embedder -> prefill/decode -> logits -> sample -> ParseResponse -> ClassifyApiResponse { output: "..." } ``` --- ## 5. Observed Inference Behavior ### Text Generation When provided with text followed by ``, the model **echoes and reformats** the input: | Input | Generated Output | |-------|-----------------| | `best laptop deals gaming electronics` | `Best laptop deals: gaming electronics` | | `Italian cooking recipes pasta carbonara` | `Italian cooking recipes Pasta carbonara` | | `NBA basketball scores live updates` | `NBA Basketball Scores Live Update` | | `stock market investment trading` | `Stock market investment trading` | Across the 4 tests performed, the output contains capitalization and punctuation changes relative to the input. No classification labels or category identifiers were observed in any tested output. ### Custom Tokens in Generation | Metric | Value | |--------|-------| | Custom token max logit | -19 to -21 | | Regular token max logit | +19 to +27 | | Gap | ~37 to 46 logit units | | Custom token entropy ratio | 1.0000 (uniform) | Across all tests performed, custom tokens (16014-16383) had logits between -19 and -21, producing negligible probability after softmax. ### Embedding Quality (Layer 17 KV cache as representation) | Metric | Last Position | Mean-Pooled | |--------|--------------|-------------| | Within-category similarity | 0.849 | 0.783 | | Across-category similarity | 0.805 | 0.756 | | Separation gap | 0.044 | 0.027 | Within-category similarity exceeds across-category similarity by 0.044 (last position) and 0.027 (mean-pooled). ### Performance | Operation | Duration (~16 tokens) | |-----------|-----------------------| | Decode only (token by token) | ~2.0--2.4s | | Prefill batch | ~1.3s | | Prefill + full decode | ~3.0s | --- ## 6. Weight Analysis ### Embedding Table: Custom vs Regular Tokens | Metric | Custom (370) | Regular (370) | Multimodal (14) | |--------|-------------|---------------|-----------------| | Intra-group cosine | **0.999937** | 0.051 | 0.886 | | L2 norm mean | 32.045 | 29.616 | 32.029 | | L2 norm std | **0.007** | 3.019 | 0.077 | | Per-dimension variance (mean) | **0.0004** | 6.600 | -- | | Regular/custom variance ratio | **14,817x** | -- | -- | | K-means k=50: distinct clusters | 33 (collapse) | 50 | -- | | DBSCAN: clusters (eps x1.0) | 2 | 1 | 1 | K-means at k=50 collapses to 33 clusters (ConvergenceWarning). DBSCAN at eps x1.0 finds 2 clusters. No significant intra/inter-group separation detected for the grid patterns tested (10x37, 37x10, 74x5, 185x2). Multimodal tokens (16000-16013) show 2 sub-clusters with cosine ~0.20 between groups. Custom tokens show no sub-clusters. ### Transformer Layers: Weight Distribution | Metric | Value (54 FFN tensors, 18 layers) | |--------|-----------------------------------| | Mean std | 2.6252 | | Std of stds | 0.0093 | | Coefficient of variation | **0.36%** | | % outliers | 0.00% (all layers) | | % zeros | 15.3--16.4% | 0.36% coefficient of variation across the 54 FFN tensors. Differences between early, mid, and late layers: | Weight type | Early (0-5) | Mid (6-11) | Late (12-17) | Variation | |-------------|-------------|------------|--------------|-----------| | gate_proj std | 2.6319 | 2.6243 | 2.6307 | 0.8% | | up_proj std | 2.6314 | 2.6246 | 2.6318 | 0.8% | | down_proj std | 2.6217 | 2.6090 | 2.6214 | 1.3% | `o_proj` has std of 18.49 (7x the FFN layers), constant across all 18 layers (18.492--18.496). ### Normalization Layers - Layer 0: all RMSNorm values are zero - Layers 1-17: RMSNorm values range from 3.89 to 14.02 --- ## 7. Comparison with Other Chrome Targets ### Target 72 vs Targets 43 and 74 | Attribute | Target 43 (Embedder) | Target 72 (Classifier) | Target 74 (Shopping) | |-----------|---------------------|----------------------|---------------------| | Optimization target | `PASSAGE_EMBEDDER` | `CLASSIFIER` | `SHOPPING_CLASSIFIER` | | Timestamp in model-info.pb / version | 2024-06-25 (timestamp) | 2026-02-12 (inferred from version `2026.2.12.1554`) | 2026-03-30 (timestamp) | | Config populated | Yes (147 bytes) | **No (0 bytes)** | Yes (121 bytes) | | model-info.pb | Present | **Absent** | Present | | Metadata type | `PassageEmbeddingsModelMetadata` | None | `CategoryClassifierMetadata` | | Architecture | Sentence embedder TFLite | TinyGemma LLM 18 layers | Linear classifier head TFLite | | Model size | 112 MB | 120 MB | 789 KB | | Input | int32[1,64] token IDs | Text + `` | float32[1,1536] embeddings | | Output | float32[1,768] embedding | Generated text | float32[1,1] score | | Tokenizer | sentencepiece.model (794 KB) | tokenizer.spm (313 KB) | Uses target 43's | | Input window | 64 tokens | ~10240 tokens | N/A | | Depends on another target | No | No | Yes (target 43, v >= 2026-02-02) | Target 74's metadata declares a dependency on target 43 via `required_embedder_version`. Target 74's input shape (1536-dim) matches the concatenation of two 768-dim embeddings from target 43. Target 72 has no declared dependency. ### Target 72 vs Target 62 (Proofreader) | Attribute | Target 62 | Target 72 | |-----------|-----------|-----------| | Own model | No (uses shared Gemini Nano ~1.7 GB) | Yes (standalone TinyGemma 120 MB) | | LoRA adapter | Yes (`adaptation_weights.bin`, 16.6 MB) | No | | Config | Non-empty | Empty (0 bytes) | ### Legacy Topics System: PAGE_TOPICS_V2 | Attribute | PAGE_TOPICS_V2 | Target 72 | |-----------|---------------|-----------| | Target ID | 15 | 72 | | Architecture | BERT (BertNLClassifier) | TinyGemma (Gemma) | | Size | ~2.7 MB | ~120 MB | | Input | Cleaned hostname | Full page content | | Output | 469 topic probabilities | Generated text | | Taxonomy | v2, 469 topics (IDs 1-629) | None in the component | | Override list | ~47,128 domain-to-topic mappings | None | --- ## 8. Verified External Context ### Topics API: Deprecation Status - **October 17, 2025**: Google announces retirement of the Topics API, Protected Audience, and Attribution Reporting - **Deprecation** scheduled for Chrome 144 - **Full removal** scheduled for Chrome 150 - **Reason**: insufficient ecosystem adoption - Source: [Privacy Sandbox Update](https://privacysandbox.google.com/blog/update-on-plans-for-privacy-sandbox-technologies) ### Classifier API: Public Proposal (GitHub) An explainer published by the Chrome Built-in AI Team exists on GitHub: - **Repo**: [`explainers-by-googlers/classifier-api`](https://github.com/explainers-by-googlers/classifier-api) - **Description**: "A proposal for a high-performance, on-device browser API to classify text." - **Status**: "a tentative and early design sketch [...] to solicit feedback" Proposal contents: | Element | Detail | |---------|--------| | JavaScript API | `Classifier.availability()`, `Classifier.create()`, `classifier.classify(text)` | | Taxonomy | IAB Content Taxonomy V3.1 (experimental default) | | IAB v3.1 size | 706 total categories, 588 leaf nodes ([source TSV](https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/Content%20Taxonomies/Content%20Taxonomy%203.1.tsv)) | | Output | Array of `{id: string, confidence: number}` — ID-based, not human-readable | | Model | Described as "dedicated, on-device expert model optimized for high-speed classification" | | Target latency | "Millisecond-level inference" | | vs generic LLM | "A generic LLM is 'overkill' for classification" | | Privacy | "Raw text never leaves the browser", "stateless by design" | | Model updates | "managed by the browser component updater" | | Non-goals | Summarization, translation, sentiment analysis; human-readable string outputs | | Out of MVP scope | Streaming, multilingual, quota API, download progress, AbortSignal | | Language | English only (MVP) | | Author | Chrome Built-in AI Team | ### Shopping Classifier: Working Pipeline (Target 43+74) Documented by Dejan Petrovic ([blog](https://dejan.ai/blog/google-shopping-classifier/), [HuggingFace](https://huggingface.co/dejanseo/chrome_models)): | Step | Detail | |------|--------| | Chunking | 100 words max per passage, 10 passages max, truncated to 64 tokens | | Embedding | Target 43: text -> 768-dim embedding | | Title+URL | Concatenated into a single embedding | | Classifier input | `[title_emb(768) || mean_pool(passage_embs)(768)]` = 1536 dim | | Output | Score 0--1 (shopping probability) | | Storage | `VisitContentAnnotations` in Chrome's history database | | Usage | Aggregated via `OPTIMIZATION_TARGET_SEGMENTATION_SHOPPING_USER` to enable commerce features | ### LiteRT-LM Open-source C++ framework by Google for on-device language model inference. Used in Chrome, Chromebook Plus, and Pixel Watch according to Google. Handles KV cache, session cloning, and CPU/GPU/NPU acceleration. The `LITERTLM` container format in which taxonomy-tiny is packaged corresponds to this framework. Source: [Google Developers Blog](https://developers.googleblog.com/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/) --- ## 9. What the Facts Do Not Tell Us The following points are not resolved by available data: 1. **The link between target 72 and the Classifier API** -- The GitHub explainer does not name the underlying model. Target 72 has no reference to the Classifier API in its files. The correspondence is plausible but unconfirmed. 2. **The intended use of the 370 custom tokens** -- They exist in the vocabulary and are untrained. IAB Content Taxonomy V3.1 has 706 categories (588 leaf nodes); 370 does not match either count. Their correspondence to any taxonomy is not documented anywhere. 3. **Why a generative LLM rather than an embedding-based classifier** -- The explainer says "expert model", not "LLM". It says "millisecond-level inference", which does not match the ~3s observed on target 72. The explainer's output format is structured (`{id, confidence}`), while the Chromium controller reads a plain string (`ClassifyApiResponse { string output = 1; }`). 4. **The relationship with the Topics API** -- Target 72 appeared (2026-02-12) after the Topics API retirement announcement (2025-10-17). Both involve taxonomies and page classification, but no source code explicitly links them. 5. **When or whether fine-tuning will be deployed** -- The model is a pre-trained base model. No public information exists on a fine-tuning timeline. 6. **The semantics of `BaseModelSpec.version = "0.0.0.0"`** -- This format is not documented anywhere. Other Chrome components use normal numeric versions. --- ## 10. References | Source | Type | URL / Path | |--------|------|------------| | Chrome component (CRX) | Local file | `classifier_analysis/classifier.crx3` | | Manifest.json | Local file | `classifier_analysis/extracted/manifest.json` | | Installer source | Chromium | [optimization_guide_on_device_model_installer.cc](https://chromium.googlesource.com/experimental/chromium/src/+/refs/heads/main/chrome/browser/component_updater/optimization_guide_on_device_model_installer.cc) | | Controller source | Chromium | `on_device_model_classifier_controller.cc` (read via Chromium Code Search, `experimental/chromium/src` branch, no stable direct URL) | | Classifier API explainer | GitHub | [explainers-by-googlers/classifier-api](https://github.com/explainers-by-googlers/classifier-api) | | Shopping classifier blog | Web | [dejan.ai/blog/google-shopping-classifier](https://dejan.ai/blog/google-shopping-classifier/) | | Privacy Sandbox update | Web | [privacysandbox.google.com/blog/update-on-plans](https://privacysandbox.google.com/blog/update-on-plans-for-privacy-sandbox-technologies) | | Topics deprecation | Chromium Groups | [topics-api-announce](https://groups.google.com/a/chromium.org/g/topics-api-announce/c/iQX7PC3S0Ds) | | LiteRT-LM blog | Web | [developers.googleblog.com](https://developers.googleblog.com/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/) | | Mission 1 report | Local file | `results/MISSION_01/target_72_inventory.md` | | Mission 2 report | Local file | `results/MISSION_02/target_configs_comparison.md` | | Mission 4 report | Local file | `results/MISSION_04/weight_variance_analysis.md` | | Mission 5 report | Local file | `results/MISSION_05/custom_tokens_clustering.md` |