YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Multilingual V-Score Analysis with SAE Features

MVA project. 
Contributors:
Yannis Kolodziej, Tom Mariani, Hai Pham, Valentin Smague 

This project investigates how Sparse Autoencoder (SAE) features inside large language models (LLMs) encode language identity. The core metric is the v-score β€” a per-feature, per-language score that measures how much a SAE feature activates on one language compared to the average across all other languages.

What is the V-Score?

For a given SAE feature f and language L (over a set of K languages):

v(f, L) = mean_activation(f, L) - mean( mean_activation(f, L') for L' β‰  L )

Features with a high v-score for language L are considered language-specific to L. By sorting features by their v-score, we obtain a ranked list of the most language-discriminative SAE features per layer.

Project Structure

MVA-SNLP/
β”œβ”€β”€ compute_v_scores.py         # CLI: compute & save v-scores for any model/SAE/language set
β”œβ”€β”€ visualize_v_scores.ipynb    # Visualize saved v-score runs (Figure 1 reproduction + insights)
β”œβ”€β”€ sae_feature_exploration.ipynb     # Hugging Face-based interactive SAE feature exploration
β”œβ”€β”€ extended_visualization.ipynb      # Extended visualizations and additional analyses
β”œβ”€β”€ code_switching_analysis.ipynb     # Code-switching analysis on specific words
β”‚
β”œβ”€β”€ scripts/                          # Ready-to-run bash scripts for each experiment
β”‚   β”œβ”€β”€ run_gemma_reprod.sh           # Reproduce Figure 1 (Gemma-2B, 10 languages, 100 texts)
β”‚   β”œβ”€β”€ run_gemma_diverse_langs_all.sh    # Insight 1: all texts, 10 diverse languages
β”‚   β”œβ”€β”€ run_gemma_diverse_langs_small.sh  # Insight 1: quick run (25 texts/language)
β”‚   β”œβ”€β”€ run_gemma_similar_langs.sh    # Insight 2: similar languages (es/pt/gl/ca)
β”‚   β”œβ”€β”€ run_gemma_underrepresented_langs.sh  # Insight 4: underrepresented languages
β”‚   └── run_qwen_reprod.sh            # Reproduction with Qwen3-0.6B
β”‚
└── v_score_runs/                     # Saved results (meta.json + v_scores.pt per run)
    β”œβ”€β”€ run_reprod_fig_1/
    β”œβ”€β”€ run_insight_1_all/
    β”œβ”€β”€ run_insight_1_small/
    β”œβ”€β”€ run_insight_2/
    β”œβ”€β”€ run_insight_4/
    └── qwen_run_reprod_fig_1/

Supported Models

Alias Model SAE Release
gemma-2b google/gemma-2-2b gemma-scope-2b-pt-res-canonical
qwen3-0.6b Qwen/Qwen3-0.6B mwhanna-qwen3-0.6b-transcoders-lowl0

Quick Start

1. Install dependencies

pip install torch transformers datasets sae-lens matplotlib

2. Compute v-scores (CLI)

Reproduce Figure 1 (Gemma-2B, 10 languages):

bash scripts/run_gemma_reprod.sh

Or run directly with custom settings:

python compute_v_scores.py compute \
    --model gemma-2b \
    --languages eng_Latn,fra_Latn,jpn_Jpan,cmn_Hans \
    --layers 0,5,10,15,20 \
    --n-texts-per-lang 100 \
    --out-dir ./v_score_runs/my_run

Use a custom model/SAE not in the presets:

python compute_v_scores.py compute \
    --model custom \
    --model-id "your/hf-model-id" \
    --sae-release "your-sae-release" \
    --sae-id-template "layer_{layer}" \
    --languages eng_Latn,fra_Latn \
    --layers 0,5,10 \
    --out-dir ./v_score_runs/custom_run

3. Visualize results

Open visualize_v_scores.ipynb and point it to any v_score_runs/<run_name>/ directory. The notebook loads meta.json and v_scores.pt and renders:

  • Top language-specific features per layer
  • Feature activation heatmaps across languages
  • V-score distributions

CLI Reference

python compute_v_scores.py compute [OPTIONS]

Options:
  --model              Preset: gemma-2b | qwen3-0.6b | custom
  --model-id           Override HuggingFace model ID (for --model custom)
  --sae-release        Override sae_lens release name
  --sae-id-template    Template string with {layer}, e.g. "layer_{layer}/width_16k/canonical"
  --languages          Comma-separated flores_plus language codes
  --layers             Comma-separated layer indices to analyse
  --n-texts-per-lang   Number of FLORES+ texts per language (default: 100, -1 = all)
  --split              FLORES+ split: dev | devtest (default: dev)
  --out-dir            Output directory for meta.json and v_scores.pt
  --device             cuda | cpu (default: auto-detect)

Language Codes

Languages are specified as FLORES+ codes (lang_Script). Examples:

Code Language
eng_Latn English
fra_Latn French
spa_Latn Spanish
por_Latn Portuguese
jpn_Jpan Japanese
cmn_Hans Chinese (Simplified)
kor_Hang Korean
tha_Thai Thai
vie_Latn Vietnamese
kas_Arab Kashmiri (Arabic script)
wuu_Hans Wu Chinese
azb_Arab South Azerbaijani
nus_Latn Nuer
arg_Latn Aragonese
glg_Latn Galician
cat_Latn Catalan

Full list: FLORES+ dataset

Saved Run Format

Each run in v_score_runs/ contains:

  • meta.json β€” run configuration (model, languages, layers, etc.)
  • v_scores.pt β€” PyTorch file with structure:
    {
      "layers": {
        "0":  {"top_index_per_lan": Tensor[K, F], "top_values_per_lan": Tensor[K, F]},
        "5":  {...},
        ...
      }
    }
    
    where K = number of languages and F = number of SAE features, sorted by v-score descending.

Load a saved run programmatically:

from compute_v_scores import load_v_score_run
meta, layers = load_v_score_run("./v_score_runs/run_reprod_fig_1")
top_indices = layers[10]["top_index_per_lan"]   # layer 10, shape [K, F]

Experiments

Script Description Languages Layers
run_gemma_reprod.sh Reproduce Figure 1 10 diverse 0,2,5,10,15,20
run_gemma_diverse_langs_all.sh Insight 1: all FLORES texts 10 diverse 0,2,5,10,15,20
run_gemma_diverse_langs_small.sh Insight 1: quick (25 texts) 10 diverse 0,2,5,10,15,20
run_gemma_similar_langs.sh Insight 2: similar languages es/pt/gl/ca 0,2,5,10,15,20
run_gemma_underrepresented_langs.sh Insight 4: low-resource languages 4 rare 0,2,5,10,15,20
run_qwen_reprod.sh Same setup on Qwen3-0.6B 10 diverse 0,2,5,10,15,20

Related: Part 6 β€” Steering Vectors

The v-score runs produced by this repo feed into a companion project that extends the analysis into active language steering using SAE-gated steering vectors:

siemovit/snlp β€” Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders (Part 6 experiments)

That repo implements three experiments on top of the v-scores:

Experiment Entry point What it does
Baseline steering part_6/baseline_experiment.py One-layer toy steering demo
Adversarial Language Identification (LID) part_6/lid_experiment.py Steers model to generate in a target language; measures first-token LID accuracy
Cross-Lingual Continuation (CLC) part_6/clc_experiment.py Prompts in one language, steers continuation into another

Using v-score runs from this repo in the steering repo

The steering repo can import saved v-score runs directly via its export utility:

cd /path/to/siemovit-snlp
uv run python -m part_6.export_v_scores \
  --run-dir ../MVA-SNLP/v_score_runs/run_reprod_fig_1 \
  --top-k 5
# β†’ results/csv/v_scores_run_reprod_fig_1_top5.csv

Quick start for the steering experiments

git clone https://github.com/siemovit/snlp.git
cd snlp
uv sync
uv run python download.py --model-name gemma-2-2b

# Adversarial Language Identification (French β†’ English, Gemma-2B)
uv run python -m part_6.lid_experiment \
  --model-name gemma-2-2b \
  --source-lang fr \
  --target-lang en \
  --base-layer 20 \
  --alpha 0.5

# Cross-Lingual Continuation (French β†’ English, Qwen3)
uv run python -m part_6.clc_experiment \
  --model-name qwen \
  --source-lang fr \
  --target-lang en \
  --base-layer 18 \
  --alpha 10.0

Note: The LID experiment is memory-heavy. On a Tesla V100, --train-n 100 can cause OOM β€” start with the default --train-n 20 and scale up carefully.

Related: Extended Analysis β€” Ablation, Clustering & Synergy

VSmague/NLP β€” Extended experiments by Valentin Smague covering ablation studies, feature clustering, and cross-language synergy analysis built on top of the v-scores from this repo.

That repo covers four additional directions:

Analysis Script / Notebook What it does
Feature ablation ablation.py, SNLP_ablation_clean.ipynb Ablates top language-specific SAE features and measures the effect on model behavior; produces per-language specificity plots
Language clustering compute_clusters.py, compute_matrix.py Clusters languages by their v-score feature overlap using MDS and similarity matrices
Cross-language synergy cross_language_synergy.py Measures how much top features for one language also activate on other languages (feature sharing / synergy)
Visualization visualisation.py, reprod.py Reproduces v-score bar charts (Figure 1 style) and generates additional plots

Key outputs stored in the repo:

  • v_scores.png β€” reproduced v-score figure
  • ablation_fr.png, ablation_specificity.png β€” ablation results for French
  • clustering_best.png, clustering_comparison.png, clustering_mds.png β€” language clustering visualizations
  • plots/, plots_interaction/, plots_synergy/ β€” full plot collections
  • sae_features/ β€” saved SAE feature data
  • figures_section5/ β€” figures for section 5 of the report
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support