YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Multilingual V-Score Analysis with SAE Features
MVA project.
Contributors:
Yannis Kolodziej, Tom Mariani, Hai Pham, Valentin Smague
This project investigates how Sparse Autoencoder (SAE) features inside large language models (LLMs) encode language identity. The core metric is the v-score β a per-feature, per-language score that measures how much a SAE feature activates on one language compared to the average across all other languages.
What is the V-Score?
For a given SAE feature f and language L (over a set of K languages):
v(f, L) = mean_activation(f, L) - mean( mean_activation(f, L') for L' β L )
Features with a high v-score for language L are considered language-specific to L. By sorting features by their v-score, we obtain a ranked list of the most language-discriminative SAE features per layer.
Project Structure
MVA-SNLP/
βββ compute_v_scores.py # CLI: compute & save v-scores for any model/SAE/language set
βββ visualize_v_scores.ipynb # Visualize saved v-score runs (Figure 1 reproduction + insights)
βββ sae_feature_exploration.ipynb # Hugging Face-based interactive SAE feature exploration
βββ extended_visualization.ipynb # Extended visualizations and additional analyses
βββ code_switching_analysis.ipynb # Code-switching analysis on specific words
β
βββ scripts/ # Ready-to-run bash scripts for each experiment
β βββ run_gemma_reprod.sh # Reproduce Figure 1 (Gemma-2B, 10 languages, 100 texts)
β βββ run_gemma_diverse_langs_all.sh # Insight 1: all texts, 10 diverse languages
β βββ run_gemma_diverse_langs_small.sh # Insight 1: quick run (25 texts/language)
β βββ run_gemma_similar_langs.sh # Insight 2: similar languages (es/pt/gl/ca)
β βββ run_gemma_underrepresented_langs.sh # Insight 4: underrepresented languages
β βββ run_qwen_reprod.sh # Reproduction with Qwen3-0.6B
β
βββ v_score_runs/ # Saved results (meta.json + v_scores.pt per run)
βββ run_reprod_fig_1/
βββ run_insight_1_all/
βββ run_insight_1_small/
βββ run_insight_2/
βββ run_insight_4/
βββ qwen_run_reprod_fig_1/
Supported Models
| Alias | Model | SAE Release |
|---|---|---|
gemma-2b |
google/gemma-2-2b |
gemma-scope-2b-pt-res-canonical |
qwen3-0.6b |
Qwen/Qwen3-0.6B |
mwhanna-qwen3-0.6b-transcoders-lowl0 |
Quick Start
1. Install dependencies
pip install torch transformers datasets sae-lens matplotlib
2. Compute v-scores (CLI)
Reproduce Figure 1 (Gemma-2B, 10 languages):
bash scripts/run_gemma_reprod.sh
Or run directly with custom settings:
python compute_v_scores.py compute \
--model gemma-2b \
--languages eng_Latn,fra_Latn,jpn_Jpan,cmn_Hans \
--layers 0,5,10,15,20 \
--n-texts-per-lang 100 \
--out-dir ./v_score_runs/my_run
Use a custom model/SAE not in the presets:
python compute_v_scores.py compute \
--model custom \
--model-id "your/hf-model-id" \
--sae-release "your-sae-release" \
--sae-id-template "layer_{layer}" \
--languages eng_Latn,fra_Latn \
--layers 0,5,10 \
--out-dir ./v_score_runs/custom_run
3. Visualize results
Open visualize_v_scores.ipynb and point it to any v_score_runs/<run_name>/ directory. The notebook loads meta.json and v_scores.pt and renders:
- Top language-specific features per layer
- Feature activation heatmaps across languages
- V-score distributions
CLI Reference
python compute_v_scores.py compute [OPTIONS]
Options:
--model Preset: gemma-2b | qwen3-0.6b | custom
--model-id Override HuggingFace model ID (for --model custom)
--sae-release Override sae_lens release name
--sae-id-template Template string with {layer}, e.g. "layer_{layer}/width_16k/canonical"
--languages Comma-separated flores_plus language codes
--layers Comma-separated layer indices to analyse
--n-texts-per-lang Number of FLORES+ texts per language (default: 100, -1 = all)
--split FLORES+ split: dev | devtest (default: dev)
--out-dir Output directory for meta.json and v_scores.pt
--device cuda | cpu (default: auto-detect)
Language Codes
Languages are specified as FLORES+ codes (lang_Script). Examples:
| Code | Language |
|---|---|
eng_Latn |
English |
fra_Latn |
French |
spa_Latn |
Spanish |
por_Latn |
Portuguese |
jpn_Jpan |
Japanese |
cmn_Hans |
Chinese (Simplified) |
kor_Hang |
Korean |
tha_Thai |
Thai |
vie_Latn |
Vietnamese |
kas_Arab |
Kashmiri (Arabic script) |
wuu_Hans |
Wu Chinese |
azb_Arab |
South Azerbaijani |
nus_Latn |
Nuer |
arg_Latn |
Aragonese |
glg_Latn |
Galician |
cat_Latn |
Catalan |
Full list: FLORES+ dataset
Saved Run Format
Each run in v_score_runs/ contains:
meta.jsonβ run configuration (model, languages, layers, etc.)v_scores.ptβ PyTorch file with structure:where{ "layers": { "0": {"top_index_per_lan": Tensor[K, F], "top_values_per_lan": Tensor[K, F]}, "5": {...}, ... } }K= number of languages andF= number of SAE features, sorted by v-score descending.
Load a saved run programmatically:
from compute_v_scores import load_v_score_run
meta, layers = load_v_score_run("./v_score_runs/run_reprod_fig_1")
top_indices = layers[10]["top_index_per_lan"] # layer 10, shape [K, F]
Experiments
| Script | Description | Languages | Layers |
|---|---|---|---|
run_gemma_reprod.sh |
Reproduce Figure 1 | 10 diverse | 0,2,5,10,15,20 |
run_gemma_diverse_langs_all.sh |
Insight 1: all FLORES texts | 10 diverse | 0,2,5,10,15,20 |
run_gemma_diverse_langs_small.sh |
Insight 1: quick (25 texts) | 10 diverse | 0,2,5,10,15,20 |
run_gemma_similar_langs.sh |
Insight 2: similar languages | es/pt/gl/ca | 0,2,5,10,15,20 |
run_gemma_underrepresented_langs.sh |
Insight 4: low-resource languages | 4 rare | 0,2,5,10,15,20 |
run_qwen_reprod.sh |
Same setup on Qwen3-0.6B | 10 diverse | 0,2,5,10,15,20 |
Related: Part 6 β Steering Vectors
The v-score runs produced by this repo feed into a companion project that extends the analysis into active language steering using SAE-gated steering vectors:
siemovit/snlp β Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders (Part 6 experiments)
That repo implements three experiments on top of the v-scores:
| Experiment | Entry point | What it does |
|---|---|---|
| Baseline steering | part_6/baseline_experiment.py |
One-layer toy steering demo |
| Adversarial Language Identification (LID) | part_6/lid_experiment.py |
Steers model to generate in a target language; measures first-token LID accuracy |
| Cross-Lingual Continuation (CLC) | part_6/clc_experiment.py |
Prompts in one language, steers continuation into another |
Using v-score runs from this repo in the steering repo
The steering repo can import saved v-score runs directly via its export utility:
cd /path/to/siemovit-snlp
uv run python -m part_6.export_v_scores \
--run-dir ../MVA-SNLP/v_score_runs/run_reprod_fig_1 \
--top-k 5
# β results/csv/v_scores_run_reprod_fig_1_top5.csv
Quick start for the steering experiments
git clone https://github.com/siemovit/snlp.git
cd snlp
uv sync
uv run python download.py --model-name gemma-2-2b
# Adversarial Language Identification (French β English, Gemma-2B)
uv run python -m part_6.lid_experiment \
--model-name gemma-2-2b \
--source-lang fr \
--target-lang en \
--base-layer 20 \
--alpha 0.5
# Cross-Lingual Continuation (French β English, Qwen3)
uv run python -m part_6.clc_experiment \
--model-name qwen \
--source-lang fr \
--target-lang en \
--base-layer 18 \
--alpha 10.0
Note: The LID experiment is memory-heavy. On a Tesla V100,
--train-n 100can cause OOM β start with the default--train-n 20and scale up carefully.
Related: Extended Analysis β Ablation, Clustering & Synergy
VSmague/NLP β Extended experiments by Valentin Smague covering ablation studies, feature clustering, and cross-language synergy analysis built on top of the v-scores from this repo.
That repo covers four additional directions:
| Analysis | Script / Notebook | What it does |
|---|---|---|
| Feature ablation | ablation.py, SNLP_ablation_clean.ipynb |
Ablates top language-specific SAE features and measures the effect on model behavior; produces per-language specificity plots |
| Language clustering | compute_clusters.py, compute_matrix.py |
Clusters languages by their v-score feature overlap using MDS and similarity matrices |
| Cross-language synergy | cross_language_synergy.py |
Measures how much top features for one language also activate on other languages (feature sharing / synergy) |
| Visualization | visualisation.py, reprod.py |
Reproduces v-score bar charts (Figure 1 style) and generates additional plots |
Key outputs stored in the repo:
v_scores.pngβ reproduced v-score figureablation_fr.png,ablation_specificity.pngβ ablation results for Frenchclustering_best.png,clustering_comparison.png,clustering_mds.pngβ language clustering visualizationsplots/,plots_interaction/,plots_synergy/β full plot collectionssae_features/β saved SAE feature datafigures_section5/β figures for section 5 of the report