--- language: - en license: apache-2.0 library_name: llm2ner base_model: phi-4 tags: - ner - span-detection - llm - pytorch pipeline_tag: token-classification model_name: ToMMeR-phi-4_L3_R64 source: https://github.com/VictorMorand/llm2ner paper: https://arxiv.org/abs/2510.19410 --- # ToMMeR-phi-4_L3_R64 [](https://arxiv.org/abs/2510.19410) [](https://huggingface.co/llm2ner) [](https://github.com/VictorMorand/llm2ner) ToMMeR is a lightweight probing model extracting emergent mention detection capabilities from early layers representations of any LLM backbone, achieving high Zero Shot recall across a wide set of 13 NER benchmarks. ## Model Details This model can be plugged at layer 3 of `phi-4`, with a computational overhead not greater than an additional attention head. | Property | Value | |-----------|-------| | Base LLM | `phi-4` | | Layer | 3| | #Params | 660.5K | # Usage ## Installation To use ToMMeR, you need to install its codebase first. ```bash pip install git+https://github.com/VictorMorand/llm2ner.git ``` ## Raw inference By default, ToMMeR outputs span probabilities, but we also propose built-in options for decoding entities. - Inputs: - tokens (batch, seq): tokens to process, - model: LLM to extract representation from. - Outputs: (batch, seq, seq) matrix (masked outside valid spans) ```python from xpm_torch.huggingface import TorchHFHub from llm2ner import ToMMeR, utils tommer: ToMMeR = TorchHFHub.from_pretrained("llm2ner/ToMMeR-phi-4_L3_R64") # load Backbone llm, optionnally cut the unused layer to save GPU space. llm = utils.load_llm( tommer.llm_name, cut_to_layer=tommer.layer,) tommer.to(llm.device) #### Raw Inference text = ["Large language models are awesome"] print(f"Input text: {text[0]}") #tokenize in shape (1, seq_len) tokens = llm.tokenizer(text, return_tensors="pt")["input_ids"].to(llm.device) # Output raw scores output = tommer.forward(tokens, llm) # (batch_size, seq_len, seq_len) print(f"Raw Output shape: {output.shape}") #use given decoding strategy to infer entities entities = tommer.infer_entities(tokens=tokens, model=llm, threshold=0.5, decoding_strategy="greedy") str_entities = [ llm.tokenizer.decode(tokens[0,b:e+1]) for b, e in entities[0]] print(f"Predicted entities: {str_entities}") >>>INFO:root:Cut LlamaModel with 16 layers to 7 layers >>> Input text: Large language models are awesome >>> Raw Output shape: torch.Size([1, 6, 6]) >>> Predicted entities: ['Large language models'] ``` ## Fancy Outputs We also provide inference and plotting utils in `llm2ner.plotting`. ```python from xpm_torch.huggingface import TorchHFHub from llm2ner import ToMMeR, utils, plotting tommer: ToMMeR = TorchHFHub.from_pretrained("llm2ner/ToMMeR-phi-4_L3_R64") # load Backbone llm, optionnally cut the unused layer to save GPU space. llm = utils.load_llm( tommer.llm_name, cut_to_layer=tommer.layer,) tommer.to(llm.device) text = "Large language models are awesome. While trained on language modeling, they exhibit emergent Zero Shot abilities that make them suitable for a wide range of tasks, including Named Entity Recognition (NER). " #fancy interactive output outputs = plotting.demo_inference( text, tommer, llm, decoding_strategy="threshold", # or "greedy" for flat segmentation threshold=0.5, # default 50% show_attn=True, ) ```