| --- |
| language: en |
| library_name: transformers |
| pipeline_tag: token-classification |
| tags: |
| - ner |
| - token-classification |
| - cybersecurity |
| - threat-intelligence |
| - secureBert |
| license: mit |
| metrics: |
| - accuracy |
| base_model: |
| - answerdotai/ModernBERT-large |
| --- |
| |
| # Model Overview |
|
|
| **SecureModernBERT-NER** represents a new generation of cybersecurity-focused language models — combining the **state-of-the-art architecture of ModernBERT** with one of the **largest and most diverse CTI-labelled NER corpora ever built**. |
|
|
| Unlike conventional NER systems, SecureModernBERT-NER recognises **22 finely-grained, security-specific entity types**, covering the full spectrum of cyber-threat intelligence — from `THREAT-ACTOR` and `MALWARE` to `CVE`, `IPV4`, `DOMAIN`, and `REGISTRY-KEYS`. |
|
|
| Trained on more than **half a million manually curated spans** sourced from real-world threat reports, vulnerability advisories, and incident analyses, it achieves an exceptional balance of **accuracy, generalisation, and contextual depth**. |
|
|
| This model is designed to **parse complex security narratives with human-level precision**, extracting both contextual metadata (e.g., `ORG`, `PRODUCT`, `PLATFORM`) and highly technical indicators (e.g., `HASHES`, `URLS`, `NETWORK ADDRESSES`) — all within a single unified framework. |
|
|
| SecureModernBERT-NER sets a new standard for **automated CTI entity recognition**, enabling the next wave of **threat-intelligence automation, enrichment, and analytics**. |
|
|
| ## Quick Start |
|
|
| ```python |
| from transformers import pipeline |
| |
| model_id = "attack-vector/SecureModernBERT-NER" |
| |
| pipe = pipeline( |
| task="token-classification", |
| model=model_id, |
| tokenizer=model_id, |
| aggregation_strategy="first", |
| ) |
| |
| text = "TrickBot connects to hxxp://185.222.202.55 to exfiltrate data from Windows hosts." |
| predictions = pipe(text) |
| for pred in predictions: |
| print(pred) |
| ``` |
|
|
| Sample output: |
|
|
| ``` |
| {'entity_group': 'MALWARE', 'score': np.float32(0.9615546), 'word': 'TrickBot', 'start': 0, 'end': 8} |
| {'entity_group': 'URL', 'score': np.float32(0.9905957), 'word': ' hxxp://185.222.202.55', 'start': 20, 'end': 42} |
| {'entity_group': 'PLATFORM', 'score': np.float32(0.92317337), 'word': ' Windows', 'start': 66, 'end': 74} |
| ``` |
|
|
| ## Intended Use & Limitations |
|
|
| - **Use cases:** automated tagging of CTI reports, IOC extraction pipelines, knowledge-base enrichment, security-focused RAG systems. |
| - **Languages:** English (model was trained and evaluated on English sources only). |
| - **Input format:** free-form prose or long-form CTI articles; maximum sequence length 128 tokens during training. |
| - **Limitations:** noisy or ambiguous extractions may occur, especially with rare entity types (`IPV6`, `EMAIL`) and obfuscated strings. The model does not normalise entities (e.g., deobfuscating `hxxp`) nor validate indicator authenticity. Always pair with downstream validation and human review. |
|
|
| ## Training Data |
|
|
| - **Size:** 502,726 labelled text spans before filtering; 22 distinct entity classes in BIO format. |
| - **Label distribution (spans):** `ORG` (approx. 198k), `PRODUCT` (approx. 79k), `MALWARE` (approx. 67k), `PLATFORM` (approx. 57k), `THREAT-ACTOR` (approx. 49k), `SERVICE` (approx. 46k), `CVE` (approx. 41k), `LOC` (approx. 38k), `SECTOR` (approx. 34k), `TOOL` (approx. 29k), plus indicator types such as `URL`, `IPV4`, `SHA256`, `MD5`, and `REGISTRY-KEYS`. |
| - **Pre-processing:** JSONL articles were tokenised and converted to BIO tags; spans in conflict were resolved manually and via automated heuristics before upload. |
|
|
| ## Label Mapping |
|
|
| | Label | Description | Example mention | |
| |-------|-------------|-----------------| |
| | URL | Web address or obfuscated link used in campaigns. | `hxxp://185.222.202.55` | |
| | ORG | Organisations such as companies, CERTs, or research groups. | `Microsoft Threat Intelligence` | |
| | SERVICE | Online or cloud services referenced in attacks. | `Google Ads` | |
| | SECTOR | Industry sectors or verticals targeted. | `critical infrastructure` | |
| | FILEPATH | File system paths observed in malware samples. | `C:\Windows\System32\svchost.exe` | |
| | DOMAIN | Fully qualified domains or subdomains. | `malicious-domain[.]com` | |
| | PLATFORM | Operating systems or computing platforms. | `Windows Server` | |
| | THREAT-ACTOR | Named adversary groups or aliases. | `LockBit` | |
| | PRODUCT | Commercial or open-source software products. | `VMware ESXi` | |
| | MALWARE | Malware families, strains, or toolkits. | `TrickBot` | |
| | LOC | Countries, cities, or regions. | `United States` | |
| | CVE | CVE identifiers for vulnerabilities. | `CVE-2023-23397` | |
| | TOOL | Legitimate or dual-use tools leveraged in incidents. | `Cobalt Strike` | |
| | IPV4 | IPv4 addresses. | `185.222.202.55` | |
| | MITRE-TACTIC | MITRE ATT&CK tactic categories. | `Credential Access` | |
| | MD5 | MD5 cryptographic hashes. | `d41d8cd98f00b204e9800998ecf8427e` | |
| | CAMPAIGN | Named operations or campaigns. | `Operation Cronos` | |
| | SHA1 | SHA-1 hashes. | `da39a3ee5e6b4b0d3255bfef95601890afd80709` | |
| | SHA256 | SHA-256 hashes. | `9e107d9d372bb6826bd81d3542a419d6...` | |
| | EMAIL | Email addresses. | `alerts@example.com` | |
| | IPV6 | IPv6 addresses. | `2001:0db8:85a3:0000:0000:8a2e:0370:7334` | |
| | REGISTRY-KEYS | Windows registry keys or paths. | `HKLM\Software\Microsoft\Windows\CurrentVersion\Run` | |
|
|
| ## Training Procedure |
|
|
| - **Base model:** [`answerdotai/ModernBERT-large`](https://huggingface.co/answerdotai/ModernBERT-large). |
| - **Hardware:** single Nvidia L40S instance (8 vCPU / 62 GB RAM / 48 GB VRAM). |
| - **Optimisation setup:** mixed precision `fp16`, optimiser `adamw_torch`, cosine learning-rate scheduler, gradient accumulation `1`. |
| - **Key hyperparameters:** learning rate `5e-5`, batch size `128`, epochs `5`, maximum sequence length `128`. |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Mixed precision | `fp16` | |
| | Batch size | `128` | |
| | Learning rate | `5e-5` | |
| | Optimiser | `adamw_torch` | |
| | Scheduler | `cosine` | |
| | Epochs | `5` | |
| | Gradient accumulation | `1` | |
| | Max sequence length | `128` | |
|
|
| ## Evaluation |
|
|
| AutoTrain reports the following micro-averaged metrics on its validation split (seqeval entity scoring): |
|
|
| | Metric | Score | |
| |------------|--------| |
| | Precision | 0.8468 | |
| | Recall | 0.8484 | |
| | F1 | 0.8476 | |
| | Accuracy | 0.9589 | |
|
|
| An independent re-evaluation against a consolidated CTI set (same taxonomy as this model) produced the label-level accuracy breakdown below. These scores are macro-averaged across labels and therefore are not numerically comparable to the micro metrics above, but they provide insight into class balance and span quality. |
|
|
| | Label | Used | Accuracy | |
| |-------|------|----------| |
| | CAMPAIGN | 1,817 | 0.7980 | |
| | CVE | 28,293 | 0.9995 | |
| | DOMAIN | 12,182 | 0.8878 | |
| | EMAIL | 731 | 0.8495 | |
| | FILEPATH | 13,889 | 0.7957 | |
| | IPV4 | 1,164 | 0.9631 | |
| | IPV6 | 563 | 0.7425 | |
| | LOC | 7,915 | 0.9557 | |
| | MALWARE | 10,405 | 0.9087 | |
| | MD5 | 389 | 0.9100 | |
| | MITRE-TACTIC | 2,181 | 0.7093 | |
| | ORG | 36,324 | 0.9301 | |
| | PLATFORM | 8,036 | 0.8977 | |
| | PRODUCT | 18,720 | 0.8432 | |
| | REGISTRY-KEYS | 1,589 | 0.8490 | |
| | SECTOR | 6,453 | 0.8309 | |
| | SERVICE | 8,533 | 0.8179 | |
| | SHA1 | 222 | 0.9189 | |
| | SHA256 | 2,146 | 0.9874 | |
| | THREAT-ACTOR | 9,532 | 0.9418 | |
| | TOOL | 4,874 | 0.7895 | |
| | URL | 7,470 | 0.9801 | |
|
|
| - **Macro accuracy:** 0.8776 |
|
|
| Because micro vs macro averaging and dataset composition differ, expect numerical gaps between the two evaluations even though both describe the same checkpoint. |
|
|
| These metrics were computed with the `seqeval` micro-average at the entity level. |
|
|
| ## External Benchmarks |
|
|
| The following tables report detailed results on a shared CTI validation set. **Do not compare the per-label values across models directly:** each checkpoint uses a different taxonomy or remapping strategy, so accuracy percentages can be misleading when labels are aligned or collapsed differently. Use the per-model tables to understand performance within a single schema, and interpret macro-accuracy scores with caution. |
|
|
|
|
| ### CyberPeace-Institute/SecureBERT-NER |
|
|
| | Label | Used | Accuracy | |
| |-------|------|----------| |
| | ACT | 3,945 | 0.1706 | |
| | APT | 9,518 | 0.5331 | |
| | DOM | 10,694 | 0.0196 | |
| | EMAIL | 731 | 0.0000 | |
| | FILE | 31,864 | 0.0747 | |
| | IP | 1,251 | 0.0088 | |
| | LOC | 7,895 | 0.8711 | |
| | MAL | 10,341 | 0.6076 | |
| | MD5 | 354 | 0.8672 | |
| | O | 16,275 | 0.4700 | |
| | OS | 7,974 | 0.6598 | |
| | SECTEAM | 36,083 | 0.3509 | |
| | SHA1 | 191 | 0.0209 | |
| | SHA2 | 1,647 | 0.9709 | |
| | TOOL | 4,816 | 0.4043 | |
| | URL | 6,997 | 0.0795 | |
| | VULID | 27,586 | 0.3849 | |
|
|
| - **Macro accuracy:** 0.3820 |
|
|
| ### PranavaKailash/CyNER-2.0-DeBERTa-v3-base |
|
|
| | Label | Used | Accuracy | |
| |-------|------|----------| |
| | Indicator | 35,936 | 0.7878 | |
| | Location | 7,895 | 0.0113 | |
| | Malware | 12,125 | 0.7800 | |
| | O | 2,896 | 0.7652 | |
| | Organization | 42,537 | 0.6556 | |
| | System | 35,063 | 0.7259 | |
| | TOOL | 4,820 | 0.0000 | |
| | Threat Group | 9,522 | 0.0000 | |
| | Vulnerability | 27,673 | 0.1876 | |
|
|
| - **Macro accuracy:** 0.4348 |
|
|
| ### cisco-ai/SecureBERT2.0-NER |
|
|
| | Label | Used | Accuracy | |
| |-------|------|----------| |
| | Indicator | 35,789 | 0.8854 | |
| | Malware | 16,926 | 0.6204 | |
| | O | 10,786 | 0.6813 | |
| | Organization | 51,993 | 0.5579 | |
| | System | 34,955 | 0.6600 | |
| | Vulnerability | 27,525 | 0.2552 | |
|
|
| - **Macro accuracy:** 0.6100 |
|
|
|
|
| ## Responsible Use |
|
|
| - Confirm entity detections before acting on indicators (e.g., automated blocking). |
| - Combine with enrichment and scoring systems to filter false positives. |
| - Monitor for drift if applying to new domains (e.g., non-English sources, informal channels). |
| - Respect licensing and confidentiality of any proprietary CTI sources used for inference. |
|
|
|
|
| ## Support & Connect |
|
|
| * ❤️ **Like the repo** if you found it useful |
| * ☕ **Support me:** Say thanks by buying me a coffee! [https://buymeacoffee.com/juanmcristobal](https://buymeacoffee.com/juanmcristobal) |
| * 💼 **Open to work:** [https://www.linkedin.com/in/jmcristobal/](https://www.linkedin.com/in/jmcristobal/) |
|
|
| If you use SecureModernBERT-NER in a project, feel free to share it in the Discussions/Issues — I love seeing real-world use cases. |
|
|
| ## Citation |
|
|
| If you find this model useful, please cite the repository and the base model: |
|
|
| ``` |
| @software{securemodernbert_ner_2025, |
| author = {Juan Manuel Cristóbal Moreno}, |
| title = {SecureModernBERT-NER: Cyber Threat Intelligence Named Entity Recogniser}, |
| year = {2025}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/attack-vector/SecureModernBERT-NER} |
| } |
| ``` |
|
|
| ## Contact |
|
|
| Questions or feedback? Open an issue on the Hugging Face model repository or reach out at [`@juanmcristobal`](https://huggingface.co/juanmcristobal). |