Upgrade model card: badges, quick start, training details, collection table, citations

Browse files

Files changed (1) hide show

README.md +131 -37

README.md CHANGED Viewed

@@ -1,60 +1,154 @@
 ---
 library_name: peft
 license: bigcode-openrail-m
-base_model: bigcode/starcoder2-15b-instruct-v0.1
 tags:
-- base_model:adapter:bigcode/starcoder2-15b-instruct-v0.1
-- lora
-- transformers
-pipeline_tag: text-generation
 model-index:
-- name: starcoder2-15b-securecode
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# starcoder2-15b-securecode
-This model is a fine-tuned version of [bigcode/starcoder2-15b-instruct-v0.1](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1) on the None dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0002
-- train_batch_size: 1
-- eval_batch_size: 8
-- seed: 42
-- gradient_accumulation_steps: 16
-- total_train_batch_size: 16
-- optimizer: Use paged_adamw_8bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_steps: 100
-- num_epochs: 3
-### Training results
-### Framework versions
-- PEFT 0.18.1
-- Transformers 5.1.0
-- Pytorch 2.7.1+cu128
-- Datasets 2.21.0
-- Tokenizers 0.22.2

 ---
 library_name: peft
+pipeline_tag: text-generation
 license: bigcode-openrail-m
+language:
+  - code
+base_model:
+  - bigcode/starcoder2-15b-instruct-v0.1
 tags:
+  - securecode
+  - security
+  - owasp
+  - code-generation
+  - secure-coding
+  - lora
+  - qlora
+  - vulnerability-detection
+  - cybersecurity
+datasets:
+  - scthornton/securecode
 model-index:
+  - name: starcoder2-15b-securecode
+    results: []
 ---
+# StarCoder2 15B SecureCode
+[![Parameters](https://img.shields.io/badge/parameters-15B-blue.svg)](#model-details) [![Dataset](https://img.shields.io/badge/dataset-2,185_examples-green.svg)](https://huggingface.co/datasets/scthornton/securecode) [![OWASP](https://img.shields.io/badge/OWASP-Top_10_2021_+_LLM_Top_10-red.svg)](#security-coverage) [![Method](https://img.shields.io/badge/method-QLoRA-purple.svg)](#training-details) [![License](https://img.shields.io/badge/license-BigCode_OpenRAIL--M-orange.svg)](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement)
+**Open-source flagship security-aware code generation model. Fine-tuned on 2,185 real-world vulnerability examples covering OWASP Top 10 2021 and OWASP LLM Top 10 2025.**
+[Dataset](https://huggingface.co/datasets/scthornton/securecode) | [Paper](https://huggingface.co/papers/2512.18542) | [Model Collection](https://huggingface.co/collections/scthornton/securecode) | [perfecXion.ai](https://perfecxion.ai) | [Blog Post](https://huggingface.co/blog/scthornton/securecode-models)
+---
+## What This Model Does
+StarCoder2 15B SecureCode generates security-aware code by teaching the model to recognize vulnerability patterns and produce secure implementations. Every training example includes:
+- **Real-world incident grounding** — Tied to documented CVEs and breach reports
+- **Vulnerable + secure implementations** — Side-by-side comparison
+- **Attack demonstrations** — Concrete exploit code
+- **Defense-in-depth guidance** — SIEM rules, logging, monitoring, infrastructure hardening
+---
+## Model Details
+| Property | Value |
+|----------|-------|
+| **Base Model** | [bigcode/starcoder2-15b-instruct-v0.1](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1) |
+| **Parameters** | 15B |
+| **Architecture** | GPT-2 (StarCoder2) |
+| **Method** | QLoRA (4-bit quantization + LoRA) |
+| **LoRA Rank** | 16 |
+| **LoRA Alpha** | 32 |
+| **Training Data** | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples) |
+| **Training Time** | ~1h 40min |
+| **Hardware** | 2x NVIDIA A100 40GB (GCP) |
+| **Framework** | PEFT 0.18.1, Transformers 5.1.0, PyTorch 2.7.1 |
+---
+## Quick Start
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+# Load base model + LoRA adapter
+base_model = AutoModelForCausalLM.from_pretrained(
+    "bigcode/starcoder2-15b-instruct-v0.1",
+    device_map="auto",
+    load_in_4bit=True
+)
+model = PeftModel.from_pretrained(base_model, "scthornton/starcoder2-15b-securecode")
+tokenizer = AutoTokenizer.from_pretrained("scthornton/starcoder2-15b-securecode")
+# Generate secure code
+prompt = "Write a secure JWT authentication handler in Python with proper token validation"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+---
+## Training Details
+| Hyperparameter | Value |
+|----------------|-------|
+| Learning Rate | 2e-4 |
+| Batch Size | 1 |
+| Gradient Accumulation | 16 |
+| Epochs | 3 |
+| Scheduler | Cosine |
+| Warmup Steps | 100 |
+| Optimizer | paged_adamw_8bit |
+| Max Sequence Length | 2048 |
+### Dataset Breakdown
+| Component | Examples | Coverage |
+|-----------|----------|----------|
+| Web Security (OWASP Top 10:2021) | 1,378 | 12 languages, 9 frameworks |
+| AI/ML Security (OWASP LLM Top 10:2025) | 750 | Prompt injection, RAG poisoning, model theft |
+| Framework-Specific Additions | 219 | Django, Flask, Express, Spring Boot, etc. |
+| **Total** | **2,185** | **Complete OWASP coverage** |
+---
+## SecureCode Model Collection
+| Model | Parameters | Base | Training Time | Link |
+|-------|------------|------|---------------|------|
+| Llama 3.2 3B | 3B | Meta Llama 3.2 | 1h 5min | [scthornton/llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode) |
+| Qwen Coder 7B | 7B | Qwen 2.5 Coder | 1h 24min | [scthornton/qwen-coder-7b-securecode](https://huggingface.co/scthornton/qwen-coder-7b-securecode) |
+| CodeGemma 7B | 7B | Google CodeGemma | 1h 27min | [scthornton/codegemma-7b-securecode](https://huggingface.co/scthornton/codegemma-7b-securecode) |
+| DeepSeek Coder 6.7B | 6.7B | DeepSeek Coder | 1h 15min | [scthornton/deepseek-coder-6.7b-securecode](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) |
+| CodeLlama 13B | 13B | Meta CodeLlama | 1h 32min | [scthornton/codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode) |
+| Qwen Coder 14B | 14B | Qwen 2.5 Coder | 1h 19min | [scthornton/qwen2.5-coder-14b-securecode](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode) |
+| **StarCoder2 15B** | **15B** | **BigCode StarCoder2** | **1h 40min** | **This model** |
+| Granite 20B | 20B | IBM Granite Code | 1h 19min | [scthornton/granite-20b-code-securecode](https://huggingface.co/scthornton/granite-20b-code-securecode) |
+---
+## Citation
+```bibtex
+@misc{thornton2025securecode,
+  title={SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models},
+  author={Thornton, Scott},
+  year={2025},
+  publisher={perfecXion.ai},
+  url={https://perfecxion.ai/articles/securecode-v2-dataset-paper.html},
+  note={Model: https://huggingface.co/scthornton/starcoder2-15b-securecode}
+}
+```
+---
+## Links
+- **Dataset**: [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples)
+- **Paper**: [SecureCode v2.0](https://huggingface.co/papers/2512.18542)
+- **Model Collection**: [SecureCode Models](https://huggingface.co/collections/scthornton/securecode) (8 models)
+- **Blog Post**: [Training Security-Aware Code Models](https://huggingface.co/blog/scthornton/securecode-models)
+- **Publisher**: [perfecXion.ai](https://perfecxion.ai)
+---
+## License
+BigCode OpenRAIL-M