scthornton commited on
Commit
22ed853
·
verified ·
1 Parent(s): 6fc3921

Upgrade model card: badges, quick start, training details, collection table, citations

Browse files
Files changed (1) hide show
  1. README.md +131 -37
README.md CHANGED
@@ -1,60 +1,154 @@
1
  ---
2
  library_name: peft
 
3
  license: bigcode-openrail-m
4
- base_model: bigcode/starcoder2-15b-instruct-v0.1
 
 
 
5
  tags:
6
- - base_model:adapter:bigcode/starcoder2-15b-instruct-v0.1
7
- - lora
8
- - transformers
9
- pipeline_tag: text-generation
 
 
 
 
 
 
 
10
  model-index:
11
- - name: starcoder2-15b-securecode
12
- results: []
13
  ---
14
 
15
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
- should probably proofread and complete it, then remove this comment. -->
17
 
18
- # starcoder2-15b-securecode
19
 
20
- This model is a fine-tuned version of [bigcode/starcoder2-15b-instruct-v0.1](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1) on the None dataset.
21
 
22
- ## Model description
23
 
24
- More information needed
25
 
26
- ## Intended uses & limitations
27
 
28
- More information needed
29
 
30
- ## Training and evaluation data
 
 
 
31
 
32
- More information needed
33
 
34
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- ### Training hyperparameters
 
 
37
 
38
- The following hyperparameters were used during training:
39
- - learning_rate: 0.0002
40
- - train_batch_size: 1
41
- - eval_batch_size: 8
42
- - seed: 42
43
- - gradient_accumulation_steps: 16
44
- - total_train_batch_size: 16
45
- - optimizer: Use paged_adamw_8bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
46
- - lr_scheduler_type: cosine
47
- - lr_scheduler_warmup_steps: 100
48
- - num_epochs: 3
49
 
50
- ### Training results
 
 
 
 
 
 
 
51
 
 
 
 
 
 
 
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
- ### Framework versions
55
 
56
- - PEFT 0.18.1
57
- - Transformers 5.1.0
58
- - Pytorch 2.7.1+cu128
59
- - Datasets 2.21.0
60
- - Tokenizers 0.22.2
 
1
  ---
2
  library_name: peft
3
+ pipeline_tag: text-generation
4
  license: bigcode-openrail-m
5
+ language:
6
+ - code
7
+ base_model:
8
+ - bigcode/starcoder2-15b-instruct-v0.1
9
  tags:
10
+ - securecode
11
+ - security
12
+ - owasp
13
+ - code-generation
14
+ - secure-coding
15
+ - lora
16
+ - qlora
17
+ - vulnerability-detection
18
+ - cybersecurity
19
+ datasets:
20
+ - scthornton/securecode
21
  model-index:
22
+ - name: starcoder2-15b-securecode
23
+ results: []
24
  ---
25
 
26
+ # StarCoder2 15B SecureCode
 
27
 
28
+ [![Parameters](https://img.shields.io/badge/parameters-15B-blue.svg)](#model-details) [![Dataset](https://img.shields.io/badge/dataset-2,185_examples-green.svg)](https://huggingface.co/datasets/scthornton/securecode) [![OWASP](https://img.shields.io/badge/OWASP-Top_10_2021_+_LLM_Top_10-red.svg)](#security-coverage) [![Method](https://img.shields.io/badge/method-QLoRA-purple.svg)](#training-details) [![License](https://img.shields.io/badge/license-BigCode_OpenRAIL--M-orange.svg)](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement)
29
 
30
+ **Open-source flagship security-aware code generation model. Fine-tuned on 2,185 real-world vulnerability examples covering OWASP Top 10 2021 and OWASP LLM Top 10 2025.**
31
 
32
+ [Dataset](https://huggingface.co/datasets/scthornton/securecode) | [Paper](https://huggingface.co/papers/2512.18542) | [Model Collection](https://huggingface.co/collections/scthornton/securecode) | [perfecXion.ai](https://perfecxion.ai) | [Blog Post](https://huggingface.co/blog/scthornton/securecode-models)
33
 
34
+ ---
35
 
36
+ ## What This Model Does
37
 
38
+ StarCoder2 15B SecureCode generates security-aware code by teaching the model to recognize vulnerability patterns and produce secure implementations. Every training example includes:
39
 
40
+ - **Real-world incident grounding** — Tied to documented CVEs and breach reports
41
+ - **Vulnerable + secure implementations** — Side-by-side comparison
42
+ - **Attack demonstrations** — Concrete exploit code
43
+ - **Defense-in-depth guidance** — SIEM rules, logging, monitoring, infrastructure hardening
44
 
45
+ ---
46
 
47
+ ## Model Details
48
+
49
+ | Property | Value |
50
+ |----------|-------|
51
+ | **Base Model** | [bigcode/starcoder2-15b-instruct-v0.1](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1) |
52
+ | **Parameters** | 15B |
53
+ | **Architecture** | GPT-2 (StarCoder2) |
54
+ | **Method** | QLoRA (4-bit quantization + LoRA) |
55
+ | **LoRA Rank** | 16 |
56
+ | **LoRA Alpha** | 32 |
57
+ | **Training Data** | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples) |
58
+ | **Training Time** | ~1h 40min |
59
+ | **Hardware** | 2x NVIDIA A100 40GB (GCP) |
60
+ | **Framework** | PEFT 0.18.1, Transformers 5.1.0, PyTorch 2.7.1 |
61
 
62
+ ---
63
+
64
+ ## Quick Start
65
 
66
+ ```python
67
+ from transformers import AutoModelForCausalLM, AutoTokenizer
68
+ from peft import PeftModel
 
 
 
 
 
 
 
 
69
 
70
+ # Load base model + LoRA adapter
71
+ base_model = AutoModelForCausalLM.from_pretrained(
72
+ "bigcode/starcoder2-15b-instruct-v0.1",
73
+ device_map="auto",
74
+ load_in_4bit=True
75
+ )
76
+ model = PeftModel.from_pretrained(base_model, "scthornton/starcoder2-15b-securecode")
77
+ tokenizer = AutoTokenizer.from_pretrained("scthornton/starcoder2-15b-securecode")
78
 
79
+ # Generate secure code
80
+ prompt = "Write a secure JWT authentication handler in Python with proper token validation"
81
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
82
+ outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
83
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
84
+ ```
85
 
86
+ ---
87
+
88
+ ## Training Details
89
+
90
+ | Hyperparameter | Value |
91
+ |----------------|-------|
92
+ | Learning Rate | 2e-4 |
93
+ | Batch Size | 1 |
94
+ | Gradient Accumulation | 16 |
95
+ | Epochs | 3 |
96
+ | Scheduler | Cosine |
97
+ | Warmup Steps | 100 |
98
+ | Optimizer | paged_adamw_8bit |
99
+ | Max Sequence Length | 2048 |
100
+
101
+ ### Dataset Breakdown
102
+
103
+ | Component | Examples | Coverage |
104
+ |-----------|----------|----------|
105
+ | Web Security (OWASP Top 10:2021) | 1,378 | 12 languages, 9 frameworks |
106
+ | AI/ML Security (OWASP LLM Top 10:2025) | 750 | Prompt injection, RAG poisoning, model theft |
107
+ | Framework-Specific Additions | 219 | Django, Flask, Express, Spring Boot, etc. |
108
+ | **Total** | **2,185** | **Complete OWASP coverage** |
109
+
110
+ ---
111
+
112
+ ## SecureCode Model Collection
113
+
114
+ | Model | Parameters | Base | Training Time | Link |
115
+ |-------|------------|------|---------------|------|
116
+ | Llama 3.2 3B | 3B | Meta Llama 3.2 | 1h 5min | [scthornton/llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode) |
117
+ | Qwen Coder 7B | 7B | Qwen 2.5 Coder | 1h 24min | [scthornton/qwen-coder-7b-securecode](https://huggingface.co/scthornton/qwen-coder-7b-securecode) |
118
+ | CodeGemma 7B | 7B | Google CodeGemma | 1h 27min | [scthornton/codegemma-7b-securecode](https://huggingface.co/scthornton/codegemma-7b-securecode) |
119
+ | DeepSeek Coder 6.7B | 6.7B | DeepSeek Coder | 1h 15min | [scthornton/deepseek-coder-6.7b-securecode](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) |
120
+ | CodeLlama 13B | 13B | Meta CodeLlama | 1h 32min | [scthornton/codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode) |
121
+ | Qwen Coder 14B | 14B | Qwen 2.5 Coder | 1h 19min | [scthornton/qwen2.5-coder-14b-securecode](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode) |
122
+ | **StarCoder2 15B** | **15B** | **BigCode StarCoder2** | **1h 40min** | **This model** |
123
+ | Granite 20B | 20B | IBM Granite Code | 1h 19min | [scthornton/granite-20b-code-securecode](https://huggingface.co/scthornton/granite-20b-code-securecode) |
124
+
125
+ ---
126
+
127
+ ## Citation
128
+
129
+ ```bibtex
130
+ @misc{thornton2025securecode,
131
+ title={SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models},
132
+ author={Thornton, Scott},
133
+ year={2025},
134
+ publisher={perfecXion.ai},
135
+ url={https://perfecxion.ai/articles/securecode-v2-dataset-paper.html},
136
+ note={Model: https://huggingface.co/scthornton/starcoder2-15b-securecode}
137
+ }
138
+ ```
139
+
140
+ ---
141
+
142
+ ## Links
143
+
144
+ - **Dataset**: [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples)
145
+ - **Paper**: [SecureCode v2.0](https://huggingface.co/papers/2512.18542)
146
+ - **Model Collection**: [SecureCode Models](https://huggingface.co/collections/scthornton/securecode) (8 models)
147
+ - **Blog Post**: [Training Security-Aware Code Models](https://huggingface.co/blog/scthornton/securecode-models)
148
+ - **Publisher**: [perfecXion.ai](https://perfecxion.ai)
149
+
150
+ ---
151
 
152
+ ## License
153
 
154
+ BigCode OpenRAIL-M