Add metadata and improve model card for industrial code intelligence

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +35 -138
README.md CHANGED
@@ -1,17 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # InCoder-32B: Code Foundation Model for Industrial Scenarios
2
 
3
  <div align="center">
4
 
5
  [![HuggingFace](https://img.shields.io/badge/🤗-Model%20Hub-yellow)](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder)
6
  [![GitHub](https://img.shields.io/badge/GitHub-Industrial--Coder-blue)](https://github.com/CSJianYang/Industrial-Coder)
7
- [![arXiv](https://img.shields.io/badge/arXiv-2603.16790-red)](https://arxiv.org/abs/2603.16790)
8
  [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
9
 
10
  </div>
11
 
12
  ## Model Summary
13
 
14
- **InCoder-32B** (Industrial-Coder-32B) is the first 32B-parameter code foundation model purpose-built for industrial code intelligence. While general-purpose code LLMs excel at mainstream software tasks, they struggle with the unique demands of industrial programming — hardware semantics, specialized language constructs, strict resource constraints, and domain-specific correctness verification. InCoder-32B unifies code intelligence across five industrial domains:
 
 
15
 
16
  | Domain | Languages & Frameworks |
17
  |---|---|
@@ -21,7 +37,7 @@
21
  | 🔨 **Compiler Optimization** | x86-64 ASM, C/C++, LLVM-IR |
22
  | 📐 **3D Modeling / CAD** | CadQuery, OpenCascade, Python |
23
 
24
- InCoder-32B achieves competitive general-purpose performance while establishing the strongest open-source baselines across all evaluated industrial domains.
25
 
26
  ---
27
 
@@ -37,26 +53,17 @@ InCoder-32B achieves competitive general-purpose performance while establishing
37
  | HumanEval+ | **89.6%** |
38
  | MBPP+ | **78.3%** |
39
  | BigCodeBench (Full) | **49.8%** |
40
- | τ²-bench (Retail) | **85.1%** |
41
- | τ²-bench (Telecom) | **86.8%** |
42
 
43
  ### Industrial Code Benchmarks
44
 
45
  | Benchmark | Domain | InCoder-32B | Best Competing Open-Weight |
46
  |---|---|---|---|
47
  | VeriScope Score | Chip Design | **80.7** | 83.2 (GLM-5) |
48
- | VeriRepair Fix | Chip Design | **80.0%** | 90.0% (GLM-5) |
49
- | RealBench Syn@1 (Module) | Chip Design | **74.8%** | 50.1% (Kimi-K2-Instruct) |
50
- | ArchXBench (n) | Chip Design | **51.0** | 50.0 (Claude-Sonnet-4.6) |
51
  | CAD-Coder Compile | 3D Modeling | **82.0%** | 48.0% (Kimi-K2-Thinking) |
52
- | CAD-Coder IoU | 3D Modeling | **53.5** | 20.0 (Kimi-K2-Thinking) |
53
- | EmbedCGen Main | Code Optimization | **35.2%** | 90.2% (GLM-5) |
54
  | KernelBench L1 | GPU Optimization | **22.2%** | 16.2% (GLM-5) |
55
  | KernelBench L2 | GPU Optimization | **36.0%** | 28.0% (KernelBench L2) |
56
- | KernelBench L3 | GPU Optimization | **14.0%** | 8.0% (MiniMax-M2.5) |
57
- | TritonBench G-call | GPU Optimization | **18.5%** | 28.8% (Claude-Sonnet-4.6) |
58
 
59
- > InCoder-32B leads all open-weight baselines on CAD-Coder and KernelBench (all three levels), and even surpasses the proprietary Claude-Sonnet-4.6 on CAD-Coder IoU and KernelBench L1/L2/L3.
60
 
61
  ---
62
 
@@ -69,98 +76,28 @@ InCoder-32B adopts a standard decoder-only Transformer architecture with the fol
69
  | Parameters | ~32B |
70
  | Layers | 64 |
71
  | Hidden Size | 5,120 |
72
- | Intermediate Size | 27,648 |
73
- | Attention Heads | 40 |
74
- | KV Heads (GQA) | 8 |
75
- | Head Dimension | 128 |
76
- | Vocabulary Size | 76,800 |
77
  | Max Context Length | 131,072 (128K) |
78
- | Activation | SiLU |
79
  | Positional Encoding | RoPE (θ = 500,000) |
80
  | Precision | BFloat16 |
81
- | Tie Embeddings | No |
82
 
83
  ---
84
 
85
- ## Training Pipeline
86
 
87
  InCoder-32B is trained through a three-stage **Code-Flow** pipeline:
88
 
89
  ### Stage 1 — Pre-training & Annealing
90
-
91
- Industrial code corpora (Verilog, CUDA, firmware C, CadQuery scripts) are severely underrepresented in existing datasets like The Stack v2. We construct a dedicated data pipeline using:
92
-
93
- - **Three-step domain recall**: rule-based filtering (file extensions, keywords like `endmodule`, `__global__`), FastText classifier, and semantic encoder retrieval
94
- - **OCR extraction** from technical books and hardware reference manuals
95
- - **Multi-level deduplication**: exact hash, MinHash LSH, repository-level fork consolidation, cross-source dedup
96
- - **Domain-specific validation**: AST comparison, re-compilation, synthesis checks
97
- - **Data refinement**: normalized formatting, cross-file dependency resolution, code-text alignment annotations
98
-
99
- Training details:
100
- - **Hardware**: 4,096 GPUs
101
- - **Objectives**: Autoregressive LM + Fill-in-the-Middle (FIM)
102
- - **Learning rate**: 3 × 10⁻⁴ (constant)
103
- - **Batch size**: 2,048 globally
104
- - **Total tokens**: 15T
105
- - **Curriculum**: function-level → file-level → multi-file/project-level
106
 
107
  ### Stage 2 — Mid-Training (Context Extension)
108
-
109
- Context window is extended progressively from 8K to 128K tokens across two sub-stages, combined with domain-specific data synthesis:
110
-
111
- **Stage 2.1 — 8K → 32K:**
112
- - Targets file-level tasks: completing RTL modules, infilling kernel functions, generating testbenches
113
- - Data mix: reasoning QA (40%), agent trajectories (20%), commits (15%), industrial artifacts (15%), FIM (10%)
114
-
115
- **Stage 2.2 — 32K → 128K:**
116
- - Unlocks long-context capabilities: extended debugging sessions, cross-module projects
117
- - Graduated warm-up: long sequences start at 10%, linearly increases to 50%
118
- - Data mix shifts toward long-context: agent trajectories (30%), FIM (25%), reasoning QA (25%)
119
-
120
- **Synthetic Industrial QA Pipeline:**
121
- 1. *Scenario specification* — identified with practicing hardware engineers
122
- 2. *Seed code generation* — realistic RTL patterns, CUDA memory access idioms, interrupt-driven firmware
123
- 3. *QA synthesis with automated verification* — code execution validation, static analysis, logical consistency checks
124
 
125
  ### Stage 3 — Post-Training
126
-
127
- 2.5M supervised fine-tuning samples are constructed directly from real industrial coding tasks with execution-grounded verification across four environments:
128
-
129
- | Environment | Toolchain |
130
- |---|---|
131
- | Chip Design | Icarus Verilog, Verilator, Yosys |
132
- | GPU Optimization | NVIDIA A100, nvcc, Triton compiler |
133
- | 3D Modeling | CadQuery, OpenCascade |
134
- | Embedded Systems | arm-none-eabi-gcc, Renode simulator (STM32F407) |
135
-
136
- SFT samples are organized into three categories:
137
- - **Direct solution** — requirement-to-implementation
138
- - **Defect repair** — failure-feedback-fix loop with closed-loop repair trajectories
139
- - **Performance & structural optimization** — improving correct code for efficiency, readability, or architecture
140
-
141
- ---
142
-
143
- ## Benchmarks
144
-
145
- ### Industrial Benchmarks (New)
146
-
147
- This release introduces several new industrial benchmarks:
148
-
149
- - **VeriScope** — 568 Verilog generation problems across 5 difficulty levels (combinational logic → dual-core out-of-order RISC-V SoC with cache coherence). Graded via RTL simulation.
150
- - **VeriRepair** — ~22K training / 300 test Verilog bug-fix samples with 4 major error categories and 20 error types.
151
- - **EmbedCGen** — 500 bare-metal embedded C generation problems for STM32F407 (ARM Cortex-M4), evaluated via cross-compilation + Renode simulation.
152
-
153
- ### General Benchmarks Evaluated
154
-
155
- | Category | Benchmarks |
156
- |---|---|
157
- | Code Generation | EvalPlus (HumanEval, MBPP), BigCodeBench, FullStackBench |
158
- | Code Reasoning | CRUXEval, LiveCodeBench |
159
- | Code Efficiency | Mercury |
160
- | Text-to-SQL | Spider, BIRD |
161
- | Agentic Coding | Terminal-Bench v1/v2, SWE-bench Verified |
162
- | Tool Use | Mind2Web, BFCL v3, τ²-bench |
163
- | Industrial | VeriScope, VeriRepair, RealBench, ArchXBench, CAD-Coder, EmbedCGen, SuperCoder, TritonBench, KernelBench |
164
 
165
  ---
166
 
@@ -220,42 +157,16 @@ outputs = model.generate(**inputs, max_new_tokens=256)
220
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
221
  ```
222
 
223
- ### Chat / Instruction Format
224
-
225
- ```python
226
- messages = [
227
- {
228
- "role": "user",
229
- "content": "Optimize this CUDA matrix multiplication kernel for an NVIDIA A100 using shared memory tiling with TILE_SIZE=32."
230
- }
231
- ]
232
-
233
- text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
234
- inputs = tokenizer(text, return_tensors="pt").to(model.device)
235
- outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.1, do_sample=True)
236
- print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
237
- ```
238
-
239
  ---
240
 
241
- ## Limitations & Known Failure Modes
242
-
243
- Based on analysis of 1,882 failure cases across 9 industrial benchmarks:
244
-
245
- - **Compilation & syntax errors**: Dominant in Verilog tasks — 71% of RealBench failures involve malformed literals, incorrect port declarations, or bit-width mismatches.
246
- - **Incomplete API knowledge**: 47% of EmbedCGen failures are linker errors from undefined or incorrectly typed HAL/CMSIS functions; 33% of TritonBench failures are NameErrors from incorrect Triton API usage.
247
- - **Format compliance**: 46% of VeriScope failures are unparseable structured outputs where the required format is ignored entirely.
248
- - **Functional correctness under precise semantics**: 79% of VeriRepair failures produce compilable but functionally incorrect code; most CAD-Coder failures stem from systematic Euler angle convention misinterpretation.
249
- - **Optimization gap**: 33% of KernelBench failures produce functionally correct but insufficiently fast GPU kernels; 83% of SuperCoder failures result in the model copying input assembly without modification.
250
-
251
- ---
252
 
253
- ## Ablation Findings
 
 
 
254
 
255
- - **Repository transition data** outperforms static snapshots for planning tasks
256
- - **Mid-training reasoning trajectories** improve robustness under distribution shift
257
- - **Thinking paths** unlock emergent capabilities absent in standard instruction tuning
258
- - **Scaling industrial SFT data** is a reliable performance driver across all 9 industrial benchmarks (83M → 167M → 250M tokens shows consistent improvement)
259
 
260
  ---
261
 
@@ -270,18 +181,4 @@ Based on analysis of 1,882 failure cases across 9 industrial benchmarks:
270
  journal={arXiv preprint arXiv:2603.16790},
271
  year={2026}
272
  }
273
- ```
274
-
275
- ---
276
-
277
- ## Model Card Authors
278
-
279
- Jian Yang, Wei Zhang, Jiajun Wu, Junhang Cheng, Shawn Guo, Haowen Wang, Weicheng Gu, Yaxin Du, Joseph Li, Fanglin Xu, Yizhi Li, Lin Jing, Yuanbo Wang, Yuhan Gao, Ruihao Gong, Chuan Hao, Ran Tao, Aishan Liu, Tuney Zheng, Ganqu Cui, Zhoujun Li, Mingjie Tang, Chenghua Lin, Wayne Xin Zhao, Xianglong Liu, Ming Zhou, Bryan Dai, Weifeng Lv
280
-
281
- Affiliations: Beihang University, IQuest Research, Shanghai Jiao Tong University, ELLIS, University of Manchester, Shanghai Artificial Intelligence Laboratory, Sichuan University, Gaoling School of Artificial Intelligence (Renmin University of China), Langboat
282
-
283
- ---
284
-
285
- ## License
286
-
287
- This model is released under the [Apache 2.0 License](LICENSE).
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - code
7
+ - industrial-code
8
+ - verilog
9
+ - cuda
10
+ - triton
11
+ - chip-design
12
+ - cad
13
+ ---
14
+
15
  # InCoder-32B: Code Foundation Model for Industrial Scenarios
16
 
17
  <div align="center">
18
 
19
  [![HuggingFace](https://img.shields.io/badge/🤗-Model%20Hub-yellow)](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder)
20
  [![GitHub](https://img.shields.io/badge/GitHub-Industrial--Coder-blue)](https://github.com/CSJianYang/Industrial-Coder)
21
+ [![arXiv](https://img.shields.io/badge/arXiv-2603.16790-red)](https://huggingface.co/papers/2603.16790)
22
  [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
23
 
24
  </div>
25
 
26
  ## Model Summary
27
 
28
+ **InCoder-32B** (Industrial-Coder-32B) is the first 32B-parameter code foundation model purpose-built for industrial code intelligence. While general-purpose code LLMs excel at mainstream software tasks, they often struggle with the unique demands of industrial programming — hardware semantics, specialized language constructs, strict resource constraints, and domain-specific correctness verification.
29
+
30
+ Presented in the paper [InCoder-32B: Code Foundation Model for Industrial Scenarios](https://huggingface.co/papers/2603.16790), InCoder-32B unifies code intelligence across five industrial domains:
31
 
32
  | Domain | Languages & Frameworks |
33
  |---|---|
 
37
  | 🔨 **Compiler Optimization** | x86-64 ASM, C/C++, LLVM-IR |
38
  | 📐 **3D Modeling / CAD** | CadQuery, OpenCascade, Python |
39
 
40
+ InCoder-32B achieves highly competitive performance on general tasks while establishing the strongest open-source baselines across all evaluated industrial domains.
41
 
42
  ---
43
 
 
53
  | HumanEval+ | **89.6%** |
54
  | MBPP+ | **78.3%** |
55
  | BigCodeBench (Full) | **49.8%** |
 
 
56
 
57
  ### Industrial Code Benchmarks
58
 
59
  | Benchmark | Domain | InCoder-32B | Best Competing Open-Weight |
60
  |---|---|---|---|
61
  | VeriScope Score | Chip Design | **80.7** | 83.2 (GLM-5) |
 
 
 
62
  | CAD-Coder Compile | 3D Modeling | **82.0%** | 48.0% (Kimi-K2-Thinking) |
 
 
63
  | KernelBench L1 | GPU Optimization | **22.2%** | 16.2% (GLM-5) |
64
  | KernelBench L2 | GPU Optimization | **36.0%** | 28.0% (KernelBench L2) |
 
 
65
 
66
+ > InCoder-32B leads all open-weight baselines on CAD-Coder and KernelBench (all three levels), and even surpasses proprietary models like Claude-Sonnet-4.6 on CAD-Coder IoU and KernelBench L1/L2/L3.
67
 
68
  ---
69
 
 
76
  | Parameters | ~32B |
77
  | Layers | 64 |
78
  | Hidden Size | 5,120 |
 
 
 
 
 
79
  | Max Context Length | 131,072 (128K) |
 
80
  | Positional Encoding | RoPE (θ = 500,000) |
81
  | Precision | BFloat16 |
 
82
 
83
  ---
84
 
85
+ ## Training Pipeline: Code-Flow
86
 
87
  InCoder-32B is trained through a three-stage **Code-Flow** pipeline:
88
 
89
  ### Stage 1 — Pre-training & Annealing
90
+ - **Industrial Recall**: Data pipeline using rule-based filtering, FastText classifiers, and semantic retrieval for Verilog, CUDA, firmware C, and CadQuery.
91
+ - **Refinement**: OCR extraction from technical manuals, multi-level deduplication, and repository-level fork consolidation.
92
+ - **Training**: 15T total tokens using Autoregressive LM + Fill-in-the-Middle (FIM) objectives.
 
 
 
 
 
 
 
 
 
 
 
 
 
93
 
94
  ### Stage 2 — Mid-Training (Context Extension)
95
+ Context window extended progressively from 8K to 128K tokens:
96
+ - **8K 32K**: Targets file-level tasks like completing RTL modules or kernel functions.
97
+ - **32K → 128K**: Unlocks long-context capabilities for extended debugging and cross-module projects.
 
 
 
 
 
 
 
 
 
 
 
 
 
98
 
99
  ### Stage 3 — Post-Training
100
+ 2.5M supervised fine-tuning (SFT) samples constructed from real industrial tasks with execution-grounded verification using toolchains like Icarus Verilog, `nvcc`, and Renode (STM32 simulator).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
  ---
103
 
 
157
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
158
  ```
159
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
  ---
161
 
162
+ ## Limitations & Disclaimers
 
 
 
 
 
 
 
 
 
 
163
 
164
+ Based on failure analysis, the model may struggle with:
165
+ - **API Knowledge**: Linker errors from undefined HAL/CMSIS functions in embedded C.
166
+ - **Functional Semantics**: Producing compilable but functionally incorrect RTL under complex logic scenarios.
167
+ - **Optimization**: Correct but sub-optimal GPU kernel performance.
168
 
169
+ Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware) requires expert review before deployment.
 
 
 
170
 
171
  ---
172
 
 
181
  journal={arXiv preprint arXiv:2603.16790},
182
  year={2026}
183
  }
184
+ ```