Add missing metadata (#1)

0be5dd6 verified about 1 year ago

4.85 kB

	---
	license: llama3.1
	datasets:
	- OpenCoder-LLM/opc-sft-stage1
	- OpenCoder-LLM/opc-sft-stage2
	language:
	- en
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	model-index:
	- name: Control-LLM-Llama3.1-8B-OpenCoder8
	results:
	- task:
	type: code-evaluation
	dataset:
	type: mixed
	name: Code Evaluation Dataset
	metrics:
	- name: pass_at_1,n=1 (code_instruct)
	type: pass_at_1
	value: 0.770508826583593
	stderr: 0.013547264970313243
	verified: false
	- name: pass_at_1,n=1 (humaneval_greedy_instruct)
	type: pass_at_1
	value: 0.823170731707317
	stderr: 0.029883277857485988
	verified: false
	- name: pass_at_1,n=1 (humaneval_plus_greedy_instruct)
	type: pass_at_1
	value: 0.7621951219512195
	stderr: 0.033346454086653404
	verified: false
	- name: pass_at_1,n=1 (mbpp_plus_0shot_instruct)
	type: pass_at_1
	value: 0.7751322751322751
	stderr: 0.02150209607822914
	verified: false
	- name: pass_at_1,n=1 (mbpp_sanitized_0shot_instruct)
	type: pass_at_1
	value: 0.7354085603112841
	stderr: 0.027569713464529938
	verified: false
	- task:
	type: original-capability
	dataset:
	type: meta/Llama-3.1-8B-Instruct-evals
	name: Llama-3.1-8B-Instruct-evals Dataset
	dataset_path: "meta-llama/llama-3.1-8_b-instruct-evals"
	dataset_name: "Llama-3.1-8B-Instruct-evals__arc_challenge__details"
	metrics:
	- name: exact_match,strict-match (original_capability_instruct)
	type: exact_match
	value: 0.5599378769819771
	stderr: 0.0028491774433443513
	verified: false
	- name: exact_match,strict-match (meta_arc_0shot_instruct)
	type: exact_match
	value: 0.8094420600858369
	stderr: 0.011511446994122106
	verified: false
	- name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
	type: exact_match
	value: 0.32589285714285715
	stderr: 0.02216910313464341
	verified: false
	- name: exact_match,strict-match (meta_mmlu_0shot_instruct)
	type: exact_match
	value: 0.681241988320752
	stderr: 0.003932622311434926
	verified: false
	- name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
	type: exact_match
	value: 0.4029255319148936
	stderr: 0.004471732136513382
	verified: false
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Control-LLM-Llama3.1-8B-OpenCoder8
	This is a fine-tuned model of Llama-3.1-8B-Instruct for coding tasks on OpenCoder SFT dataset described in the paper: [Control LLM: Controlled Evolution for Intelligence Retention in LLM](https://huggingface.co/papers/2501.10979).

	Code: https://github.com/linkedin/ControlLLM.

	## Linked Open Source code - training, eval and benchmark
	This model is associated with the github: [Control-LLM](https://github.com/linkedin/ControlLLM).

	## Evaluation Results
	Here is an overview of the evaluation results and findings:

	### Hybrid Expansion on OpenCoder
	The following diagram illustrates how hybrid expansion works.

	![Catastrophic Forgetting](plots/control_llm_structure_analysis.png)

	### Benchmark Results Table
	The table below summarizes evaluation results across coding tasks and original capabilities.

	\| Model \| MB+ \| MS \| HE+ \| HE \| C-Avg \| ARC \| GP \| MLU \| MLUP \| O-Avg \| Overall \|
	\|--------------------\|---------\|---------\|---------\|---------\|-----------\|---------\|---------\|---------\|----------\|-----------\|-------------\|
	\| Llama3.1-8B-Ins \| 70.4 \| 67.7 \| 66.5 \| 70.7 \| 69.1 \| 83.4 \| 29.9 \| 72.4 \| 46.7 \| 60.5 \| 64.8 \|
	\| OpenCoder-8B-Ins \| 81.2 \| 76.3 \| 78.0 \| 82.3 \| 79.5 \| 8.2 \| 25.4 \| 37.4 \| 11.3 \| 24.6 \| 52.1 \|
	\| Full Param Tune \| 75.1 \| 69.6 \| 71.3 \| 76.8 \| 73.3 \| 24.4 \| 21.9 \| 43.0 \| 19.2 \| 31.5 \| 52.4 \|
	\| Partial Param Tune \| 75.7 \| 71.6 \| 74.4 \| 79.3 \| 75.0 \| 70.2 \| 28.1 \| 60.7 \| 32.4 \| 48.3 \| 61.7 \|
	\| Stack Expansion \| 77.2 \| 72.8 \| 73.2 \| 78.7 \| 75.6 \| 80.0 \| 26.3 \| 66.6 \| 38.2 \| 54.2 \| 64.9 \|
	\| ControlLLM-Hybrid \| 77.5 \| 73.5 \| 76.2\| 82.3\| 77.1 \| 80.9 \| 32.6\| 68.1 \| 40.3 \| 56.0 \| 66.6 \|

	---

	### Explanation:
	- MB+: MBPP Plus
	- MS: MBPP Sanitized
	- HE+: HumanEval Plus
	- HE: HumanEval
	- C-Avg: Coding - Size Weighted Average across MB+, MS, HE+, and HE
	- ARC: ARC benchmark
	- GP: GPQA benchmark
	- MLU: MMLU (Massive Multitask Language Understanding)
	- MLUP: MMLU Pro
	- O-Avg: Original Capability - Size Weighted Average across ARC, GPQA, MMLU, and MMLU Pro
	- Overall: Combined average across all tasks