| --- |
| license: llama2 |
| model-index: |
| - name: ETRI_CodeLLaMA_7B_CPP |
| results: |
| - task: |
| type: text-generation |
| dataset: |
| type: HumanEval-X |
| name: humanevalsynthesize-cpp |
| metrics: |
| - name: pass@1 |
| type: pass@1 |
| value: 34.3% |
| verified: false |
| --- |
| |
|
|
| ## **ETRI_CodeLLaMA_7B_CPP** |
| |
| We used LoRa to further pre-train Meta's CodeLLaMA-7B-hf model with high-quality C++ code tokens. |
| |
| Furthermore, we fine-tuned on CodeM's C++ instruction data. |
| |
| ## Model Details |
| |
| This model was trained using LoRa and achieved a pass@1 of 34.3% on HumanEvalX-cpp. |
| |
| ETRI_CodeLLaMA_7B_CPP is a C++ specialized model. |
| |
| ## Dataset Details |
| |
| We pre-trained CodeLLaMA-7B further using 543 GB of C++ code collected online, and fine-tuned it using CodeM's C++ instruction data. We utilized 1 x A100-80GB GPU for the training. |
| |
| ## Requirements |
| |
| ``` |
| pip install torch transformers accelerate |
| ``` |
| |
| ## How to reproduce HumanEval-X results |
| |
| We use Bigcode-evaluation-harness repo for evaluating our trained model. |
| |
| bigcode-evaluation-harness |
| |
| ``` |
| git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git |
| ``` |
| |
| Then, run main.py as follows. |
| |
| ``` |
| accelerate launch bigcode-evaluation-harness/main.py \ |
| --model DDIDU/ETRI_CodeLLaMA_7B_CPP \ |
| --max_length_generation 512 \ |
| --prompt continue \ |
| --tasks humanevalsynthesize-cpp \ |
| --temperature 0.2 \ |
| --n_samples 100 \ |
| --precision bf16 \ |
| --do_sample True \ |
| --batch_size 10 \ |
| --allow_code_execution \ |
| --save_generations \ |
| ``` |
| |
| ## Model use |
| |
| ``` |
| from transformers import AutoTokenizer |
| import transformers |
| import torch |
| |
| model = "DDIDU/ETRI_CodeLLaMA_7B_CPP" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model) |
| pipeline = transformers.pipeline( |
| "text-generation", |
| model=model, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| ) |
| |
| sequences = pipeline( |
| '#include <iostream>\n#include <vector>\n\nusing namespace std;\n\nvoid quickSort(int *data, int start, int end) {', |
| do_sample=True, |
| top_k=10, |
| temperature=0.1, |
| top_p=0.95, |
| num_return_sequences=1, |
| eos_token_id=tokenizer.eos_token_id, |
| max_length=200, |
| ) |
| for seq in sequences: |
| print(f"Result: {seq['generated_text']}") |
| ``` |