Skip to content

kustomzone/code-eval

 
 

Repository files navigation

code-eval

What

This is a repo I use to run human-eval on code models, adjust as needed. Some scripts were adjusted from wizardcoder repo (process_eval.py). The evaluation code is duplicated in several files, mostly to handle edge cases around model tokenizing and loading (might eventually clean it up).

Results

Table is sorted by pass@1 score.

model size pass@1 pass@10 screenshot
sahil2801/replit-code-instruct-glaive 3B 63.5% 67% instruct-glaive
WizardCoder-15B-V1.0 15B 57% 68.9% wizardcoder
bigcode/starcoder 15B 34.6% 48.7% starcoder
openchat/opencoderplus 15B 27.3% 43.9% opencoder
teknium/Replit-v1-CodeInstruct-3B 3B 25.8% 42.6% replit-codeinstruct-v1
teknium/Replit-v2-CodeInstruct-3B 3B 21.5% 31% replit-codeinstruct-v2
replit-code-v1-3b 3B 15.1% 27.4% replit-code-v1
xgen-7b-8k-base 7B 14.6% 20.7% xgen-7b-8k-base
mpt-7b 7B 11.7% 14% mpt-7b

Setup

Create python environment

python -m venv env && source env/bin/activate

Install dependencies

pip install -r requirements.txt

Run the eval script

# replace script file name for various models:
# eval_wizard.py
# eval_opencode.py
# eval_mpt.py
# eval_starcoder.py
# eval_replit.py
# eval_replit_glaive.py
# eval_replit_instruct.py

python eval_wizard.py

Process the jsonl file to extract code samples from model completions

Note: the replit base, instruct model, and starcoder does not go through this process

# replace args for various models:
# --path results/wizard --out_path results/wizard/eval.jsonl
# --path results/opencode --out_path results/opencode/eval.jsonl

python process_eval.py --path results/wizard --out_path results/wizard/processed.jsonl --add_prompt

Then get the results

# replace args for various models:
# results/wizard/processed.jsonl
# results/starcoder/eval.jsonl
# results/mpt/eval.jsonl
# results/opencode/processed.jsonl
# results/replit_instruct/eval.jsonl
# results/replit_glaive/eval.jsonl
# results/replit/eval.jsonl

evaluate_functional_correctness results/wizard/processed.jsonl

About

Run evaluation on LLMs using human-eval benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%