This is a repo I use to run human-eval on code models, adjust as needed. Some scripts were adjusted from wizardcoder repo (process_eval.py). The evaluation code is duplicated in several files, mostly to handle edge cases around model tokenizing and loading (might eventually clean it up).
Table is sorted by pass@1 score.
| model | size | pass@1 | pass@10 | screenshot |
|---|---|---|---|---|
| sahil2801/replit-code-instruct-glaive | 3B | 63.5% | 67% | |
| WizardCoder-15B-V1.0 | 15B | 57% | 68.9% | |
| bigcode/starcoder | 15B | 34.6% | 48.7% | |
| openchat/opencoderplus | 15B | 27.3% | 43.9% | |
| teknium/Replit-v1-CodeInstruct-3B | 3B | 25.8% | 42.6% | |
| teknium/Replit-v2-CodeInstruct-3B | 3B | 21.5% | 31% | |
| replit-code-v1-3b | 3B | 15.1% | 27.4% | |
| xgen-7b-8k-base | 7B | 14.6% | 20.7% | ![]() |
| mpt-7b | 7B | 11.7% | 14% |
Create python environment
python -m venv env && source env/bin/activateInstall dependencies
pip install -r requirements.txtRun the eval script
# replace script file name for various models:
# eval_wizard.py
# eval_opencode.py
# eval_mpt.py
# eval_starcoder.py
# eval_replit.py
# eval_replit_glaive.py
# eval_replit_instruct.py
python eval_wizard.pyProcess the jsonl file to extract code samples from model completions
Note: the replit base, instruct model, and starcoder does not go through this process
# replace args for various models:
# --path results/wizard --out_path results/wizard/eval.jsonl
# --path results/opencode --out_path results/opencode/eval.jsonl
python process_eval.py --path results/wizard --out_path results/wizard/processed.jsonl --add_promptThen get the results
# replace args for various models:
# results/wizard/processed.jsonl
# results/starcoder/eval.jsonl
# results/mpt/eval.jsonl
# results/opencode/processed.jsonl
# results/replit_instruct/eval.jsonl
# results/replit_glaive/eval.jsonl
# results/replit/eval.jsonl
evaluate_functional_correctness results/wizard/processed.jsonl