code

Pseudocode-to-code

The directory contains the code for pseudocode-to-code experiments, including a synthetic pseudocode-to-code dataset with 1000 paired examples and 20000 unlabeled examples (and other versions can be generated using the given code), as well as a processed version of the SPoC dataset with crowdsourced pseudocode on code from programming competitions on codeforces.com.

The synthetic pseudocode-to-code dataset supplies the model with all but the declaration types, which must be inferred from context. The denoiser helps with global type inference and instantiation decisions, simplifying the task for the base predictor.

We consider full program pseudocode-to-code translation, instead of line-by-line as in previous works. We do not utilize the compiler in any way except for during evaluation, and thus do not use compiler messages as side information. We also use greedy decoding, meaning that a beam search could be used (at the expense of added computation) to improve the results.

All the code-based data is in C++.

Setup

For SPoC unlabeled data, run the script download_spoc_unlabeled_data.sh. It should add three new directories to data/spoc-data/train/split/, where each directory begins with preprocess_denoise_*.

Please install fairseq by going into the fairseq directory and running pip install -e . in your virtualenv.

In general, if you see file exists errors for dictionary files, this is OK; the preprocessing to create the dictionaries will just be skipped if they already exist.

Run

The synthetic experiment can be run using the command bash run_all_synthetic.sh. The SPoC experiment can be run using the command python run_spoc.py.

Data

The synthetic data is in data/synthetic_1000a/, which includes train, val, test splits. Each split contains 4 data files:

{split}.src is the pseudocode
{split}.tgt is the code
{split}.inp is the inputs file for the test case,
{split}.out is the outputs file for the correct output of the test case as outputted by the gold code in the {split}.tgt file.

Running with different settings in the synthetic dataset will cause the code to generate new data.

The SPoC training data contains SPoC training examples with less than 1000 characters. The SPoC unlabeled data contains corrupted code according to the corruption method in this paper.

Name		Name	Last commit message	Last commit date
parent directory ..
data		data
fairseq		fairseq
fairseq_modules		fairseq_modules
lib		lib
README.md		README.md
__init__.py		__init__.py
c_tokenizer_mod.py		c_tokenizer_mod.py
code-example.png		code-example.png
cpp_functions.txt		cpp_functions.txt
download_spoc_unlabeled_data.sh		download_spoc_unlabeled_data.sh
evaluate_full.py		evaluate_full.py
evaluate_synthetic.py		evaluate_synthetic.py
format_pairs.py		format_pairs.py
pretty_print.py		pretty_print.py
run_all_synthetic.sh		run_all_synthetic.sh
run_spoc.py		run_spoc.py
run_synthetic.py		run_synthetic.py
spm_encode.py		spm_encode.py
stitch.py		stitch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Pseudocode-to-code

Setup

Run

Data

FilesExpand file tree

code

Directory actions

More options

Directory actions

More options

Latest commit

History

code

Folders and files

parent directory

README.md

Pseudocode-to-code

Setup

Run

Data