Machine learning for detecting source code language from a string of text. This code utilizes the Kaggle dataset Github Code Snippets, with 97,000,000 code samples, to train, test, and deploy classification models with accuracy over 80%.
Our system supports 20 programming languages: Bash, C, C++, CSV, DOTFILE, Go, HTML, JSON, Java, JavaScript, Jupyter, Markdown, PowerShell, Python, Ruby, Rust, Shell, TSV, Text, and Yaml.
With our fork of the Ace online text editor, run and develop code with real time, predictive syntax highlighting, as you type. Run the following command after training a model
python3 app.py
- the Kaggle dataset
- Python >3.0
- scikit-learn
- matplotlib for visuals
- Flask for running the GUI locally
To build, train, and test a Naive Bayes model, use
python3 model.py --algorithm bayes --out models/bayes.sav
To perform runtime diagnostics on multiple ML classifiers, run
python3 runtime.py
and visualize the results with
python3 plot_runtime.py
To predict a model on a custom code snippet and visualize model output probabilities, use
python3 plot_prediction.py --mpath models/bayes.sav --out plots/prediction.png --snippet import numpy as np\n
See docs\ for a research paper and presentation on this code.
Andrew Dircks ([email protected]) & Sam Kantor ([email protected])