Skip to content

andrewdircks/lang_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Programming Language Classifier

Machine learning for detecting source code language from a string of text. This code utilizes the Kaggle dataset Github Code Snippets, with 97,000,000 code samples, to train, test, and deploy classification models with accuracy over 80%.

Our system supports 20 programming languages: Bash, C, C++, CSV, DOTFILE, Go, HTML, JSON, Java, JavaScript, Jupyter, Markdown, PowerShell, Python, Ruby, Rust, Shell, TSV, Text, and Yaml.

Auto syntax-highlight

With our fork of the Ace online text editor, run and develop code with real time, predictive syntax highlighting, as you type. Run the following command after training a model

python3 app.py

Dependencies

Running

To build, train, and test a Naive Bayes model, use

python3 model.py --algorithm bayes --out models/bayes.sav

To perform runtime diagnostics on multiple ML classifiers, run

python3 runtime.py

and visualize the results with

python3 plot_runtime.py

To predict a model on a custom code snippet and visualize model output probabilities, use

python3 plot_prediction.py --mpath models/bayes.sav --out plots/prediction.png --snippet import numpy as np\n

Documentation

See docs\ for a research paper and presentation on this code.

Authors

Andrew Dircks ([email protected]) & Sam Kantor ([email protected])

About

Machine learning model to predict programming language, with auto-formatting text editor.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors