Magic Keys is a simple predictive text application that works pretty much like a cell phone's keyboard, making suggestion about the next word to be entered when writing an email or replaying a message. It is written in R and its Shiny package in order to build the interactive web application. This project was developed to complete the Capstone Project from the Data Science Specialization at Coursera. It is inspired by other works like Word Psychic and Next word prediction. To see more details about the n-gram model employed, you can have a look to this short presentation.
-
Install the dependencies for R:
- utils: to unzip files
suppressWarnings(install.packages("utils") - qdapRegex: regular expression removal, extraction, and replacement tools to clean training.
setsuppressWarnings(install.packages("qdapRegex")) - tm: basic framework for text mining applications within R.
suppressWarnings(install.packages("tm")) - slam: to compute frequencies from tm Term-Document Matrices.
suppressWarnings(install.packages("slam")) - textreg: to convert tm corpus into character vector.
suppressWarnings(install.packages("textreg")) - parallel: for parallel computation.
suppressWarnings(install.packages("parallel")) - RWeka: to tokenize words from text.
suppressWarnings(install.packages("RWeka")) - stringr: to split columns from matrix as part of the process to make ngrams.
suppressWarnings(install.packages("stringr")) - digest: to apply cryptographical hash functions to benchmark text.
suppressWarnings(install.packages("digest")) - data.table: for faster data manipulation.
suppressWarnings(install.packages("data.table")) - shiny: for compile web apps on R Studio servers.
suppressWarnings(install.packages("shiny")) - DT: to display R dataframes as tables on HTML pages.
suppressWarnings(install.packages("DT"))
- utils: to unzip files
-
About RWeka and Mac OS. There seem to be a little problem between RWeka and Java on Mac OS. To solve it try this:
- On your terminal:
sudo R CMD javareconf - On R:
install.packages("rJava",type='source') - On terminal:
sudo ln -f -s $(/usr/libexec/java_home)/jre/lib/server/libjvm.dylib /usr/local/li
- On your terminal:
There is still much work to be done in relation to the n-gram model. Firs of all, the corpus should be augmented with more texts from areas beyond news. Second, state-of-the-art models are nowadays based on Deep Learning. It is worth to explore such DL models.