https://gokul.dev/ iMessage Spam Detection with CoreML <h4 id="background">Background</h4> <p>A couple years ago, iMessage spam started becoming annoying enough to be reported on by a variety of <a href="http://www.businessinsider.com/imessage-iphone-message-spam-how-to-stop-unwanted-messages-2016-11">major</a> <a href="https://www.wired.com/2014/08/apples-imessage-is-being-taken-over-by-spammers/">news</a> <a href="http://www.macworld.co.uk/how-to/iphone/how-stop-imessage-spam-block-report-imessage-spam-3623845/">sites</a>. Apple responded to the situation by allowing users to report messages that didn’t come from their contacts for review and potential suspension of the sender’s account.</p> <p>Detecting spam is an age-old problem that has somewhat recently been taken on with great success <a href="https://gmail.googleblog.com/2015/07/the-mail-you-want-not-spam-you-dont.html">by machine learning</a>. And running these models became magnitudes easier with the release of <a href="https://developer.apple.com/machine-learning/">CoreML</a> at WWDC this year. In this post, we develop a simple iMessage App to detect whether a message is spam or not.</p> <h4 id="about-coreml">About CoreML</h4> <p>Here’s a quick visual of the CoreML stack:</p> <p align="center"> <img src="/assets/img/posts/imessage-spam-detection/coreml-stack.png" style="width:60%" /> </p> <p>Python models are converted using the <strong>coremltools</strong> package (not pictured) into Apple’s new <strong>.mlmodel</strong> format which can then be used on iOS devices with all the GPU/CPU threading/compute enhancements provided by <strong>Accelerate</strong> (linear algebra library), <strong>BNNS</strong> (basic neural network subroutines), and <strong>MPS</strong> (GPU interface).</p> <p>Let’s take a look at the <a href="http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/">data we’ll be using</a>. Messages are binary classified as either “spam” or “ham” (everything else). As expected, the messages are short and use many non-standard words. We’ll need to use a model that can generalize easily. This dataset is also ham dominated - only around 13% of the data is spam. Our model needs to be able to respond well to unbalanced data. Now a <strong>multinomial naive Bayes classifier</strong> is the standard in spam detection but a survey of the <a href="http://cs229.stanford.edu/proj2013/ShiraniMehr-SMSSpamDetectionUsingMachineLearningApproach.pdf">literature</a> indicates that <strong>SVM’s</strong> and <strong>random forests</strong> are picking up steam. We’re going to try all three of these approaches on top of both the <strong>bag-of-words</strong> and <strong>tf-idf</strong> vectorization procedures and choose the best of the 6 to include in our app.</p> <p>To incorporate an ML model into an iOS app, one needs to:</p> <ol> <li>Train the model in one of the <a href="https://developer.apple.com/documentation/coreml/converting_trained_models_to_core_ml">CoreML-supported python frameworks</a></li> <li>Convert it into a .mlmodel file through the coremltools python 2.7 package</li> <li>Drop the .mlmodel file into one’s app and use the provided methods to input data and generate predictions.</li> </ol> <p>As an aside, it is currently not possible to perform additional training after an .mlmodel has been generated. However, it is possible to build a neural network using nothing other than coremltools - take a look at the neural network builder file under coremltools if curious.</p> <h4 id="choosing-a-model">Choosing a model</h4> <p>All code for this post is available from <a href="https://github.com/gkswamy98/imessage-spam-detection/tree/master">here</a>.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>raw_data = open('SMSSpamCollection.txt', 'r') sms_data = [] for line in raw_data: split_line = line.split("\t") sms_data.append(split_line) </code></pre></div></div> <p>Then, divide it up into messages, labels, training, and test:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sms_data = np.array(sms_data) X = sms_data[:, 1] y = sms_data[:, 0] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=22) </code></pre></div></div> <p>Build the 6 pipelines:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pipeline_1 = Pipeline([('vect', CountVectorizer()),('clf', MultinomialNB())]) pipeline_2 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB())]) pipeline_3 = Pipeline([('vect', CountVectorizer()),('clf', SGDClassifier())]) pipeline_4 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', SGDClassifier())]) pipeline_5 = Pipeline([('vect', CountVectorizer()),('clf', RandomForestClassifier())]) pipeline_6 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', RandomForestClassifier())]) pipelines = [pipeline_1, pipeline_2, pipeline_3, pipeline_4, pipeline_5, pipeline_6] </code></pre></div></div> <p>Now the fun part - perform the classification and check <a href="http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html">precision/recall</a> (we only have 2 classes and we want both a low false positive rate and a low false negative rate):</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for pipeline in pipelines: pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) print(classification_report(y_test, y_pred, target_names=["ham", "spam"])) </code></pre></div></div> <p>In my testing, the SVM seems to perform the best with an average precision of 99%, a result that is supported by <a href="http://ats.cs.ut.ee/u/kt/hw/spam/spam.pdf">work in the field</a>. Using tf-idf doesn’t seem to have a large influence on the classification result but as doing so is best practice we’re going to include it as a step in our pipeline.</p> <h4 id="creating-a-model-file">Creating a model file</h4> <p>coremltools is a python 2.7 package so make sure to do the following step in the <strong>appropriate python version</strong>. To create a virtual environment to run 2 versions of python on a Mac, use the following commands:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brew install pyenv pyenv install 2.7.12 pyenv global 2.7.12 pyenv rehash </code></pre></div></div> <p>And then run:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install scipy pip install sklearn pip install coremltools </code></pre></div></div> <p>To create the .mlmodel file, run the following lines. Note that as of the writing of this post, CoreML does not support tf-idf or count vectorizers so we’ll have to calculate the tf-idf representation in the app. For that, we need an ordered list of words that we also generate below.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vectorizer = TfidfVectorizer() vectorized = vectorizer.fit_transform(X) words = open('words_ordered.txt', 'w') for feature in vectorizer.get_feature_names(): words.write(feature.encode('utf-8') + '\n') words.close() model = LinearSVC() model.fit(vectorized, y) coreml_model = coremltools.converters.sklearn.convert(model, "message", 'label') coreml_model.save('MessageClassifier.mlmodel') </code></pre></div></div> <p>You can download all of the above commands as one file from <a href="https://github.com/gkswamy98/imessage-spam-detection/blob/master/spam_detection.py">here</a> and the generated model from <a href="https://github.com/gkswamy98/imessage-spam-detection/blob/master/MessageClassifier.mlmodel">here</a>.</p> <h4 id="creating-the-imessage-app">Creating the iMessage App</h4> <p>Create a new iMessage App project in Xcode 9 and drop in the original text file as well as the model and file we just generated. Your directory structure should look something like this:</p> <p align="center"> <img src="/assets/img/posts/imessage-spam-detection/file-structure.png" style="width:30%" /> </p> <p><br /> Open up the Main Storyboard and change the text of the label from “Hello World” to “Copy a Message”:</p> <p align="center"> <img src="/assets/img/posts/imessage-spam-detection/copy-message.png" style="width:80%" /> </p> <p><br /></p> <p>Next, open up the assistant editor and add an IBOutlet for the label by control-dragging to the file that opens up. Add a button and do the same but create an action instead.</p> <p align="center"> <img src="/assets/img/posts/imessage-spam-detection/label-outlet.png" style="width:80%" /> </p> <p><br /></p> <p>Open up MessagesViewController, import CoreML, and paste in the following helper method. It calculates the tf-idf representation of the user’s text using the SMS dataset.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>func tfidf(sms: String) -&gt; MLMultiArray{ let wordsFile = Bundle.main.path(forResource: "words_ordered", ofType: "txt") let smsFile = Bundle.main.path(forResource: "SMSSpamCollection", ofType: "txt") do { let wordsFileText = try String(contentsOfFile: wordsFile!, encoding: String.Encoding.utf8) var wordsData = wordsFileText.components(separatedBy: .newlines) wordsData.removeLast() // Trailing newline. let smsFileText = try String(contentsOfFile: smsFile!, encoding: String.Encoding.utf8) var smsData = smsFileText.components(separatedBy: .newlines) smsData.removeLast() // Trailing newline. let wordsInMessage = sms.split(separator: " ") var vectorized = try MLMultiArray(shape: [NSNumber(integerLiteral: wordsData.count)], dataType: MLMultiArrayDataType.double) for i in 0..&lt;wordsData.count{ let word = wordsData[i] if sms.contains(word){ var wordCount = 0 for substr in wordsInMessage{ if substr.elementsEqual(word){ wordCount += 1 } } let tf = Double(wordCount) / Double(wordsInMessage.count) var docCount = 0 for sms in smsData{ if sms.contains(word) { docCount += 1 } } let idf = log(Double(smsData.count) / Double(docCount)) vectorized[i] = NSNumber(value: tf * idf) } else { vectorized[i] = 0.0 } } return vectorized } catch { return MLMultiArray() } } </code></pre></div></div> <p>Add the following lines to the button-bound function you created:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>let copied = UIPasteboard.general.string if let text = copied { let vec = tfidf(sms: text) do { let prediction = try MessageClassifier().prediction(message: vec).label label.text = prediction } catch { label.text = "No Prediction" } } </code></pre></div></div> <p>Finally, change the CoreML code generation language to Swift under Project &gt; Build Settings &gt; All:</p> <p align="center"> <img src="/assets/img/posts/imessage-spam-detection/coreml-lang.png" style="width:80%" /> </p> <p><br /></p> <p>… And you’re all done! Congrats! If everything worked right, it should look a little something <a href="https://www.youtube.com/watch?v=i5ZVG8Hph6Q">like this</a> when built.</p> Tue, 27 Jun 2017 00:00:00 +0000 https://gokul.dev//blog/imessage-spam-detection/ https://gokul.dev//blog/imessage-spam-detection/