<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title></title>
    <description></description>
    <link>https://gokul.dev/</link>
    <atom:link href="https://gokul.dev/https://gokul.dev//feed.xml" rel="self" type="application/rss+xml" />
    
      <item>
        <title>iMessage Spam Detection with CoreML</title>
        <description>&lt;h4 id=&quot;background&quot;&gt;Background&lt;/h4&gt;
&lt;p&gt;A couple years ago, iMessage spam started becoming annoying enough to be reported on by a variety of &lt;a href=&quot;http://www.businessinsider.com/imessage-iphone-message-spam-how-to-stop-unwanted-messages-2016-11&quot;&gt;major&lt;/a&gt; &lt;a href=&quot;https://www.wired.com/2014/08/apples-imessage-is-being-taken-over-by-spammers/&quot;&gt;news&lt;/a&gt; &lt;a href=&quot;http://www.macworld.co.uk/how-to/iphone/how-stop-imessage-spam-block-report-imessage-spam-3623845/&quot;&gt;sites&lt;/a&gt;. Apple responded to the situation by allowing users to report messages that didn’t come from their contacts for review and potential suspension of the sender’s account.&lt;/p&gt;

&lt;p&gt;Detecting spam is an age-old problem that has somewhat recently been taken on with great success &lt;a href=&quot;https://gmail.googleblog.com/2015/07/the-mail-you-want-not-spam-you-dont.html&quot;&gt;by machine learning&lt;/a&gt;. And running these models became magnitudes easier with the release of &lt;a href=&quot;https://developer.apple.com/machine-learning/&quot;&gt;CoreML&lt;/a&gt; at WWDC this year. In this post, we develop a simple iMessage App to detect whether a message is spam or not.&lt;/p&gt;

&lt;h4 id=&quot;about-coreml&quot;&gt;About CoreML&lt;/h4&gt;
&lt;p&gt;Here’s a quick visual of the CoreML stack:&lt;/p&gt;

&lt;p align=&quot;center&quot;&gt;
&lt;img src=&quot;/assets/img/posts/imessage-spam-detection/coreml-stack.png&quot; style=&quot;width:60%&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;Python models are converted using the &lt;strong&gt;coremltools&lt;/strong&gt; package (not pictured) into Apple’s new &lt;strong&gt;.mlmodel&lt;/strong&gt; format which can then be used on iOS devices with all the GPU/CPU threading/compute enhancements provided by &lt;strong&gt;Accelerate&lt;/strong&gt; (linear algebra library), &lt;strong&gt;BNNS&lt;/strong&gt; (basic neural network subroutines), and &lt;strong&gt;MPS&lt;/strong&gt; (GPU interface).&lt;/p&gt;

&lt;p&gt;Let’s take a look at the &lt;a href=&quot;http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/&quot;&gt;data we’ll be using&lt;/a&gt;. Messages are binary classified as either “spam” or “ham” (everything else). As expected, the messages are short and use many non-standard words. We’ll need to use a model that can generalize easily. This dataset is also ham dominated - only around 13% of the data is spam. Our model needs to be able to respond well to unbalanced data. Now a &lt;strong&gt;multinomial naive Bayes classifier&lt;/strong&gt; is the standard in spam detection but a survey of the &lt;a href=&quot;http://cs229.stanford.edu/proj2013/ShiraniMehr-SMSSpamDetectionUsingMachineLearningApproach.pdf&quot;&gt;literature&lt;/a&gt; indicates that &lt;strong&gt;SVM’s&lt;/strong&gt; and &lt;strong&gt;random forests&lt;/strong&gt; are picking up steam. We’re going to try all three of these approaches on top of both the &lt;strong&gt;bag-of-words&lt;/strong&gt; and &lt;strong&gt;tf-idf&lt;/strong&gt; vectorization procedures and choose the best of the 6 to include in our app.&lt;/p&gt;

&lt;p&gt;To incorporate an ML model into an iOS app, one needs to:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Train the model in one of the &lt;a href=&quot;https://developer.apple.com/documentation/coreml/converting_trained_models_to_core_ml&quot;&gt;CoreML-supported python frameworks&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Convert it into a .mlmodel file through the coremltools python 2.7 package&lt;/li&gt;
  &lt;li&gt;Drop the .mlmodel file into one’s app and use the provided methods to input data and generate predictions.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As an aside, it is currently not possible to perform additional training after an .mlmodel has been generated. However, it is possible to build a neural network using nothing other than coremltools - take a look at the neural network builder file under coremltools if curious.&lt;/p&gt;
&lt;h4 id=&quot;choosing-a-model&quot;&gt;Choosing a model&lt;/h4&gt;
&lt;p&gt;All code for this post is available from &lt;a href=&quot;https://github.com/gkswamy98/imessage-spam-detection/tree/master&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;raw_data = open(&apos;SMSSpamCollection.txt&apos;, &apos;r&apos;)
sms_data = []
for line in raw_data:
    split_line = line.split(&quot;\t&quot;)
    sms_data.append(split_line)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then, divide it up into messages, labels, training, and test:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;sms_data = np.array(sms_data)
X = sms_data[:, 1]
y = sms_data[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=22)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Build the 6 pipelines:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pipeline_1 = Pipeline([(&apos;vect&apos;, CountVectorizer()),(&apos;clf&apos;, MultinomialNB())])
pipeline_2 = Pipeline([(&apos;vect&apos;, CountVectorizer()),(&apos;tfidf&apos;, TfidfTransformer()),(&apos;clf&apos;, MultinomialNB())])
pipeline_3 = Pipeline([(&apos;vect&apos;, CountVectorizer()),(&apos;clf&apos;, SGDClassifier())])
pipeline_4 = Pipeline([(&apos;vect&apos;, CountVectorizer()),(&apos;tfidf&apos;, TfidfTransformer()),(&apos;clf&apos;, SGDClassifier())])
pipeline_5 = Pipeline([(&apos;vect&apos;, CountVectorizer()),(&apos;clf&apos;, RandomForestClassifier())])
pipeline_6 = Pipeline([(&apos;vect&apos;, CountVectorizer()),(&apos;tfidf&apos;, TfidfTransformer()),(&apos;clf&apos;, RandomForestClassifier())])
pipelines = [pipeline_1, pipeline_2, pipeline_3, pipeline_4, pipeline_5, pipeline_6]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now the fun part - perform the classification and check &lt;a href=&quot;http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html&quot;&gt;precision/recall&lt;/a&gt; (we only have 2 classes and we want both a low false positive rate and a low false negative rate):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;for pipeline in pipelines:
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    print(classification_report(y_test, y_pred, target_names=[&quot;ham&quot;, &quot;spam&quot;]))
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In my testing, the SVM seems to perform the best with an average precision of 99%, a result that is supported by &lt;a href=&quot;http://ats.cs.ut.ee/u/kt/hw/spam/spam.pdf&quot;&gt;work in the field&lt;/a&gt;. Using tf-idf doesn’t seem to have a large influence on the classification result but as doing so is best practice we’re going to include it as a step in our pipeline.&lt;/p&gt;

&lt;h4 id=&quot;creating-a-model-file&quot;&gt;Creating a model file&lt;/h4&gt;
&lt;p&gt;coremltools is a python 2.7 package so make sure to do the following step in the &lt;strong&gt;appropriate python version&lt;/strong&gt;. To create a virtual environment to run 2 versions of python on a Mac, use the following commands:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;brew install pyenv
pyenv install 2.7.12
pyenv global 2.7.12
pyenv rehash
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And then run:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pip install scipy
pip install sklearn
pip install coremltools
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;To create the .mlmodel file, run the following lines. Note that as of the writing of this post, CoreML does not support tf-idf or count vectorizers so we’ll have to calculate the tf-idf representation in the app. For that, we need an ordered list of words that we also generate below.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vectorizer = TfidfVectorizer()
vectorized = vectorizer.fit_transform(X)
words = open(&apos;words_ordered.txt&apos;, &apos;w&apos;)
for feature in vectorizer.get_feature_names():
    words.write(feature.encode(&apos;utf-8&apos;) + &apos;\n&apos;)
words.close()
model = LinearSVC()
model.fit(vectorized, y)
coreml_model = coremltools.converters.sklearn.convert(model, &quot;message&quot;, &apos;label&apos;)
coreml_model.save(&apos;MessageClassifier.mlmodel&apos;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You can download all of the above commands as one file from &lt;a href=&quot;https://github.com/gkswamy98/imessage-spam-detection/blob/master/spam_detection.py&quot;&gt;here&lt;/a&gt; and the generated model from &lt;a href=&quot;https://github.com/gkswamy98/imessage-spam-detection/blob/master/MessageClassifier.mlmodel&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id=&quot;creating-the-imessage-app&quot;&gt;Creating the iMessage App&lt;/h4&gt;
&lt;p&gt;Create a new iMessage App project in Xcode 9 and drop in the original text file as well as the model and file we just generated. Your directory structure should look something like this:&lt;/p&gt;

&lt;p align=&quot;center&quot;&gt;
    &lt;img src=&quot;/assets/img/posts/imessage-spam-detection/file-structure.png&quot; style=&quot;width:30%&quot; /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br /&gt;
Open up the Main Storyboard and change the text of the label from “Hello World” to “Copy a Message”:&lt;/p&gt;

&lt;p align=&quot;center&quot;&gt;
    &lt;img src=&quot;/assets/img/posts/imessage-spam-detection/copy-message.png&quot; style=&quot;width:80%&quot; /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Next, open up the assistant editor and add an IBOutlet for the label by control-dragging to the file that opens up. Add a button and do the same but create an action instead.&lt;/p&gt;

&lt;p align=&quot;center&quot;&gt;
    &lt;img src=&quot;/assets/img/posts/imessage-spam-detection/label-outlet.png&quot; style=&quot;width:80%&quot; /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Open up MessagesViewController, import CoreML, and paste in the following helper method. It calculates the tf-idf representation of the user’s text using the SMS dataset.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;func tfidf(sms: String) -&amp;gt; MLMultiArray{
        let wordsFile = Bundle.main.path(forResource: &quot;words_ordered&quot;, ofType: &quot;txt&quot;)
        let smsFile = Bundle.main.path(forResource: &quot;SMSSpamCollection&quot;, ofType: &quot;txt&quot;)
        do {
            let wordsFileText = try String(contentsOfFile: wordsFile!, encoding: String.Encoding.utf8)
            var wordsData = wordsFileText.components(separatedBy: .newlines)
            wordsData.removeLast() // Trailing newline.
            let smsFileText = try String(contentsOfFile: smsFile!, encoding: String.Encoding.utf8)
            var smsData = smsFileText.components(separatedBy: .newlines)
            smsData.removeLast() // Trailing newline.
            let wordsInMessage = sms.split(separator: &quot; &quot;)
            var vectorized = try MLMultiArray(shape: [NSNumber(integerLiteral: wordsData.count)],
                                              dataType: MLMultiArrayDataType.double)
            for i in 0..&amp;lt;wordsData.count{
                let word = wordsData[i]
                if sms.contains(word){
                    var wordCount = 0
                    for substr in wordsInMessage{
                        if substr.elementsEqual(word){
                            wordCount += 1
                        }
                    }
                    let tf = Double(wordCount) / Double(wordsInMessage.count)
                    var docCount = 0
                    for sms in smsData{
                        if sms.contains(word) {
                            docCount += 1
                        }
                    }
                    let idf = log(Double(smsData.count) / Double(docCount))
                    vectorized[i] = NSNumber(value: tf * idf)
                } else {
                    vectorized[i] = 0.0
                }
            }
            return vectorized
        } catch {
            return MLMultiArray()
        }
    }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Add the following lines to the button-bound function you created:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;let copied = UIPasteboard.general.string
        if let text = copied {
            let vec = tfidf(sms: text)
            do {
                let prediction = try MessageClassifier().prediction(message: vec).label
                label.text = prediction
            } catch {
                label.text = &quot;No Prediction&quot;
            }
        }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Finally, change the CoreML code generation language to Swift under Project &amp;gt; Build Settings &amp;gt; All:&lt;/p&gt;

&lt;p align=&quot;center&quot;&gt;
    &lt;img src=&quot;/assets/img/posts/imessage-spam-detection/coreml-lang.png&quot; style=&quot;width:80%&quot; /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;… And you’re all done! Congrats! If everything worked right, it should look a little something &lt;a href=&quot;https://www.youtube.com/watch?v=i5ZVG8Hph6Q&quot;&gt;like this&lt;/a&gt; when built.&lt;/p&gt;
</description>
        <pubDate>Tue, 27 Jun 2017 00:00:00 +0000</pubDate>
        <link>https://gokul.dev//blog/imessage-spam-detection/</link>
        <guid isPermaLink="true">https://gokul.dev//blog/imessage-spam-detection/</guid>
      </item>
    
  </channel>
</rss>
