Nianlong Gu

Training a Wasserstein GAN on the free google colab TPU

2019-11-08T00:00:00-08:00

The jupyter notebook is available on my github repo. Click HERE to play with google colab on live!

Why using TPU?
TPU is much faster than GPU. A single TPU contains 8 cores, each with a 8GB memory. During training, each batch of data is equally dispatched to all 8 cores. This means the equivalent memrory size if 8x8 = 64 GB. This make it possible to train some large models.
How to run GAN on TPU?
Using TPU Estimator.
Different from classic classification task, training of GAN involves alternation between training Generator and Discriminator. This fact makes it imossible to simply use Keras on TPU. TPU Estimator allows you to more flexibiy configure the network and training optimizer behaviors

Start of coding

# here we force google colab to use tensorflow 1.x, the configuration will be slightly different for tf 2.0
%tensorflow_version 1.x
import tensorflow as tf
# tf.enable_eager_execution()
import numpy as np
# we use mnist dataset as an example
import keras.datasets.mnist as mnist
import math
import os
import matplotlib.pyplot as plt
import imageio

from google.colab import auth
auth.authenticate_user()

the command auth.authenticate_user() is needed for access the google cloud storage to save/restore the model and load the training/testing data.

Step 0: Some helper function for visualizing the results

def add_padding( x, padding_size=(2,2,2,2), padding_value = 1 ):
    # x is a 4 d ndarray with range [0,1]
    background = padding_value * np.ones(  [ x.shape[0], x.shape[1]+ padding_size[0]+padding_size[2], x.shape[2] + padding_size[1]+padding_size[3], x.shape[3]   ]  ).astype(np.float32)
    background[:, padding_size[0]:-padding_size[2], padding_size[1]:-padding_size[3], : ] = x
    padded_x = background
    return padded_x

# to convert a bulk of images into grid of images
def make_grid(images,  ncol= None):
	# ncol represents the number of columns of the image grid, if ncol is None, then arrange the grid as close to a square as possible
	# This function always assume that the input image is RGB color space , normalized float type
	if np.max(images)-np.min(images) >1 :
		images = np.clip( images, -1,1 )
		images = images /2 +0.5
		
	image_num = images.shape[0]
	num_h = None
	num_w = None
	im_h = images.shape[1]
	im_w = images.shape[2]   
	im_c = images.shape[3]
	if (ncol==None):
		num_w = int( np.ceil(np.sqrt(image_num )))
		num_h = int( np.ceil( image_num/ num_w ))
	else:
		num_w = int(ncol)
		num_h = int( np.ceil(  image_num/num_w ))

	# create a white pannel, which is a [height, width, channel] ndarray
	pannel = np.ones(( num_h * im_h, num_w * im_w , im_c )).astype(np.float32)

	for i in range( image_num ):
		start_h = int(i / num_w) * im_h
		start_w = (i % num_w) * im_w
		pannel[ start_h: start_h+im_h , start_w : start_w + im_w ,: ]= images[i,:,:,:]
	return  pannel

Step 1: preparation of training dataset

For using TPU in a practical scenario, it is recommended to use tf.data.TFRecordDataset. The reason of not using tf.data.Dataset.from_tensor_slices is this will store the training dataset directly to the computation graph, which will consume a lot of memory, especially for large training dataset.

A typical work flow is:

generating TFRecod locally;
uploading TFRecord to google cloud storage bucket;
load TFRecord data using TFRecordDataset during training.

Get the original image data, write them into TFRecord file. Each image is corresponding to a single record in the TFRecord file

(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train= X_train[:,:,:,np.newaxis]
X_test = X_test[ :,:,:,np.newaxis ]
X_train = (X_train/255).astype(np.float32)
X_test = (X_test/255).astype(np.float32)

print(X_train.shape)

Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
(60000, 28, 28, 1)

The image shape will be used later on. Then we write mnist images into TFRecord files.

def make_tfrecord( file_name, images ):
    # define a TFRecordWriter
    writer = tf.python_io.TFRecordWriter(file_name, options= tf.io.TFRecordOptions(tf.io.TFRecordCompressionType.GZIP) )
    for img in images:
        if isinstance( img, str ):  # This case img is the path to the image file
            img =imageio.imread(img)
            img = (img/255.0).astype(np.float32)
        features = { 
            "image": tf.train.Feature( float_list = tf.train.FloatList( value = img.reshape(-1) ) ),
            "image_shape": tf.train.Feature( int64_list = tf.train.Int64List( value = img.shape )  )}
        # tf_serialized_example contains the serialized information about the value and shape of an image
        tf_serialized_example = tf.train.Example( features = tf.train.Features( feature = features ) ).SerializeToString()
        writer.write( tf_serialized_example )
    writer.close()

make_tfrecord( 'mnist_train.tfrecord', X_train  )
make_tfrecord( 'mnist_eval.tfrecord', X_test[:5000] )

Copy the generated TFRecord to the GCS storage bucket.

! gsutil cp mnist_train.tfrecord mnist_eval.tfrecord gs://gan-tpu-tutorial/data

Copying file://mnist_train.tfrecord [Content-Type=application/octet-stream]...
Copying file://mnist_eval.tfrecord [Content-Type=application/octet-stream]...
- [2 files][ 16.9 MiB/ 16.9 MiB]                                                
Operation completed over 2 objects/16.9 MiB.                                     

One can also load multiple TFRecords by

ds=tf.data.TFRecordDataset(["record1.tfrecord","record2.tfrecord","record3.tfrecord"], compression_type='GZIP')

Step 2: configure the TPU Estimator

The TPU Estimator mainly contains two parts: train/eval/test input function, and model function

Prepeare the input_functions for tf estimator

 def parse_tfrecord_func( serialized_record ):
    parse_dic = { "image": tf.FixedLenFeature(shape=(28,28,1), dtype = tf.float32 ),
                  "image_shape": tf.FixedLenFeature( shape=(3,), dtype = tf.int64 ) 
                }
    parses_record = tf.parse_single_example( serialized_record, parse_dic )
    ## note that parse_single_example can only be placed before batch()
    ## parse_example can only be placed after batch()
    return {"image": parses_record["image"]}
 
def train_input_fn( batch_size ):
    
    dataset_x_train = tf.data.TFRecordDataset([ "gs://gan-tpu-tutorial/data/mnist_train.tfrecord" ], compression_type='GZIP')
    
    dataset_x_train = dataset_x_train.shuffle(60000).repeat()
    pattern = np.array([0,0,0,0,1]).repeat(batch_size).astype(np.float32)
    dataset_g_w = tf.data.Dataset.from_tensor_slices( { "g_w": pattern  } ).repeat()
    dataset_output = tf.data.Dataset.from_tensor_slices( ( np.zeros( ( batch_size ) ).astype(np.float32) ) ).repeat()
    
    ds = tf.data.Dataset.zip(( dataset_x_train, dataset_g_w, dataset_output))
    def merge_func(a,b,c):
        a = parse_tfrecord_func(a)
        a.update(b)
        return a, c
    ds = ds.map(merge_func)
    return ds.batch( batch_size, drop_remainder = True ).prefetch(buffer_size =1)

def eval_input_fn( batch_size ):
    dataset_x_eval = tf.data.TFRecordDataset([ "gs://gan-tpu-tutorial/data/mnist_eval.tfrecord" ], compression_type='GZIP')
    dataset_x_eval = dataset_x_eval.shuffle(10000).repeat()
    dataset_output = tf.data.Dataset.from_tensor_slices( ( np.zeros( ( batch_size ) ).astype(np.float32) ) ).repeat()
    
    ds = tf.data.Dataset.zip(( dataset_x_eval, dataset_output))
    def merge_func(a,b):
        a = parse_tfrecord_func(a)
        return a, b
    ds = ds.map(merge_func)
    return ds.batch(batch_size, drop_remainder = True).prefetch(buffer_size =1)

def predict_input_fn( z ):
    dataset_z_input = tf.data.Dataset.from_tensor_slices( ( z.astype(np.float32), np.zeros(( z.shape[0],) ).astype(np.float32)  ) )
    return dataset_z_input.batch(64, drop_remainder= False)

dataset_x_train is loaded from the TFRecordDataset, the parse_tfrecord_func is used to convert the serialized tf record examples into image tensors.
g_w is used to control when to update the generator’s parameters. Here the pattern “0 0 0 0 1” means training the generator once for every n_critic = 5 batches as stated in WGAN-GP. This setting is due to the special property of the TPU mechanism
dataset_output is some dummy values (0) for the label parameter of the estimator.train() function

Each batch of ds contains follows contents:

 features: {"image": image_tensors, "g_w": g_w }
 labels: dataset_output (dummy value 0)

predict_input_fn doesn’t use TFRecord, since in GAN during test we would like to input some random latent vectors to get the output, therefore, using Dataset.from_tensor_slices is more flexible.

Define the model_fn part of the TPUEstimator

Define the generator and discriminator

def generator( z, scope="generator", trainable= True ):
    with tf.variable_scope( scope, reuse= tf.AUTO_REUSE ):
        net = tf.layers.BatchNormalization()( tf.layers.Dense( 7*7*128, activation= tf.nn.relu )(z) )
        net = tf.reshape( net, [ tf.shape(net)[0], 7, 7, 128 ] )
        net = tf.layers.BatchNormalization()( tf.layers.Conv2DTranspose( 64, 5, (2,2), "same", activation= tf.nn.relu )(net) )
        net = tf.layers.BatchNormalization()( tf.layers.Conv2DTranspose( 32, 5, (2,2), "same", activation= tf.nn.relu )(net) )
        net = tf.layers.Conv2D(1, 5, (1,1), "same", activation= tf.nn.sigmoid  )(net)
        return net

def discriminator( x, scope="discriminator", trainable = True ):
    with tf.variable_scope( scope, reuse= tf.AUTO_REUSE ):
        net = tf.layers.Conv2D( 32, 5, (2,2), "same", activation= tf.nn.leaky_relu )(x) 
        net = tf.layers.BatchNormalization()( tf.layers.Conv2D( 64, 5, (2,2), "same", activation= tf.nn.leaky_relu )(net) )
        net = tf.layers.BatchNormalization()( tf.layers.Conv2D( 128, 5, (2,2), "same", activation= tf.nn.leaky_relu )(net) )
        net = tf.layers.Flatten()(net)
        net = tf.layers.BatchNormalization()( tf.layers.Dense( 128, activation= tf.nn.leaky_relu  )(net) )
        net = tf.layers.Dense( 1 )(net)
        return net

Define some metric functions for evaluation

def metric_fn(loss_gen, loss_dis, W_dis ):
    """Function to return metrics for evaluation.
    The input parameters can be arbritary
    """
    return {"loss_gen": tf.metrics.mean(loss_gen), 
            "loss_dis": tf.metrics.mean(loss_dis),
            "wasserstein_distance": tf.metrics.mean( W_dis ),
            }

Define the model_fn

def model_fn(features, labels, mode, params):
    
    lr = params["learning_rate"]
    z_dim = params["z_dim"]
    
    if mode == tf.estimator.ModeKeys.TRAIN or mode == tf.estimator.ModeKeys.EVAL:
        """ Part I. create the model networks"""
        x = features["image"]
        is_train = mode == tf.estimator.ModeKeys.TRAIN
        random_z = tf.random.normal( [tf.shape(x)[0], z_dim ]  )
        gen_x = generator( random_z, trainable= is_train )
        dis_x = discriminator( x, trainable= is_train )
        dis_gen_x = discriminator( gen_x, trainable= is_train )

        # This is used to compute the gradient penalty
        epsilon = tf.random.uniform( [ tf.shape(x)[0],1,1,1 ], minval=0, maxval= 1 )
        interp_x = epsilon * x + (1-epsilon) * gen_x
        dis_interp_x = discriminator( interp_x, trainable= is_train )
        gradient_x = tf.gradients( dis_interp_x, [ interp_x ]  )[0]
        gradient_penalty = tf.square( tf.sqrt( tf.reduce_sum( tf.square(gradient_x ),[1,2,3] ) ) - 1  )
        LAMBDA = 10
    
        """Part II. define the loss and relative parameters for mode == TRAIN/EVAL/PREDICT"""
        ## compute loss
        loss_dis = dis_gen_x  - dis_x + LAMBDA * gradient_penalty 
        loss_gen = - dis_gen_x
        W_dis = dis_x - dis_gen_x

        ## operations for the training mode, define the optimizer, and reconfig it using tpu.CrossShardOptimizer
        if mode == tf.estimator.ModeKeys.TRAIN:
            g_w = features["g_w"]
            loss_dis = tf.reduce_mean( loss_dis   )  
            ## when g_w =0, this loss_gen 's gradient will be 0, which means generator is not trained during current batch
            loss_gen = tf.reduce_mean( loss_gen * g_w)
            W_dis = tf.reduce_mean(W_dis)

            # Define the optimizer
            d_optimizer = tf.train.AdamOptimizer(learning_rate=lr, beta1=0, beta2= 0.99 )
            g_optimizer = tf.train.AdamOptimizer(learning_rate=lr, beta1=0, beta2= 0.99 )
            # convert to TPU optimizer version
            d_optimizer = tf.tpu.CrossShardOptimizer(d_optimizer)
            g_optimizer = tf.tpu.CrossShardOptimizer(g_optimizer)

            with tf.control_dependencies( tf.get_collection( tf.GraphKeys.UPDATE_OPS )):
                d_op = d_optimizer.minimize( loss = loss_dis, var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,\
                                           scope="discriminator")  )
                g_op = g_optimizer.minimize(loss = loss_gen,  var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,\
                                           scope="generator"),global_step= tf.train.get_global_step() )
                # This group command will group the discriminator optimization and generator optimization together
                # all the optimizations in the group will be run during each batch
                # g_w can control whether to update the parameters of generator or not, which plays the role of n_critic in WGAN
                train_op = tf.group( [ d_op , g_op] )
                spec= tf.estimator.tpu.TPUEstimatorSpec(mode=mode, loss= W_dis ,train_op= train_op  )
    
        ## for EVAL mode, the parameters eval_metrics takes a tuple or list of two elements. The first element is a callable function,
        ## The second element is a list of parameters. The return value of the callable function will be shown in the evaluatio results
        elif mode == tf.estimator.ModeKeys.EVAL:
            spec = tf.estimator.tpu.TPUEstimatorSpec(mode=mode, loss= tf.reduce_mean(W_dis), eval_metrics=(metric_fn, [loss_gen, loss_dis, W_dis ] ) )
    
    elif mode == tf.estimator.ModeKeys.PREDICT:
        """ construct the model (only the generator part) """
        input_z = features
        gen_x = generator( input_z, trainable= False )

        """Define the predictions"""
        predictions = { "generated_images":  gen_x   }
        spec= tf.estimator.tpu.TPUEstimatorSpec( mode = mode, predictions = predictions )

    return spec

Create the TPUEstimator entity, and run the train / evaluate/ predict

iterations_per_loop means the number of batches fed into TPU before returning to the host CPU

model_dir="gs://gan-tpu-tutorial/model"
tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
iterations_per_loop = 200

run_config = tf.estimator.tpu.RunConfig(
        model_dir=model_dir,
        cluster=tf.distribute.cluster_resolver.TPUClusterResolver(tpu_address),
        session_config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True),
        tpu_config=tf.estimator.tpu.TPUConfig(iterations_per_loop),
        )

model = tf.estimator.tpu.TPUEstimator(
                               model_fn=model_fn,
                               params = {"learning_rate": 1e-3, "z_dim": 100 },
                               config = run_config,
                               use_tpu= True,
                               train_batch_size=512  ,
                               eval_batch_size=512 ,
                               predict_batch_size= 64,
                              ) 

Training

What is the relationship between max_steps and epochs?

max_epoch = max_steps * batch_size / total_number_of_training_samples

model.train( input_fn = lambda params: train_input_fn( params["batch_size"] ), max_steps= 10000 )

Evaluate

eval_result = model.evaluate(input_fn=lambda params: eval_input_fn( params["batch_size"]), steps = 10)
eval_result

Predict

random_z = np.random.normal( size=(1024, 100)  ).astype(np.float32)
pred_results = model.predict( input_fn=lambda params: predict_input_fn(random_z) )
images = np.array([ result["generated_images"] for result in pred_results  ])
print("generated images")
plt.figure(figsize = (5,5)) 
plt.gray()
plt.imshow( make_grid(add_padding(images[np.random.choice( images.shape[0], 64, replace= False )  ])).squeeze() )
plt.show()

This is one example of generated images:

EM Algorithm and Gaussian Mixture Model for Clustering

2019-07-10T00:00:00-07:00

In the last post on EM algorithm, we introduced the deduction of the EM algorithm and use it to solve the MLE of the heads probability of two coins. In this post, we will apply EM algorithm to more practical and useful problem, the Gaussian Mixture Model (GMM), and discuss about using GMM for clustering.

First, let’s recall the EM algorithm:

Suppose that we have the observations $\{\mathbf{x}^{(i)}\}, i=1,\dots,n$. $\mathbf{x}^{(i)}$ is related with a hidden variable $\mathbf{z}$ which is unknown to us. The task is to find the MLE of $\theta$:
$$ \theta_\text{MLE} = \arg \max_{\theta} \sum_{i=1}^{n}\log \sum_{\mathbf{z}} p_\theta(\mathbf{z}, \mathbf{x}^{(i)}) $$ The EM algorithm works as follows:

Randomly initialize $\theta$, set the $\mathbf{z}$ prior $p(\mathbf{z})$

Repeat:
At the $l^\text{th}$ iteration:

E step:
set $Q_{l}^{(i)}(\mathbf{z})=p_{\theta_{l-1}}(\mathbf{z}\vert \mathbf{x}_ i)$ for $i=1,\dots,n$

M step:
update $\theta_{l}=\arg \max_{\theta} \sum_{i=1}^{n}Q_{l}^{(i)}(\mathbf{z})\log \frac{p_{\theta_{l-1}}(\mathbf{z}, \mathbf{x}^{(i)})}{Q_{l}^{(i)}(\mathbf{z})}$

Update the prior $p(\mathbf{z})$ (optional)

Until $\theta$ converges.

Based on the experience on solving coin tossing problem using EM, we can further deform the EM algorithm:

In the E step, according to Bayes Theorem, we have $Q_{l}^{(i)}(\mathbf{z})=p_{\theta_{l-1}}(\mathbf{z}\vert \mathbf{x}^{(i)})=\frac{p_{\theta_{l-1}}(\mathbf{x}^{(i)}\vert \mathbf{z})p(\mathbf{z})}{p_{\theta_{l-1}}(\mathbf{x}^{(i)})}=\frac{ p_{\theta_{l-1}}(\mathbf{x}^{(i)}\vert \mathbf{z})p(\mathbf{z}) }{ \sum_{\mathbf{z}}p_{\theta_{l-1}}(\mathbf{x}^{(i)}\vert \mathbf{z})p(\mathbf{z}) }$.
Here $p(\mathbf{z})$ is the prior of the latent variable $\mathbf{z}$. Either we initialize it as fixed distribution, or we dynamically update it over each iteration. If the number of the value of the variable $\mathbf{z}$ is finite and traversable, we can directly compute the sum $\sum_{\mathbf{z}}p_{\theta_{l-1}}(\mathbf{x}^{(i)}\vert \mathbf{z})p(\mathbf{z})$, and then compute the posterior $Q_{l}^{(i)}(\mathbf{z})$. For example, in two coin tossing problem, $\mathbf{z}$ can only be coin A (1) or coin B (0).
In the M step, the objective function $$\begin{align} L(\theta_{l-1}, Q_{l}^{(i)} ) &= \sum_{i=1}^{n}\sum_{\mathbf{z}}Q_{l}^{(i)}(\mathbf{z})\log \frac{p_{\theta_{l-1}}(\mathbf{z},\mathbf{x}^{(i)})}{Q_{l}^{(\mathbf{z})}}\\ &= \sum_{i=1}^{n}\sum_{\mathbf{z}}Q_{l}^{(i)}(\mathbf{z})\log \frac{p_{\theta_{l-1}}(\mathbf{x}^{(i)}\vert \mathbf{z})p(\mathbf{z})}{Q_{l}^{(i)}(\mathbf{z})}\\ &= \sum_{i=1}^{n}\sum_{\mathbf{z}}Q_{l}^{(i)}(\mathbf{z})\log p_{\theta_{l-1}}(\mathbf{x}^{(i)}\vert \mathbf{z}) + \sum_{i=1}^{n}\sum_{\mathbf{z}}Q_{l}^{(i)}(\mathbf{z})\log \frac{p(\mathbf{z})}{Q_{l}^{(i)}(\mathbf{z})} \end{align} $$ Since $Q_{l}^{(i)}$ is computed in the E step, in the M step it is treated as something which is independent of $\theta$. Moreover, the prior $p(\mathbf{z})$ is also assumed to be independent of $\theta$. Therefore, the term $\sum_{i=1}^{n}\sum_{\mathbf{z}}Q_{l}^{(i)}(\mathbf{z})\log \frac{p(\mathbf{z})}{Q_{l}^{(i)}(\mathbf{z})}$ is irrelevant to $\theta$, and the equivalent updating rule of $\theta$ can be written as: $$\theta_{l}= \arg \max_{\theta} \sum_{i=1}^{n}\sum_{\mathbf{z}}Q_{l}^{(i)}(\mathbf{z})\log p_{\theta_{l-1}}(\mathbf{x}^{(i)}\vert \mathbf{z}) $$
Update the prior $p(\mathbf{z})$. The initialized prior may not be the real prior, therefore, we need to update the prior during iteration. The choice of $f(\mathbf{z})$ is also the prior distribution which can maximize the objective function in the M step: $L(\theta_{l-1}, Q_{l}^{(i)})$. In the expression of $L(\theta_{l-1}, Q_{l}^{(i)})$, only the term $\sum_{i=1}^{n}\sum_{\mathbf{z}}Q_{l}^{(i)}(\mathbf{z})\log {p(\mathbf{z})}$ is related with $p(\mathbf{z})$. Therefore, the update rule of $p(\mathbf{z})$ is:$$ \min_{f(\mathbf{z})} -\sum_{i=1}^{n}\sum_{\mathbf{z}} Q_l^{(i)}(\mathbf{z}) \log{p(\mathbf{z}) }\\ \text{s.t.}\ \sum_{\mathbf{z}} p(\mathbf{z}) = 1 $$ The Lagrangian function $$L(p(\mathbf{z}), \lambda)= -\sum_{i=1}^{n}\sum_{\mathbf{z}} Q_l^{(i)}(\mathbf{z}) \log{p(\mathbf{z})} + \lambda( \sum_{\mathbf{z}} p(\mathbf{z}) - 1 ) $$ By solving $\frac{\partial{L(p(\mathbf{z}),\lambda)}}{\partial{p(\mathbf{z})}}=-\sum_{i=1}^{n}\frac{Q_{l}^{(i)}(\mathbf{z})}{p(\mathbf{z})}+\lambda=0$, we have the primal optimum $p^{\star}(\mathbf{z})=\frac{\sum_{i=1}^{n}Q_l^{(i)}(\mathbf{z})}{\lambda}$. Substituting it to the Lagrangian function, we have the dual function:$$d(\lambda)=-\sum_{i=1}^{n}\sum_{\mathbf{z}}Q_l^{(i)}(\mathbf{z})\log\frac{\sum_{k=1}^{n}Q_{l}^{(k)}(\mathbf{z})}{\lambda}+\lambda(\sum_{\mathbf{z}}\frac{\sum_{k=1}^{n}Q_{l}^{(k)}(\mathbf{z}) }{\lambda} -1)$$. Therefore, the dual problem is: $$ \lambda^\star=\arg\max_{\lambda} d(\lambda) $$$$\frac{\partial{d(\lambda)}}{\partial{\lambda}}=\sum_{i=1}^{n}\sum_{\mathbf{z}}\frac{Q_l^{(i)}(\mathbf{z}) }{\lambda}-1=0$$So we have the dual optimum $\lambda^\star=\sum_{i=1}^{n}\sum_{\mathbf{z}}Q_{l}^{(i)}(\mathbf{z}) =\sum_{i=1}^{n}1=n$ Therefore, we get the update rule of $p(\mathbf{z})$: $$ p^\text{new}(\mathbf{z}) = \frac{1}{n}\sum_{i=1}^{n}Q_{l}^{(i)}(\mathbf{z}),\ k=1,\dots,M $$ This deduction uses the Lagrangian Duality Principle, for detail please see my post An Introduction to Support Vector Machines (SVM): Convex Optimization and Lagrangian Duality Principle

Gaussian Mixture Model (GMM)

As indicated by its name, the GMM is a mixture (actually a linear combination) of multiple Gaussian distributions. The probability density function of a GMM is ($\mathbf{x}\in R^p$):

$$ \begin{align} p(\mathbf{x}; \phi, \mu, \Sigma) &= \sum_{j=1}^{M}\phi_{j} N(\mathbf{x};\mu_j, \Sigma_j)\\ &= \sum_{j=1}^{M}\phi_{j} \frac{1}{(2\pi)^{\frac{p}{2}}\vert\Sigma_j\vert^{\frac{1}{2}} } \exp\{-\frac{1}{2}{(\mathbf{x}-\mu_j)^T \Sigma_j^{-1} (\mathbf{x}-\mu_j)}\} \end{align} $$

where $M$ is the number of Gaussian models. $\phi_j$ is the weight factor of the Gaussian model $N(\mu_j,\Sigma_j)$. Moreover, we have the constraint: $\sum_{j=1}^{M} \phi_j =1$.

GMM is very suitable to be used to fit the dataset which contains multiple clusters, and each cluster has circular or elliptical shape. For example, the data distribution shown in the following figure can be modeled by GMM. Now the question is: given a dataset with the distribution in the figure above, if we want to use GMM to model it, how to find the MLE of the parameters ($\phi,\mu,\Sigma$) of the Gaussian Mixture Model?

The answer is: using EM algorithm!

EM algorithm on GMM parameters estimation

Before we move forward, we need to figure out what the prior $p(\mathbf{z})$ is for the GMM. Suppose that there are $M$ Gaussian models in the GMM, our latent variable $\mathbf{z}$ only has $M$ different values: $\{\mathbf{z}^{(j)}=j| j=1,\dots,M\}$. The prior $p(\mathbf{z}^{(j)})=p(\mathbf{z}=j)$ represents the likelihood that the data belongs to cluster (Gaussian model) $j$, without any information about the data $\mathbf{x}$. According to the marginal likelihood we have:

$$ p(\mathbf{x}; \phi, \mu, \Sigma) =\sum_{j=1}^{M} p(\mathbf{z}^{(j)}) p(\mathbf{x}\vert \mathbf{z}^{(j)}; \mu, \Sigma)\\ \sum_{j=1}^{M}p(\mathbf{z}^{(j)})=1 $$

If we compare these two equations with the expression of the GMM, we will find that $p(\mathbf{z}^{(j)})$ plays the role of $\phi_j$. In other words, we can treat $\phi_j$ as the prior and $p(\mathbf{x}\vert \mathbf{z}^{(j)}; \mu, \Sigma)= N(\mathbf{x};\mu_j, \Sigma_j)$

Moreover, $\mathbf{x}^{(i)}\in R^p$. The EM algorithm works as follows:

Initialization:
We normalize the raw data if necessary, randomly initialize $\phi, \mu, \Sigma$.
Repeat:
To avoid an overcomplicated expression, we omit the footnote of iteration index $l$
- E step:
  $\begin{align} Q^{(i)}(\mathbf{z}^{(j)})&=\frac{ p( \mathbf{x}^{(i)}\vert \mathbf{z}^{(j)} ; \mu,\Sigma ) p(\mathbf{z}^{(j)})}{ \sum_{k=1}^{M}p( \mathbf{x}^{(i)}\vert \mathbf{z}^{(k)} ; \mu,\Sigma ) p(\mathbf{z}^{(k)}) } \\ &=\frac{\phi_{j} \frac{1}{(2\pi)^{\frac{p}{2}}\vert \Sigma_j \vert^{\frac{1}{2}} }\exp\{-\frac{1}{2}(\mathbf{x}^{(i)}-\mu_j)^T\Sigma_j^{-1}(\mathbf{x}^{(i)}-\mu_j) \} }{ \sum_{k=1}^{M} \phi_{k} \frac{1}{(2\pi)^{\frac{p}{2}}\vert \Sigma_k \vert^{\frac{1}{2}} }\exp\{-\frac{1}{2}(\mathbf{x}^{(i)}-\mu_k)^T\Sigma_k^{-1}(\mathbf{x}^{(i)}-\mu_k) \} } \end{align}$
  For short we denote $q_{i,j}=Q^{(i)}(\mathbf{z}^{(j)})$
- M step:
  The objective function is:
  $L= \sum_{i=1}^{n}\sum_{j=1}^{M} q_{i,j} \log \phi_{j} \frac{1}{(2\pi)^{\frac{p}{2}}\vert\Sigma_j\vert^{\frac{1}{2}} } \exp\{-\frac{1}{2}{(\mathbf{x}^{(i)}-\mu_j)^T \Sigma_j^{-1} (\mathbf{x}^{(i)}-\mu_j)}\}$
  Update $\mu$
  We compute the partial derivative:$$\begin{align} \frac{\partial{L}}{\partial{\mu_k}}&=\frac{\partial}{\partial{\mu_k}}\Big[\sum_{i=1}^{n}-\frac{1}{2}q_{i,k}(\mathbf{x}^{(i)}-\mu_k)^T\Sigma_k^{-1}(\mathbf{x}^{(i)}-\mu_k)\Big]\\ &= -\sum_{i=1}^{n}q_{i,k}\Sigma_k^{-1}(\mu_k - \mathbf{x}^{(i)})\\ &= -\Sigma_k^{-1}( \sum_{i=1}^{n}q_{i,k}\mu_k -\sum_{i=1}^{n}q_{i,k}\mathbf{x}^{(i)} )\\ &=0 \end{align}$$ Since $\Sigma_k>0$, the solution to the above equation is:$$\mu_k^\text{new} = \frac{ \sum_{i=1}^{n}q_{i,k}\mathbf{x}^{(i)} }{ \sum_{i=1}^{n}q_{i,k} },\ k=1,\dots,M$$ Update $\Sigma$
  let $\Lambda_k=\Sigma_k^{-1}$, then the objective function is $L= \sum_{i=1}^{n}\sum_{j=1}^{M} q_{i,j} \log \phi_{j} \frac{\vert\Lambda_j\vert^{\frac{1}{2}}}{(2\pi)^{\frac{p}{2}} } \exp\{-\frac{1}{2}{(\mathbf{x}^{(i)}-\mu_j)^T \Lambda_j (\mathbf{x}^{(i)}-\mu_j)}\}$
  we compute the optimal $\Lambda^\star_k$ by solving the equation: $$\begin{align}\frac{\partial{L}}{\partial{\Lambda_k}}&=\frac{\partial}{\partial{\Lambda_k}}\Big\{ \sum_{i=1}^{n} q_{i,k}\Big[\frac{1}{2}\log \vert\Lambda_k\vert - \frac{1}{2}(\mathbf{x}^{(i)}-\mu_k)^T\Lambda_k(\mathbf{x}^{(i)}-\mu_k)\Big] \Big\} \\ &=\frac{\partial}{\partial{\Lambda_k}}\Big\{ \sum_{i=1}^{n} q_{i,k}\Big[\frac{1}{2}\log \vert\Lambda_k\vert - \frac{1}{2}\text{tr}\left((\mathbf{x}^{(i)}-\mu_k)^T\Lambda_k(\mathbf{x}^{(i)}-\mu_k)\right) \Big] \Big\} \\ &= \frac{1}{2}\sum_{i=1}^{n} q_{i,k} \Lambda_k^{-1} - \frac{1}{2} \frac{\partial}{\partial{\Lambda_k}} \text{tr}\left(\Lambda_k\sum_{i=1}^{n}q_{i,k}(\mathbf{x}^{(i)}-\mu_k)(\mathbf{x}^{(i)}-\mu_k)^T\right) \\ &= \frac{1}{2}\sum_{i=1}^{n} q_{i,k} \Sigma_k - \frac{1}{2} \sum_{i=1}^{n}q_{i,k}(\mathbf{x}^{(i)}-\mu_k)(\mathbf{x}^{(i)}-\mu_k)^T \\ &=0\end{align}$$So we have the update rule of $\Sigma_k$:$$ \Sigma_k^\text{new} = \frac{ \sum_{i=1}^{n}q_{i,k}(\mathbf{x}^{(i)}-\mu_k)(\mathbf{x}^{(i)}-\mu_k)^T }{\sum_{i=1}^{n} q_{i,k} } $$
- update the prior $\phi_j=\frac{1}{n}\sum_{i=1}^{n}Q^{(i)}(\mathbf{z}^{(j)})$

Until all the parameters converges.

GMM for Clustering

Suppose that we have use the EM algorithm to find the estimation of the model parameters, what does the posterior $p_\theta(\mathbf{z}^{(j)}\vert \mathbf{x})$ represent? It actually represents the likelihood that the data $\mathbf{x}$ belongs to the Gaussian model index $j$ (or Cluster $j$). Therefore, we can use the posterior expression given in the E step above, to the compute the posterior $p_\theta(\mathbf{z}^{(j)}\vert \mathbf{x}),\ j=1,\dots,M$, and determine the cluster index with largest posterior $c_x=\arg \max_{j} p_\theta(\mathbf{z}^{(j)}\vert \mathbf{x})$

Demo

We implement the EM & GMM using python, and test it on 2d dataset.

import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist

Using TensorFlow backend.

def load_data(  num_samples, prior_z_list , mu_list , sigma_list ):
	X=[]
	choice_of_gaussian_model = np.random.choice(len( prior_z_list), num_samples, p=prior_z_list  )
	for sample_ind in range(num_samples):
		gaussian_ind = choice_of_gaussian_model[sample_ind]
		x= np.random.multivariate_normal( mu_list[gaussian_ind], sigma_list[gaussian_ind] )
		X.append(x)

	X= np.asarray(X)
	return X

def EM(X, num_clusters, epsilon = 1e-2, update_prior = True, max_iter = 100000 ):
	x_dim = X.shape[1]
	num_samples = X.shape[0]
	## initialization
	mu = np.random.uniform( size=( num_clusters, x_dim ) )
	## initializing sigma as identity matrix can guarantee it's positive definite
	sigma = []
	for _ in range(num_clusters):
		sigma.append( np.eye(x_dim) )
	sigma = np.asarray(sigma)
	phi = np.ones(num_clusters)/ num_clusters

	count = 0

	while True:
		## E step
		# Q is the posterior, with the dimension num_samples x num_clusters
		Q=np.zeros( [num_samples, num_clusters])
		sigma_det =[ (np.linalg.det(sigma[j]))**0.5 for j in range(num_clusters)  ]
		sigma_inverse = [ np.linalg.inv(sigma[j]) for j in range(num_clusters)  ]
		for i in range(num_samples):
			for j in range(num_clusters):
				Q[i,j]= phi[j]/(   sigma_det[j] ) * np.exp( -0.5 * np.matmul( np.matmul((X[i]-mu[j]).T, sigma_inverse[j]), X[i]-mu[j]))   
		Q=np.array(Q)
		Q=Q/(np.sum(Q,axis=1,keepdims=True))	

		## M step
		# update mu
		mu_new = np.ones([num_clusters, x_dim])
		for j in range(num_clusters):
			mu_new[j] = np.sum (Q[:,j:j+1]*X ,axis=0 )/np.sum(Q[:,j],axis=0)
		# update sigma
		sigma_new = np.zeros_like(sigma)
		for j in range(num_clusters):
			for i in range(num_samples):
				sigma_new[j] += Q[i,j] * np.matmul( (X[i]-mu[j])[:,np.newaxis], (X[i]-mu[j])[:,np.newaxis].T )
			sigma_new[j] = sigma_new[j]/np.sum(Q[:,j])
		# update phi
		if update_prior:
			phi_new = np.mean( Q, axis=0 )
		else:
			phi_new = phi

		delta_change = np.mean(np.abs(phi-phi_new)) + np.mean( np.abs( mu- mu_new ) )+np.mean( np.abs( sigma- sigma_new ) )
		print("parameter changes: ",delta_change)
		
		if delta_change < epsilon:
			break
		
		count +=1
		if count >= max_iter:
			break

		phi=phi_new
		mu= mu_new
		sigma = sigma_new

	## a function used for performing clustering
	def cluster( X ):
		Q=np.zeros( [X.shape[0], num_clusters])
		sigma_det =[ (np.linalg.det(sigma[j]))**0.5 for j in range(num_clusters)  ]
		sigma_inverse = [ np.linalg.inv(sigma[j]) for j in range(num_clusters)  ]
		for i in range(X.shape[0]):
			for j in range(num_clusters):
				Q[i,j]= phi[j]/(  sigma_det[j] ) * np.exp( -0.5 * np.matmul( np.matmul((X[i]-mu[j]).T, sigma_inverse[j]), X[i]-mu[j]))   
		Q=np.array(Q)
		Q=Q/(np.sum(Q,axis=1,keepdims=True))	
		cluster_info = np.argmax( Q, axis=1)
		return cluster_info


	return {"mu":mu, "sigma":sigma, "phi":phi, "cluster": cluster}

GMM on 2d data points with convex shapes

First let load a small data points

real_phi =  [0.2,0.6,0.1,0.1]
real_mu = [ [0,0],[2,8],[10,10],[9,1] ]
real_sigma = [ [[1,0.5],[0.5,1]], [[2,-0.6],[-0.6,1]], [[1,0],[0,1]],[[1,0.3],[0.3,0.5]] ]
X=load_data(10000, real_phi, real_mu, real_sigma )

for i in range(len(real_phi)):
    print("real phi: ", real_phi[i], " real mu: ", real_mu[i], " real sigma: ", real_sigma[i])

real phi:  0.2  real mu:  [0, 0]  real sigma:  [[1, 0.5], [0.5, 1]]
real phi:  0.6  real mu:  [2, 8]  real sigma:  [[2, -0.6], [-0.6, 1]]
real phi:  0.1  real mu:  [10, 10]  real sigma:  [[1, 0], [0, 1]]
real phi:  0.1  real mu:  [9, 1]  real sigma:  [[1, 0.3], [0.3, 0.5]]

Let’s plot the data and have a look at it.

plt.scatter( X[:,0], X[:,1] )
plt.show()

Then we apply the EM algorithm, to get the MLE of GMM parameters and get the cluster function

params=EM(X, num_clusters=4, epsilon= 1E-4)
mu= params["mu"]
sigma = params["sigma"]
phi=params["phi"]
cluster = params["cluster"]

parameter changes:  28.449669073154364
parameter changes:  17.400927300989974
parameter changes:  0.9644888523985635
parameter changes:  1.0995072448163998
parameter changes:  1.3509364912075696
parameter changes:  1.2308294431017273
parameter changes:  1.3794412438676897
parameter changes:  1.4081227407466508
parameter changes:  1.0857571446279906
parameter changes:  0.7155881044307679
parameter changes:  0.411613512938475
parameter changes:  0.12457364032905578
parameter changes:  0.04685136953006225
parameter changes:  0.0540454165259536
parameter changes:  0.06456840164792643
parameter changes:  0.07771391163679765
parameter changes:  0.09436688134288668
parameter changes:  0.11582159431045104
parameter changes:  0.14421201360388664
parameter changes:  0.1834323022021212
parameter changes:  0.24801453948582258
parameter changes:  0.3558084755399498
parameter changes:  0.5349701481676721
parameter changes:  0.7677886989164794
parameter changes:  0.7666771213539978
parameter changes:  0.5043555266074152
parameter changes:  0.11678542980595268
parameter changes:  0.001048169134691374
parameter changes:  1.550958923947094e-06

esti_mu= (mu*100).astype(np.int32)/100.  
esti_sigma= (sigma*100).astype(np.int32)/100. 
esti_phi= (phi*100).astype(np.int32)/100. 
for i in range(len(esti_phi)):
    print("esti phi:", esti_phi[i], "esti mu:", esti_mu[i].tolist(), "esti sigma:", esti_sigma[i].tolist())

esti phi: 0.09 esti mu: [8.99, 0.99] esti sigma: [[1.07, 0.31], [0.31, 0.51]]
esti phi: 0.19 esti mu: [0.01, 0.01] esti sigma: [[1.0, 0.48], [0.48, 1.01]]
esti phi: 0.1 esti mu: [10.02, 10.02] esti sigma: [[0.92, -0.01], [-0.01, 1.03]]
esti phi: 0.6 esti mu: [2.01, 7.98] esti sigma: [[2.0, -0.61], [-0.61, 1.02]]

If we compare the estimated parameters with the real paramets, we can see the estimation error is within 0.05, and the correspondence between the phi, mu and sigma is also correct. Therefore the EM algorithm does work!

We can perform clustering using the trained cluster model and plot the clustering results

cluster_X = cluster(X)
cluster_index = np.unique(cluster_X)
for ind in cluster_index:
	plt.scatter( X[cluster_X==ind][:,0], X[cluster_X==ind][:,1], color = np.random.uniform(size=3) )
plt.legend(cluster_index)
plt.show()

Well, the clustering results are pretty accurate and reasonable! So we can use GMM for unsupervised clustering!

Discussion: As shown the in the figure above, each cluster is actually a convex set.

A convex set $S$ means for any two points $\mathbf{x}1\in S, \mathbf{x}_2\in S$, the linear interpolation $\mathbf{x}\text{int}= \lambda * \mathbf{x}_1 + (1-\lambda)\mathbf{x}_2, 0\leq\lambda\leq 1$ also belongs to $S$

This is pretty reasonable, since Gaussian distribution naturally has a convex shape. However, what the performance of GMM clustering will be for non-convex dataset?

GMM on 2d data points with non-convex shapes

First of all, let prepare the data:

def load_non_convex_data(num_samples=10000, prior_z_list=[0.5,0.5], mu_list=[[np.pi/2, 3], [np.pi*1, -3]], sigma_list=[[[np.pi,0],[0,2]],[[np.pi,0],[0,2]]]):
    X=[]
    choice_of_model = np.random.choice(len( prior_z_list), num_samples, p=prior_z_list  )
    for ind in choice_of_model:
        while True:
            x= np.random.multivariate_normal( mu_list[ind], sigma_list[ind] )
            if ind==0:
                if x[1]>1.5*np.sin(x[0])+0.5:
                    break
            else:
                if x[1]<1.5*np.sin(x[0])-0.5:
                    break
        X.append(x)
    X= np.array(X)
    return X            

X= load_non_convex_data()

plt.scatter(X[:,0],X[:,1] )
plt.show()

Use EM algorithm to estimate the parameters of the GMM model.

params=EM(X, num_clusters=2, epsilon= 1E-2)
mu= params["mu"]
sigma = params["sigma"]
phi=params["phi"]
cluster = params["cluster"]

parameter changes:  7.344997536220525
parameter changes:  2.769657568563131
parameter changes:  0.6826557990296913
parameter changes:  0.8559206668196735
parameter changes:  0.9985169905722497
parameter changes:  0.6972809861725238
parameter changes:  0.16143972260766515
parameter changes:  0.014376638549487432
parameter changes:  0.002146320352925

Let’s see the clustering results:

cluster_X = cluster(X)
cluster_index = np.unique(cluster_X)
for ind in cluster_index:
	plt.scatter( X[cluster_X==ind][:,0], X[cluster_X==ind][:,1], color = np.random.uniform(size=3) )
plt.legend(cluster_index)
plt.show()

From this figure we can see the real clusters are actually non-convex, since there is a sine-shape gap between two real clusters. However, the GMM clustering resluts always provide convex clutsers. For example, either the blue points set or the red points set is convex. This is determined by the fact that Gaussian distribution has convex shape.

Conclusion

Now we see the ability and shortcoming of the GMM clustering. In the GMM clustering results, each cluster’s region ussually has a convex shape. This actually limits the power of GMM clustering especially on some mainfold data clustring. In the future we will discuss how to cluster such non-convex dataset.

Moreover, this GMM model is not very practical, since for some sparse dataset, when updating the $\Sigma_j$ in the M step, the covariance matrix $\frac{ \sum_{i=1}^{n}q_{i,k}(\mathbf{x}^{(i)}-\mu_k)(\mathbf{x}^{(i)}-\mu_k)^T }{\sum_{i=1}^{n} q_{i,k} }$ may not be positive definite (be singular). In this case we cannot directly compute the inverse of $\Sigma_j$. More works are needed to deal with such cases.

Reference

Andrew Ng’s course on Machine Learning at Stanford University

An Introduction to Expectation-Maximization (EM) Algorithm

2019-07-07T00:00:00-07:00

Expectation Maximization (EM) algorithm is a special case of MLE where the observations (data samples $\mathbf{x}$) are inherently related with some hidden variables ($\mathbf{z}$). First of all, we need to review the basics of MLE.

Maximum Likelihood Estimation (MLE)

Let $\{\mathbf{x}^{(i)}\},\ i=1,\dots,n$ be a set of independent and identically distributed observations, and $\mathbf{\theta}$ be the parameters of the data distribution which are unknown for us. The maximum likelihood estimation of the parameters $\theta$ is the parameters which can maximize the joint distribution $p_\theta(\mathbf{x}^{(1)},\dots,\mathbf{x}^{(n)})= \prod_{i=1}^{n}p_\theta(\mathbf{x}^{(i)})$

$$ \hat{\theta}_\text{MLE} = \arg\ \max_{\theta}{\prod_{i=1}^{n}p_\theta(\mathbf{x}^{(i)})} $$

More commonly, we choose to maximize the joint log-likelihood:

$$ \hat{\theta}_\text{MLE} = \arg\ \max_{\theta}{\sum_{i=1}^{n}\log p_\theta(\mathbf{x}^{(i)})} $$

We use an example to illustrate how it works (referred from EM算法详解-知乎).

Suppose that we have a coin A, the likelihood of a heads is $\theta_A$. We denote one observation $\mathbf{x}^{(i)}=\{ x_ {i,1},x_ {i,2},x_ {i,3},x_ {i,4},x_ {i,5}, \}$ as tossing the coin A 5 times and record the heads (1) or tails (0) of each tossing. For example, $\mathbf{x}^{(i)}$ can be 01001, 01110, 10010, … etc. The likelihood of the observation $\mathbf{x}^{(i)}$ is:

$$ p({\mathbf{x}^{(i)}}) = \prod_{j=1}^{5}\theta_A^{x_{i,j}}(1-\theta_A)^{1-x_{i,j}} = \theta_A^{\sum_{j=1}^{5}x_{i,j}}(1-\theta_A)^{\sum_{j=1}^{5}(1-x_{i,j})} $$

Therefore, the log likelihood of the joint distribution of $n$ observations is:

$$ l(\theta_A) = \sum_{i=1}^{n} \sum_{j=1}^{5}\Big( x_{i,j}\log{\theta_A} +(1-x_{i,j})\log{(1-\theta_A)} \Big) $$

The MLE of $\theta_A$ is

$$ \hat{\theta}_{A,\text{MLE}} = \arg \max_{\theta_A} l(\theta_A) $$

To get $\hat{\theta}_{A,\text{MLE}}$ we can solve the equation $\frac{\partial{l(\theta_A)}}{\partial{\theta_A}}=0$.

$$ \begin{align} \frac{\partial{l(\theta_A)}}{\partial{\theta_A}} &= \sum_{i=1}^{n}\sum_{j=1}^5 \Big(\frac{x_{i,j}}{\theta_A} + \frac{x_{i,j}-1}{1-\theta_A} \Big)\\ &= \frac{\sum_{i=1}^{n}\sum_{j=1}^5 x_{i,j}}{\theta_A} + \frac{\sum_{i=1}^{n}\sum_{j=1}^5 x_{i,j}-5n}{1-\theta_A} \\ &= 0 \end{align} $$

Therefore, we have

$$ \hat{\theta}_{A, \text{MLE}} = \frac{\sum_{i=1}^{n}\sum_{j=1}^5 x_{i,j}}{5n} $$

This is actually equivalent to compute the average value of all tossing results. For example, if we have 10 observations as below:

$\mathbf{x}^{(1)}$	01011	$\mathbf{x}^{(6)}$	01110
$\mathbf{x}^{(2)}$	01111	$\mathbf{x}^{(7)}$	01110
$\mathbf{x}^{(3)}$	11011	$\mathbf{x}^{(8)}$	11011
$\mathbf{x}^{(4)}$	00011	$\mathbf{x}^{(9)}$	00100
$\mathbf{x}^{(5)}$	01010	$\mathbf{x}^{(10)}$	01001

The sum of all tossing is 28, and the total number of tossing is 50, so MLE of $\theta_A$ is $\frac{28}{50}=\frac{14}{25}$

MLE with hidden variables

Now things become more complicated. Suppose we have two coins: A and B. The likelihood of a heads of coin A and B are $\theta_A$ and $\theta_B$ respectively. We want to find the MLE of $\theta_A, \theta_B$ using $n$ observations $\{\mathbf{x}^{(i)}\},\ i=1,\dots,n$. Each observation has the same form as above. The challenging part is that for each observation $\mathbf{x}^{(i)}$, we don’t know which coin it comes from. For example, $n=10$, the observation set is the same as the table above. In this case how to find the MLE of $\theta_A$ and $\theta_B$?

This is an simple example where our observation is closely related with some hidden (unknown) variables. In other words, the information of the data is incomplete. The Expectation-Maximization algorithm can be used to solve these problems.

Expectation-Maximization (EM) Algorithm

Before introducing EM algorithm, we need to known an important inequality: Jensen-Shannon Inequality.

Jensen-Shannon Inequality

If a function $f(\mathbf{X})$ is strictly convex, where $\mathbf{X}$ is a random variable and the Hessian matrix $H$ is positive definite, we have

$$ E_{X}[f(\mathbf{X})] \geq f(E_X(\mathbf{X})) $$

The equality holds if and only if $E_X [\mathbf{X}]= \mathbf{X}$ with the probability 1 ($\mathbf{X}$ is a constant). Note that of $f(\mathbf{X})$ is strictly concave, the direction of the inequality needs to be reversed.

We can use an example to illustrate Jensen-Shannon inequality more intuitively (not proof). This example is referred from Andrew Ng’s lecture note on EM. As shown in this figure, the random variable $\mathbf{X}$ has only two possible values: $a$ and $b$, each with the probability 0.5. Therefore, $f(E[\mathbf{X}])= f(\frac{a+b}{2})$ and $E[f(\mathbf{X})]=\frac{f(a)+f(b)}{2}$. According to the convexity of the function $f(\mathbf{X})$, we have $E[f(\mathbf{X})]\geq f(E[\mathbf{X}])$, and the equality holds if and only if $a=b$, which means $E[\mathbf{X}]=\mathbf{X}=a$.

Now we have a powerful tool, and we will use it to deduce the EM algorithm.

EM algorithm
Recall the MLE problem:

$$ \theta_\text{MLE} = \arg \max_{\theta} \sum_{i=1}^{n} \log{p_\theta(\mathbf{x}^{(i)})} $$

If $\mathbf{x}$ is related with a latent variable $\mathbf{z}$, we write $p_\theta$ as the marginal likelihood of the joint distribution:

$$ p_\theta(\mathbf{x}^{(i)}) = \sum_{z} p_\theta(\mathbf{x}^{(i)}, \mathbf{z}) $$

Now our log-likelihood function $l(\theta)$ becomes:

$$ l(\theta) = \sum_{i=1}^{n}\log \sum_{z} p_\theta(\mathbf{x}^{(i)}, \mathbf{z}) $$

The expression of the joint distribution is not known to us. To solve this maximization problem we introduce a distribution $Q^{(i)}(\mathbf{z})$, and rewrite the log-likelihood function as:

$$ \begin{align} l(\theta) &= \sum_{i=1}^{n} \log \sum_{z} Q^{(i)}(z) \frac{p_\theta(\mathbf{x}^{(i)},\mathbf{z}))}{Q^{(i)}(\mathbf{z})}\\ &= \sum_{i=1}^{n} \log E_{z\in Q^{(i)}(z)} \left[ \frac{p_\theta(\mathbf{x}^{(i)},\mathbf{z}))}{Q^{(i)}(\mathbf{z})}\right] \end{align} $$

We know that the function $f(x)=\log(x)$ is strictly concave, so according to the Jensen-Shannon inequality, we have the following inequality:

$$ l(\theta) \geq L(\theta, Q^{(i)}) =\sum_{i=1}^{n} E_{z \in Q^{(i)}(z)}\left[ \log{\frac{p_\theta(\mathbf{x}^{(i)},\mathbf{z}))} {Q^{(i)}(\mathbf{z})}} \right] $$

The equality holds if and only if $\frac{p_\theta(\mathbf{x}^{(i)},\mathbf{z}))} {Q^{(i)}(\mathbf{z})}$ is a constant (with respect to variable $\mathbf{z}$). To achieve this we can set $Q^{(i)}(\mathbf{z})= \frac{p_\theta(\mathbf{x}^{(i)},\mathbf{z})}{\sum_{z}p_\theta(\mathbf{x}^{(i)},\mathbf{z})}= \frac{ p_\theta(\mathbf{x}^{(i)},\mathbf{z}) }{p_\theta(\mathbf{x}^{(i)})}= p_\theta(\mathbf{z}\vert \mathbf{x}^{(i)})$. This means $Q^{(i)}(\mathbf{z})$ is the posterior of $\mathbf{z}$ given $\mathbf{x}^{(i)}$.

People may ask: why do we try to find a proper $Q^{(i)}(\mathbf{z})$ to make the equality hold?

Our initial goal is to find the MLE of $\theta$ w.r.t $l(\theta)$. However, the original expression of $l(\theta)$ is not explicit, so we take advantage of the Jensen-Shannon inequality, by selecting a proper distribution of $Q^{(i)}(\mathbf{z})$, to make $l(\theta)=L(\theta, Q^{(i)})$. Then we can instead maximize $L(\theta, Q^{(i)})$ w.r.t $\theta$ to find the MLE in an iterative way.

Now, we can summarize the EM algorithm:

Heuristically (or randomly) initialize parameters $\theta$
Repeat:
- Expectation (E) step:
  Based on current parameter $\theta$ and $\mathbf{x}^{(i)}$, set $Q^{(i)}(\mathbf{z})=p_\theta(\mathbf{z}\vert \mathbf{x}^{(i)})$.
- Maximization (M) step:
  Update $\theta \leftarrow \arg \max_{\theta} \sum_{i=1}^{n}\sum_z Q^{(i)}(\mathbf{z}) \log \frac{p_\theta(\mathbf{x}^{(i)}, \mathbf{z})}{Q^{(i)}(\mathbf{z})}$

Until: $\theta$ converges

In Andrew Ng’s lecture notes, it is proven that can guarantee that $l(\theta)$ is steadily maximized. Suppose that after $l-1$ iterations we have the log-likelihood $l(\theta_{l-1})$. At the $l^\text{th}$ iteration, after the E step, we have $L(\theta_{l-1}, Q^{(i)}_l)= l(\theta_{l-1})$; After the M step, updated $\theta_{l}$ is selected such that $L(\theta_l, Q^{(i)}_l)\geq L(\theta_{l-1}, Q^{(i)}_l)$. Then at the $l+1$ iteration, by selecting $Q^{(i)}_{l+1}$ as the posterior of $\mathbf{z}$, we have $l(\theta_l)=L(\theta_l, Q^{(i)}_{l+1})$. Therefore, we have

$$ l(\theta_l) = L(\theta_l, Q^{(i)}_{l+1}) \geq L(\theta_l, Q^{(i)}_{l}) \geq L(\theta_{l-1}, Q^{(i)}_{l}) = l(\theta_{l-1}) $$

So we have $l(\theta_{l})\geq l(\theta_{l-1})$. This guarantees that the overall log-likelihood can only keep increasing or stay unchanged, but not decrease.

This deduction shows that the EM algorithm is heading to the right direction. However, this direction may not be the ideal one. It is pretty obvious that if $l(\theta)$ is globally concave, EM algorithm can always converge at the global optimum. If $l(\theta)$ is not globally concave, the property $l(\theta_l)\geq l(\theta_{l-1})$ will guarantee that EM algorithm will converge at some point (assume that $l(\theta)$ is not delta function), but the converge point may not be globally optimum.

Moreover, the EM algorithm is sensitive to the initialization. Different initialization may results in pretty different converge points, as shown in the figure below. As shown in this figure, if the initialization is at point $A$, then EM will converge at point $C_A$, while the EM will converge at point $C_B$ if initialization is $B$. Obviously $C_B$ is the global optimum and $C_A$ not.

So how to make EM algorithm less sensitive to initialization and be more likely to find the global optimum? One simple, straight-forward but effective way is to randomly initialize the parameters and rum EM algorithm multiple times, and choose the parameters with the largest converged log-likelihood (objective function).

Apply EM algorithm to practical questions

Tossing two coins with different heads probability

Let recall the question raised in the first section:

Suppose we have two coins: A and B. The likelihood of a heads of coin A and B are $\theta_A$ and $\theta_B$ respectively. We want to find the MLE of $\theta_A, \theta_B$ using $n$ observations $\{\mathbf{x}^{(i)}\in \{0,1\}^d\},\ i=1,\dots,n$. Each observation has $d$ dimension, which means $d$ times of tossing for each observation. The challenging part is that for each observation $\mathbf{x}^{(i)}$, we don’t know which coin it comes from. In this case how to find the MLE of $\theta_A$ and $\theta_B$?

In this case, $\mathbf{x}$ is related with a hidden variable $z$. $z$ can only have 2 values: $z=A$ for coin $A$ and $z=B$ for coin $B$. We want to apply EM algorithm to this case.

Randomly initialize $\theta_{A,0}$, $\theta_{B,0}$, and the prior distribution of $z$ is $P(z=A)$ , $P(z=B)$.

Note that the choice of prior distribution of $z$ will influence the final learned parameters very much. If the chosen prior is pretty different from the real prior, the estimated parameters will be inaccurate. To solve this problem, we can update the prior by setting the prior of current iteration as the posterior of previous iteration, averaged over all observations. This is commonly used when data comes as a sequence.
Repeat:
at the $l^\text{th}$ iteration:
- E step:
  We need to compute the posterior:$$Q^{(i)}_l(z) = P(z|\mathbf{x}^{(i)}; \theta_{l-1}) = \frac{P(\mathbf{x}^{(i)}\vert z; \theta_{l-1}) P(z) }{ P(\mathbf{x}^{(i)}) }$$ Here $\theta_{l-1}$ represents the parameter set $\{\theta_{A,l-1}, \theta_{B, l-1}\}$. Furthermore, we known that $P(z=A|\mathbf{x}^{(i)}; \theta_{l-1})+P(z=A|\mathbf{x}^{(i)}; \theta_{l-1})=1$. Moreover we have the prior $P(z=A)=0.5$ and $P(z=B)=0.5$. Therefore, we have$$\begin{align} & Q^{(i)}_{A,l}= Q^{(i)}_ l(z=A)\\=&\frac{P(\mathbf{x}^{(i)}\vert z=A; \theta_{l-1})P(z=A) }{ P(\mathbf{x}^{(i)}\vert z=A; \theta_{l-1})P(z=A) +P(\mathbf{x}^{(i)}\vert z=B; \theta_{l-1}) P(z=B) }\\ =&\frac{P(\mathbf{x}^{(i)}\vert z=A; \theta_{l-1})}{ P(\mathbf{x}^{(i)}\vert z=A; \theta_{l-1}) +P(\mathbf{x}^{(i)}\vert z=B; \theta_{l-1}) } \end{align}$$ and $$\begin{align} &Q^{(i)}_{B,l}=Q^{(i)}_ l(z=B)\\=&\frac{P(\mathbf{x}^{(i)}\vert z=B; \theta_{l-1}) }{ P(\mathbf{x}^{(i)}\vert z=A; \theta_{l-1}) +P(\mathbf{x}^{(i)}\vert z=B; \theta_{l-1}) }\end{align}$$
- M step:
  The objective function is $$\begin{align}&L(\theta_{l-1}, Q^{(i)}_l)\\ =& \sum_{i=1}^{n} \sum_z Q^{(i)}_l(z) \log \frac{P(\mathbf{x}^{(i)},z;\theta_{l-1})}{Q^{(i)}_ l(z)}\\ =& \sum_{i=1}^{n}\Big( Q_l^{(i)}(z=A)\log \frac{ P(\mathbf{x}^{(i)}|z=A; \theta_{l-1} )P(z=A) }{Q_l^{(i)}(z=A) } +\\ &\ \ \ Q_l^{(i)}(z=B)\log \frac{ P(\mathbf{x}^{(i)}|z=B; \theta_{l-1} )P(z=B) }{Q_l^{(i)}(z=B) }\Big)\\ =& \sum_{i=1}^{n} \sum_{j=1}^{d}\Big[Q_{A,l}^{(i)}\Big(x_{i,j}\log \theta_{A,l-1} +(1-x_{i,j})\log (1-\theta_{A,l-1}) \Big)+\\ &\ \ \ Q_{B,l}^{(i)}\Big(x_{i,j}\log \theta_{B,l-1} +(1-x_{i,j})\log (1-\theta_{B,l-1}) \Big)\Big]+C \end{align}$$ Where $C$ is a term which is not related with $\theta_A$ or $\theta_B$. To update $\theta$, we need to compute the partial derivate ad set them to 0: $$\begin{align} \frac{\partial{L(\theta_{l-1}, Q_l^{(i)})}}{\partial{ \theta_{A,l-1} }}= \frac{\sum_{i=1}^{n}\sum_{j=1}^{d}Q_{A,l}^{(i)}x_{i,j}}{\theta_{A,l-1}} + \frac{\sum_{i=1}^{n}\sum_{j=1}^{d}Q_{A,l}^{(i)}(x_{i,j}-1)}{1-\theta_{A,l-1}} =0 \\ \frac{\partial{L(\theta_{l-1}, Q_l^{(i)})}}{\partial{ \theta_{B,l-1} }}= \frac{\sum_{i=1}^{n}\sum_{j=1}^{d}Q_{B,l}^{(i)}x_{i,j}}{\theta_{B,l-1}} + \frac{\sum_{i=1}^{n}\sum_{j=1}^{d}Q_{B,l}^{(i)}(x_{i,j}-1)}{1-\theta_{B,l-1}} =0 \end{align}$$ By solving these two equations, we get the update rule: $$ \theta_{A,l} = \frac{ \sum_{i}^{n} \sum_{j=1}^d Q_{A,l}^{(i)} x_{i,j} }{ \sum_{i}^{n}Q_{A,l}^{(i)}d}\\ \theta_{B,l} = \frac{ \sum_{i}^{n} \sum_{j=1}^d Q_{B,l}^{(i)} x_{i,j} }{ \sum_{i}^{n}Q_{B,l}^{(i)}d} $$
- Update prior $P(z)$: $P(z=A)=\frac{1}{n}Q_{A,l}^{(i)}$, $P(z=B)=\frac{1}{n}Q_{B,l}^{(i)}$

Until $\theta_A,\theta_B$ converges.

Implementation and Analysis

To test the effectiveness of the EM algorithm, I wrote a small demo for the coin tossing problem:

import numpy as np

## Define a tossing function, to generate our observations
## theta is the head likelihood; num is the number of tossing for a single observation
def tossing( theta, num ):
	return (np.random.uniform(size=num)<theta).astype(np.int32)

## the load data is used to generate a set of observations
## prior_coin_A is the prior of the hidden variable z;
## theta_A, theta_B is heads probability of coin A and B separately. 
## this method return a dataset X, without any explicit information about  prior_coin_A, theta_A, theta_B
def load_data(  num_samples, prior_coin_A = 0.8 , theta_A=0.2, theta_B = 0.7, num_tossing_per_sample = 5 ):
	X=[]
	for _ in range(num_samples):
		random_v = np.random.uniform()
		if random_v < prior_coin_A:
			##generate a tossing observation using coin A
			X.append( tossing( theta_A, num_tossing_per_sample) )
		else:
			##generate a tossing observation using coin B
			X.append( tossing( theta_B, num_tossing_per_sample ) )

	X = np.asarray(X)
	return X

## The task of EM is to found the MLE of theta_A, theta_B using only obtained observations X
def EM( X, epsilon = 1e-8, update_prior = True , is_return_prior_list = False):
	## initialization
	prior_coin_A = 0.5
	prior_coin_B = 1- prior_coin_A
	theta_A = np.random.uniform()
	theta_B = np.random.uniform()
	prior_coin_A_list=[prior_coin_A]
	prev_theta_A = theta_A
	prev_theta_B = theta_B
	count = 0
	while True:
		## E step:		
		P_X_with_z_eq_A = theta_A**( np.sum(X, axis=1) )* (1-theta_A)**(np.sum( 1-X, axis=1 ))
		P_X_with_z_eq_B = theta_B**( np.sum(X, axis=1) )* (1-theta_B)**(np.sum( 1-X, axis=1 ))
		Q_A = P_X_with_z_eq_A*prior_coin_A/(P_X_with_z_eq_A*prior_coin_A+P_X_with_z_eq_B*prior_coin_B)
		Q_B = P_X_with_z_eq_B*prior_coin_B/(P_X_with_z_eq_A*prior_coin_A+P_X_with_z_eq_B*prior_coin_B)
		## M step:
		theta_A =  np.sum( Q_A * np.sum(X,axis=1))/np.sum( X.shape[1]*Q_A)
		theta_B =  np.sum( Q_B * np.sum(X,axis=1))/np.sum(X.shape[1]*Q_B)
		if abs(theta_A- prev_theta_A) + abs(theta_B- prev_theta_B) < epsilon:
			break
		prev_theta_A = theta_A
		prev_theta_B = theta_B
		## update prior
		if update_prior:
			prior_coin_A= np.mean(Q_A)
			prior_coin_B = np.mean(Q_B)
		prior_coin_A_list.append(prior_coin_A)
	if is_return_prior_list:
		return theta_A, theta_B, {"prior_coin_A_list":np.array(prior_coin_A_list),"prior_coin_B_list":1-np.array(prior_coin_A_list)}
	else:
		return theta_A, theta_B

First, let’s load the coin tossing data. The true prior distribution of $z$ is $P(z=A)=0.8$ and $P(z=B)=0.2$. For coin A, the true heads probability is 0.2; for coin B, the true heads probability is 0.7. For each observation, there are 10 tossing results.

true_prior_coin_A = 0.7
true_theta_A = 0.2
true_theta_B = 0.7
X = load_data(1000, prior_coin_A = true_prior_coin_A , theta_A=true_theta_A , theta_B = true_theta_B, num_tossing_per_sample = 10)

We can have a look at the loaded data (the first 10 observations)

print(X[:10])

[[0 1 1 1 1 1 1 1 1 0]
 [1 1 1 1 1 1 1 1 0 1]
 [0 1 1 0 1 1 1 1 1 1]
 [0 0 0 0 0 0 0 0 0 1]
 [0 0 0 1 0 1 0 0 0 0]
 [0 0 0 1 1 0 0 1 1 1]
 [0 0 0 0 0 0 0 0 1 0]
 [0 1 0 0 1 0 0 1 1 1]
 [1 1 0 1 0 1 0 0 0 1]
 [0 0 0 0 0 0 0 1 0 1]]

The influence of whether dynamically updating the prior or not

EM algorithm with dynamically updated prior distribution of $z$

estimated_theta_A,estimated_theta_B, params = EM(X, update_prior=True, is_return_prior_list=True)
## This problem is (strictly) concave, 
print("Estimated theta_A: %.4f, Estimated theta_B: %.4f"%( estimated_theta_A, estimated_theta_B))

Estimated theta_A: 0.1923, Estimated theta_B: 0.7053

Wow, the estimated theta_A is almost equal to the true theta_A (0.2), and the same holds for estimated theta_B. Note that the EM output may sometimes be “Estimated theta_A: 0.7, Estimated theta_B: 0.2”. This is OK because EM doesn’t know estimated theta_A is corresponding to coin A literally. It only knows that there are two coins, one with heads prob 0.7 and another on with 0.2.

EM algorithm with fixed prior distribution of $z$: $P(z=A)=0.5$ and $P(z=B)=0.5$.

estimated_theta_A,estimated_theta_B = EM(X, update_prior=False)
print("Estimated theta_A: %.4f, Estimated theta_B: %.4f"%( estimated_theta_A, estimated_theta_B))

Estimated theta_A: 0.6621, Estimated theta_B: 0.1762

From this result it’s obvious that if we use a fixed prior distribution of $z$ which is pretty different from the true prior, the final estimate of model parameters will be less accurate.

In fact, if we choose to dynamically update prior we check how the prior distribution changes, we will see the prior distribution will gradually approach the true prior. This can be shown by plotting the prior_coin_A_list variable:

import matplotlib.pyplot as plt
plt.plot( params["prior_coin_A_list"] )
plt.plot( np.ones_like( params["prior_coin_A_list"])*true_prior_coin_A )
plt.legend(["dynamically updated prior","true prior"])
plt.ylabel("$p(z=A)$")
plt.xlabel("iteration")
plt.show()

We can see that the prior gradually approachs the true prior as we expected. However, we also notice that there always exists some gap. This might be analytically explained. I will try to think about this in the future.

Conclusion

EM algorithm does work on this example.
To better estimate the parameters, it’s advisable to dynamically update the prior distribution of the hidden variables.

An Introduction to Support Vector Machines (SVM): A Python Implementation

2019-07-04T00:00:00-07:00

The Jupyter notebook is available at my github: SupportVectorMachine/SVM-Tutorial

First of all, we need to implement the SVM solver based on the SMO algorithm.

import numpy as np
import matplotlib.pyplot as plt
import os
from keras.datasets import mnist

Using TensorFlow backend.

We define some auxilary functions. The following two functions are related with the kernel function and kernel matrix

""" kernel part for SVM """
def kernel_func(x1,x2, kernel_type=None):
	if kernel_type is None:
		return np.dot( x1,x2)
	elif kernel_type["name"]=="GAUSSIAN":
		sigma = kernel_type["params"][0]
		return np.exp(- np.dot( x1-x2, x1-x2 )/(2*sigma**2)  )

def get_kernel_matrix( x1, x2, kernel_type=None ):
	num_samples_x1 = x1.shape[0]
	num_samples_x2 = x2.shape[0]
	kernel_matrix = np.zeros([num_samples_x1, num_samples_x2])
	for nrow in range(num_samples_x1 ):
		for ncol in range(num_samples_x2  ):
			kernel_matrix[nrow][ncol] = kernel_func(x1[nrow] , x2[ncol], kernel_type = kernel_type)
	return kernel_matrix

Then we need to implement the SVM solver part, which is encapsulated into a class

"""
Description: A SVM solver
Input:  training dataset (x,y), together with other hype-parameters
Return: a trained SVM model (solver) which is able to perform classification for a give x
"""
class SVM_Solver:
	def __init__(self, kernel_type=None , C=10):

		self.support_ind = None
		self.support_x = None
		self.support_y = None
		self.support_lamb = None
		self.kernel_type= kernel_type
		self.C = C
		self.count = 0
		self.objective_func = -np.Inf
		self.lamb = None
		self.param_b = None

	## This is a SVM trained predictor
	def predict(self,x, decision_mode = "hard"):
		def decision_func(z):
			if decision_mode == "soft":
				if z<-1:
					return -1
				elif z>1:
					return 1
				else:
					return z
			elif decision_mode == "hard":
				if z<0:
					return -1
				else:
					return 1
		K = get_kernel_matrix(self.support_x, x, kernel_type = self.kernel_type )
		pred_y = []
		for ind in range(x.shape[0]):
			z= np.dot( self.support_lamb* self.support_y, K[:,ind] ) +  self.param_b  
			pred_y.append(decision_func(z))	
		return np.array(pred_y)
		
		"""Training the SVM model, which uses x, y and validation set x_val, y_val
        max_iter is the maximum iteration to train;
        epsilon is use to determine when the training is terminated -- the change of objective
        function is less than epsilon
        """
	def train( self, x, y, x_val, y_val, max_iter= 1E6, epsilon= 1E-4 ):
	
		num_samples = x.shape[0]
		"""Solve the dual problem using SMO"""
		## Initialization
		K=get_kernel_matrix(x,x, kernel_type = self.kernel_type )	
		C = self.C
		if self.lamb is None:
			self.lamb = np.zeros(num_samples)
		if self.param_b is None:
			self.param_b = np.random.normal()
		## Start looping:
		## looping parameters:

		local_count =0
		##Here is the part of the SMO algorithm
		while True:
			## randomly select a pair (a,b) to optimize
			[a,b] = np.random.choice( num_samples, 2, replace= False )
			if K[a,a] + K[b,b] - 2*K[a,b] ==0:
				continue	

			lamb_a_old = self.lamb[a]
			lamb_b_old = self.lamb[b]	

			Ea =  np.dot(self.lamb * y, K[:,a]) + self.param_b - y[a]
			Eb =  np.dot(self.lamb * y, K[:,b]) + self.param_b - y[b]	

			lamb_a_new_unclip = lamb_a_old  + y[a] *(Eb-Ea)/( K[a,a] + K[b,b] - 2*K[a,b] )
			xi = - lamb_a_old  * y[a] - lamb_b_old * y[b]	

			if y[a] != y[b]:
				L = max( xi * y[b], 0 )
				H = min( C+xi*y[b], C )
			else:
				L = max( 0, -C-xi*y[b])
				H = min( C, -xi*y[b] )	

			if lamb_a_new_unclip < L:
				lamb_a_new = L
			elif lamb_a_new_unclip > H:
				lamb_a_new = H
			else:
				lamb_a_new = lamb_a_new_unclip	

			lamb_b_new = lamb_b_old + ( lamb_a_old - lamb_a_new )*y[a] * y[b]
			if lamb_a_new >0 and lamb_a_new <C:
				self.param_b =  self.param_b - Ea + ( lamb_a_old- lamb_a_new)*y[a]*K[a,a] + (lamb_b_old - lamb_b_new)*y[b] * K[b,a]
			elif lamb_b_new >0 and lamb_b_new <C:
				self.param_b = self.param_b - Eb + ( lamb_a_old- lamb_a_new)*y[a]*K[a,b] + (lamb_b_old - lamb_b_new)*y[b] * K[b,b]	

			self.lamb[a] = lamb_a_new
			self.lamb[b] = lamb_b_new	

			self.count +=1
			local_count +=1

			"""Every 10000 iterations record the current progree of the training,
            and determine whether to stop the training.
            """
			if local_count >= max_iter or self.count % 10000 ==0:
				## get the support set
				self.support_ind =  self.lamb > 0
				self.support_x = x[self.support_ind]
				self.support_y = y[self.support_ind]
				self.support_lamb = self.lamb[self.support_ind]	
	
				## Evaluate the performance (accuracy) on training set and validation set
				pred_y=self.predict(x)
				train_acc =  np.sum( pred_y == y)/ y.shape[0]
				pred_y=self.predict(x_val)
				val_acc =  np.sum( pred_y == y_val  )/ y_val.shape[0]

				support_K = K[ self.support_ind,: ][:, self.support_ind]
				new_objective_func = np.sum( self.support_lamb ) - 0.5 * np.dot( np.matmul( ( self.support_lamb *self.support_y ).T, support_K ).T , self.support_lamb* self.support_y  ) 

				## support ratio represents the percentage of the points which are support vectors
				support_ratio = np.sum( self.support_ind )/ self.support_ind.shape[0] 

				print("Iteration: %d, \tTrain accuracy: %.2f%%, \tVal accuracy: %.2f%%, \tDelta Objective Function: %f, \tSupport Ratio: %.2f%%"%(self.count, train_acc*100, val_acc*100, new_objective_func - self.objective_func, support_ratio *100 ))
				
				## If the change of dual objective function is less than epsilon, then stop training
				if abs( new_objective_func - self.objective_func ) <= epsilon:
					break
				else:
					self.objective_func = new_objective_func
				
				if local_count >= max_iter:
					break

Define some auxilary functions for compute the distance matrix, which is used to estimate the sigma for Gaussian Kernel, generator folder and plot the results.

def distance_matrix( x,y, metric = "Euclidean" ):
	def distance( a,b ):
		if metric == "Euclidean":
			return np.linalg.norm(a-b)
	n_row = x.shape[0]
	n_col = y.shape[0]
	dis_matrix = np.zeros([n_row, n_col] )
	for r in range( n_row ):
		for c in range(n_col ):
			dis_matrix[r][c] = distance( x[r], y[c])
	return dis_matrix

def generate_folder(path):
	if not os.path.exists(path):
		os.makedirs(path)
	return path

# This plot results is used to plot the results on training dataset, e.g, what the separating hyperplane looks
# like, how the support vectors are distributed, and whether the points are correctly classified
def plot_results( x,y, support_ind, pred_y, title = "", img_save_path = None , show_img = True ):

	fig, ax = plt.subplots()


	x_low_dim = x[:,:2]

	x_support =  x[support_ind]
	y_support = y[support_ind]
	pred_y_support = pred_y[support_ind]
	x_support_low_dim = x_low_dim[support_ind]

	for ind in range(x.shape[0]):
		if y[ind] == 1:
			mshape = "^"
		else:
			mshape = "o"
		if pred_y[ind] == 1:
			color = "r"
		else:
			color = "b"

		plt.plot(x_low_dim[ind,0], x_low_dim[ind,1], mshape, c= color, markerfacecolor='none', markeredgewidth=0.4, markersize =4)

	for ind in range(x_support.shape[0]):
		if y_support[ind] == 1:
			mshape = "^"
		else:
			mshape = "o"
		if pred_y_support[ind] == 1:
			color = "r"
		else:
			color = "b"

		plt.plot(x_support_low_dim[ind,0], x_support_low_dim[ind,1], mshape, c= color, markersize =4)

	for ind in range(x.shape[0]):
		if y[ind]!= pred_y[ind]:
			plt.plot(x_low_dim[ind,0], x_low_dim[ind,1], "o", c= "g", markersize =9, markerfacecolor='none')

	plt.xlabel("x")
	plt.ylabel("y")
	plt.xlim([min(x_low_dim[:,0])-0.5, max(x_low_dim[:,0])+0.5 ])
	plt.ylim([min(x_low_dim[:,1])-0.5, max(x_low_dim[:,1])+0.5 ])

	plt.title(title)
	if img_save_path is not None:
		plt.savefig( img_save_path )
	if show_img:
		plt.show()

	plt.close()

Test the kernel SVM on linearly non-separable data

First we load the data

def load_data(num_samples = 1000):
	x1 = []
	x2 = []
	for _ in range(num_samples):
		while True:
			r_x = np.random.multivariate_normal( [0,1], [[20,0],[0,1]], 1 )
			if r_x[0,1]>np.sin( r_x[0,0] )+0.5:
				x1.append( r_x )
				break
		while True:
			r_x = np.random.multivariate_normal( [0,-1], [[20,0],[0,1]], 1 )
			if r_x[0,1]<np.sin( r_x[0,0] ):
				x2.append( r_x )
				break

	x1 = np.concatenate( x1, axis =0 )
	x2 = np.concatenate( x2, axis =0)
	y1 = np.ones([num_samples]) *-1
	y2 = np.ones([num_samples]) *1
	x = np.concatenate([x1,x2], axis =0)
	y = np.concatenate([y1,y2], axis =0)

	return x, y

What does this loaded data look like? Let’s load and plot it.

x,y = load_data(500)
x_val, y_val = load_data(100)
## x,y are used for training, and x_val, y_val are used for validation
x_pos = x[y==1]
x_neg = x[y==-1]
plt.plot( x_pos[:,0], x_pos[:,1], "^", markerfacecolor='none' )
plt.plot( x_neg[:,0], x_neg[:,1], "o", markerfacecolor='none' )
plt.show()

From the figure, we can see that these points of two classes are obviously linearly non-separable, therefore we need to use the kernel SVM, and use Gaussian Kernel. Note that in the Gaussian Kernel there is a parameter: sigma, which represents the standard deviation.

The SVM results are very sensitive to the selection of sigma.

If sigma is too small, the model is going to overfit. Then the training/validation accuracy would be high, but the percentage of support vectors will be extremely high – almost every point is a support vector!
If sigma is too large, the model is going to underfit. The percentage of suport vector “might” be lower. However, the training/validation accuracy is low, which means the separating hyperplane is not accurately learned.

In this experiment, I used an empirical way to estimate the sigma:

Given the training dataset $X$, we use the function distance_matrix($X$,$X$) to compute the distance matrix w.r.t the element in $X$. Then we can use the average value of the distances between point pairs. Moreover, we can use a factor, such as 0.5, to adjust the value of $\sigma$.

estimated_sigma = np.mean( distance_matrix( x,x ) ) * 0.5
print(estimated_sigma)
svm= SVM_Solver( kernel_type = {"name":"GAUSSIAN", "params":[estimated_sigma] } )

2.800322186496282

Then, we can train our SVM model!

svm.train(x,y, x_val, y_val, max_iter = 200000)

Iteration: 10000, 	Train accuracy: 99.10%, 	Val accuracy: 98.00%, 	Delta Objective Function: inf, 	Support Ratio: 19.40%
Iteration: 20000, 	Train accuracy: 99.80%, 	Val accuracy: 99.00%, 	Delta Objective Function: 51.337039, 	Support Ratio: 16.10%
Iteration: 30000, 	Train accuracy: 99.90%, 	Val accuracy: 99.50%, 	Delta Objective Function: 26.450561, 	Support Ratio: 14.00%
Iteration: 40000, 	Train accuracy: 99.40%, 	Val accuracy: 99.50%, 	Delta Objective Function: 10.510557, 	Support Ratio: 14.30%
Iteration: 50000, 	Train accuracy: 99.00%, 	Val accuracy: 98.00%, 	Delta Objective Function: 8.315844, 	Support Ratio: 13.70%
Iteration: 60000, 	Train accuracy: 99.00%, 	Val accuracy: 99.00%, 	Delta Objective Function: 6.098615, 	Support Ratio: 12.70%
Iteration: 70000, 	Train accuracy: 99.70%, 	Val accuracy: 99.00%, 	Delta Objective Function: 6.163077, 	Support Ratio: 12.20%
Iteration: 80000, 	Train accuracy: 100.00%, 	Val accuracy: 99.50%, 	Delta Objective Function: 4.421311, 	Support Ratio: 12.40%
Iteration: 90000, 	Train accuracy: 99.80%, 	Val accuracy: 99.00%, 	Delta Objective Function: 4.129242, 	Support Ratio: 11.40%
Iteration: 100000, 	Train accuracy: 99.80%, 	Val accuracy: 99.00%, 	Delta Objective Function: 1.332299, 	Support Ratio: 10.50%
Iteration: 110000, 	Train accuracy: 100.00%, 	Val accuracy: 100.00%, 	Delta Objective Function: 1.195627, 	Support Ratio: 10.20%
Iteration: 120000, 	Train accuracy: 99.90%, 	Val accuracy: 100.00%, 	Delta Objective Function: 2.248688, 	Support Ratio: 10.50%
Iteration: 130000, 	Train accuracy: 100.00%, 	Val accuracy: 100.00%, 	Delta Objective Function: 2.209730, 	Support Ratio: 9.80%
Iteration: 140000, 	Train accuracy: 100.00%, 	Val accuracy: 100.00%, 	Delta Objective Function: 0.971279, 	Support Ratio: 9.60%
Iteration: 150000, 	Train accuracy: 100.00%, 	Val accuracy: 99.50%, 	Delta Objective Function: 1.393107, 	Support Ratio: 9.50%
Iteration: 160000, 	Train accuracy: 99.90%, 	Val accuracy: 99.50%, 	Delta Objective Function: 0.346211, 	Support Ratio: 8.90%
Iteration: 170000, 	Train accuracy: 99.80%, 	Val accuracy: 99.50%, 	Delta Objective Function: 0.258947, 	Support Ratio: 8.70%
Iteration: 180000, 	Train accuracy: 100.00%, 	Val accuracy: 100.00%, 	Delta Objective Function: 0.207311, 	Support Ratio: 8.60%
Iteration: 190000, 	Train accuracy: 99.90%, 	Val accuracy: 99.00%, 	Delta Objective Function: 0.738704, 	Support Ratio: 8.30%
Iteration: 200000, 	Train accuracy: 100.00%, 	Val accuracy: 100.00%, 	Delta Objective Function: 0.676768, 	Support Ratio: 8.30%

Ok, let’s have a look of the results by plotting it!

plot_results( x,y, svm.support_ind, pred_y= svm.predict(x), title = "",show_img = True )

Several conclusions can be drawn:

the training dataset is correctly separated, which implies an accurate separating hyperplane;
The support vectors (points with solid color) are only located around the margin area, which indicates a good selection of the kernel parameters. This SVM is neither over-fit nor under-fit.

We can further evaluate the accuracy on test dataset:

x_test, y_test = load_data( 500 )

pred_y=svm.predict(x_test)
test_acc =  np.sum( pred_y == y_test  )/ y_test.shape[0]
print("Test Accuracy: %.2f%%"%(test_acc *100))

Test Accuracy: 99.50%

Test the kernel SVM on MNIST for classification

To further prove the effectiveness of the SVM model, we tested it on a slightly more complex problem: ditingush between digit “4” and digit “9” using SVM.

First of all, we need to load and prepare the data

## load data
(mnist_x, mnist_y), _ = mnist.load_data()

## extract the digit "4" images (positive 1) and digit "9" images (negative 1)
x_pos= mnist_x[mnist_y == 4]
y_pos= np.ones( x_pos.shape[0] )
x_neg= mnist_x[mnist_y == 9]
y_neg= np.ones( x_neg.shape[0] ) *(-1)

## Put both positive/negative samples together to get the train/val/test dataset
x = np.concatenate( [ x_pos, x_neg ], axis =0 )
x = np.reshape(x, [x.shape[0],-1] )/255   ## normalization
y = np.concatenate( [ y_pos, y_neg], axis =0 )

## randomly shuffle
random_indx = np.random.permutation( np.arange( x.shape[0] ) )
x = x[random_indx]
y = y[random_indx]

## get x,y x_val, y_val, x_test, y_test
x_val = x[:500]
y_val = y[:500]
x_test = x[500:1000]
y_test = y[500:1000]
x = x[1000:2000]
y = y[1000:2000]

Train the SVM model

## Estimate the value of sigma
sigma_mnist = np.mean( distance_matrix( x,y ) )*0.5

svm_mnist = SVM_Solver( kernel_type = {"name":"GAUSSIAN", "params":[sigma_mnist]} )
svm_mnist.train( x,y, x_val, y_val )

Iteration: 10000, 	Train accuracy: 98.70%, 	Val accuracy: 96.40%, 	Delta Objective Function: inf, 	Support Ratio: 34.50%
Iteration: 20000, 	Train accuracy: 98.60%, 	Val accuracy: 96.00%, 	Delta Objective Function: 92.534743, 	Support Ratio: 29.50%
Iteration: 30000, 	Train accuracy: 98.60%, 	Val accuracy: 95.80%, 	Delta Objective Function: 17.440160, 	Support Ratio: 27.20%
Iteration: 40000, 	Train accuracy: 98.80%, 	Val accuracy: 95.60%, 	Delta Objective Function: 5.375707, 	Support Ratio: 25.50%
Iteration: 50000, 	Train accuracy: 98.90%, 	Val accuracy: 95.40%, 	Delta Objective Function: 2.259709, 	Support Ratio: 24.40%
Iteration: 60000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 1.122565, 	Support Ratio: 23.70%
Iteration: 70000, 	Train accuracy: 98.70%, 	Val accuracy: 94.20%, 	Delta Objective Function: 0.762315, 	Support Ratio: 23.90%
Iteration: 80000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.425399, 	Support Ratio: 23.60%
Iteration: 90000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.291108, 	Support Ratio: 23.30%
Iteration: 100000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.258185, 	Support Ratio: 22.80%
Iteration: 110000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.179889, 	Support Ratio: 22.80%
Iteration: 120000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.140847, 	Support Ratio: 22.30%
Iteration: 130000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.071885, 	Support Ratio: 22.70%
Iteration: 140000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.037175, 	Support Ratio: 22.50%
Iteration: 150000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.056052, 	Support Ratio: 22.30%
Iteration: 160000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.032109, 	Support Ratio: 22.20%
Iteration: 170000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.021068, 	Support Ratio: 22.30%
Iteration: 180000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.020332, 	Support Ratio: 22.30%
Iteration: 190000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.019551, 	Support Ratio: 22.30%
Iteration: 200000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.009766, 	Support Ratio: 22.20%
Iteration: 210000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.010782, 	Support Ratio: 22.40%
Iteration: 220000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.011163, 	Support Ratio: 22.30%
Iteration: 230000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.009348, 	Support Ratio: 22.20%
Iteration: 240000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.006827, 	Support Ratio: 22.40%
Iteration: 250000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.004162, 	Support Ratio: 22.20%
Iteration: 260000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.003338, 	Support Ratio: 22.40%
Iteration: 270000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.002217, 	Support Ratio: 22.40%
Iteration: 280000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.001653, 	Support Ratio: 22.10%
Iteration: 290000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.001823, 	Support Ratio: 22.00%
Iteration: 300000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.001461, 	Support Ratio: 22.20%
Iteration: 310000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.001170, 	Support Ratio: 22.10%
Iteration: 320000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.000930, 	Support Ratio: 22.00%
Iteration: 330000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.000714, 	Support Ratio: 22.10%
Iteration: 340000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.000511, 	Support Ratio: 22.10%
Iteration: 350000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.000345, 	Support Ratio: 21.90%
Iteration: 360000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.000353, 	Support Ratio: 22.00%
Iteration: 370000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.000227, 	Support Ratio: 21.90%
Iteration: 380000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.000179, 	Support Ratio: 22.10%
Iteration: 390000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.000226, 	Support Ratio: 21.90%
Iteration: 400000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.000133, 	Support Ratio: 22.00%
Iteration: 410000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.000229, 	Support Ratio: 21.90%
Iteration: 420000, 	Train accuracy: 98.80%, 	Val accuracy: 95.40%, 	Delta Objective Function: 0.000100, 	Support Ratio: 21.90%

The Delta Objective Function is now less than epsilon= 1E-4, so the training is finished. Now we can test the classification results on test dataset.

pred_y=svm_mnist.predict(x_test)
test_acc =  np.sum( pred_y == y_test  )/ y_test.shape[0]
print("Test Accuracy: %.2f%%"%(test_acc *100))

Test Accuracy: 95.60%

Moreover, we can also have a look of the support_x to see what do the support vectors look like:

support_x_pos = svm_mnist.support_x[ svm_mnist.support_y==1 ]
support_x_neg = svm_mnist.support_x[ svm_mnist.support_y==-1 ]

fig=plt.figure(figsize=(10, 2), dpi= 80, facecolor='w', edgecolor='k')
plt.gray()
for i in range( 10 ):
    plt.subplot(2,10,i+1)
    plt.imshow( np.reshape(support_x_pos[i,:],[28,28] ))
    plt.title("digit 4")
    plt.axis('off')
    
for i in range( 10 ):
    plt.subplot(2,10,i+11)
    plt.imshow( np.reshape(support_x_neg[i,:],[28,28] ))
    plt.title("digit 9")
    plt.axis('off')

plt.subplots_adjust(wspace=1,  hspace=1)
plt.show()

From the results we can see the support vectors are somehow ambiguous to distinguish. E.g, the 9th digit 4 looks also like digit 9, and the second digit 9 also looks like digit 4. The SVM model is sensitive to such ambiguous samples and tend to use them as support vectors to determine the separating hyperplane.

Conclusion

In these long series, we mathematically show the principle of SVM and many possible solutions to the problem. We also show the implementation and performance on some small but interesting samples. Hope this could be somehow helpful.

Reference

An Introduction to Support Vector Machines (SVM): Sequential Minimal Optimization (SMO)

2019-06-28T00:00:00-07:00

Recall the Kernel SVM dual problem:

Dual Problem

$$ \max_{\lambda, \mu} L(\lambda)= \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j K_{i,j} $$$$ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

We have introduced using gradient descent algorithm to solve the dual problem. However, the computation of the gradient has a high time complexity and thus would be a challenge for memory, especially when the training dataset is large. In this post, I introduce an efficient and light-version algorithm to solve the dual problem: Sequential Minimal Optimization (SMO)

Sequential Minimal Optimization (SMO)

The algorithm of SMO is:

Initialization: let $\{\lambda_i\}, i=1,\dots,n$ be a set which satisfies the dual constraint.
Repeat:

(1) heuristically select two $\lambda_a, \lambda_b$, and set all the other $\lambda_i (i\neq a,b)$ fixed;

(2) optimize $L(\lambda)$ with respect to $\lambda_a, \lambda_b$; Until: KKT condition is satisfied with certain accuracy.

First question about the initialization: how to find a set $\{\lambda_i\}$ which satisfies the dual constraints?
The answer is simply set $\lambda_i=0$ for $i=0,\dots,n$.

Suppose that we have finished the initialization, and pick up a pair $\lambda_a, \lambda_b$ to optimize while keeping $\lambda_i (i\neq a,b)$ fixed, then we have

$$ \begin{align} L(\lambda) =& \lambda_a + \lambda_b -\frac{1}{2} \lambda_a^2 K_{a,a} - \frac{1}{2} \lambda_b^2 K_{b,b} - \lambda_a \lambda_b y_a y_b K_{a,b} \\ & - \sum_{i\neq a,b} \lambda_a \lambda_i y_a y_i K_{a,i} - \sum_{i \neq a,b} \lambda_b \lambda_i y_b y_i K_{b,i} + Const \end{align} $$

Moreover, according to the dual constraints, we have

$$ \lambda_a y_a + \lambda_b y_b = -\sum_{i\neq a,b} \lambda_i y_i = - \xi\\ \lambda_b y_b = -\lambda_a y_a -\xi\\ \lambda_b = -\lambda_a y_a y_b -\xi y_b $$

So we have

$$ \begin{align} L(\lambda) =& \lambda_a -\lambda_a y_a y_b - \xi y_b - \frac{1}{2}\lambda_a^2 K_{a,a} -\frac{1}{2}(\lambda_a y_a + \xi)^2 K_{b,b} \\ & + \lambda_a y_a ( \lambda_a y_a + \xi ) K_{a,b} - \sum_{i\neq a,b} \lambda_a y_a \lambda_i y_i K_{a,i}\\ & + \sum_{i\neq a,b}(\lambda_a y_a + \xi)\lambda_i y_i K_{b,i} + Const \end{align} $$

$L(\lambda)$ is concave with respect to $\lambda_a$, since $\frac{\partial^2{L}}{\partial{\lambda_a^2}}= -( K_{a,a} + K_{b,b} - 2K_{a,b} )=-(e_a - e_b)^T \mathbf{K} (e_a - e_b) \leq 0$ due to the fact that the kernel matrix $\mathbf{K}$ is nonnegative definite (see last post An Introduction to Support Vector Machines (SVM): kernel functions ). Therefore, we can find the optimal value of $\lambda_a$ which maximizes $L(\lambda)$ by computing the gradient and set it to 0.

$$ \begin{align} \frac{\partial{L(\lambda)}}{\partial{\lambda_a}} =& 1 - y_a y_b -\lambda_a K_{a,a} - (\lambda_a y_a +\xi)y_a K_{b,b} + 2\lambda_a K_{a,b} \\ &+ y_a \xi K_{a,b} - \sum_{i\neq a,b} y_a \lambda_i y_i K_{a,i} + \sum_{i \neq a,b}y_a \lambda_i y_i K_{b,i}\\ =& 0 \end{align} $$

By solving this equation, we will get the solution for $\lambda_a^\star$:

$$ \lambda_a^{\text{new}} = \frac{ 1-y_a y_b - \xi y_a K_{b,b} + y_a \xi K_{a,b} - \sum_{i \neq a,b} y_a \lambda_i y_i K_{a,i} +\sum_{i\neq a,b}y_a \lambda_i y_i K_{b,i} }{ K_{a,a} + K_{b,b} -2K_{a,b} } $$

It is too complicated to compute the numerator since there are too many terms. In the next, we will show that we can actually compute $\lambda_a^\text{new}, \lambda_b^\text{new}$ from the old $\lambda_a^\text{old}, \lambda_b^\text{old}$.

Before updating the value of $\lambda_a, \lambda_b$, we first use the old version $\lambda$ to perform the classification on data $\mathbf{x}_ a, \mathbf{x}_ b$:

$$ \begin{align} \hat{y}_a &= \sum_{i\neq a,b}\lambda_i y_i K_{i,a} + \lambda_a^\text{old} y_a K_{a,a} + \lambda_b^\text{old} y_b K_{b,a}\\ \hat{y}_b &= \sum_{i\neq a,b}\lambda_i y_i K_{i,b} + \lambda_a^\text{old} y_a K_{a,b} + \lambda_b^\text{old} y_b K_{b,b}\\ \end{align} $$

Base on the expressions of $\hat{y}_a, \hat{y}_b$, we can have the following equation:

$$ \begin{align} &y_a[ (\hat{y}_b - y_b) - (\hat{y}_a - y_a) ]\\ = & \sum_{i\neq a,b}y_a \lambda_i y_i K_{i,b} + \lambda_a^\text{old} K_{a,b} + \lambda_b^\text{old} y_a y_b K_{b,b} + y_a b^\text{old} - y_a y_b \\ \ & - \sum_{i \neq a,b}y_a \lambda_i y_i K_{i,a} - \lambda_a^\text{old}K_{a,a} - \lambda_b^\text{old} y_a y_b K_{b,a} - y_a b^\text{old} +1\\ =& \sum_{i\neq a,b} y_a \lambda_i y_i K_{i,b} + \lambda_a^\text{old} K_{a,b} - \xi y_a K_{b,b} - \lambda_a^\text{old} K_{b,b}- y_a y_b \\ \ & - \sum_{i \neq a,b}y_a \lambda_i y_i K_{i,a} - \lambda_a^\text{old}K_{a,a} + \lambda_a^\text{old} K_{a,b} + \xi y_a K_{a,b} +1 \\ =& 1- y_a y_b - \xi y_a K_{b,b} + \xi y_a K_{a,b} - \sum_{i \neq a,b}y_a \lambda_i y_i K_{i,a} +\sum_{i\neq a,b} y_a \lambda_i y_i K_{i,b}\\ \ & -\lambda_a^\text{old}( K_{a,a} + K_{b,b} - 2K_{a,b} )\\ = & \lambda_a^{\text{new}}(K_{a,a} + K_{b,b} -2K_{a,b})-\lambda_a^\text{old}( K_{a,a} + K_{b,b} - 2K_{a,b} ) \end{align} $$

We denote prediction error $E_i= \hat{y}_i - y_i$, then we have the expression of $\lambda_a^\text{new}$:

$$ \lambda_a^\text{new} = \lambda_a^\text{old} + \frac{y_a(E_b - E_a)}{K_{a,a} +K_{b,b} - 2K_{a,b} } $$

Discussion: What if $K_{a,a} +K_{b,b} - 2K_{a,b}=0$? In this case $L(\lambda)$ is a first degree function, it’s still concave, but in this case the definition of $\lambda_a^\text{new}$ is no longer meaningful, so we just simply select another pair $(\lambda_a, \lambda_b)$ and do the computation above.

Note that the expression of the $\lambda_a^\text{new}$ is not clipped, so for simplicity we name it as $\lambda_a^\text{new, unclipped}$. It is inadequate to only compute the $\lambda_a^\text{new, unclipped}$. We need to further clip it based on the meaningful domain determined by the dual constraints. According to the dual constraints, each $\lambda_i$ actually has a box constraint. So we have:

$$ 0\leq \lambda_a \leq C\\ 0\leq \lambda_b \leq C\\ \lambda_b = -\lambda_a y_a y_b - \xi y_b $$

We know that $y_i \in \{-1, +1\}$. Based on whether $y_a = y_b$ or not, we can have the relationship between $\lambda_a$ and $\lambda_b$ with box constraints, shown in the figure below.

Relationship between $\lambda_a$ and $\lambda_b$ with box constraints.

According to the figure, we can get the lower bound $L$ and higher bound $H$ for a meaningful solution of a new $\lambda_a$:

if $y_a \neq y_b$: $$ L = \max(\xi y_b, 0)$$$$ H = \min(C+\xi y_b, C ) $$
if $y_a = y_b$: $$ L = \max(0, -C-\xi y_b)$$$$ H = \min(C, -\xi y_b) $$ Based on $L$ and $H$, we can get the clipped new $\lambda_a$:

$$ \lambda_a^\text{new, clipped} = \begin{cases} L, &\ \text{if}\ \lambda_a^\text{new, unclipped} < L \\ H, &\ \text{if}\ \lambda_a^\text{new, unclipped} > H \\ \lambda_a^\text{new, unclipped}, &\ \text{otherwise} \end{cases} $$

This $\lambda_a^\text{new, clipped}$ is the final meaningful new value of $\lambda_a$. For simplicity, in the following we use $\lambda_a^\text{new}$ to refer $\lambda_a^\text{new, clipped}$.

After getting $\lambda_a^\text{new}$, we need to compute $\lambda_b^\text{new}$:

$$ \lambda_b^\text{new} = -\lambda_a^\text{new} y_a y_b - \xi y_b $$

Now, we need to decide whether to update the value of $b^\star$. If \(0<\lambda_a^\text{new}

$$ \begin{align} b^\text{new} &= y_a -\sum_{i\neq a,b} \lambda_i y_i K_{i,a} - \lambda_a^\text{new} y_a K_{a,a} - \lambda_b^\text{new} y_b K_{b,a}\\ &= b^\text{old} - ( \sum_{i}\lambda_i y_i K_{i,a} + b^\text{old} - y_a ) \\ &\ \ \ + (\lambda_a^\text{old}-\lambda_a^\text{new})y_a K_{a,a} +(\lambda_b^\text{old}-\lambda_b^\text{new}) y_b K_{b,a} \\ &= b^\text{old} - E_a + (\lambda_a^\text{old}-\lambda_a^\text{new})y_a K_{a,a} +(\lambda_b^\text{old}-\lambda_b^\text{new}) y_b K_{b,a} \end{align} $$

Otherwise, if \(0<\lambda_b^\text{new}

$$ b^\text{new} = b^\text{old} - E_b + ( \lambda_a^\text{old} - \lambda_a^\text{new} )y_a K_{a,b} +( \lambda_b^\text{old} - \lambda_b^\text{old} ) y_b K_{b,b} $$

Note that if neither \(0<\lambda_a^\text{new}

Now, we have finished one single iteration in SMO.

Before we summarize the algorithm of SMO, there are some updates that can improve the computation efficiency.

Computation of $\xi$: In the deduction above, we can see $\xi$ is used in computing $L,\ H$ and $\lambda_b^\text{new}$. If we compute $\xi$ using $\xi = \sum_{i\neq a,b}\lambda_i y_i$, it will be time consuming. Instead, we can use the equation $$ \xi = -\lambda_a^\text{old} y_a - \lambda_b^\text{old} y_b $$By substituting the expression of $\xi$ into the expression of $\lambda_b^\text{new}$, we have:

$$ \lambda_b^\text{new} = \lambda_b^\text{old} + ( \lambda_a^\text{old} - \lambda_a^\text{new}) y_a y_b $$

Sequential Minimal Optimization Algorithm

According to the deduction above, we can have the pseudo algorithm of the SMO.

Initialization: $\lambda_i=0$ for $i=1,\dots,n$, $b=0$, and pre-calculation of the Kernel matrix $\mathbf{K}$
Repeat:
     heuristically (or randomly) select a pair $\lambda_a^\text{old}\leftarrow \lambda_a,\ \lambda_b^\text{old}\leftarrow \lambda_b$;

    if $K_{a,a}+K_{b,b}-2K_{a,b}==0$:
        continue

    $E_a = \sum_{i} \lambda_i y_i K_{i,a}+ b^\text{old} - y_a$
    $E_b = \sum_{i}\lambda_i y_i K_{i,b}+ b^\text{old} - y_b$
    $\lambda_a^\text{new, unclipped} = \lambda_a^\text{old} + \frac{ y_a (E_b - E_a)}{ K_{a,a} + K_{b,b} -2K_{a,b} }$
    $\xi = -\lambda_a^\text{old} y_a - \lambda_b^\text{old} y_b$

    if $y_a \neq y_b$:
        $L= \max( \xi y_b,0 ),\ H=\min(C+\xi y_b,C)$
    else:
        $L= \max( 0, -C-\xi y_b ),\ H=\min(C, -\xi y_b)$

    if $\lambda_a^\text{new, unclipped} < L$:
        $\lambda_a^\text{new} = L$
    else if $\lambda_a^\text{new, unclipped} > H$:
        $\lambda_a^\text{new} = H$
    else:
        $\lambda_a^\text{new} = \lambda_a^\text{new, unclipped}$

    $\lambda_b^\text{new}=\lambda_b^\text{old}+(\lambda_a^\text{old}-\lambda_a^\text{new})y_a y_b$
    $\lambda_a\leftarrow \lambda_a^\text{new},\ \lambda_b\leftarrow \lambda_b^\text{new}$

    if $0<\lambda_a^\text{new}         \(b^\text{new}=b^\text{old}-E_a +(\lambda_a^\text{old}-\lambda_a^\text{new})y_a K_{a,a}+(\lambda_b^\text{old}-\lambda_b^\text{new})y_b K_{b,a}$
    else if $0<\lambda_b^\text{new}         \(b^\text{new}=b^\text{old}-E_b +(\lambda_a^\text{old}-\lambda_a^\text{new})y_a K_{a,b}+(\lambda_b^\text{old}-\lambda_b^\text{new})y_b K_{b,b}$

Until: Maximum iteration reached, or the dual objective function $L(\lambda)$ is not further maximized with a certain accuracy.

Cool, isn’t it? Now We are able to solve the dual problem using the SMO algorithm!

Ref:

机器学习算法实践-SVM中的SMO算法- 知乎

An Introduction to Support Vector Machines (SVM): kernel functions

2019-06-27T00:00:00-07:00

Recall of the Slack SVM dual problem:

Dual Problem

$$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$$$$ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

Suppose that we have solved the dual problem and get the dual optimum. Let $S_w=\{ i \vert 0<\lambda_i^\star \leq C \}$ represent the support set related with $\mathbf{w}$; $S_b=\{ i \vert 0<\lambda_i^\star < C \}$ represent the support set related with $b$. Meanwhile, we define $S_b^+ =\{ i \vert i\in S_b \ \text{and}\ y_i = +1 \}$ and $S_b^-=\{ i \vert i\in S_b\ \text{and}\ y_i = -1 \}$. Then we can compute the primal optimum:

$$ \mathbf{w}^\star = \sum_{i\in S_w}\lambda_i^\star y_i \mathbf{x}_i$$$$ b^\star= y_j - {\mathbf{w}^\star}^T\mathbf{x}_j = y_j - \sum_{i\in S_w}\lambda_i^\star y_i \mathbf{x}_i^T \mathbf{x}_j \ , \ j\in S_b$$$$ $$

Given a new point $\mathbf{x}$, we can perform classification by computing:

$$ \begin{align} \hat{y} &= {\mathbf{w}^\star}^T \mathbf{x} + b^\star\\ &=\sum_{i\in S_w} \lambda^\star_i y_i \mathbf{x}_i^T \mathbf{x} + b^\star\\ \end{align} $$

According to the formulas above, we notice that in the dual problem, computation of $\mathbf{w}^\star$ and classification of new points, $\mathbf{x}_ i^T\mathbf{x}_ j$ always appears as a whole.

SVM with kernel functions

Mapping points to a higher dimensional space

In some cases, if the points is not linearly separable in current space, they are possibly linearly separable if we map them into the higher dimension.

Mapping points from 2d to 3d to make them linearly separable.

We define $\phi(\mathbf{x}): R^p \rightarrow R^d\ ,\ d>p$ as a mapping function which maps low dimensional data to a high dimensional data. We can first map our data $\mathbf{x}_ i \rightarrow \phi(\mathbf{x}_ i)$, then solve the dual problem:

$$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$$$$ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

We notice that in the dual problem, computing $\mathbf{w}^\star$ and performing classification, $\phi(\mathbf{x}_ i)^T\phi(\mathbf{x}_ j)$ always appears as a whole. Therefore, we can avoid computing the exact form of $\phi(\mathbf{x})$, but instead directly explore the function for the inner product of two mapped points $K: R^p \times R^p \rightarrow R$:

$$ K_{i,j}=K(\mathbf{x}_i, \mathbf{x}_j)=<\phi(\mathbf{x}_i), \phi(\mathbf{x}_j)> $$

We call $K(\mathbf{x}_i, \mathbf{x}_j)$ as the kernel function.

What is a valid kernel function?

A kernel function $K(\mathbf{x}_ i, \mathbf{x}_ j)$ is valid if there exists a mapping function $\phi$, such that it holds $K_{i,j} = <\phi(\mathbf{x}_ i), \phi(\mathbf{x}_ j)>$ for any $\mathbf{x}_ i, \mathbf{x}_ j\in R^p$.

Moreover, there is an equivalent conclusion on the validness of a kernel function.

A kernel function $K(\mathbf{x}_ i, \mathbf{x}_ j)$ is valid if for any $n$ samples $\{ \mathbf{x}_ i \vert \mathbf{x}_ i \in R^p \}, i=1,\dots, n$, the kernel matrix $\mathbf{K}=\begin{bmatrix}K_{1,1}, \dots, K_{1,n}\\\dots \\ K_{n,1},\dots, K_{n,n} \end{bmatrix}$ is non-negative definite.

Examples of Kernel functions

Polynomial kernel function
\[K(\mathbf{x}, \mathbf{y}) = ( \mathbf{x}^T\mathbf{y} +c )^d\]
It can be proven that this function is equivalent to first mapping points to higher dimensional space and then computing the inner product.
Gaussian Kernel
\[K(\mathbf{x}, \mathbf{y}) = \exp\{ -\frac{ \|\mathbf{x}-\mathbf{y}\|^2 }{2{\epsilon}^2} \}\]
Applying Gaussian kernel is equivalent to first mapping points to a infinitely high dimensional space and then computing the inner product. This can be understood by the Taylor expansion of the exponential function. For detailed explanation please see SVM中，高斯核为什么会把原始维度映射到无穷多维？

Dual problem with kernel function

With the definition of the kernel function, we can rewrite the dual problem and classification task as following.

Dual Problem

$$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j) $$$$ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

$$ \mathbf{w}^\star = \sum_{i\in S_w}\lambda_i^\star y_i \phi(\mathbf{x}_i)\\ b^\star= y_j - {\mathbf{w}^\star}^T\phi(\mathbf{x}_j) = y_j - \sum_{i\in S_w}\lambda_i^\star y_i K(\mathbf{x}_i, \mathbf{x}_j) \ , \ j\in S_b\\ $$

Given a new point $\mathbf{x}$, we can perform classification by computing:

$$ \begin{align} \hat{y} &= {\mathbf{w}^\star}^T \phi(\mathbf{x}) + b^\star\\ &=\sum_{i\in S_w} \lambda^\star_i y_i K(\mathbf{x}_i, \mathbf{x}) + b^\star\\ \end{align} $$

See, in fact $\mathbf{w}^\star$ is never really computed, since we are only interested in the kernel function!

Solve the dual problem using Gradient Descent Algorithm

We can solve the dual problem using gradient descent algorithm as introduced in the post An Introduction to Support Vector Machines (SVM): Dual problem solution using GD. Just simply select a kernel function, such as polynomial or Gaussian, compte the Kernel matrix $\mathbf{K}$ for the training dataset, compute the gradient and then perform back propagation to get the dual optimum $\lambda^\star$. After getting $\lambda^\star$, we can compute the primal optimum $b^\star$ and perform classification on new points using the equations above.

In the next post, I will introduce how to solve the dual problem using Sequential Minimal Optimization (SMO).

An Introduction to Support Vector Machines (SVM): SVM with slack variables

2019-06-07T00:00:00-07:00

Recall of the SVM primal problem:

$$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2\\ \begin{align} \ \ \ s.t.\ \ & y_i(\mathbf{w}^T\mathbf{x}_i+b)\geq 1,\ i=1,\dots,n \end{align} $$

This is the primal problem of the SVM in the case where points of two classes are linearly separable. Such a primal problem has two drawbacks:

The separating plane is sensitive to (easily influenced by) outliers.
Not suitable for the case where points of two classes are not linearly separable.

The separating plane is sensitive to (easily influenced by) outliers. Hyperplane Influenced by Outliers Figure Hyperplane Influenced by Outliers shows how a single outlier greatly influences the final results of the hyperplane. This is due to the constraints $y_i(\mathbf{w}^T\mathbf{x}_ i+b)\geq 1$ in the primal problem will make sure that the minimum geodesic distance between points and the separating hyperplane is $\frac{1}{\|\mathbf{w}\|}$. When there is an outlier, in order to satisfy the constraints, the model will choose a smaller $\|\mathbf{w}\|$ and also greatly change the rotation/position of the separating hyperplane. However, using the separating hyperplane in Figure (b) is not a good choice, since compared with (a), in (b) the points have a much smaller average geodesic distance to the separating hyperplane. Therefore, it is more likely that the SVM makes wrong decisions when classifying new points.
Not suitable for the case where points of two classes are not linearly separable.
If the points are not linearly separable, then the SVM primal problem doesn’t have a optimal solution, since there doesn’t exist a certain $\mathbf{w}$ and $b$ which satisfies the constraints $y_i(\mathbf{w}^T\mathbf{x}_ i+b)\geq 1$.

SVM with Slack Variables

To solve the problems above, we need to introduce a slack variable to the original SVM primal problem. This means that we allow certain (outlier) points to be within the margin or even cross the separating hyperplane, but such cases would be penalized. Now the primal problem of the “Slack-SVM” will be:

Primal Problem

$$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{n}\xi _i \\ \begin{align} \ \ \ \ s.t.\ \ y_i(\mathbf{w}^T\mathbf{x}_i+b) &\geq 1-\xi_i ,\ &i=1,\dots,n \\ \xi_i &\geq 0,\ &i=1,\dots,n \end{align} $$

Here $\xi_i$ is the slack variable, and the positive $C$ is the weight for the penalty term. Suppose that for some point $\mathbf{x}_i$, it holds $y_i(\mathbf{w}^T\mathbf{x}_i+b) = 1-\xi_i$:

if $\xi_i=0$, then $\mathbf{x}_i$ is exactly at the marginal hyperplane (the margin for short).
if $0<\xi_i\leq 1$, then $\mathbf{x}_i$ is located within the margin, but the label of $\mathbf{x}_i$ is correctly classified.
if $\xi_i > 1$, then $\mathbf{x}_i$ is located at the other side of the separating hyperplane, which means a miss-classification. Different $$\xi$$ and Point Locations

It is possible to use Gradient Descent algorithm to solve the primal problem. However, due to the slack variables, the constraints is much more complex than the case without slack variables. It is more difficult to define the loss function used for gradient descent. On the contrary, the Lagrangian dual problem of this primal problem still remains compact and solvable, and can be easily extended to kernel SVM. Therefore, in the next, we mainly discuss the deduction of the Lagrangian dual problem of the Slack SVM primal problem.

Lagrangian Function

$$L( \mathbf{w}, b, \mathbf{\xi}, \lambda, \mu )= \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{n}\xi_i + \sum_{i=1}^{n}\lambda_i ( 1-\xi_i - y_i(\mathbf{w}^T\mathbf{x}_ i+b ) ) - \sum_{i=1}^{n}\mu_i \xi_i $$

Lagrangian Dual function

$$ g(\lambda, \mu) = \inf_{\mathbf{w}, b, \xi} L(\mathbf{w}, b, \xi, \lambda, \mu) $$

To get the dual function, we can compute the derivative and set them to 0.

$$\frac{\partial{L}}{\partial{\mathbf{w}}} = \mathbf{w} - \sum_{i=1}^{n}\lambda_i y_i \mathbf{x}_i = 0$$ $$\frac{\partial{L}}{\partial{b}} = -\sum_{i=1}^{n}\lambda_i y_i = 0$$ $$\frac{\partial{L}}{\partial{\xi_i}} = C - \lambda_i - \mu_i = 0$$

From these 3 equations we have

$$\mathbf{w}^\star = \sum_{i=1}^{n} \lambda_i y_i \mathbf{x}_i$$ $$\sum_{i=1}^{n} \lambda_i y_i = 0$$ $$\mu_i = C-\lambda_i $$

Substitue them in the Lagrangian function, we can get the Lagrangian dual function:

$$ g(\lambda, \mu) = \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j + C\sum_{i=1}^{n}\xi_i + \sum_{i=1}^{n}\lambda_i -\sum_{i=1}^{n}\lambda_i \xi_i $$$$-\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j - \sum_{i=1}^{n}\lambda_i y_i b - C\sum_{i=1}^{n}\xi_i + \sum_{i=1}^{n}\lambda_i \xi_i$$$$ = \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j $$

Therefore, the Lagrangian dual problem is:

$$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$$$$ \begin{align} s.t.\ & \lambda_i \geq 0\\ &\mu_i \geq 0 \\ &\mu_i = C-\lambda_i \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

We can use $\lambda_i$ to represent $\mu_i$, and finally get the dual problem:

Dual Problem

$$ \max_{\lambda} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$$$$ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

Compared with the dual problem for the SVM without slack variables, the only difference is that here the constraints of $\lambda$ are $0\leq \lambda_i \leq C$, instead of $\lambda_i \geq 0$.

Actually in the primal problem of the SVM without slack variables, we can think there is a hidden $C=\infty$, which means that the penalty of slack variables is infinitely large, so all points need to satisfy $y_i(\mathbf{w}^T\mathbf{x}_ i+b)\geq 1$.

Solution of the Dual Problem

Gradient Descent Algorithm The objective function for gradient descent is:
$$ \min_{\lambda} L(\lambda) = -\sum_{i=1}^{n} \lambda_i + \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_ i^T \mathbf{x}_ j + \frac{c}{2}(\sum_{i=1}^{n}\lambda_i y_i)^2 $$$$ s.t.\ 0\leq \lambda_i \leq C \ , \ i=1,\dots,n $$ Compared with the post An Introduction to Support Vector Machines (SVM): Dual problem solution using Gradient Descent, the objective function is the same. The only difference is that here the constraints are $0\leq \lambda_i \leq C$. To achieve this constraints, we can clip the value of $\lambda_i$ in the range $[0,C]$ after each gradient descent back propagation. For the detail of the gradient form, please have a look at that post.
Sequential Minimal Optimization (SMO), which will be discussed in the following posts.

Discussion on the Karush-Kuhn-Tucker (KKT) conditions The KKT conditions are now slightly different, since now in the dual function there are actually two variables: $\lambda$ and $\mu$. For the primal optimum $\mathbf{w}^\star, b^\star, \xi^\star$ and the dual optimum $\lambda^\star, \mu^\star$, it holds:

primal constraints $$ y_i({\mathbf{w}^\star}^T\mathbf{x}_ i +b^\star) \geq 1-\xi^\star_i $$$$ \xi^\star_i \geq 0 $$
compute the infimum of $L$ w.r.t $\mathbf{w}$ and $\xi$ $$ \Delta_{\mathbf{w},b,\xi}L( \mathbf{w}^\star, b^\star, \xi^\star, \lambda^\star, \mu^\star)=0 $$
dual constraints $$ \lambda_i^\star \geq 0$$$$ \mu_i^\star \geq 0$$$$ \sum_{i=1}^{n}\lambda_i^\star y_i =0$$$$ \mu_i^\star = C - \lambda_i^\star $$
Complementary Slackness $$ \lambda_i^\star ( 1-\xi_i^\star - y_i({\mathbf{w}^\star}^T\mathbf{x}_ i +b^\star ) ) =0 $$$$ \mu_i^\star \xi_i^\star = 0 $$

The complementary slackness is interesting. Suppose that we have already find the primal optimum and dual optimum. We can analyze the location of the point $\mathbf{x}_ i$ based on the value of $\lambda_i$:

$\lambda_i^\star=0$
then $\mu_i^\star = C$, so $\xi^\star_i=0$, and $y_i({\mathbf{w}^\star}^T\mathbf{x}_ i+b^\star)\geq 1$. This means the distance from point $\mathbf{x}_ i$ to the separating hyperplane is greater than or equal to $\frac{1}{\|\mathbf{w}^\star\|}$. This point $\mathbf{x}_ i$ is not support vector.
\(0<\lambda_i^\star then \(0<\mu_i^\star
$\lambda_i^\star = C$
then $\mu_i^\star =0$, so $\xi^\star_i\geq 0$, and $y_i({\mathbf{w}^\star}^T\mathbf{x}_ i+b^\star)=1-\xi^\star_i$. This means that $\mathbf{x}_ i$ is within the margin, or even located in the other side of the separating hyperplane (miss-classification). The point $\mathbf{x}_ i$ is also a support vector which is used to compute $\mathbf{w}^\star$, but not used to compute $b^\star$.

$$ \mathbf{w}^\star = \sum_{i\in S_w}\lambda_i^\star y_i \mathbf{x}_i $$

Multiple ways can be used to compute $b^\star$:

$$ b^\star= y_i - {\mathbf{w}^\star}^T\mathbf{x}_i \ , \ i\in S_b$$$$ b^\star= \frac{1}{\vert S_b \vert}\sum_{i\in S_b}({y_i - {\mathbf{w}^\star}^T\mathbf{x}_i})$$$$ b^\star = -\frac{1}{2}{\mathbf{w}^\star}^T(\mathbf{x}_i+\mathbf{x}_j)\ , \ i\in S_b^+, j \in S_b^- $$

Experiment Results

We compare the separating hyperplane results between the SVM with slack variables (Slack-SVM for short) and the original SVM without slack variables (Original-SVM for short). The SVM models are trained by solving the Lagrangian dual problem using gradient descent algorithm introduced in the last post.

For further discussion, we recall the primal/dual problem of the Original-SVM and the primal/dual problem of the Slack-SVM:

Original-SVM
Primal Problem $$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2$$$$ \begin{align} s.t.\ \ & y_i(\mathbf{w}^T\mathbf{x}_ i+b)\geq 1,\ i=1,\dots,n \end{align} $$ Dual Problem $$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$$$$ \begin{align} s.t.\ & \lambda_i \geq 0 \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$
Slack-SVM
Primal Problem $$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{n}\xi _i $$$$ \begin{align} s.t.\ \ y_i(\mathbf{w}^T\mathbf{x}_ i+b) &\geq 1-\xi_i ,\ &i=1,\dots,n \\ \xi_i &\geq 0,\ &i=1,\dots,n \end{align} $$ Dual Problem $$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$$$$ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

Experiment 1.
Comparison of performance in the case where there are outliers but the points are still linearly separable. The Slack-SVM penalty term weight $C=0.5$

Slack SVM vs Original SVM on separable outliers This result fits well with the analysis in the Figure Hyperplane Influenced by Outliers! The original SVM tries hard to find a separating hyperplane regardless of the obvious outlier point. It takes $2\times 10^6$ iterations (20 times longer than Slack SVM) to finally find the separating hyperplane with a tight margin. On the contrary, the Slack SVM simply chooses to ignore the outlier point. The separating hyperplane is almost identical to the case without the outlier point.

Experiment 2.
Analyzing the influence of different Slack-SVM penalty term weight $C$.

Slack SVM over different penalty weight C

As we increase the value of $C$, the geodesic margin becomes wider. The outlier point is closer to the margin hyperplane geodesically. More points become support vectors.

To explain this we need to refer the form of the Slack SVM primal problem. When we increase $C$, the penalty term $C\sum_{i=1}^{n}\xi_i$ is more heavily penalized. The model tends to reduce the value of $\xi_i$. So how to reduce $\xi_i$ ?

The answer is to reduce $\|\mathbf{w}\|$. This may sound a little bit bizarre, but we can tell that from the figure Slack SVM over different penalty weight C.

For different value of $C$, the location and rotation of the separating hyperplane remains similar, so the distance from points to the separating hyperplane is similar. We know that for a point $\mathbf{x}_ i$ which is within the margin or is located in the other side of the separating hyperplane, its geodesic distance to the separating hyperplane is $\frac{\vert 1-\xi_i \vert}{\|\mathbf{w}\|}$. For the outlier points which cross the separating hyperplane, like the solid blue circle in the top right corner, the geodesic distance is $\frac{\xi_i -1 }{\|\mathbf{w}\|}$.

Since for large $C$, we need to reduce the large $\xi_i$ of that outlier point, with the fact that the geodesic distance remains unchanged. So the possible solution is to reduce $\|\mathbf{w}\|$. As a result, the geodesic margin $\frac{1}{\|\mathbf{w}\|}$ will be increased. Therefore, the larger $C$ is, the wider the margin area is.

Original SVM for linearly non-separable cases
We also notice that for $C=100$ and $C=10000$, the separating results are almost the same. This leads to another question: what if we set $C=\infty$ and solve the dual problem of the Slack SVM?

If we set $C=\infty$, the primal/dual problem of the Slack SVM is exactly the same as the primal/dual problem of the original SVM. This is the short proof:

for dual problem, it obviously holds.
for primal problem: $$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{n}\xi _i $$$$ \begin{align} s.t.\ \ y_i(\mathbf{w}^T\mathbf{x}_ i+b) &\geq 1-\xi_i ,\ &i=1,\dots,n \\ \xi_i &\geq 0,\ &i=1,\dots,n \end{align} $$ When $C\rightarrow \infty$, to minimize the objective function into some finite value, it must hold $\xi_i \equiv 0$. Therefore, $C \sum_{i=1}^{n}\xi _ i=0$, and the Slack SVM’s primal problem will be: $$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2 $$$$ \begin{align} s.t.\ \ y_i(\mathbf{w}^T\mathbf{x}_ i+b) &\geq 1,\ &i=1,\dots,n \\ \end{align} $$ This is exactly the Original SVM’s primal problem.

Therefore, the above question is equivalent to ask: What if we apply the Original SVM to the linearly non-separable case?

The answer is that the separating results will be almost the same as the case $C=10000$ in the figure Slack SVM over different penalty weight C. Why the geodesic margin is not further enlarged?

We showed that original SVM is equivalent to set $C=\infty$ in Slack-SVM. However, from the aspect of the dual problem, the real value of $C$ is actually determined by the up-bound of $\lambda$. For example, if we set $C=\infty$, but the real up-bound of the trained $\lambda$ is 10000, then the real effective $C$ is actually 10000. Therefore, we will see by applying Original SVM to linearly non-separable case, the final separating result is identical to the $C=10000$ case.

Original SVM on linearly non-separable case Here I also check the maximum of $\lambda$ after the training for different $C$

$C$	10	100	10000	$\infty$
$\max{\lambda}$	10	62.2	62.2	62.2

We can see that when $C$ reaches 100, the maximum of $\lambda$ usually reaches around 60. Therefore, keeping increasing $C$ does not influence the separating results further. Note that as we continue training, the $\max{\lambda}$ may further rise, but it can hardly reach the value of $C$ if $C$ is very large.

An Introduction to Support Vector Machines (SVM): Dual problem solution using Gradient Descent

2019-05-27T00:00:00-07:00

Recall of the SVM primal problem and dual problem:
Primal Problem

$$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2\\ \begin{align} s.t.\ \ & y_i(\mathbf{w}^T\mathbf{x}_i+b)\geq 1,\ i=1,\dots,n \end{align} $$

Dual Problem

$$ \max_{\lambda}\ \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j\\ \begin{align} s.t.\ \ & \lambda_i \geq 0,\ i=1,\dots,n\\ & \sum_{i=1}^{n}\lambda_i y_i = 0 \end{align} $$

The the last post we introduced how to apply Lagrangian duality to SVM and how to get the primal optimum once we get the dual optimum. In this post we mainly discuss how to solve the dual problem and get the dual optimum.

Gradient Descent Algorithm for Dual Problem

To apply GD to SVM, we need to reformulate the objective function of the dual problem. Our new objective function will be:

$$ \min_{\lambda}L(\lambda)=-\sum_{i=1}^{n}\lambda_i + \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j + \frac{c}{2}(\sum_{i=1}^{n}\lambda_i y_i)^2 $$ $$\text{s.t.}\ \lambda_i\geq 0 $$

where $c>0$ is the weighting factor for the constraint $\sum_{i=1}^{n}\lambda_i y_i = 0$. For the constraint $\lambda_i\geq 0$, we can satisfy this constraint by clipping $\lambda$ into the region $[0,\infty)$ after each back propagation during gradient descent.

Discussion: why not also put the constraints $\lambda_i\geq 0$ also into the loss function by introducing an extra hinge loss term? Then the final loss function will be: $\min_{\lambda}L(\lambda)=-\sum_{i=1}^{n}\lambda_i + \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j + \frac{c}{2}(\sum_{i=1}^{n}\lambda_i y_i)^2 + d \sum_{i=1}^{n}\text{max}\{-\lambda_i,0\}$

This is reasonable in theory but not so feasible in practice. This will introduce one extra hyper parameter $d$, and we will be lost in endlessly fine tuning and balancing the hyper parameters $c$ and $d$. Test results also show that achieving the constraint $\lambda_i\geq 0$ using clipping is efficient and this method also easily support more general cases of SVM with penalty terms. This will be discussed later.

Based on the loss function, We can compute the gradient:

$$ \begin{align} \frac{\partial{L}}{\partial{\lambda_i}} &= -1 + y_i \sum_{j=1}^{n}\lambda_j y_j \mathbf{x}_i^T\mathbf{x}_j + {c}\sum_{j=1}^{n}\lambda_j y_j y_i \end{align} $$

We define a function $K(\mathbf{x}_i, \mathbf{x}_j)= \mathbf{x}_i^T\mathbf{x}_j$. To maintain the consistence with future posts, we can this function as kernel function. Given a training dataset $\{\mathbf{x}_i\}, i=1,\dots,n$, we can get a kernel matrix:

$$ \mathbf{K} = \begin{bmatrix}K_{1,1}\dots K_{1,n}\\ \dots \\ K_{n,1}\dots {K_{n,n}} \end{bmatrix} $$

where $K_{i,j}=K(\mathbf{x}_i, \mathbf{x}_j)$.
Then the gradient $\frac{\partial{L}}{\partial{\lambda_i}}$ can be expressed by the kernel matrix:

$$ \frac{\partial{L}}{\partial{\lambda_i}} = -1 + y_i \mathbf{e}_i^T \mathbf{K} ( \lambda \circ \mathbf{y} ) + c y_i \lambda ^T \mathbf{y} $$

where $\mathbf{e}_i=[0,\dots,0,1,0,\dots,0]$, with the $i^{th}$ element being 1 and other elements being 0. The sign $\lambda \circ \mathbf{y}$ represents the element-wise multiplication two vectors $\lambda$ and $\mathbf{y}$.

We can also write the expression of the gradient of $L$ with respect to the whole vector $\lambda$:

$$ \frac{\partial{L}}{\partial{\lambda}} = -\mathbf{1}_n + (\mathbf{K}(\lambda \circ \mathbf{y}))\circ \mathbf{y} + c(\lambda^T\mathbf{y})\mathbf{y} $$

In practice, when we implement the gradient descent algorithm, we don’t need to compute $\mathbf{K}$ in each iteration, since $\mathbf{K}$ does not rely on $\lambda$. Instead, we can simply compute $\mathbf{K}$ before applying gradient descent and store it in the memory, and call it each time when computing the gradient.

Another implicit advantage of using such a kernel matrix expression is that such a definition can be extended into a broader definition of SVM – SVM with kernels, where we can give a more sophisticated definition to the kernel function $K(\mathbf{x}_ i, \mathbf{x}_ j)$, instead of just vector dot product. But even in that case, the expression of the gradient still remains the same. We just simply pre-calculate the kernel matrix $\mathbf{K}$ based on the new definition of kernel function, and then apply gradient descent algorithm to find the optimal solution. We will discuss kernel SVM in the future posts.

Implementation and Experiments

I implement the Gradient Descent algorithm to compute the dual optimum and use it to solve the original SVM optimization problem. The code is available in my github SupportVectorMachine/gd-dual-svm.py. The change of the hyperplane over iterations is shown in figure Hyperplane Over Iteration

Hyperplane Over Iteration

In the above figure, the points with solid color are the support vectors. As the training goes on, more and more points are excluded from the support vector set. Finally there are only 3 support vectors. The finally separating hyperplane is obviously the optimal separating hyperplane with maximized margin.

An Introduction to Support Vector Machines (SVM): Convex Optimization and Lagrangian Duality Principle

2019-05-25T00:00:00-07:00

In the last post we have conquered how to use gradient descent algorithm to train a SVM. Although using GD can solve the SVM optimization, GD has some shortcomings:

Gradient procedure is time consuming and the solution may be suboptimal.
GD method cannot explicitly identify support vectors (points) which determine the hyperplane.

To overcome these shortcomings, we can take advantage of the Lagrangian duality. First we convert original SVM optimization problem into a primal (convex) optimization problem, then we can get the Lagrangian dual problem. Luckily we can solve the dual problem based on KKT condition using more efficient methods.

First of all, we need to briefly introduce Lagrangian duality and Karush-Kuhn-Tucker (KKT) condition.

Lagrangian Duality Principle

Primal Problem
A primal convex optimization problem has the following expression:

$$\min_{\mathbf{x}} f_0(\mathbf{x})$$ $$s.t. \ \ f_i(\mathbf{x}) \leq 0, \ i=1,\dots,n $$ $$\ \ \ \ \ \ \ h_j(\mathbf{x}) = 0, \ j=1,\dots,p$$

where $f_i(\mathbf{x}) _{(i=0,1,\dots,n)}$ are convex, and $h_j(\mathbf{x}) _{(j=1,\dots,p)}$ are linear (or affine).

The constraint that $f_i(\mathbf{x}) _{(i=0,1,\dots,n)}$ are convex defines a convex region.
The constraint $h_j(\mathbf{x}) _{(j=1,\dots,p)}$ are linear confines the region into the intersections of multiple hyperplanes (potential reduces the dimensionality.)

We can get the Lagrangian function:

$$ L(\mathbf{x}, \mathbf{\lambda}, \mathbf{\mu}) = f_0(\mathbf{x}) + \sum_{i=1}^{n}\lambda_{i}f_i(\mathbf{x}) + \sum_{j=1}^{p}\mu_jh_j(\mathbf{x}) $$

Since $f_i(\mathbf{x})$ are convex, and $h_j(\mathbf{x})$ are linear, $L(\mathbf{x}, \mathbf{\lambda}, \mathbf{\mu})$ is also convex w.r.t $\mathbf{x}$. Therefore, we can get the infimum of $L(\mathbf{x}, \mathbf{\lambda}, \mathbf{\mu})$, which is called the Lagrangian dual function:

$$ g(\mathbf{\lambda},\mathbf{\mu})= \inf_\mathbf{x} \ L(\mathbf{x},\mathbf{\lambda},\mathbf{\mu}) $$

The difference between minimum and infimum:

$\min(S)$ means the smallest element in set $S$;

$inf(S)$ means the largest value which is less than or equal to any element in $S$.

In the case where the minimum value is reachable, infimum = minimum. e.g. $S=\{\text{all natural number}\}$, then $\inf(S) = \min(S) = 0$

In the case where the minimum is not reachable, infimum may still exist. e.g. $S=\{f(x)\vert f(x)=1/x, x>0\}$, $\inf(S)=0$

Dual Problem Based on the dual function we can get the dual optimization problem:

$$\max_{\mathbf{\lambda},\mathbf{\mu}}\ g(\mathbf{\lambda},\mathbf{\mu})$$ s.t. $$\lambda_i \geq 0, i=1,\dots,n$$ $$\small\text{and other constraints introduced by computing the dual function}$$

Strong Duality and Slater’s Condition
Let $f_0^\star(x)$ and $g^\star(\mathbf{\lambda},\mathbf{\mu})$ be the primal optimum and dual optimum respectively. Weak duality means that $g^\star(\mathbf{\lambda},\mathbf{\mu}) \leq f_0^\star(x)$ The difference $f_0^\star(x)-g^\star(\mathbf{\lambda},\mathbf{\mu})$ is called duality gap.

Under certain circumstances, the duality gap can be 0, which means the strong duality holds. This condition is called Slater’s condition:

Apart from the constraints in primal problem, Slater’s condition requires that the constraints $f_i(\mathbf{x}) _ {(i=1,\dots,n)}$ are linear (or affine). This guarantees that there must exist an $\mathbf{x}$, such that all strict inequality holds.

If Slater’s condition is satisfied, strong duality holds, and furthermore for the optimal value $\mathbf{x}^\star$, $\mathbf{\lambda}^\star$ and $\mathbf{\mu}^\star$, the Karush-Kuhn-Tucker (KKT) conditions also holds.

Karush-Kuhn-Tucker (KKT) Conditions
KKT conditions contain four conditions:

primal constraints

$$f_i(\mathbf{x}^\star)\leq 0, \ i=1,\dots,n$$ $$h_j(\mathbf{x}^\star)=0, \ j=1,\dots,p$$

dual constraints $$ \lambda_i^\star\geq 0, \ i=1,\dots,n $$
Stationarity compute the infimum of $L$ w.r.t $\mathbf{x}$ $$\Delta_{\mathbf{x}} L(\mathbf{x}^\star, \mathbf{\lambda}^\star, \mathbf{\mu}^\star) = 0$$
Complementary Slackness $$ \lambda_i^\star f_i(\mathbf{x}^\star) = 0, \ i=1,\dots,n $$

Therefore, if strong duality holds, we can first solve the dual problem and get the optimal $\mathbf{\lambda}^\star$, $\mathbf{\mu}^\star$. Then we can substitute the dual optimum into the KKT conditions (especially KKT condition 2) to get the primal optimum $\mathbf{x}^\star$. Then the primal convex optimization problem can be solved.

Apply Lagrangian Duality to SVM

Now we are able to solve the SVM optimization problem using Lagrangian duality. As introduced in the first post An Introduction to Support Vector Machines (SVM): Basics, the SVM optimization problem is:

$$ \min_{\mathbf{w},b}\frac{1}{2}\|\mathbf{w}\|^2$$ s.t. $$\ \ y_i(\mathbf{w}^T\mathbf{x}_i+b) \geq 1$$

The Lagrangian dual function is

$$ L(\mathbf{w}, b , \mathbf{\lambda}) = \frac{1}{2}\|\mathbf{w}\|^2 + \sum_{i=1}^{n}\lambda_i(1-y_i(\mathbf{w}^T\mathbf{x}_i+b)) $$

To compute the Lagrangian dual function, we can compute the partial derivative of $L$ w.r.t $\mathbf{w},b$ and set them to 0 (see KKT condition 2)

$$ \frac{\partial{L}}{\partial{\mathbf{w}}} = \mathbf{w} - \sum_{i=1}^{n}\lambda_i y_i \mathbf{x}_i = 0\\ \frac{\partial{L}}{\partial{b}} = -\sum_{i=1}^{n}\lambda_i y_i =0 $$

Then we get

$$\mathbf{w}^\star = \sum_{i=1}^{n}\lambda_i y_i \mathbf{x}_i$$ $$\sum_{i=1}^{n}\lambda_i y_i = 0$$

Substitute these two constraint equations into $L(\mathbf{w},b,\mathbf{\lambda})$, we get the Lagrangian dual function:

$$ \begin{align} g(\mathbf{\lambda}) & = \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j + \sum_{i=1}^{n}\lambda_i(1-y_i( \sum_{j=1}^{n}\lambda_j y_j \mathbf{x_j}^T\mathbf{x}_i +b ))\\ & = \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j - \sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j + \sum_{i=1}^{n}\lambda_i - (\sum_{i=1}^{n}\lambda_i y_i)b \\ &= \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j \end{align} $$

Then the dual problem is:

$$ \max_{\lambda} \ \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i\lambda_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j\\ \begin{align} s.t. \ \ &\lambda_i \geq 0, \ \ i=1,\dots,n\\ &\sum_{i=1}^{n} \lambda_i y_i = 0 \end{align} $$

We can solve this dual problem using Gradient descent algorithm or Sequential Minimal Optimization (SMO). This will be discussed in the next post.

Once we get the dual optimum $\lambda^\star$, we can get the primal optimum $\mathbf{w}^\star=\sum_{i=1}^{n} \lambda_i^\star y_i\mathbf{x}_ i$. But wait, how to get the optimal $b^\star$? To further understand this, we need analyze the KKT conditions for SVM optimization problem.

KKT conditions for SVM

Since the primal constraints $1-y_i(\mathbf{w}^T\mathbf{x}_ i+b)\leq 0$ is obviously linear, so the Slater’s condition holds, strong duality holds, and the KKT conditions are satisfied for the primal optimum and dual optimum of the SVM. Therefore, we have the complementary slackness:

$$ \lambda_i^\star (1-y_i({\mathbf{w}^\star}^T\mathbf{x}_i+b^\star))=0, \ \ i=1,\dots,n $$

This looks interesting. From dual constraints we know that $\lambda^\star\geq 0$. Together with this complementary slackness, we will know that if $\lambda_i>0$, then it must hold $y_i({\mathbf{w}^\star}^T\mathbf{x}_i+b^\star)=1$. This means $\mathbf{x}_i$ is exactly one of the support vectors (the points which have a margin distance to the separating hyperplane)!

Therefore, we find a way to identify support vectors using Lagrangian duality:

Compute the dual optimum, if $\lambda_i^\star>0$, then $\mathbf{x}_ i$ is a support vector.

Let $S=\{i\vert \lambda^\star_i > 0\}$ represent the support vector set, $S_+=\{i\vert i\in S\ \text{and}\ y_i=+1\}$ represent the subset whose labels are $+1$, and $S_-=\{i\vert i\in S\ \text{and}\ y_i=-1 \}$ represent the subset whose labels are -1. Then the primal optimum will be:

$$ \mathbf{w}^\star = \sum_{i\in S} \lambda_i^\star y_i \mathbf{x}_i\\ $$

Since we know for support vectors $\mathbf{x}_i,\ i\in S$, it holds $y_i({\mathbf{w}^\star}^T\mathbf{x}_i+b^\star)=1$. $y_i \in \{-1,+1\}$, so we get ${\mathbf{w}^\star}^T\mathbf{x}_i + b^\star= y_i$. Therefore, the primal optimum of $b$ is:

$$ b^\star = y_i - {\mathbf{w}^\star}^T\mathbf{x}_i, \ \ i\in S $$

$$ b^\star = -\frac{1}{2}({\mathbf{w}^\star}^T\mathbf{x}_i + {\mathbf{w}^\star}^T\mathbf{x}_j ), \ \ i\in S_+,\ j \in S_- $$

In practice, in order to avoid influence of noise, we may use a more stable way to compute $b^\star$:

$$ b^\star = \frac{1}{\vert S\vert} \sum_{i} \{ y_i - {\mathbf{w}^\star}^T\mathbf{x}_i \}, \ \ i\in S $$

Use SVM for Classification

Given a new point $\mathbf{x}$, we can compute the value ${\mathbf{w}^\star}^T\mathbf{x}+b^\star$, and predict the label $\hat{y}$ using hard decision or soft decision as shown in An Introduction to Support Vector Machines (SVM): Gradient Descent Solution. Substitute the expression of ${\mathbf{w}^\star}$, we have:

$$ {\mathbf{w}^\star}^T\mathbf{x} + b^\star = \sum_{i\in S}\lambda_i^\star y_i \mathbf{x}_i^T\mathbf{x} +b^\star $$

This implies that we only need the support vectors to determine the separating hyperplane and classify new points. Furthermore, we notice that either in the dual problem or in the classification, $\mathbf{x}_i^T\mathbf{x}_j$ always appears as a whole. This feature can be used for Kernel SVM, which will be discussed in the following posts.

In the next post I will introduce how to solve the dual problem.

An Introduction to Support Vector Machines (SVM): Gradient Descent Solution

2019-05-24T00:00:00-07:00

In the last post, we discussed that the SVM optimization problem is:

$$\text{min}\frac{1}{2}\|\mathbf{w}\|^2,\ \ \text{s.t.}\ \ \ y_i(\mathbf{w}^T\mathbf{x}_i+b)\geq 1, \ \ i=1,\dots,n$$

To solve this optimization problem, there are multiple ways. One way is to treat this problem as a standard optimization problem and use gradient descent algorithm to compute the optimal parameters. Another way is to formulate the Lagrangian dual problem of the primal problem, transferring original optimization problem into an easier problem. Here we mainly discuss the first method.

Gradient Descent Algorithm

To apply GD, we need to design a new objective function which is differentiable. The new objective function is:

$$\text{min}_{\mathbf{w},b}\ L=\frac{\lambda}{2}\|\mathbf{w}\|^2+\frac{1}{n}\sum^{n}_{i=1}{\max\{1-y_i(\mathbf{w}^T\mathbf{x}_i+b) ,0\}}$$

This objective function contains two terms. The first term is used to maximize the margin. This term is also called regularization term. The second term is a penalty term used to penalize the case where $y_i(\mathbf{w}^T\mathbf{x}_i+b)<1$, which represents incorrect/imperfect classification. Note that for the case $y_i(\mathbf{w}^T\mathbf{x}_i+b)\geq 1$ we don’t need to penalize it, so we use a max function $\max\{1-y_i(\mathbf{w}^T\mathbf{x}_i+b) ,0\}$. This is also called hinge loss.

$$ h(z) = \max\{1-z, 0\} $$

It looks like a hinge, isn't it?

$\lambda$ is a weight parameter used to control the weight of the regularization term. If $\lambda$ is too small, the model (the learned hyperplane) will mainly focuses on correctly classify the training data, but the margin may not be maximized. If $\lambda$ is too large, the model will have have a large margin, while there may exist more miss-classified points in the training dataset.

Compute the gradient
To apply GD we also need get the exact expression of the gradient.

$$\frac{\partial{L}}{\partial{\mathbf{w}}}=\lambda\mathbf{w}-\frac{1}{n}\sum_{i=1}^{n}{u(1-y_i(\mathbf{w}^T\mathbf{x}_i+b))y_i\mathbf{x}_i} $$ $$ \frac{\partial{L}}{\partial{b}}=-\frac{1}{n}\sum_{i=1}^{n}{u(1-y_i(\mathbf{w}^T\mathbf{x}_i+b))y_i} $$

where

$$u(z)=\begin{cases} 1, & \text{if $$z>0$$}.\\ 0, & \text{otherwise}. \end{cases}$$

The updating rules of the parameter $\mathbf{w}$ and $b$ are:

$$\mathbf{w}\leftarrow \mathbf{w} - \alpha\frac{\partial{L}}{\partial{\mathbf{w}}}$$ $$b\leftarrow b - \alpha\frac{\partial{L}}{\partial{b}}$$

where $\alpha$ is the learning rate.

Note that in practice that in each update loop we may not use the whole training dataset, instead we may use a mini-batch. Suppose that the mini batch size is $m$, then the expression of the gradient is:

$$\frac{\partial{L}}{\partial{\mathbf{w}}}=\lambda\mathbf{w}-\frac{1}{m}\sum_{i=1}^{m}{u(1-y_i(\mathbf{w}^T\mathbf{x}_i+b))y_i\mathbf{x}_i}$$ $$\frac{\partial{L}}{\partial{b}}=-\frac{1}{m}\sum_{i=1}^{m}{u(1-y_i(\mathbf{w}^T\mathbf{x}_i+b))y_i}$$

In the following we will use this mini-batch style expression.

Code Implementation

To test the GD algorithm, we use toy data shown in figure 2d toy data

2d toy data In this dataset, each $\mathbf{x}_ i$ is an 2 dimensional vector. In total there are 2000 samples. We need to use GD to find the optimal separating hyperplane, which is a line in this case. The code is available in my github: SupportVectorMachine/gd-svm.py.

Experiment and Analysis

Visualization of Hyperplane
In this part, we set $\lambda=1e-4, \text{learning_rate}=0.1, \text{batch_size}=100, \text{maximum_iteration}=100000$. The change of the hyperplane over iterations is shown in figure Hyperplane Over Iteration

Hyperplane Over Iteration After 100000 iterations the hyperplane looks accurate and the margin seems to be maximized. If we compare the final result with the SVM illustration figure, we will find that they are very similar, which implies that the gradient descent algorithm does work! Comparison between experiment results and model illustration Influence of $\lambda$ on the final results
We can also test the influence of $\lambda$ on the final results of the hyperplane, to check if our illustration on $\lambda$ above is right or not. The results are shown in figure Influence Of Lambda. Influence Of Lambda The results are within our expectation. When $\lambda$ is too large, like 0.1, the margin is very large, but there are actually some points inside the margin area, which means that the constraints $y_i(\mathbf{w}^T\mathbf{x}_ i+b)\geq 1$ is not satisfied for some points. when $\lambda$ is smaller, the margin becomes smaller, but all points satisfy the constraint.

We also noticed that when $\lambda$ is extremely small, like 1e-5, the margin doesn’t become further smaller. Actually we tested that even if $\lambda=0$ we will still get the same ideal results, which implies that the regularization term in the loss function is useless in this toy example! This may be due to the fact that for such a simple dataset, it is very easy to find the optimal separating hyperplane and support vectors. Once the optimal separating hyperplane is found, the model will stick to it even if there is no regularization term in the loss function, since in this case the gradient is 0, and the training is actually stopped.

Use SVM for classification

Suppose that we have obtained the optimal $\mathbf{w}^{\star}$ and $b^{\star}$, given a new input data $\mathbf{x}$, we can make a decision of the label $\hat{y}$ in two ways:

Hard Decision
$\hat{y}=\begin{cases} +1, & \text{if}\ {\mathbf{w}^{\star}}^T\mathbf{x} +b^{\star}\geq 0\\ -1, & \text{if}\ {\mathbf{w}^{\star}}^T\mathbf{x} +b^{\star} < 0\\ \end{cases}$

Soft Decision
$\hat{y} = d( {\mathbf{w}^{\star}}^T\mathbf{x} +b^{\star} )$
where $d(z) = \begin{cases} 1, & \text{if}\ z \geq 1 \\ z, & \text{if}\ -1 \leq z < 1\\ -1, & \text{if}\ z < -1\\ \end{cases}$

So that’s it. Now we are able to use GD to train a SVM model and used it for classification task. In the next post we will explore more possibilities of the solutions on SVM.

\(\mathbf{x}^{(1)}\)	01011	\(\mathbf{x}^{(6)}\)	01110
\(\mathbf{x}^{(2)}\)	01111	\(\mathbf{x}^{(7)}\)	01110
\(\mathbf{x}^{(3)}\)	11011	\(\mathbf{x}^{(8)}\)	11011
\(\mathbf{x}^{(4)}\)	00011	\(\mathbf{x}^{(9)}\)	00100
\(\mathbf{x}^{(5)}\)	01010	\(\mathbf{x}^{(10)}\)	01001

Nianlong Gu

Training a Wasserstein GAN on the free google colab TPU

Start of coding

Step 0: Some helper function for visualizing the results

Step 1: preparation of training dataset

Step 2: configure the TPU Estimator

Prepeare the input_functions for tf estimator

Define the model_fn part of the TPUEstimator

Define the generator and discriminator

Define some metric functions for evaluation

Define the model_fn

Create the TPUEstimator entity, and run the train / evaluate/ predict

Training

Evaluate

Predict

EM Algorithm and Gaussian Mixture Model for Clustering

Gaussian Mixture Model (GMM)

EM algorithm on GMM parameters estimation

GMM for Clustering

Demo

GMM on 2d data points with convex shapes

GMM on 2d data points with non-convex shapes

Conclusion

An Introduction to Expectation-Maximization (EM) Algorithm

Maximum Likelihood Estimation (MLE)

Expectation-Maximization (EM) Algorithm

Apply EM algorithm to practical questions

Implementation and Analysis

The influence of whether dynamically updating the prior or not

Conclusion

An Introduction to Support Vector Machines (SVM): A Python Implementation

Test the kernel SVM on linearly non-separable data

Test the kernel SVM on MNIST for classification

An Introduction to Support Vector Machines (SVM): Sequential Minimal Optimization (SMO)

Sequential Minimal Optimization (SMO)

Sequential Minimal Optimization Algorithm

An Introduction to Support Vector Machines (SVM): kernel functions

SVM with kernel functions

Dual problem with kernel function

Solve the dual problem using Gradient Descent Algorithm

An Introduction to Support Vector Machines (SVM): SVM with slack variables

SVM with Slack Variables

Experiment Results

An Introduction to Support Vector Machines (SVM): Dual problem solution using Gradient Descent

Gradient Descent Algorithm for Dual Problem

Other Solutions?

An Introduction to Support Vector Machines (SVM): Convex Optimization and Lagrangian Duality Principle

Lagrangian Duality Principle

Apply Lagrangian Duality to SVM

KKT conditions for SVM

Use SVM for Classification

An Introduction to Support Vector Machines (SVM): Gradient Descent Solution

Gradient Descent Algorithm

Code Implementation

Experiment and Analysis

Use SVM for classification