The jupyter notebook is available on my github repo. Click HERE to play with google colab on live!
Why using TPU?
TPU is much faster than GPU. A single TPU contains 8 cores, each with a 8GB memory. During training, each batch of data is equally dispatched to all 8 cores. This means the equivalent memrory size if 8x8 = 64 GB. This make it possible to train some large models.
How to run GAN on TPU?
Using TPU Estimator.
Different from classic classification task, training of GAN involves alternation between training Generator and Discriminator. This fact makes it imossible to simply use Keras on TPU. TPU Estimator allows you to more flexibiy configure the network and training optimizer behaviors
# here we force google colab to use tensorflow 1.x, the configuration will be slightly different for tf 2.0
%tensorflow_version 1.x
import tensorflow as tf
# tf.enable_eager_execution()
import numpy as np
# we use mnist dataset as an example
import keras.datasets.mnist as mnist
import math
import os
import matplotlib.pyplot as plt
import imageio
from google.colab import auth
auth.authenticate_user()
the command auth.authenticate_user() is needed for access the google cloud storage to save/restore the model and load the training/testing data.
def add_padding( x, padding_size=(2,2,2,2), padding_value = 1 ):
# x is a 4 d ndarray with range [0,1]
background = padding_value * np.ones( [ x.shape[0], x.shape[1]+ padding_size[0]+padding_size[2], x.shape[2] + padding_size[1]+padding_size[3], x.shape[3] ] ).astype(np.float32)
background[:, padding_size[0]:-padding_size[2], padding_size[1]:-padding_size[3], : ] = x
padded_x = background
return padded_x
# to convert a bulk of images into grid of images
def make_grid(images, ncol= None):
# ncol represents the number of columns of the image grid, if ncol is None, then arrange the grid as close to a square as possible
# This function always assume that the input image is RGB color space , normalized float type
if np.max(images)-np.min(images) >1 :
images = np.clip( images, -1,1 )
images = images /2 +0.5
image_num = images.shape[0]
num_h = None
num_w = None
im_h = images.shape[1]
im_w = images.shape[2]
im_c = images.shape[3]
if (ncol==None):
num_w = int( np.ceil(np.sqrt(image_num )))
num_h = int( np.ceil( image_num/ num_w ))
else:
num_w = int(ncol)
num_h = int( np.ceil( image_num/num_w ))
# create a white pannel, which is a [height, width, channel] ndarray
pannel = np.ones(( num_h * im_h, num_w * im_w , im_c )).astype(np.float32)
for i in range( image_num ):
start_h = int(i / num_w) * im_h
start_w = (i % num_w) * im_w
pannel[ start_h: start_h+im_h , start_w : start_w + im_w ,: ]= images[i,:,:,:]
return pannel
For using TPU in a practical scenario, it is recommended to use tf.data.TFRecordDataset. The reason of not using tf.data.Dataset.from_tensor_slices is this will store the training dataset directly to the computation graph, which will consume a lot of memory, especially for large training dataset.
A typical work flow is:
Get the original image data, write them into TFRecord file. Each image is corresponding to a single record in the TFRecord file
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train= X_train[:,:,:,np.newaxis]
X_test = X_test[ :,:,:,np.newaxis ]
X_train = (X_train/255).astype(np.float32)
X_test = (X_test/255).astype(np.float32)
print(X_train.shape)
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
(60000, 28, 28, 1)
The image shape will be used later on. Then we write mnist images into TFRecord files.
def make_tfrecord( file_name, images ):
# define a TFRecordWriter
writer = tf.python_io.TFRecordWriter(file_name, options= tf.io.TFRecordOptions(tf.io.TFRecordCompressionType.GZIP) )
for img in images:
if isinstance( img, str ): # This case img is the path to the image file
img =imageio.imread(img)
img = (img/255.0).astype(np.float32)
features = {
"image": tf.train.Feature( float_list = tf.train.FloatList( value = img.reshape(-1) ) ),
"image_shape": tf.train.Feature( int64_list = tf.train.Int64List( value = img.shape ) )}
# tf_serialized_example contains the serialized information about the value and shape of an image
tf_serialized_example = tf.train.Example( features = tf.train.Features( feature = features ) ).SerializeToString()
writer.write( tf_serialized_example )
writer.close()
make_tfrecord( 'mnist_train.tfrecord', X_train )
make_tfrecord( 'mnist_eval.tfrecord', X_test[:5000] )
Copy the generated TFRecord to the GCS storage bucket.
! gsutil cp mnist_train.tfrecord mnist_eval.tfrecord gs://gan-tpu-tutorial/data
Copying file://mnist_train.tfrecord [Content-Type=application/octet-stream]...
Copying file://mnist_eval.tfrecord [Content-Type=application/octet-stream]...
- [2 files][ 16.9 MiB/ 16.9 MiB]
Operation completed over 2 objects/16.9 MiB.
One can also load multiple TFRecords by
ds=tf.data.TFRecordDataset(["record1.tfrecord","record2.tfrecord","record3.tfrecord"], compression_type='GZIP')
The TPU Estimator mainly contains two parts: train/eval/test input function, and model function
def parse_tfrecord_func( serialized_record ):
parse_dic = { "image": tf.FixedLenFeature(shape=(28,28,1), dtype = tf.float32 ),
"image_shape": tf.FixedLenFeature( shape=(3,), dtype = tf.int64 )
}
parses_record = tf.parse_single_example( serialized_record, parse_dic )
## note that parse_single_example can only be placed before batch()
## parse_example can only be placed after batch()
return {"image": parses_record["image"]}
def train_input_fn( batch_size ):
dataset_x_train = tf.data.TFRecordDataset([ "gs://gan-tpu-tutorial/data/mnist_train.tfrecord" ], compression_type='GZIP')
dataset_x_train = dataset_x_train.shuffle(60000).repeat()
pattern = np.array([0,0,0,0,1]).repeat(batch_size).astype(np.float32)
dataset_g_w = tf.data.Dataset.from_tensor_slices( { "g_w": pattern } ).repeat()
dataset_output = tf.data.Dataset.from_tensor_slices( ( np.zeros( ( batch_size ) ).astype(np.float32) ) ).repeat()
ds = tf.data.Dataset.zip(( dataset_x_train, dataset_g_w, dataset_output))
def merge_func(a,b,c):
a = parse_tfrecord_func(a)
a.update(b)
return a, c
ds = ds.map(merge_func)
return ds.batch( batch_size, drop_remainder = True ).prefetch(buffer_size =1)
def eval_input_fn( batch_size ):
dataset_x_eval = tf.data.TFRecordDataset([ "gs://gan-tpu-tutorial/data/mnist_eval.tfrecord" ], compression_type='GZIP')
dataset_x_eval = dataset_x_eval.shuffle(10000).repeat()
dataset_output = tf.data.Dataset.from_tensor_slices( ( np.zeros( ( batch_size ) ).astype(np.float32) ) ).repeat()
ds = tf.data.Dataset.zip(( dataset_x_eval, dataset_output))
def merge_func(a,b):
a = parse_tfrecord_func(a)
return a, b
ds = ds.map(merge_func)
return ds.batch(batch_size, drop_remainder = True).prefetch(buffer_size =1)
def predict_input_fn( z ):
dataset_z_input = tf.data.Dataset.from_tensor_slices( ( z.astype(np.float32), np.zeros(( z.shape[0],) ).astype(np.float32) ) )
return dataset_z_input.batch(64, drop_remainder= False)
features: {"image": image_tensors, "g_w": g_w }
labels: dataset_output (dummy value 0)
def generator( z, scope="generator", trainable= True ):
with tf.variable_scope( scope, reuse= tf.AUTO_REUSE ):
net = tf.layers.BatchNormalization()( tf.layers.Dense( 7*7*128, activation= tf.nn.relu )(z) )
net = tf.reshape( net, [ tf.shape(net)[0], 7, 7, 128 ] )
net = tf.layers.BatchNormalization()( tf.layers.Conv2DTranspose( 64, 5, (2,2), "same", activation= tf.nn.relu )(net) )
net = tf.layers.BatchNormalization()( tf.layers.Conv2DTranspose( 32, 5, (2,2), "same", activation= tf.nn.relu )(net) )
net = tf.layers.Conv2D(1, 5, (1,1), "same", activation= tf.nn.sigmoid )(net)
return net
def discriminator( x, scope="discriminator", trainable = True ):
with tf.variable_scope( scope, reuse= tf.AUTO_REUSE ):
net = tf.layers.Conv2D( 32, 5, (2,2), "same", activation= tf.nn.leaky_relu )(x)
net = tf.layers.BatchNormalization()( tf.layers.Conv2D( 64, 5, (2,2), "same", activation= tf.nn.leaky_relu )(net) )
net = tf.layers.BatchNormalization()( tf.layers.Conv2D( 128, 5, (2,2), "same", activation= tf.nn.leaky_relu )(net) )
net = tf.layers.Flatten()(net)
net = tf.layers.BatchNormalization()( tf.layers.Dense( 128, activation= tf.nn.leaky_relu )(net) )
net = tf.layers.Dense( 1 )(net)
return net
def metric_fn(loss_gen, loss_dis, W_dis ):
"""Function to return metrics for evaluation.
The input parameters can be arbritary
"""
return {"loss_gen": tf.metrics.mean(loss_gen),
"loss_dis": tf.metrics.mean(loss_dis),
"wasserstein_distance": tf.metrics.mean( W_dis ),
}
def model_fn(features, labels, mode, params):
lr = params["learning_rate"]
z_dim = params["z_dim"]
if mode == tf.estimator.ModeKeys.TRAIN or mode == tf.estimator.ModeKeys.EVAL:
""" Part I. create the model networks"""
x = features["image"]
is_train = mode == tf.estimator.ModeKeys.TRAIN
random_z = tf.random.normal( [tf.shape(x)[0], z_dim ] )
gen_x = generator( random_z, trainable= is_train )
dis_x = discriminator( x, trainable= is_train )
dis_gen_x = discriminator( gen_x, trainable= is_train )
# This is used to compute the gradient penalty
epsilon = tf.random.uniform( [ tf.shape(x)[0],1,1,1 ], minval=0, maxval= 1 )
interp_x = epsilon * x + (1-epsilon) * gen_x
dis_interp_x = discriminator( interp_x, trainable= is_train )
gradient_x = tf.gradients( dis_interp_x, [ interp_x ] )[0]
gradient_penalty = tf.square( tf.sqrt( tf.reduce_sum( tf.square(gradient_x ),[1,2,3] ) ) - 1 )
LAMBDA = 10
"""Part II. define the loss and relative parameters for mode == TRAIN/EVAL/PREDICT"""
## compute loss
loss_dis = dis_gen_x - dis_x + LAMBDA * gradient_penalty
loss_gen = - dis_gen_x
W_dis = dis_x - dis_gen_x
## operations for the training mode, define the optimizer, and reconfig it using tpu.CrossShardOptimizer
if mode == tf.estimator.ModeKeys.TRAIN:
g_w = features["g_w"]
loss_dis = tf.reduce_mean( loss_dis )
## when g_w =0, this loss_gen 's gradient will be 0, which means generator is not trained during current batch
loss_gen = tf.reduce_mean( loss_gen * g_w)
W_dis = tf.reduce_mean(W_dis)
# Define the optimizer
d_optimizer = tf.train.AdamOptimizer(learning_rate=lr, beta1=0, beta2= 0.99 )
g_optimizer = tf.train.AdamOptimizer(learning_rate=lr, beta1=0, beta2= 0.99 )
# convert to TPU optimizer version
d_optimizer = tf.tpu.CrossShardOptimizer(d_optimizer)
g_optimizer = tf.tpu.CrossShardOptimizer(g_optimizer)
with tf.control_dependencies( tf.get_collection( tf.GraphKeys.UPDATE_OPS )):
d_op = d_optimizer.minimize( loss = loss_dis, var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,\
scope="discriminator") )
g_op = g_optimizer.minimize(loss = loss_gen, var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,\
scope="generator"),global_step= tf.train.get_global_step() )
# This group command will group the discriminator optimization and generator optimization together
# all the optimizations in the group will be run during each batch
# g_w can control whether to update the parameters of generator or not, which plays the role of n_critic in WGAN
train_op = tf.group( [ d_op , g_op] )
spec= tf.estimator.tpu.TPUEstimatorSpec(mode=mode, loss= W_dis ,train_op= train_op )
## for EVAL mode, the parameters eval_metrics takes a tuple or list of two elements. The first element is a callable function,
## The second element is a list of parameters. The return value of the callable function will be shown in the evaluatio results
elif mode == tf.estimator.ModeKeys.EVAL:
spec = tf.estimator.tpu.TPUEstimatorSpec(mode=mode, loss= tf.reduce_mean(W_dis), eval_metrics=(metric_fn, [loss_gen, loss_dis, W_dis ] ) )
elif mode == tf.estimator.ModeKeys.PREDICT:
""" construct the model (only the generator part) """
input_z = features
gen_x = generator( input_z, trainable= False )
"""Define the predictions"""
predictions = { "generated_images": gen_x }
spec= tf.estimator.tpu.TPUEstimatorSpec( mode = mode, predictions = predictions )
return spec
iterations_per_loop means the number of batches fed into TPU before returning to the host CPU
model_dir="gs://gan-tpu-tutorial/model"
tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
iterations_per_loop = 200
run_config = tf.estimator.tpu.RunConfig(
model_dir=model_dir,
cluster=tf.distribute.cluster_resolver.TPUClusterResolver(tpu_address),
session_config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True),
tpu_config=tf.estimator.tpu.TPUConfig(iterations_per_loop),
)
model = tf.estimator.tpu.TPUEstimator(
model_fn=model_fn,
params = {"learning_rate": 1e-3, "z_dim": 100 },
config = run_config,
use_tpu= True,
train_batch_size=512 ,
eval_batch_size=512 ,
predict_batch_size= 64,
)
What is the relationship between max_steps and epochs?
model.train( input_fn = lambda params: train_input_fn( params["batch_size"] ), max_steps= 10000 )
eval_result = model.evaluate(input_fn=lambda params: eval_input_fn( params["batch_size"]), steps = 10)
eval_result
random_z = np.random.normal( size=(1024, 100) ).astype(np.float32)
pred_results = model.predict( input_fn=lambda params: predict_input_fn(random_z) )
images = np.array([ result["generated_images"] for result in pred_results ])
print("generated images")
plt.figure(figsize = (5,5))
plt.gray()
plt.imshow( make_grid(add_padding(images[np.random.choice( images.shape[0], 64, replace= False ) ])).squeeze() )
plt.show()
This is one example of generated images:

First, let’s recall the EM algorithm:
Suppose that we have the observations \(\{\mathbf{x}^{(i)}\}, i=1,\dots,n\). \(\mathbf{x}^{(i)}\) is related with a hidden variable \(\mathbf{z}\) which is unknown to us. The task is to find the MLE of \(\theta\):
$$ \theta_\text{MLE} = \arg \max_{\theta} \sum_{i=1}^{n}\log \sum_{\mathbf{z}} p_\theta(\mathbf{z}, \mathbf{x}^{(i)}) $$ The EM algorithm works as follows:
- Randomly initialize \(\theta\), set the \(\mathbf{z}\) prior \(p(\mathbf{z})\)
- Repeat:
At the \(l^\text{th}\) iteration:
- E step:
set \(Q_{l}^{(i)}(\mathbf{z})=p_{\theta_{l-1}}(\mathbf{z}\vert \mathbf{x}_ i)\) for \(i=1,\dots,n\)- M step:
update \(\theta_{l}=\arg \max_{\theta} \sum_{i=1}^{n}Q_{l}^{(i)}(\mathbf{z})\log \frac{p_{\theta_{l-1}}(\mathbf{z}, \mathbf{x}^{(i)})}{Q_{l}^{(i)}(\mathbf{z})}\)- Update the prior \(p(\mathbf{z})\) (optional)
Until \(\theta\) converges.
Based on the experience on solving coin tossing problem using EM, we can further deform the EM algorithm:
As indicated by its name, the GMM is a mixture (actually a linear combination) of multiple Gaussian distributions. The probability density function of a GMM is (\(\mathbf{x}\in R^p\)):
where \(M\) is the number of Gaussian models. \(\phi_j\) is the weight factor of the Gaussian model \(N(\mu_j,\Sigma_j)\). Moreover, we have the constraint: \(\sum_{j=1}^{M} \phi_j =1\).
GMM is very suitable to be used to fit the dataset which contains multiple clusters, and each cluster has circular or elliptical shape. For example, the data distribution shown in the following figure can be modeled by GMM.
Now the question is: given a dataset with the distribution in the figure above, if we want to use GMM to model it, how to find the MLE of the parameters (\(\phi,\mu,\Sigma\)) of the Gaussian Mixture Model?
The answer is: using EM algorithm!
Before we move forward, we need to figure out what the prior \(p(\mathbf{z})\) is for the GMM. Suppose that there are \(M\) Gaussian models in the GMM, our latent variable \(\mathbf{z}\) only has \(M\) different values: \(\{\mathbf{z}^{(j)}=j| j=1,\dots,M\}\). The prior \(p(\mathbf{z}^{(j)})=p(\mathbf{z}=j)\) represents the likelihood that the data belongs to cluster (Gaussian model) \(j\), without any information about the data \(\mathbf{x}\). According to the marginal likelihood we have:
If we compare these two equations with the expression of the GMM, we will find that \(p(\mathbf{z}^{(j)})\) plays the role of \(\phi_j\). In other words, we can treat \(\phi_j\) as the prior and \(p(\mathbf{x}\vert \mathbf{z}^{(j)}; \mu, \Sigma)= N(\mathbf{x};\mu_j, \Sigma_j)\)
Moreover, \(\mathbf{x}^{(i)}\in R^p\). The EM algorithm works as follows:
Until all the parameters converges.
Suppose that we have use the EM algorithm to find the estimation of the model parameters, what does the posterior \(p_\theta(\mathbf{z}^{(j)}\vert \mathbf{x})\) represent? It actually represents the likelihood that the data \(\mathbf{x}\) belongs to the Gaussian model index \(j\) (or Cluster \(j\)). Therefore, we can use the posterior expression given in the E step above, to the compute the posterior \(p_\theta(\mathbf{z}^{(j)}\vert \mathbf{x}),\ j=1,\dots,M\), and determine the cluster index with largest posterior \(c_x=\arg \max_{j} p_\theta(\mathbf{z}^{(j)}\vert \mathbf{x})\)
We implement the EM & GMM using python, and test it on 2d dataset.
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
Using TensorFlow backend.
def load_data( num_samples, prior_z_list , mu_list , sigma_list ):
X=[]
choice_of_gaussian_model = np.random.choice(len( prior_z_list), num_samples, p=prior_z_list )
for sample_ind in range(num_samples):
gaussian_ind = choice_of_gaussian_model[sample_ind]
x= np.random.multivariate_normal( mu_list[gaussian_ind], sigma_list[gaussian_ind] )
X.append(x)
X= np.asarray(X)
return X
def EM(X, num_clusters, epsilon = 1e-2, update_prior = True, max_iter = 100000 ):
x_dim = X.shape[1]
num_samples = X.shape[0]
## initialization
mu = np.random.uniform( size=( num_clusters, x_dim ) )
## initializing sigma as identity matrix can guarantee it's positive definite
sigma = []
for _ in range(num_clusters):
sigma.append( np.eye(x_dim) )
sigma = np.asarray(sigma)
phi = np.ones(num_clusters)/ num_clusters
count = 0
while True:
## E step
# Q is the posterior, with the dimension num_samples x num_clusters
Q=np.zeros( [num_samples, num_clusters])
sigma_det =[ (np.linalg.det(sigma[j]))**0.5 for j in range(num_clusters) ]
sigma_inverse = [ np.linalg.inv(sigma[j]) for j in range(num_clusters) ]
for i in range(num_samples):
for j in range(num_clusters):
Q[i,j]= phi[j]/( sigma_det[j] ) * np.exp( -0.5 * np.matmul( np.matmul((X[i]-mu[j]).T, sigma_inverse[j]), X[i]-mu[j]))
Q=np.array(Q)
Q=Q/(np.sum(Q,axis=1,keepdims=True))
## M step
# update mu
mu_new = np.ones([num_clusters, x_dim])
for j in range(num_clusters):
mu_new[j] = np.sum (Q[:,j:j+1]*X ,axis=0 )/np.sum(Q[:,j],axis=0)
# update sigma
sigma_new = np.zeros_like(sigma)
for j in range(num_clusters):
for i in range(num_samples):
sigma_new[j] += Q[i,j] * np.matmul( (X[i]-mu[j])[:,np.newaxis], (X[i]-mu[j])[:,np.newaxis].T )
sigma_new[j] = sigma_new[j]/np.sum(Q[:,j])
# update phi
if update_prior:
phi_new = np.mean( Q, axis=0 )
else:
phi_new = phi
delta_change = np.mean(np.abs(phi-phi_new)) + np.mean( np.abs( mu- mu_new ) )+np.mean( np.abs( sigma- sigma_new ) )
print("parameter changes: ",delta_change)
if delta_change < epsilon:
break
count +=1
if count >= max_iter:
break
phi=phi_new
mu= mu_new
sigma = sigma_new
## a function used for performing clustering
def cluster( X ):
Q=np.zeros( [X.shape[0], num_clusters])
sigma_det =[ (np.linalg.det(sigma[j]))**0.5 for j in range(num_clusters) ]
sigma_inverse = [ np.linalg.inv(sigma[j]) for j in range(num_clusters) ]
for i in range(X.shape[0]):
for j in range(num_clusters):
Q[i,j]= phi[j]/( sigma_det[j] ) * np.exp( -0.5 * np.matmul( np.matmul((X[i]-mu[j]).T, sigma_inverse[j]), X[i]-mu[j]))
Q=np.array(Q)
Q=Q/(np.sum(Q,axis=1,keepdims=True))
cluster_info = np.argmax( Q, axis=1)
return cluster_info
return {"mu":mu, "sigma":sigma, "phi":phi, "cluster": cluster}
First let load a small data points
real_phi = [0.2,0.6,0.1,0.1]
real_mu = [ [0,0],[2,8],[10,10],[9,1] ]
real_sigma = [ [[1,0.5],[0.5,1]], [[2,-0.6],[-0.6,1]], [[1,0],[0,1]],[[1,0.3],[0.3,0.5]] ]
X=load_data(10000, real_phi, real_mu, real_sigma )
for i in range(len(real_phi)):
print("real phi: ", real_phi[i], " real mu: ", real_mu[i], " real sigma: ", real_sigma[i])
real phi: 0.2 real mu: [0, 0] real sigma: [[1, 0.5], [0.5, 1]]
real phi: 0.6 real mu: [2, 8] real sigma: [[2, -0.6], [-0.6, 1]]
real phi: 0.1 real mu: [10, 10] real sigma: [[1, 0], [0, 1]]
real phi: 0.1 real mu: [9, 1] real sigma: [[1, 0.3], [0.3, 0.5]]
Let’s plot the data and have a look at it.
plt.scatter( X[:,0], X[:,1] )
plt.show()

Then we apply the EM algorithm, to get the MLE of GMM parameters and get the cluster function
params=EM(X, num_clusters=4, epsilon= 1E-4)
mu= params["mu"]
sigma = params["sigma"]
phi=params["phi"]
cluster = params["cluster"]
parameter changes: 28.449669073154364
parameter changes: 17.400927300989974
parameter changes: 0.9644888523985635
parameter changes: 1.0995072448163998
parameter changes: 1.3509364912075696
parameter changes: 1.2308294431017273
parameter changes: 1.3794412438676897
parameter changes: 1.4081227407466508
parameter changes: 1.0857571446279906
parameter changes: 0.7155881044307679
parameter changes: 0.411613512938475
parameter changes: 0.12457364032905578
parameter changes: 0.04685136953006225
parameter changes: 0.0540454165259536
parameter changes: 0.06456840164792643
parameter changes: 0.07771391163679765
parameter changes: 0.09436688134288668
parameter changes: 0.11582159431045104
parameter changes: 0.14421201360388664
parameter changes: 0.1834323022021212
parameter changes: 0.24801453948582258
parameter changes: 0.3558084755399498
parameter changes: 0.5349701481676721
parameter changes: 0.7677886989164794
parameter changes: 0.7666771213539978
parameter changes: 0.5043555266074152
parameter changes: 0.11678542980595268
parameter changes: 0.001048169134691374
parameter changes: 1.550958923947094e-06
esti_mu= (mu*100).astype(np.int32)/100.
esti_sigma= (sigma*100).astype(np.int32)/100.
esti_phi= (phi*100).astype(np.int32)/100.
for i in range(len(esti_phi)):
print("esti phi:", esti_phi[i], "esti mu:", esti_mu[i].tolist(), "esti sigma:", esti_sigma[i].tolist())
esti phi: 0.09 esti mu: [8.99, 0.99] esti sigma: [[1.07, 0.31], [0.31, 0.51]]
esti phi: 0.19 esti mu: [0.01, 0.01] esti sigma: [[1.0, 0.48], [0.48, 1.01]]
esti phi: 0.1 esti mu: [10.02, 10.02] esti sigma: [[0.92, -0.01], [-0.01, 1.03]]
esti phi: 0.6 esti mu: [2.01, 7.98] esti sigma: [[2.0, -0.61], [-0.61, 1.02]]
If we compare the estimated parameters with the real paramets, we can see the estimation error is within 0.05, and the correspondence between the phi, mu and sigma is also correct. Therefore the EM algorithm does work!
We can perform clustering using the trained cluster model and plot the clustering results
cluster_X = cluster(X)
cluster_index = np.unique(cluster_X)
for ind in cluster_index:
plt.scatter( X[cluster_X==ind][:,0], X[cluster_X==ind][:,1], color = np.random.uniform(size=3) )
plt.legend(cluster_index)
plt.show()

Well, the clustering results are pretty accurate and reasonable! So we can use GMM for unsupervised clustering!
Discussion: As shown the in the figure above, each cluster is actually a convex set.
A convex set $S$ means for any two points $\mathbf{x}1\in S, \mathbf{x}_2\in S$, the linear interpolation $\mathbf{x}\text{int}= \lambda * \mathbf{x}_1 + (1-\lambda)\mathbf{x}_2, 0\leq\lambda\leq 1$ also belongs to $S$
This is pretty reasonable, since Gaussian distribution naturally has a convex shape. However, what the performance of GMM clustering will be for non-convex dataset?
First of all, let prepare the data:
def load_non_convex_data(num_samples=10000, prior_z_list=[0.5,0.5], mu_list=[[np.pi/2, 3], [np.pi*1, -3]], sigma_list=[[[np.pi,0],[0,2]],[[np.pi,0],[0,2]]]):
X=[]
choice_of_model = np.random.choice(len( prior_z_list), num_samples, p=prior_z_list )
for ind in choice_of_model:
while True:
x= np.random.multivariate_normal( mu_list[ind], sigma_list[ind] )
if ind==0:
if x[1]>1.5*np.sin(x[0])+0.5:
break
else:
if x[1]<1.5*np.sin(x[0])-0.5:
break
X.append(x)
X= np.array(X)
return X
X= load_non_convex_data()
plt.scatter(X[:,0],X[:,1] )
plt.show()

Use EM algorithm to estimate the parameters of the GMM model.
params=EM(X, num_clusters=2, epsilon= 1E-2)
mu= params["mu"]
sigma = params["sigma"]
phi=params["phi"]
cluster = params["cluster"]
parameter changes: 7.344997536220525
parameter changes: 2.769657568563131
parameter changes: 0.6826557990296913
parameter changes: 0.8559206668196735
parameter changes: 0.9985169905722497
parameter changes: 0.6972809861725238
parameter changes: 0.16143972260766515
parameter changes: 0.014376638549487432
parameter changes: 0.002146320352925
Let’s see the clustering results:
cluster_X = cluster(X)
cluster_index = np.unique(cluster_X)
for ind in cluster_index:
plt.scatter( X[cluster_X==ind][:,0], X[cluster_X==ind][:,1], color = np.random.uniform(size=3) )
plt.legend(cluster_index)
plt.show()

From this figure we can see the real clusters are actually non-convex, since there is a sine-shape gap between two real clusters. However, the GMM clustering resluts always provide convex clutsers. For example, either the blue points set or the red points set is convex. This is determined by the fact that Gaussian distribution has convex shape.
Now we see the ability and shortcoming of the GMM clustering. In the GMM clustering results, each cluster’s region ussually has a convex shape. This actually limits the power of GMM clustering especially on some mainfold data clustring. In the future we will discuss how to cluster such non-convex dataset.
Moreover, this GMM model is not very practical, since for some sparse dataset, when updating the \(\Sigma_j\) in the M step, the covariance matrix \(\frac{ \sum_{i=1}^{n}q_{i,k}(\mathbf{x}^{(i)}-\mu_k)(\mathbf{x}^{(i)}-\mu_k)^T }{\sum_{i=1}^{n} q_{i,k} }\) may not be positive definite (be singular). In this case we cannot directly compute the inverse of \(\Sigma_j\). More works are needed to deal with such cases.
Reference
Let \(\{\mathbf{x}^{(i)}\},\ i=1,\dots,n\) be a set of independent and identically distributed observations, and \(\mathbf{\theta}\) be the parameters of the data distribution which are unknown for us. The maximum likelihood estimation of the parameters \(\theta\) is the parameters which can maximize the joint distribution \(p_\theta(\mathbf{x}^{(1)},\dots,\mathbf{x}^{(n)})= \prod_{i=1}^{n}p_\theta(\mathbf{x}^{(i)})\)
More commonly, we choose to maximize the joint log-likelihood:
We use an example to illustrate how it works (referred from EM算法详解-知乎).
Suppose that we have a coin A, the likelihood of a heads is \(\theta_A\). We denote one observation \(\mathbf{x}^{(i)}=\{ x_ {i,1},x_ {i,2},x_ {i,3},x_ {i,4},x_ {i,5}, \}\) as tossing the coin A 5 times and record the heads (1) or tails (0) of each tossing. For example, \(\mathbf{x}^{(i)}\) can be 01001, 01110, 10010, … etc. The likelihood of the observation \(\mathbf{x}^{(i)}\) is:
Therefore, the log likelihood of the joint distribution of \(n\) observations is:
The MLE of \(\theta_A\) is
To get \(\hat{\theta}_{A,\text{MLE}}\) we can solve the equation \(\frac{\partial{l(\theta_A)}}{\partial{\theta_A}}=0\).
Therefore, we have
This is actually equivalent to compute the average value of all tossing results. For example, if we have 10 observations as below:
| \(\mathbf{x}^{(1)}\) | 01011 | \(\mathbf{x}^{(6)}\) | 01110 |
| \(\mathbf{x}^{(2)}\) | 01111 | \(\mathbf{x}^{(7)}\) | 01110 |
| \(\mathbf{x}^{(3)}\) | 11011 | \(\mathbf{x}^{(8)}\) | 11011 |
| \(\mathbf{x}^{(4)}\) | 00011 | \(\mathbf{x}^{(9)}\) | 00100 |
| \(\mathbf{x}^{(5)}\) | 01010 | \(\mathbf{x}^{(10)}\) | 01001 |
The sum of all tossing is 28, and the total number of tossing is 50, so MLE of \(\theta_A\) is \(\frac{28}{50}=\frac{14}{25}\)
MLE with hidden variables
Now things become more complicated. Suppose we have two coins: A and B. The likelihood of a heads of coin A and B are \(\theta_A\) and \(\theta_B\) respectively. We want to find the MLE of \(\theta_A, \theta_B\) using \(n\) observations \(\{\mathbf{x}^{(i)}\},\ i=1,\dots,n\). Each observation has the same form as above. The challenging part is that for each observation \(\mathbf{x}^{(i)}\), we don’t know which coin it comes from. For example, \(n=10\), the observation set is the same as the table above. In this case how to find the MLE of \(\theta_A\) and \(\theta_B\)?
This is an simple example where our observation is closely related with some hidden (unknown) variables. In other words, the information of the data is incomplete. The Expectation-Maximization algorithm can be used to solve these problems.
Before introducing EM algorithm, we need to known an important inequality: Jensen-Shannon Inequality.
Jensen-Shannon Inequality
If a function \(f(\mathbf{X})\) is strictly convex, where \(\mathbf{X}\) is a random variable and the Hessian matrix \(H\) is positive definite, we have
The equality holds if and only if \(E_X [\mathbf{X}]= \mathbf{X}\) with the probability 1 (\(\mathbf{X}\) is a constant). Note that of \(f(\mathbf{X})\) is strictly concave, the direction of the inequality needs to be reversed.
We can use an example to illustrate Jensen-Shannon inequality more intuitively (not proof). This example is referred from Andrew Ng’s lecture note on EM.
As shown in this figure, the random variable \(\mathbf{X}\) has only two possible values: \(a\) and \(b\), each with the probability 0.5. Therefore, \(f(E[\mathbf{X}])= f(\frac{a+b}{2})\) and \(E[f(\mathbf{X})]=\frac{f(a)+f(b)}{2}\). According to the convexity of the function \(f(\mathbf{X})\), we have \(E[f(\mathbf{X})]\geq f(E[\mathbf{X}])\), and the equality holds if and only if \(a=b\), which means \(E[\mathbf{X}]=\mathbf{X}=a\).
Now we have a powerful tool, and we will use it to deduce the EM algorithm.
EM algorithm
Recall the MLE problem:
If \(\mathbf{x}\) is related with a latent variable \(\mathbf{z}\), we write \(p_\theta\) as the marginal likelihood of the joint distribution:
Now our log-likelihood function \(l(\theta)\) becomes:
The expression of the joint distribution is not known to us. To solve this maximization problem we introduce a distribution \(Q^{(i)}(\mathbf{z})\), and rewrite the log-likelihood function as:
We know that the function \(f(x)=\log(x)\) is strictly concave, so according to the Jensen-Shannon inequality, we have the following inequality:
The equality holds if and only if \(\frac{p_\theta(\mathbf{x}^{(i)},\mathbf{z}))} {Q^{(i)}(\mathbf{z})}\) is a constant (with respect to variable \(\mathbf{z}\)). To achieve this we can set \(Q^{(i)}(\mathbf{z})= \frac{p_\theta(\mathbf{x}^{(i)},\mathbf{z})}{\sum_{z}p_\theta(\mathbf{x}^{(i)},\mathbf{z})}= \frac{ p_\theta(\mathbf{x}^{(i)},\mathbf{z}) }{p_\theta(\mathbf{x}^{(i)})}= p_\theta(\mathbf{z}\vert \mathbf{x}^{(i)})\). This means \(Q^{(i)}(\mathbf{z})\) is the posterior of \(\mathbf{z}\) given \(\mathbf{x}^{(i)}\).
People may ask: why do we try to find a proper \(Q^{(i)}(\mathbf{z})\) to make the equality hold?
Our initial goal is to find the MLE of \(\theta\) w.r.t \(l(\theta)\). However, the original expression of \(l(\theta)\) is not explicit, so we take advantage of the Jensen-Shannon inequality, by selecting a proper distribution of \(Q^{(i)}(\mathbf{z})\), to make \(l(\theta)=L(\theta, Q^{(i)})\). Then we can instead maximize \(L(\theta, Q^{(i)})\) w.r.t \(\theta\) to find the MLE in an iterative way.
Now, we can summarize the EM algorithm:
Until: \(\theta\) converges
In Andrew Ng’s lecture notes, it is proven that can guarantee that \(l(\theta)\) is steadily maximized. Suppose that after \(l-1\) iterations we have the log-likelihood \(l(\theta_{l-1})\). At the \(l^\text{th}\) iteration, after the E step, we have \(L(\theta_{l-1}, Q^{(i)}_l)= l(\theta_{l-1})\); After the M step, updated \(\theta_{l}\) is selected such that \(L(\theta_l, Q^{(i)}_l)\geq L(\theta_{l-1}, Q^{(i)}_l)\). Then at the \(l+1\) iteration, by selecting \(Q^{(i)}_{l+1}\) as the posterior of \(\mathbf{z}\), we have \(l(\theta_l)=L(\theta_l, Q^{(i)}_{l+1})\). Therefore, we have
So we have \(l(\theta_{l})\geq l(\theta_{l-1})\). This guarantees that the overall log-likelihood can only keep increasing or stay unchanged, but not decrease.
This deduction shows that the EM algorithm is heading to the right direction. However, this direction may not be the ideal one. It is pretty obvious that if \(l(\theta)\) is globally concave, EM algorithm can always converge at the global optimum. If \(l(\theta)\) is not globally concave, the property \(l(\theta_l)\geq l(\theta_{l-1})\) will guarantee that EM algorithm will converge at some point (assume that \(l(\theta)\) is not delta function), but the converge point may not be globally optimum.
Moreover, the EM algorithm is sensitive to the initialization. Different initialization may results in pretty different converge points, as shown in the figure below.
As shown in this figure, if the initialization is at point \(A\), then EM will converge at point \(C_A\), while the EM will converge at point \(C_B\) if initialization is \(B\). Obviously \(C_B\) is the global optimum and \(C_A\) not.
So how to make EM algorithm less sensitive to initialization and be more likely to find the global optimum? One simple, straight-forward but effective way is to randomly initialize the parameters and rum EM algorithm multiple times, and choose the parameters with the largest converged log-likelihood (objective function).
Tossing two coins with different heads probability
Let recall the question raised in the first section:
Suppose we have two coins: A and B. The likelihood of a heads of coin A and B are \(\theta_A\) and \(\theta_B\) respectively. We want to find the MLE of \(\theta_A, \theta_B\) using \(n\) observations \(\{\mathbf{x}^{(i)}\in \{0,1\}^d\},\ i=1,\dots,n\). Each observation has \(d\) dimension, which means \(d\) times of tossing for each observation. The challenging part is that for each observation \(\mathbf{x}^{(i)}\), we don’t know which coin it comes from. In this case how to find the MLE of \(\theta_A\) and \(\theta_B\)?
In this case, \(\mathbf{x}\) is related with a hidden variable \(z\). \(z\) can only have 2 values: \(z=A\) for coin \(A\) and \(z=B\) for coin \(B\). We want to apply EM algorithm to this case.
Note that the choice of prior distribution of \(z\) will influence the final learned parameters very much. If the chosen prior is pretty different from the real prior, the estimated parameters will be inaccurate. To solve this problem, we can update the prior by setting the prior of current iteration as the posterior of previous iteration, averaged over all observations. This is commonly used when data comes as a sequence.
Until \(\theta_A,\theta_B\) converges.
To test the effectiveness of the EM algorithm, I wrote a small demo for the coin tossing problem:
import numpy as np
## Define a tossing function, to generate our observations
## theta is the head likelihood; num is the number of tossing for a single observation
def tossing( theta, num ):
return (np.random.uniform(size=num)<theta).astype(np.int32)
## the load data is used to generate a set of observations
## prior_coin_A is the prior of the hidden variable z;
## theta_A, theta_B is heads probability of coin A and B separately.
## this method return a dataset X, without any explicit information about prior_coin_A, theta_A, theta_B
def load_data( num_samples, prior_coin_A = 0.8 , theta_A=0.2, theta_B = 0.7, num_tossing_per_sample = 5 ):
X=[]
for _ in range(num_samples):
random_v = np.random.uniform()
if random_v < prior_coin_A:
##generate a tossing observation using coin A
X.append( tossing( theta_A, num_tossing_per_sample) )
else:
##generate a tossing observation using coin B
X.append( tossing( theta_B, num_tossing_per_sample ) )
X = np.asarray(X)
return X
## The task of EM is to found the MLE of theta_A, theta_B using only obtained observations X
def EM( X, epsilon = 1e-8, update_prior = True , is_return_prior_list = False):
## initialization
prior_coin_A = 0.5
prior_coin_B = 1- prior_coin_A
theta_A = np.random.uniform()
theta_B = np.random.uniform()
prior_coin_A_list=[prior_coin_A]
prev_theta_A = theta_A
prev_theta_B = theta_B
count = 0
while True:
## E step:
P_X_with_z_eq_A = theta_A**( np.sum(X, axis=1) )* (1-theta_A)**(np.sum( 1-X, axis=1 ))
P_X_with_z_eq_B = theta_B**( np.sum(X, axis=1) )* (1-theta_B)**(np.sum( 1-X, axis=1 ))
Q_A = P_X_with_z_eq_A*prior_coin_A/(P_X_with_z_eq_A*prior_coin_A+P_X_with_z_eq_B*prior_coin_B)
Q_B = P_X_with_z_eq_B*prior_coin_B/(P_X_with_z_eq_A*prior_coin_A+P_X_with_z_eq_B*prior_coin_B)
## M step:
theta_A = np.sum( Q_A * np.sum(X,axis=1))/np.sum( X.shape[1]*Q_A)
theta_B = np.sum( Q_B * np.sum(X,axis=1))/np.sum(X.shape[1]*Q_B)
if abs(theta_A- prev_theta_A) + abs(theta_B- prev_theta_B) < epsilon:
break
prev_theta_A = theta_A
prev_theta_B = theta_B
## update prior
if update_prior:
prior_coin_A= np.mean(Q_A)
prior_coin_B = np.mean(Q_B)
prior_coin_A_list.append(prior_coin_A)
if is_return_prior_list:
return theta_A, theta_B, {"prior_coin_A_list":np.array(prior_coin_A_list),"prior_coin_B_list":1-np.array(prior_coin_A_list)}
else:
return theta_A, theta_B
First, let’s load the coin tossing data. The true prior distribution of $z$ is $P(z=A)=0.8$ and $P(z=B)=0.2$. For coin A, the true heads probability is 0.2; for coin B, the true heads probability is 0.7. For each observation, there are 10 tossing results.
true_prior_coin_A = 0.7
true_theta_A = 0.2
true_theta_B = 0.7
X = load_data(1000, prior_coin_A = true_prior_coin_A , theta_A=true_theta_A , theta_B = true_theta_B, num_tossing_per_sample = 10)
We can have a look at the loaded data (the first 10 observations)
print(X[:10])
[[0 1 1 1 1 1 1 1 1 0]
[1 1 1 1 1 1 1 1 0 1]
[0 1 1 0 1 1 1 1 1 1]
[0 0 0 0 0 0 0 0 0 1]
[0 0 0 1 0 1 0 0 0 0]
[0 0 0 1 1 0 0 1 1 1]
[0 0 0 0 0 0 0 0 1 0]
[0 1 0 0 1 0 0 1 1 1]
[1 1 0 1 0 1 0 0 0 1]
[0 0 0 0 0 0 0 1 0 1]]
EM algorithm with dynamically updated prior distribution of $z$
estimated_theta_A,estimated_theta_B, params = EM(X, update_prior=True, is_return_prior_list=True)
## This problem is (strictly) concave,
print("Estimated theta_A: %.4f, Estimated theta_B: %.4f"%( estimated_theta_A, estimated_theta_B))
Estimated theta_A: 0.1923, Estimated theta_B: 0.7053
Wow, the estimated theta_A is almost equal to the true theta_A (0.2), and the same holds for estimated theta_B. Note that the EM output may sometimes be “Estimated theta_A: 0.7, Estimated theta_B: 0.2”. This is OK because EM doesn’t know estimated theta_A is corresponding to coin A literally. It only knows that there are two coins, one with heads prob 0.7 and another on with 0.2.
EM algorithm with fixed prior distribution of $z$: $P(z=A)=0.5$ and $P(z=B)=0.5$.
estimated_theta_A,estimated_theta_B = EM(X, update_prior=False)
print("Estimated theta_A: %.4f, Estimated theta_B: %.4f"%( estimated_theta_A, estimated_theta_B))
Estimated theta_A: 0.6621, Estimated theta_B: 0.1762
From this result it’s obvious that if we use a fixed prior distribution of $z$ which is pretty different from the true prior, the final estimate of model parameters will be less accurate.
In fact, if we choose to dynamically update prior we check how the prior distribution changes, we will see the prior distribution will gradually approach the true prior. This can be shown by plotting the prior_coin_A_list variable:
import matplotlib.pyplot as plt
plt.plot( params["prior_coin_A_list"] )
plt.plot( np.ones_like( params["prior_coin_A_list"])*true_prior_coin_A )
plt.legend(["dynamically updated prior","true prior"])
plt.ylabel("$p(z=A)$")
plt.xlabel("iteration")
plt.show()

We can see that the prior gradually approachs the true prior as we expected. However, we also notice that there always exists some gap. This might be analytically explained. I will try to think about this in the future.
First of all, we need to implement the SVM solver based on the SMO algorithm.
import numpy as np
import matplotlib.pyplot as plt
import os
from keras.datasets import mnist
Using TensorFlow backend.
We define some auxilary functions. The following two functions are related with the kernel function and kernel matrix
""" kernel part for SVM """
def kernel_func(x1,x2, kernel_type=None):
if kernel_type is None:
return np.dot( x1,x2)
elif kernel_type["name"]=="GAUSSIAN":
sigma = kernel_type["params"][0]
return np.exp(- np.dot( x1-x2, x1-x2 )/(2*sigma**2) )
def get_kernel_matrix( x1, x2, kernel_type=None ):
num_samples_x1 = x1.shape[0]
num_samples_x2 = x2.shape[0]
kernel_matrix = np.zeros([num_samples_x1, num_samples_x2])
for nrow in range(num_samples_x1 ):
for ncol in range(num_samples_x2 ):
kernel_matrix[nrow][ncol] = kernel_func(x1[nrow] , x2[ncol], kernel_type = kernel_type)
return kernel_matrix
Then we need to implement the SVM solver part, which is encapsulated into a class
"""
Description: A SVM solver
Input: training dataset (x,y), together with other hype-parameters
Return: a trained SVM model (solver) which is able to perform classification for a give x
"""
class SVM_Solver:
def __init__(self, kernel_type=None , C=10):
self.support_ind = None
self.support_x = None
self.support_y = None
self.support_lamb = None
self.kernel_type= kernel_type
self.C = C
self.count = 0
self.objective_func = -np.Inf
self.lamb = None
self.param_b = None
## This is a SVM trained predictor
def predict(self,x, decision_mode = "hard"):
def decision_func(z):
if decision_mode == "soft":
if z<-1:
return -1
elif z>1:
return 1
else:
return z
elif decision_mode == "hard":
if z<0:
return -1
else:
return 1
K = get_kernel_matrix(self.support_x, x, kernel_type = self.kernel_type )
pred_y = []
for ind in range(x.shape[0]):
z= np.dot( self.support_lamb* self.support_y, K[:,ind] ) + self.param_b
pred_y.append(decision_func(z))
return np.array(pred_y)
"""Training the SVM model, which uses x, y and validation set x_val, y_val
max_iter is the maximum iteration to train;
epsilon is use to determine when the training is terminated -- the change of objective
function is less than epsilon
"""
def train( self, x, y, x_val, y_val, max_iter= 1E6, epsilon= 1E-4 ):
num_samples = x.shape[0]
"""Solve the dual problem using SMO"""
## Initialization
K=get_kernel_matrix(x,x, kernel_type = self.kernel_type )
C = self.C
if self.lamb is None:
self.lamb = np.zeros(num_samples)
if self.param_b is None:
self.param_b = np.random.normal()
## Start looping:
## looping parameters:
local_count =0
##Here is the part of the SMO algorithm
while True:
## randomly select a pair (a,b) to optimize
[a,b] = np.random.choice( num_samples, 2, replace= False )
if K[a,a] + K[b,b] - 2*K[a,b] ==0:
continue
lamb_a_old = self.lamb[a]
lamb_b_old = self.lamb[b]
Ea = np.dot(self.lamb * y, K[:,a]) + self.param_b - y[a]
Eb = np.dot(self.lamb * y, K[:,b]) + self.param_b - y[b]
lamb_a_new_unclip = lamb_a_old + y[a] *(Eb-Ea)/( K[a,a] + K[b,b] - 2*K[a,b] )
xi = - lamb_a_old * y[a] - lamb_b_old * y[b]
if y[a] != y[b]:
L = max( xi * y[b], 0 )
H = min( C+xi*y[b], C )
else:
L = max( 0, -C-xi*y[b])
H = min( C, -xi*y[b] )
if lamb_a_new_unclip < L:
lamb_a_new = L
elif lamb_a_new_unclip > H:
lamb_a_new = H
else:
lamb_a_new = lamb_a_new_unclip
lamb_b_new = lamb_b_old + ( lamb_a_old - lamb_a_new )*y[a] * y[b]
if lamb_a_new >0 and lamb_a_new <C:
self.param_b = self.param_b - Ea + ( lamb_a_old- lamb_a_new)*y[a]*K[a,a] + (lamb_b_old - lamb_b_new)*y[b] * K[b,a]
elif lamb_b_new >0 and lamb_b_new <C:
self.param_b = self.param_b - Eb + ( lamb_a_old- lamb_a_new)*y[a]*K[a,b] + (lamb_b_old - lamb_b_new)*y[b] * K[b,b]
self.lamb[a] = lamb_a_new
self.lamb[b] = lamb_b_new
self.count +=1
local_count +=1
"""Every 10000 iterations record the current progree of the training,
and determine whether to stop the training.
"""
if local_count >= max_iter or self.count % 10000 ==0:
## get the support set
self.support_ind = self.lamb > 0
self.support_x = x[self.support_ind]
self.support_y = y[self.support_ind]
self.support_lamb = self.lamb[self.support_ind]
## Evaluate the performance (accuracy) on training set and validation set
pred_y=self.predict(x)
train_acc = np.sum( pred_y == y)/ y.shape[0]
pred_y=self.predict(x_val)
val_acc = np.sum( pred_y == y_val )/ y_val.shape[0]
support_K = K[ self.support_ind,: ][:, self.support_ind]
new_objective_func = np.sum( self.support_lamb ) - 0.5 * np.dot( np.matmul( ( self.support_lamb *self.support_y ).T, support_K ).T , self.support_lamb* self.support_y )
## support ratio represents the percentage of the points which are support vectors
support_ratio = np.sum( self.support_ind )/ self.support_ind.shape[0]
print("Iteration: %d, \tTrain accuracy: %.2f%%, \tVal accuracy: %.2f%%, \tDelta Objective Function: %f, \tSupport Ratio: %.2f%%"%(self.count, train_acc*100, val_acc*100, new_objective_func - self.objective_func, support_ratio *100 ))
## If the change of dual objective function is less than epsilon, then stop training
if abs( new_objective_func - self.objective_func ) <= epsilon:
break
else:
self.objective_func = new_objective_func
if local_count >= max_iter:
break
Define some auxilary functions for compute the distance matrix, which is used to estimate the sigma for Gaussian Kernel, generator folder and plot the results.
def distance_matrix( x,y, metric = "Euclidean" ):
def distance( a,b ):
if metric == "Euclidean":
return np.linalg.norm(a-b)
n_row = x.shape[0]
n_col = y.shape[0]
dis_matrix = np.zeros([n_row, n_col] )
for r in range( n_row ):
for c in range(n_col ):
dis_matrix[r][c] = distance( x[r], y[c])
return dis_matrix
def generate_folder(path):
if not os.path.exists(path):
os.makedirs(path)
return path
# This plot results is used to plot the results on training dataset, e.g, what the separating hyperplane looks
# like, how the support vectors are distributed, and whether the points are correctly classified
def plot_results( x,y, support_ind, pred_y, title = "", img_save_path = None , show_img = True ):
fig, ax = plt.subplots()
x_low_dim = x[:,:2]
x_support = x[support_ind]
y_support = y[support_ind]
pred_y_support = pred_y[support_ind]
x_support_low_dim = x_low_dim[support_ind]
for ind in range(x.shape[0]):
if y[ind] == 1:
mshape = "^"
else:
mshape = "o"
if pred_y[ind] == 1:
color = "r"
else:
color = "b"
plt.plot(x_low_dim[ind,0], x_low_dim[ind,1], mshape, c= color, markerfacecolor='none', markeredgewidth=0.4, markersize =4)
for ind in range(x_support.shape[0]):
if y_support[ind] == 1:
mshape = "^"
else:
mshape = "o"
if pred_y_support[ind] == 1:
color = "r"
else:
color = "b"
plt.plot(x_support_low_dim[ind,0], x_support_low_dim[ind,1], mshape, c= color, markersize =4)
for ind in range(x.shape[0]):
if y[ind]!= pred_y[ind]:
plt.plot(x_low_dim[ind,0], x_low_dim[ind,1], "o", c= "g", markersize =9, markerfacecolor='none')
plt.xlabel("x")
plt.ylabel("y")
plt.xlim([min(x_low_dim[:,0])-0.5, max(x_low_dim[:,0])+0.5 ])
plt.ylim([min(x_low_dim[:,1])-0.5, max(x_low_dim[:,1])+0.5 ])
plt.title(title)
if img_save_path is not None:
plt.savefig( img_save_path )
if show_img:
plt.show()
plt.close()
First we load the data
def load_data(num_samples = 1000):
x1 = []
x2 = []
for _ in range(num_samples):
while True:
r_x = np.random.multivariate_normal( [0,1], [[20,0],[0,1]], 1 )
if r_x[0,1]>np.sin( r_x[0,0] )+0.5:
x1.append( r_x )
break
while True:
r_x = np.random.multivariate_normal( [0,-1], [[20,0],[0,1]], 1 )
if r_x[0,1]<np.sin( r_x[0,0] ):
x2.append( r_x )
break
x1 = np.concatenate( x1, axis =0 )
x2 = np.concatenate( x2, axis =0)
y1 = np.ones([num_samples]) *-1
y2 = np.ones([num_samples]) *1
x = np.concatenate([x1,x2], axis =0)
y = np.concatenate([y1,y2], axis =0)
return x, y
What does this loaded data look like? Let’s load and plot it.
x,y = load_data(500)
x_val, y_val = load_data(100)
## x,y are used for training, and x_val, y_val are used for validation
x_pos = x[y==1]
x_neg = x[y==-1]
plt.plot( x_pos[:,0], x_pos[:,1], "^", markerfacecolor='none' )
plt.plot( x_neg[:,0], x_neg[:,1], "o", markerfacecolor='none' )
plt.show()

From the figure, we can see that these points of two classes are obviously linearly non-separable, therefore we need to use the kernel SVM, and use Gaussian Kernel. Note that in the Gaussian Kernel there is a parameter: sigma, which represents the standard deviation.
The SVM results are very sensitive to the selection of sigma.
In this experiment, I used an empirical way to estimate the sigma:
Given the training dataset $X$, we use the function distance_matrix($X$,$X$) to compute the distance matrix w.r.t the element in $X$. Then we can use the average value of the distances between point pairs. Moreover, we can use a factor, such as 0.5, to adjust the value of $\sigma$.
estimated_sigma = np.mean( distance_matrix( x,x ) ) * 0.5
print(estimated_sigma)
svm= SVM_Solver( kernel_type = {"name":"GAUSSIAN", "params":[estimated_sigma] } )
2.800322186496282
Then, we can train our SVM model!
svm.train(x,y, x_val, y_val, max_iter = 200000)
Iteration: 10000, Train accuracy: 99.10%, Val accuracy: 98.00%, Delta Objective Function: inf, Support Ratio: 19.40%
Iteration: 20000, Train accuracy: 99.80%, Val accuracy: 99.00%, Delta Objective Function: 51.337039, Support Ratio: 16.10%
Iteration: 30000, Train accuracy: 99.90%, Val accuracy: 99.50%, Delta Objective Function: 26.450561, Support Ratio: 14.00%
Iteration: 40000, Train accuracy: 99.40%, Val accuracy: 99.50%, Delta Objective Function: 10.510557, Support Ratio: 14.30%
Iteration: 50000, Train accuracy: 99.00%, Val accuracy: 98.00%, Delta Objective Function: 8.315844, Support Ratio: 13.70%
Iteration: 60000, Train accuracy: 99.00%, Val accuracy: 99.00%, Delta Objective Function: 6.098615, Support Ratio: 12.70%
Iteration: 70000, Train accuracy: 99.70%, Val accuracy: 99.00%, Delta Objective Function: 6.163077, Support Ratio: 12.20%
Iteration: 80000, Train accuracy: 100.00%, Val accuracy: 99.50%, Delta Objective Function: 4.421311, Support Ratio: 12.40%
Iteration: 90000, Train accuracy: 99.80%, Val accuracy: 99.00%, Delta Objective Function: 4.129242, Support Ratio: 11.40%
Iteration: 100000, Train accuracy: 99.80%, Val accuracy: 99.00%, Delta Objective Function: 1.332299, Support Ratio: 10.50%
Iteration: 110000, Train accuracy: 100.00%, Val accuracy: 100.00%, Delta Objective Function: 1.195627, Support Ratio: 10.20%
Iteration: 120000, Train accuracy: 99.90%, Val accuracy: 100.00%, Delta Objective Function: 2.248688, Support Ratio: 10.50%
Iteration: 130000, Train accuracy: 100.00%, Val accuracy: 100.00%, Delta Objective Function: 2.209730, Support Ratio: 9.80%
Iteration: 140000, Train accuracy: 100.00%, Val accuracy: 100.00%, Delta Objective Function: 0.971279, Support Ratio: 9.60%
Iteration: 150000, Train accuracy: 100.00%, Val accuracy: 99.50%, Delta Objective Function: 1.393107, Support Ratio: 9.50%
Iteration: 160000, Train accuracy: 99.90%, Val accuracy: 99.50%, Delta Objective Function: 0.346211, Support Ratio: 8.90%
Iteration: 170000, Train accuracy: 99.80%, Val accuracy: 99.50%, Delta Objective Function: 0.258947, Support Ratio: 8.70%
Iteration: 180000, Train accuracy: 100.00%, Val accuracy: 100.00%, Delta Objective Function: 0.207311, Support Ratio: 8.60%
Iteration: 190000, Train accuracy: 99.90%, Val accuracy: 99.00%, Delta Objective Function: 0.738704, Support Ratio: 8.30%
Iteration: 200000, Train accuracy: 100.00%, Val accuracy: 100.00%, Delta Objective Function: 0.676768, Support Ratio: 8.30%
Ok, let’s have a look of the results by plotting it!
plot_results( x,y, svm.support_ind, pred_y= svm.predict(x), title = "",show_img = True )

Several conclusions can be drawn:
We can further evaluate the accuracy on test dataset:
x_test, y_test = load_data( 500 )
pred_y=svm.predict(x_test)
test_acc = np.sum( pred_y == y_test )/ y_test.shape[0]
print("Test Accuracy: %.2f%%"%(test_acc *100))
Test Accuracy: 99.50%
To further prove the effectiveness of the SVM model, we tested it on a slightly more complex problem: ditingush between digit “4” and digit “9” using SVM.
First of all, we need to load and prepare the data
## load data
(mnist_x, mnist_y), _ = mnist.load_data()
## extract the digit "4" images (positive 1) and digit "9" images (negative 1)
x_pos= mnist_x[mnist_y == 4]
y_pos= np.ones( x_pos.shape[0] )
x_neg= mnist_x[mnist_y == 9]
y_neg= np.ones( x_neg.shape[0] ) *(-1)
## Put both positive/negative samples together to get the train/val/test dataset
x = np.concatenate( [ x_pos, x_neg ], axis =0 )
x = np.reshape(x, [x.shape[0],-1] )/255 ## normalization
y = np.concatenate( [ y_pos, y_neg], axis =0 )
## randomly shuffle
random_indx = np.random.permutation( np.arange( x.shape[0] ) )
x = x[random_indx]
y = y[random_indx]
## get x,y x_val, y_val, x_test, y_test
x_val = x[:500]
y_val = y[:500]
x_test = x[500:1000]
y_test = y[500:1000]
x = x[1000:2000]
y = y[1000:2000]
Train the SVM model
## Estimate the value of sigma
sigma_mnist = np.mean( distance_matrix( x,y ) )*0.5
svm_mnist = SVM_Solver( kernel_type = {"name":"GAUSSIAN", "params":[sigma_mnist]} )
svm_mnist.train( x,y, x_val, y_val )
Iteration: 10000, Train accuracy: 98.70%, Val accuracy: 96.40%, Delta Objective Function: inf, Support Ratio: 34.50%
Iteration: 20000, Train accuracy: 98.60%, Val accuracy: 96.00%, Delta Objective Function: 92.534743, Support Ratio: 29.50%
Iteration: 30000, Train accuracy: 98.60%, Val accuracy: 95.80%, Delta Objective Function: 17.440160, Support Ratio: 27.20%
Iteration: 40000, Train accuracy: 98.80%, Val accuracy: 95.60%, Delta Objective Function: 5.375707, Support Ratio: 25.50%
Iteration: 50000, Train accuracy: 98.90%, Val accuracy: 95.40%, Delta Objective Function: 2.259709, Support Ratio: 24.40%
Iteration: 60000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 1.122565, Support Ratio: 23.70%
Iteration: 70000, Train accuracy: 98.70%, Val accuracy: 94.20%, Delta Objective Function: 0.762315, Support Ratio: 23.90%
Iteration: 80000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.425399, Support Ratio: 23.60%
Iteration: 90000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.291108, Support Ratio: 23.30%
Iteration: 100000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.258185, Support Ratio: 22.80%
Iteration: 110000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.179889, Support Ratio: 22.80%
Iteration: 120000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.140847, Support Ratio: 22.30%
Iteration: 130000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.071885, Support Ratio: 22.70%
Iteration: 140000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.037175, Support Ratio: 22.50%
Iteration: 150000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.056052, Support Ratio: 22.30%
Iteration: 160000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.032109, Support Ratio: 22.20%
Iteration: 170000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.021068, Support Ratio: 22.30%
Iteration: 180000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.020332, Support Ratio: 22.30%
Iteration: 190000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.019551, Support Ratio: 22.30%
Iteration: 200000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.009766, Support Ratio: 22.20%
Iteration: 210000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.010782, Support Ratio: 22.40%
Iteration: 220000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.011163, Support Ratio: 22.30%
Iteration: 230000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.009348, Support Ratio: 22.20%
Iteration: 240000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.006827, Support Ratio: 22.40%
Iteration: 250000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.004162, Support Ratio: 22.20%
Iteration: 260000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.003338, Support Ratio: 22.40%
Iteration: 270000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.002217, Support Ratio: 22.40%
Iteration: 280000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.001653, Support Ratio: 22.10%
Iteration: 290000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.001823, Support Ratio: 22.00%
Iteration: 300000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.001461, Support Ratio: 22.20%
Iteration: 310000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.001170, Support Ratio: 22.10%
Iteration: 320000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.000930, Support Ratio: 22.00%
Iteration: 330000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.000714, Support Ratio: 22.10%
Iteration: 340000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.000511, Support Ratio: 22.10%
Iteration: 350000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.000345, Support Ratio: 21.90%
Iteration: 360000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.000353, Support Ratio: 22.00%
Iteration: 370000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.000227, Support Ratio: 21.90%
Iteration: 380000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.000179, Support Ratio: 22.10%
Iteration: 390000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.000226, Support Ratio: 21.90%
Iteration: 400000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.000133, Support Ratio: 22.00%
Iteration: 410000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.000229, Support Ratio: 21.90%
Iteration: 420000, Train accuracy: 98.80%, Val accuracy: 95.40%, Delta Objective Function: 0.000100, Support Ratio: 21.90%
The Delta Objective Function is now less than epsilon= 1E-4, so the training is finished. Now we can test the classification results on test dataset.
pred_y=svm_mnist.predict(x_test)
test_acc = np.sum( pred_y == y_test )/ y_test.shape[0]
print("Test Accuracy: %.2f%%"%(test_acc *100))
Test Accuracy: 95.60%
Moreover, we can also have a look of the support_x to see what do the support vectors look like:
support_x_pos = svm_mnist.support_x[ svm_mnist.support_y==1 ]
support_x_neg = svm_mnist.support_x[ svm_mnist.support_y==-1 ]
fig=plt.figure(figsize=(10, 2), dpi= 80, facecolor='w', edgecolor='k')
plt.gray()
for i in range( 10 ):
plt.subplot(2,10,i+1)
plt.imshow( np.reshape(support_x_pos[i,:],[28,28] ))
plt.title("digit 4")
plt.axis('off')
for i in range( 10 ):
plt.subplot(2,10,i+11)
plt.imshow( np.reshape(support_x_neg[i,:],[28,28] ))
plt.title("digit 9")
plt.axis('off')
plt.subplots_adjust(wspace=1, hspace=1)
plt.show()

From the results we can see the support vectors are somehow ambiguous to distinguish. E.g, the 9th digit 4 looks also like digit 9, and the second digit 9 also looks like digit 4. The SVM model is sensitive to such ambiguous samples and tend to use them as support vectors to determine the separating hyperplane.
Conclusion
In these long series, we mathematically show the principle of SVM and many possible solutions to the problem. We also show the implementation and performance on some small but interesting samples. Hope this could be somehow helpful.
Reference
Dual Problem
We have introduced using gradient descent algorithm to solve the dual problem. However, the computation of the gradient has a high time complexity and thus would be a challenge for memory, especially when the training dataset is large. In this post, I introduce an efficient and light-version algorithm to solve the dual problem: Sequential Minimal Optimization (SMO)
The algorithm of SMO is:
Initialization: let \(\{\lambda_i\}, i=1,\dots,n\) be a set which satisfies the dual constraint.
Repeat:
- (1) heuristically select two \(\lambda_a, \lambda_b\), and set all the other \(\lambda_i (i\neq a,b)\) fixed;
- (2) optimize \(L(\lambda)\) with respect to \(\lambda_a, \lambda_b\); Until: KKT condition is satisfied with certain accuracy.
First question about the initialization: how to find a set \(\{\lambda_i\}\) which satisfies the dual constraints?
The answer is simply set \(\lambda_i=0\) for \(i=0,\dots,n\).
Suppose that we have finished the initialization, and pick up a pair \(\lambda_a, \lambda_b\) to optimize while keeping \(\lambda_i (i\neq a,b)\) fixed, then we have
Moreover, according to the dual constraints, we have
So we have
\(L(\lambda)\) is concave with respect to \(\lambda_a\), since \(\frac{\partial^2{L}}{\partial{\lambda_a^2}}= -( K_{a,a} + K_{b,b} - 2K_{a,b} )=-(e_a - e_b)^T \mathbf{K} (e_a - e_b) \leq 0\) due to the fact that the kernel matrix \(\mathbf{K}\) is nonnegative definite (see last post An Introduction to Support Vector Machines (SVM): kernel functions ). Therefore, we can find the optimal value of \(\lambda_a\) which maximizes \(L(\lambda)\) by computing the gradient and set it to 0.
By solving this equation, we will get the solution for \(\lambda_a^\star\):
It is too complicated to compute the numerator since there are too many terms. In the next, we will show that we can actually compute \(\lambda_a^\text{new}, \lambda_b^\text{new}\) from the old \(\lambda_a^\text{old}, \lambda_b^\text{old}\).
Before updating the value of \(\lambda_a, \lambda_b\), we first use the old version \(\lambda\) to perform the classification on data \(\mathbf{x}_ a, \mathbf{x}_ b\):
Base on the expressions of \(\hat{y}_a, \hat{y}_b\), we can have the following equation:
We denote prediction error \(E_i= \hat{y}_i - y_i\), then we have the expression of \(\lambda_a^\text{new}\):
Discussion: What if \(K_{a,a} +K_{b,b} - 2K_{a,b}=0\)? In this case \(L(\lambda)\) is a first degree function, it’s still concave, but in this case the definition of \(\lambda_a^\text{new}\) is no longer meaningful, so we just simply select another pair \((\lambda_a, \lambda_b)\) and do the computation above.
Note that the expression of the \(\lambda_a^\text{new}\) is not clipped, so for simplicity we name it as \(\lambda_a^\text{new, unclipped}\). It is inadequate to only compute the \(\lambda_a^\text{new, unclipped}\). We need to further clip it based on the meaningful domain determined by the dual constraints. According to the dual constraints, each \(\lambda_i\) actually has a box constraint. So we have:
We know that \(y_i \in \{-1, +1\}\). Based on whether \(y_a = y_b\) or not, we can have the relationship between \(\lambda_a\) and \(\lambda_b\) with box constraints, shown in the figure below.
Relationship between \(\lambda_a\) and \(\lambda_b\) with box constraints.
According to the figure, we can get the lower bound \(L\) and higher bound \(H\) for a meaningful solution of a new \(\lambda_a\):
This \(\lambda_a^\text{new, clipped}\) is the final meaningful new value of \(\lambda_a\). For simplicity, in the following we use \(\lambda_a^\text{new}\) to refer \(\lambda_a^\text{new, clipped}\).
After getting \(\lambda_a^\text{new}\), we need to compute \(\lambda_b^\text{new}\):
Now, we need to decide whether to update the value of \(b^\star\). If \(0<\lambda_a^\text{new}<C\), then \(\mathbf{x}_ a\) is the support vector which is exactly located at the margin. Therefore, we can update \(b^\text{new}\) as:
Otherwise, if \(0<\lambda_b^\text{new}<C\), we can update \(b^\text{new}\) as:
Note that if neither \(0<\lambda_a^\text{new}<C\) nor \(0<\lambda_b^\text{new}<C\), here we choose not to update \(b\).
Now, we have finished one single iteration in SMO.
Before we summarize the algorithm of SMO, there are some updates that can improve the computation efficiency.
According to the deduction above, we can have the pseudo algorithm of the SMO.
Initialization: \(\lambda_i=0\) for \(i=1,\dots,n\), \(b=0\), and pre-calculation of the Kernel matrix \(\mathbf{K}\)
Repeat:
heuristically (or randomly) select a pair \(\lambda_a^\text{old}\leftarrow \lambda_a,\ \lambda_b^\text{old}\leftarrow \lambda_b\);
if \(K_{a,a}+K_{b,b}-2K_{a,b}==0\):
continue
\(E_a = \sum_{i} \lambda_i y_i K_{i,a}+ b^\text{old} - y_a\)
\(E_b = \sum_{i}\lambda_i y_i K_{i,b}+ b^\text{old} - y_b\)
\(\lambda_a^\text{new, unclipped} = \lambda_a^\text{old} + \frac{ y_a (E_b - E_a)}{ K_{a,a} + K_{b,b} -2K_{a,b} }\)
\(\xi = -\lambda_a^\text{old} y_a - \lambda_b^\text{old} y_b\)
if \(y_a \neq y_b\):
\(L= \max( \xi y_b,0 ),\ H=\min(C+\xi y_b,C)\)
else:
\(L= \max( 0, -C-\xi y_b ),\ H=\min(C, -\xi y_b)\)
if \(\lambda_a^\text{new, unclipped} < L\):
\(\lambda_a^\text{new} = L\)
else if \(\lambda_a^\text{new, unclipped} > H\):
\(\lambda_a^\text{new} = H\)
else:
\(\lambda_a^\text{new} = \lambda_a^\text{new, unclipped}\)
\(\lambda_b^\text{new}=\lambda_b^\text{old}+(\lambda_a^\text{old}-\lambda_a^\text{new})y_a y_b\)
\(\lambda_a\leftarrow \lambda_a^\text{new},\ \lambda_b\leftarrow \lambda_b^\text{new}\)
if \(0<\lambda_a^\text{new}<C\):
\(b^\text{new}=b^\text{old}-E_a +(\lambda_a^\text{old}-\lambda_a^\text{new})y_a K_{a,a}+(\lambda_b^\text{old}-\lambda_b^\text{new})y_b K_{b,a}\)
else if \(0<\lambda_b^\text{new}<C\):
\(b^\text{new}=b^\text{old}-E_b +(\lambda_a^\text{old}-\lambda_a^\text{new})y_a K_{a,b}+(\lambda_b^\text{old}-\lambda_b^\text{new})y_b K_{b,b}\)
Until: Maximum iteration reached, or the dual objective function \(L(\lambda)\) is not further maximized with a certain accuracy.
Cool, isn’t it? Now We are able to solve the dual problem using the SMO algorithm!
Ref:
Dual Problem
Suppose that we have solved the dual problem and get the dual optimum. Let \(S_w=\{ i \vert 0<\lambda_i^\star \leq C \}\) represent the support set related with \(\mathbf{w}\); \(S_b=\{ i \vert 0<\lambda_i^\star < C \}\) represent the support set related with \(b\). Meanwhile, we define \(S_b^+ =\{ i \vert i\in S_b \ \text{and}\ y_i = +1 \}\) and \(S_b^-=\{ i \vert i\in S_b\ \text{and}\ y_i = -1 \}\). Then we can compute the primal optimum:
Given a new point \(\mathbf{x}\), we can perform classification by computing:
According to the formulas above, we notice that in the dual problem, computation of \(\mathbf{w}^\star\) and classification of new points, \(\mathbf{x}_ i^T\mathbf{x}_ j\) always appears as a whole.
Mapping points to a higher dimensional space
In some cases, if the points is not linearly separable in current space, they are possibly linearly separable if we map them into the higher dimension.
We define \(\phi(\mathbf{x}): R^p \rightarrow R^d\ ,\ d>p\) as a mapping function which maps low dimensional data to a high dimensional data. We can first map our data \(\mathbf{x}_ i \rightarrow \phi(\mathbf{x}_ i)\), then solve the dual problem:
We notice that in the dual problem, computing \(\mathbf{w}^\star\) and performing classification, \(\phi(\mathbf{x}_ i)^T\phi(\mathbf{x}_ j)\) always appears as a whole. Therefore, we can avoid computing the exact form of \(\phi(\mathbf{x})\), but instead directly explore the function for the inner product of two mapped points \(K: R^p \times R^p \rightarrow R\):
We call \(K(\mathbf{x}_i, \mathbf{x}_j)\) as the kernel function.
What is a valid kernel function?
A kernel function \(K(\mathbf{x}_ i, \mathbf{x}_ j)\) is valid if there exists a mapping function \(\phi\), such that it holds \(K_{i,j} = <\phi(\mathbf{x}_ i), \phi(\mathbf{x}_ j)>\) for any \(\mathbf{x}_ i, \mathbf{x}_ j\in R^p\).
Moreover, there is an equivalent conclusion on the validness of a kernel function.
A kernel function \(K(\mathbf{x}_ i, \mathbf{x}_ j)\) is valid if for any \(n\) samples \(\{ \mathbf{x}_ i \vert \mathbf{x}_ i \in R^p \}, i=1,\dots, n\), the kernel matrix \(\mathbf{K}=\begin{bmatrix}K_{1,1}, \dots, K_{1,n}\\\dots \\ K_{n,1},\dots, K_{n,n} \end{bmatrix}\) is non-negative definite.
Examples of Kernel functions
Polynomial kernel function
\[K(\mathbf{x}, \mathbf{y}) = ( \mathbf{x}^T\mathbf{y} +c )^d\]It can be proven that this function is equivalent to first mapping points to higher dimensional space and then computing the inner product.
Gaussian Kernel
\[K(\mathbf{x}, \mathbf{y}) = \exp\{ -\frac{ \|\mathbf{x}-\mathbf{y}\|^2 }{2{\epsilon}^2} \}\]Applying Gaussian kernel is equivalent to first mapping points to a infinitely high dimensional space and then computing the inner product. This can be understood by the Taylor expansion of the exponential function. For detailed explanation please see SVM中,高斯核为什么会把原始维度映射到无穷多维?
With the definition of the kernel function, we can rewrite the dual problem and classification task as following.
Dual Problem
Suppose that we have solved the dual problem and get the dual optimum. Let \(S_w=\{ i \vert 0<\lambda_i^\star \leq C \}\) represent the support set related with \(\mathbf{w}\); \(S_b=\{ i \vert 0<\lambda_i^\star < C \}\) represent the support set related with \(b\). Meanwhile, we define \(S_b^+ =\{ i \vert i\in S_b \ \text{and}\ y_i = +1 \}\) and \(S_b^-=\{ i \vert i\in S_b\ \text{and}\ y_i = -1 \}\). Then we can compute the primal optimum:
Given a new point \(\mathbf{x}\), we can perform classification by computing:
See, in fact \(\mathbf{w}^\star\) is never really computed, since we are only interested in the kernel function!
We can solve the dual problem using gradient descent algorithm as introduced in the post An Introduction to Support Vector Machines (SVM): Dual problem solution using GD. Just simply select a kernel function, such as polynomial or Gaussian, compte the Kernel matrix \(\mathbf{K}\) for the training dataset, compute the gradient and then perform back propagation to get the dual optimum \(\lambda^\star\). After getting \(\lambda^\star\), we can compute the primal optimum \(b^\star\) and perform classification on new points using the equations above.
In the next post, I will introduce how to solve the dual problem using Sequential Minimal Optimization (SMO).
This is the primal problem of the SVM in the case where points of two classes are linearly separable. Such a primal problem has two drawbacks:
To solve the problems above, we need to introduce a slack variable to the original SVM primal problem. This means that we allow certain (outlier) points to be within the margin or even cross the separating hyperplane, but such cases would be penalized. Now the primal problem of the “Slack-SVM” will be:
Primal Problem
Here \(\xi_i\) is the slack variable, and the positive \(C\) is the weight for the penalty term. Suppose that for some point \(\mathbf{x}_i\), it holds \(y_i(\mathbf{w}^T\mathbf{x}_i+b) = 1-\xi_i\):
It is possible to use Gradient Descent algorithm to solve the primal problem. However, due to the slack variables, the constraints is much more complex than the case without slack variables. It is more difficult to define the loss function used for gradient descent. On the contrary, the Lagrangian dual problem of this primal problem still remains compact and solvable, and can be easily extended to kernel SVM. Therefore, in the next, we mainly discuss the deduction of the Lagrangian dual problem of the Slack SVM primal problem.
Lagrangian Function
Lagrangian Dual function
To get the dual function, we can compute the derivative and set them to 0.
From these 3 equations we have
Substitue them in the Lagrangian function, we can get the Lagrangian dual function:
Therefore, the Lagrangian dual problem is:
We can use \(\lambda_i\) to represent \(\mu_i\), and finally get the dual problem:
Dual Problem
Compared with the dual problem for the SVM without slack variables, the only difference is that here the constraints of \(\lambda\) are \(0\leq \lambda_i \leq C\), instead of \(\lambda_i \geq 0\).
Actually in the primal problem of the SVM without slack variables, we can think there is a hidden \(C=\infty\), which means that the penalty of slack variables is infinitely large, so all points need to satisfy \(y_i(\mathbf{w}^T\mathbf{x}_ i+b)\geq 1\).
Solution of the Dual Problem
Gradient Descent Algorithm The objective function for gradient descent is:
Sequential Minimal Optimization (SMO), which will be discussed in the following posts.
Discussion on the Karush-Kuhn-Tucker (KKT) conditions The KKT conditions are now slightly different, since now in the dual function there are actually two variables: \(\lambda\) and \(\mu\). For the primal optimum \(\mathbf{w}^\star, b^\star, \xi^\star\) and the dual optimum \(\lambda^\star, \mu^\star\), it holds:
The complementary slackness is interesting. Suppose that we have already find the primal optimum and dual optimum. We can analyze the location of the point \(\mathbf{x}_ i\) based on the value of \(\lambda_i\):
Suppose that we have solved the dual problem and get the dual optimum. Let \(S_w=\{ i \vert 0<\lambda_i^\star \leq C \}\) represent the support set related with \(\mathbf{w}\); \(S_b=\{ i \vert 0<\lambda_i^\star < C \}\) represent the support set related with \(b\). Meanwhile, we define \(S_b^+ =\{ i \vert i\in S_b \ \text{and}\ y_i = +1 \}\) and \(S_b^-=\{ i \vert i\in S_b\ \text{and}\ y_i = -1 \}\). Then we can compute the primal optimum:
Multiple ways can be used to compute \(b^\star\):
We compare the separating hyperplane results between the SVM with slack variables (Slack-SVM for short) and the original SVM without slack variables (Original-SVM for short). The SVM models are trained by solving the Lagrangian dual problem using gradient descent algorithm introduced in the last post.
For further discussion, we recall the primal/dual problem of the Original-SVM and the primal/dual problem of the Slack-SVM:
Experiment 1.
Comparison of performance in the case where there are outliers but the points are still linearly separable. The Slack-SVM penalty term weight \(C=0.5\)
Experiment 2.
Analyzing the influence of different Slack-SVM penalty term weight \(C\).
As we increase the value of \(C\), the geodesic margin becomes wider. The outlier point is closer to the margin hyperplane geodesically. More points become support vectors.
To explain this we need to refer the form of the Slack SVM primal problem. When we increase \(C\), the penalty term \(C\sum_{i=1}^{n}\xi_i\) is more heavily penalized. The model tends to reduce the value of \(\xi_i\). So how to reduce \(\xi_i\) ?
The answer is to reduce \(\|\mathbf{w}\|\). This may sound a little bit bizarre, but we can tell that from the figure Slack SVM over different penalty weight C.
For different value of \(C\), the location and rotation of the separating hyperplane remains similar, so the distance from points to the separating hyperplane is similar. We know that for a point \(\mathbf{x}_ i\) which is within the margin or is located in the other side of the separating hyperplane, its geodesic distance to the separating hyperplane is \(\frac{\vert 1-\xi_i \vert}{\|\mathbf{w}\|}\). For the outlier points which cross the separating hyperplane, like the solid blue circle in the top right corner, the geodesic distance is \(\frac{\xi_i -1 }{\|\mathbf{w}\|}\).
Since for large \(C\), we need to reduce the large \(\xi_i\) of that outlier point, with the fact that the geodesic distance remains unchanged. So the possible solution is to reduce \(\|\mathbf{w}\|\). As a result, the geodesic margin \(\frac{1}{\|\mathbf{w}\|}\) will be increased. Therefore, the larger \(C\) is, the wider the margin area is.
Original SVM for linearly non-separable cases
We also notice that for \(C=100\) and \(C=10000\), the separating results are almost the same. This leads to another question: what if we set \(C=\infty\) and solve the dual problem of the Slack SVM?
If we set \(C=\infty\), the primal/dual problem of the Slack SVM is exactly the same as the primal/dual problem of the original SVM. This is the short proof:
Therefore, the above question is equivalent to ask: What if we apply the Original SVM to the linearly non-separable case?
The answer is that the separating results will be almost the same as the case \(C=10000\) in the figure Slack SVM over different penalty weight C. Why the geodesic margin is not further enlarged?
We showed that original SVM is equivalent to set \(C=\infty\) in Slack-SVM. However, from the aspect of the dual problem, the real value of \(C\) is actually determined by the up-bound of \(\lambda\). For example, if we set \(C=\infty\), but the real up-bound of the trained \(\lambda\) is 10000, then the real effective \(C\) is actually 10000. Therefore, we will see by applying Original SVM to linearly non-separable case, the final separating result is identical to the \(C=10000\) case.
| \(C\) | 10 | 100 | 10000 | \(\infty\) |
| \(\max{\lambda}\) | 10 | 62.2 | 62.2 | 62.2 |
We can see that when \(C\) reaches 100, the maximum of \(\lambda\) usually reaches around 60. Therefore, keeping increasing \(C\) does not influence the separating results further. Note that as we continue training, the \(\max{\lambda}\) may further rise, but it can hardly reach the value of \(C\) if \(C\) is very large.
Dual Problem
The the last post we introduced how to apply Lagrangian duality to SVM and how to get the primal optimum once we get the dual optimum. In this post we mainly discuss how to solve the dual problem and get the dual optimum.
To apply GD to SVM, we need to reformulate the objective function of the dual problem. Our new objective function will be:
where \(c>0\) is the weighting factor for the constraint \(\sum_{i=1}^{n}\lambda_i y_i = 0\). For the constraint \(\lambda_i\geq 0\), we can satisfy this constraint by clipping \(\lambda\) into the region \([0,\infty)\) after each back propagation during gradient descent.
Discussion: why not also put the constraints \(\lambda_i\geq 0\) also into the loss function by introducing an extra hinge loss term? Then the final loss function will be: \(\min_{\lambda}L(\lambda)=-\sum_{i=1}^{n}\lambda_i + \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j + \frac{c}{2}(\sum_{i=1}^{n}\lambda_i y_i)^2 + d \sum_{i=1}^{n}\text{max}\{-\lambda_i,0\}\)
This is reasonable in theory but not so feasible in practice. This will introduce one extra hyper parameter \(d\), and we will be lost in endlessly fine tuning and balancing the hyper parameters \(c\) and \(d\). Test results also show that achieving the constraint \(\lambda_i\geq 0\) using clipping is efficient and this method also easily support more general cases of SVM with penalty terms. This will be discussed later.
Based on the loss function, We can compute the gradient:
We define a function \(K(\mathbf{x}_i, \mathbf{x}_j)= \mathbf{x}_i^T\mathbf{x}_j\). To maintain the consistence with future posts, we can this function as kernel function. Given a training dataset \(\{\mathbf{x}_i\}, i=1,\dots,n\), we can get a kernel matrix:
where \(K_{i,j}=K(\mathbf{x}_i, \mathbf{x}_j)\).
Then the gradient \(\frac{\partial{L}}{\partial{\lambda_i}}\) can be expressed by the kernel matrix:
where \(\mathbf{e}_i=[0,\dots,0,1,0,\dots,0]\), with the \(i^{th}\) element being 1 and other elements being 0. The sign \(\lambda \circ \mathbf{y}\) represents the element-wise multiplication two vectors \(\lambda\) and \(\mathbf{y}\).
We can also write the expression of the gradient of \(L\) with respect to the whole vector \(\lambda\):
In practice, when we implement the gradient descent algorithm, we don’t need to compute \(\mathbf{K}\) in each iteration, since \(\mathbf{K}\) does not rely on \(\lambda\). Instead, we can simply compute \(\mathbf{K}\) before applying gradient descent and store it in the memory, and call it each time when computing the gradient.
Another implicit advantage of using such a kernel matrix expression is that such a definition can be extended into a broader definition of SVM – SVM with kernels, where we can give a more sophisticated definition to the kernel function \(K(\mathbf{x}_ i, \mathbf{x}_ j)\), instead of just vector dot product. But even in that case, the expression of the gradient still remains the same. We just simply pre-calculate the kernel matrix \(\mathbf{K}\) based on the new definition of kernel function, and then apply gradient descent algorithm to find the optimal solution. We will discuss kernel SVM in the future posts.
Implementation and Experiments
I implement the Gradient Descent algorithm to compute the dual optimum and use it to solve the original SVM optimization problem. The code is available in my github SupportVectorMachine/gd-dual-svm.py. The change of the hyperplane over iterations is shown in figure Hyperplane Over Iteration
In the above figure, the points with solid color are the support vectors. As the training goes on, more and more points are excluded from the support vector set. Finally there are only 3 support vectors. The finally separating hyperplane is obviously the optimal separating hyperplane with maximized margin.
One important feature of the Gradient Descent Algorithm is that in each iteration there is a matrix vector multiplication \(\mathbf{K}(\lambda \circ \mathbf{y})\), with a time complexity \(O(n^2)\). This might be computationally challenging if \(n\) is large.
Apart from the gradient descent method, there is another method called Sequential Minimal Optimization (SMO), which is a more efficient and specialized solution. We will discuss that in the following posts. Before we go further, I would like to introduce the SVM in more general cases.
To overcome these shortcomings, we can take advantage of the Lagrangian duality. First we convert original SVM optimization problem into a primal (convex) optimization problem, then we can get the Lagrangian dual problem. Luckily we can solve the dual problem based on KKT condition using more efficient methods.
First of all, we need to briefly introduce Lagrangian duality and Karush-Kuhn-Tucker (KKT) condition.
Primal Problem
A primal convex optimization problem has the following expression:
where \(f_i(\mathbf{x}) _{(i=0,1,\dots,n)}\) are convex, and \(h_j(\mathbf{x}) _{(j=1,\dots,p)}\) are linear (or affine).
We can get the Lagrangian function:
Since \(f_i(\mathbf{x})\) are convex, and \(h_j(\mathbf{x})\) are linear, \(L(\mathbf{x}, \mathbf{\lambda}, \mathbf{\mu})\) is also convex w.r.t \(\mathbf{x}\). Therefore, we can get the infimum of \(L(\mathbf{x}, \mathbf{\lambda}, \mathbf{\mu})\), which is called the Lagrangian dual function:
The difference between minimum and infimum:
- \(\min(S)\) means the smallest element in set \(S\);
- \(inf(S)\) means the largest value which is less than or equal to any element in \(S\).
- In the case where the minimum value is reachable, infimum = minimum. e.g. \(S=\{\text{all natural number}\}\), then \(\inf(S) = \min(S) = 0\)
- In the case where the minimum is not reachable, infimum may still exist. e.g. \(S=\{f(x)\vert f(x)=1/x, x>0\}\), \(\inf(S)=0\)
Dual Problem Based on the dual function we can get the dual optimization problem:
Strong Duality and Slater’s Condition
Let \(f_0^\star(x)\) and \(g^\star(\mathbf{\lambda},\mathbf{\mu})\) be the primal optimum and dual optimum respectively.
Weak duality means that
\(g^\star(\mathbf{\lambda},\mathbf{\mu}) \leq f_0^\star(x)\)
The difference \(f_0^\star(x)-g^\star(\mathbf{\lambda},\mathbf{\mu})\) is called duality gap.
Under certain circumstances, the duality gap can be 0, which means the strong duality holds. This condition is called Slater’s condition:
If Slater’s condition is satisfied, strong duality holds, and furthermore for the optimal value \(\mathbf{x}^\star\), \(\mathbf{\lambda}^\star\) and \(\mathbf{\mu}^\star\), the Karush-Kuhn-Tucker (KKT) conditions also holds.
Karush-Kuhn-Tucker (KKT) Conditions
KKT conditions contain four conditions:
Therefore, if strong duality holds, we can first solve the dual problem and get the optimal \(\mathbf{\lambda}^\star\), \(\mathbf{\mu}^\star\). Then we can substitute the dual optimum into the KKT conditions (especially KKT condition 2) to get the primal optimum \(\mathbf{x}^\star\). Then the primal convex optimization problem can be solved.
Now we are able to solve the SVM optimization problem using Lagrangian duality. As introduced in the first post An Introduction to Support Vector Machines (SVM): Basics, the SVM optimization problem is:
The Lagrangian dual function is
To compute the Lagrangian dual function, we can compute the partial derivative of \(L\) w.r.t \(\mathbf{w},b\) and set them to 0 (see KKT condition 2)
Then we get
Substitute these two constraint equations into \(L(\mathbf{w},b,\mathbf{\lambda})\), we get the Lagrangian dual function:
Then the dual problem is:
We can solve this dual problem using Gradient descent algorithm or Sequential Minimal Optimization (SMO). This will be discussed in the next post.
Once we get the dual optimum \(\lambda^\star\), we can get the primal optimum \(\mathbf{w}^\star=\sum_{i=1}^{n} \lambda_i^\star y_i\mathbf{x}_ i\). But wait, how to get the optimal \(b^\star\)? To further understand this, we need analyze the KKT conditions for SVM optimization problem.
Since the primal constraints \(1-y_i(\mathbf{w}^T\mathbf{x}_ i+b)\leq 0\) is obviously linear, so the Slater’s condition holds, strong duality holds, and the KKT conditions are satisfied for the primal optimum and dual optimum of the SVM. Therefore, we have the complementary slackness:
This looks interesting. From dual constraints we know that \(\lambda^\star\geq 0\). Together with this complementary slackness, we will know that if \(\lambda_i>0\), then it must hold \(y_i({\mathbf{w}^\star}^T\mathbf{x}_i+b^\star)=1\). This means \(\mathbf{x}_i\) is exactly one of the support vectors (the points which have a margin distance to the separating hyperplane)!
Therefore, we find a way to identify support vectors using Lagrangian duality:
Let \(S=\{i\vert \lambda^\star_i > 0\}\) represent the support vector set, \(S_+=\{i\vert i\in S\ \text{and}\ y_i=+1\}\) represent the subset whose labels are \(+1\), and \(S_-=\{i\vert i\in S\ \text{and}\ y_i=-1 \}\) represent the subset whose labels are -1. Then the primal optimum will be:
Since we know for support vectors \(\mathbf{x}_i,\ i\in S\), it holds \(y_i({\mathbf{w}^\star}^T\mathbf{x}_i+b^\star)=1\). \(y_i \in \{-1,+1\}\), so we get \({\mathbf{w}^\star}^T\mathbf{x}_i + b^\star= y_i\). Therefore, the primal optimum of \(b\) is:
or
In practice, in order to avoid influence of noise, we may use a more stable way to compute \(b^\star\):
Given a new point \(\mathbf{x}\), we can compute the value \({\mathbf{w}^\star}^T\mathbf{x}+b^\star\), and predict the label \(\hat{y}\) using hard decision or soft decision as shown in An Introduction to Support Vector Machines (SVM): Gradient Descent Solution. Substitute the expression of \({\mathbf{w}^\star}\), we have:
This implies that we only need the support vectors to determine the separating hyperplane and classify new points. Furthermore, we notice that either in the dual problem or in the classification, \(\mathbf{x}_i^T\mathbf{x}_j\) always appears as a whole. This feature can be used for Kernel SVM, which will be discussed in the following posts.
In the next post I will introduce how to solve the dual problem.
To solve this optimization problem, there are multiple ways. One way is to treat this problem as a standard optimization problem and use gradient descent algorithm to compute the optimal parameters. Another way is to formulate the Lagrangian dual problem of the primal problem, transferring original optimization problem into an easier problem. Here we mainly discuss the first method.
To apply GD, we need to design a new objective function which is differentiable. The new objective function is:
This objective function contains two terms. The first term is used to maximize the margin. This term is also called regularization term. The second term is a penalty term used to penalize the case where \(y_i(\mathbf{w}^T\mathbf{x}_i+b)<1\), which represents incorrect/imperfect classification. Note that for the case \(y_i(\mathbf{w}^T\mathbf{x}_i+b)\geq 1\) we don’t need to penalize it, so we use a max function \(\max\{1-y_i(\mathbf{w}^T\mathbf{x}_i+b) ,0\}\). This is also called hinge loss.
\(\lambda\) is a weight parameter used to control the weight of the regularization term. If \(\lambda\) is too small, the model (the learned hyperplane) will mainly focuses on correctly classify the training data, but the margin may not be maximized. If \(\lambda\) is too large, the model will have have a large margin, while there may exist more miss-classified points in the training dataset.
Compute the gradient
To apply GD we also need get the exact expression of the gradient.
where
The updating rules of the parameter \(\mathbf{w}\) and \(b\) are:
where \(\alpha\) is the learning rate.
Note that in practice that in each update loop we may not use the whole training dataset, instead we may use a mini-batch. Suppose that the mini batch size is \(m\), then the expression of the gradient is:
In the following we will use this mini-batch style expression.
To test the GD algorithm, we use toy data shown in figure 2d toy data
Visualization of Hyperplane
In this part, we set \(\lambda=1e-4, \text{learning_rate}=0.1, \text{batch_size}=100, \text{maximum_iteration}=100000\). The change of the hyperplane over iterations is shown in figure Hyperplane Over Iteration
We can also test the influence of \(\lambda\) on the final results of the hyperplane, to check if our illustration on \(\lambda\) above is right or not. The results are shown in figure Influence Of Lambda.
We also noticed that when \(\lambda\) is extremely small, like 1e-5, the margin doesn’t become further smaller. Actually we tested that even if \(\lambda=0\) we will still get the same ideal results, which implies that the regularization term in the loss function is useless in this toy example! This may be due to the fact that for such a simple dataset, it is very easy to find the optimal separating hyperplane and support vectors. Once the optimal separating hyperplane is found, the model will stick to it even if there is no regularization term in the loss function, since in this case the gradient is 0, and the training is actually stopped.
Suppose that we have obtained the optimal \(\mathbf{w}^{\star}\) and \(b^{\star}\), given a new input data \(\mathbf{x}\), we can make a decision of the label \(\hat{y}\) in two ways:
Hard Decision
\(\hat{y}=\begin{cases}
+1, & \text{if}\ {\mathbf{w}^{\star}}^T\mathbf{x} +b^{\star}\geq 0\\
-1, & \text{if}\ {\mathbf{w}^{\star}}^T\mathbf{x} +b^{\star} < 0\\
\end{cases}\)
Soft Decision
\(\hat{y} = d( {\mathbf{w}^{\star}}^T\mathbf{x} +b^{\star} )\)
where
\(d(z) = \begin{cases}
1, & \text{if}\ z \geq 1 \\
z, & \text{if}\ -1 \leq z < 1\\
-1, & \text{if}\ z < -1\\
\end{cases}\)
So that’s it. Now we are able to use GD to train a SVM model and used it for classification task. In the next post we will explore more possibilities of the solutions on SVM.