My notes on Andrew Ng's "Machine Learning Specialization" (MLS) and more
- Tools
- What is Machine Learning?
- Supervised Learning: Regression and Classification
- 3.1 Linear Regression
- 3.11 Cost Function
- 3.12 Gradient Descent
- 3.13 Multiple Features
- 3.14 Vectorization
- 3.15 The Normal Equation
- 3.16 Feature Scaling
- 3.17 Checking Gradient Convergence
- 3.18 Choosing Learning Rate
- 3.19 Feature Engineering
- 3.2 Logistic Regression
- 3.21 The Sigmoid Function
- 3.22 Decision Boundary
- 3.23 Cost Function for Logistic Regression
- 3.24 Gradient Descent for Logistic Regression
- 3.25 Overfitting
- 3.26 Regularization
- 3.1 Linear Regression
- Advanced Learning Algorithms
- 4.1 Neural Networks
- 4.11 Recognizing Images
- 4.12 Neural Network Model
- 4.13 TensorFlow Implementation
- 4.14 Training Using TensorFlow
- 4.15 Activation Functions
- 4.16 Multiclass Classification
- 4.17 Multi-Label Classification
- 4.18 Advanced Optimization
- 4.19 Backpropagation
- 4.2 Advice for Applying ML
- 4.21 Debugging
- 4.22 Evaluation
- 4.23 Bias and Variance
- 4.24 Learning Curves
- 4.25 Iterative ML Development Loop
- 4.26 Data Engineering
- 4.27 Full Cycle of an ML Project
- 4.28 Fairness, Bias, and Ethics
- 4.29 Skewed Datasets
- 4.3 Decision Trees
- 4.31 Decision Tree Model
- 4.32 Learning Process
- 4.33 One-Hot Encoding
- 4.34 Regression Trees
- 4.35 Tree Ensembles
- 4.36 XGBoost
- 4.37 When to Use Decision Trees
- 4.1 Neural Networks
- Beyond Supervised Learning
- 5.1 Unsupervised Learning
- 5.11 Clustering
- 5.12 Optimization Objective
- 5.13 Anomaly Detection
- 5.14 Anomaly Detection vs. Supervised Learning
- 5.2 Recommender Systems
- 5.21 Making Recommendations
- 5.22 Collaborative Filtering
- 5.23 Binary Labels
- 5.24 Mean Normalization
- 5.25 Content-Based Filtering
- 5.26 Principal Component Analysis
- 5.3 Reinforcement Learning
- 5.31 Return
- 5.32 Policy
- 5.33 State-Action Value Function
- 5.34 Random Environment
- 5.35 Continuous State
- 5.36 Deep-Q Reinforcement Learning
- 5.37 Mini-Batch and Soft Updates
- 5.38 State of RL
- 5.1 Unsupervised Learning
- Credits
- Language: Python
- Platform: Jupyter Notebook
- Libraries
- NumPy, scientific computing + lin algebra in Python
- Matplotlib, plotting data
- TensorFlow, machine learning package
- keras (integrated into TF 2.0), creates a simple, layer-centric interface to TF
- SciKit Learn, open source machine learning
- sklearn.linear_model.SGDRegressor: gradient descent regression model (performs best with normalized features)
- sklearn.preprocessing.StandardScaler: performs z-score normalization
- sklearn.linear_model: contains LogisticRegression + LinearRegression
- XGBoost, decision tree library
from xgboost import XGBClassifier()for classificationfrom xgboost import XGBRegressor()for regression
- Pandas, data analysis and manipulation
Arthur Samuel, a pioneer in CS + AI in his time, defined machine learning as "...[the] field of study that gives computers the ability to learn without being explicitly programmed" (1959). He helped evolve AI by writing the first checkers-playing game that learned from thousands of games against itself.
Machine Learning Algorithms:
- Supervised Learning (*Used in most real-world applications/rapid advancements)
- Unsupervised Learning
- Recommender Systems
- Reinforcement Learning
(DEF) Supervised Learning: Learning via input (x) to output (y) mappings.
Key characteristic: Learns from being given examples and the "right answers"
Given input(s), the algorithm will make a "guess" and output based on what it had previously learned.
Some examples Andrew provided:
| Input (X) | Output (Y) | Application |
|---|---|---|
| spam? (0/1) | spam filtering | |
| audio | text transcripts | speech recognition |
| ad, user info | click? (0/1) | online advertising |
| image, radar info | position of other cars | self-driving car |
Major Types of Supervised Learning
(DEF) Regression: Predict a number from infinitely many possible outputs
- Ex: Housing Market Prices (Sq. Feet to Market Value)
- Graphing technique utilizes a best-fitting line (linear, logarithmic, etc.)
- Models: Linear Regression
(DEF) Classification: Predict categories from a small number of possible outputs
- Ex: Breast Cancer Detection (Benign vs. Not-Benign)
- Terminology: Classes/Categories are often used interchangeably with the output
- Graphing technique utilizes a boundary line depending on input(s), separating one class/category from another
(DEF) Unsupervised Learning: Learning and structuring from data that only comes with inputs (x), but not output labels (y).
Key characteristic: Finds something interesting (patterns, structures, clusters, etc.) in unlabeled data -- we don't tell it what's right and wrong
Major Types of Unsupervised Learning:
(DEF) Clustering: Groups similar data points together
- Ex: Google News (matching headlines into a recommendation list)
(DEF) Anomaly Detection: Finds unusual data points
(DEF) Dimensionality Reduction: Compress data using fewer numbers
Andrew provided his thoughts on AGI in Part 2 of MLS in "Advanced Learning Algorithms". I wanted to memorialize them here and take inspiration for my growing worldview on AI and, more specifically, artificial general intelligence (AGI).
"...I think that the path to get there is not clear and could be very difficult. I don't know whether it would take us mere decades and whether we'll see breakthroughs within our lifetimes or it if it may take centuries or even longer to get there." - Andrew Ng
AI includes two very different things: artificial narrow intelligence (ANI) and artificial general intelligence (AGI)
ANI made tremendous progress in recent decades with examples like smart speakers (Siri, Alexa), self-driving cars, etc. - specific applications of intelligence
AGI would be doing anything a human can, but obviously, we are not there yet.
Can we mimic the human brain? No... we have (almost) no idea how the brain works. Every few years, new breakthroughs fundamentally change the way how we perceive the brain.
But can we see some AGI breakthroughs in our lifetimes? There are glimmers of hope.
The "one learning algorithm" hypothesis Many parts of the brain can adjust depending on what data it is given to function accordingly.
Andrew provides an example of "rewiring" the brain where a part of it (ex, the somatosensory cortex, or the part that handles touch receptors) is instead fed data from a different sensor (ex, eyes with images), the part of the brain (in this case, the somatosensory cortex) learns to function differently (the somatosensory cortex learns to see)
Another is "seeing with your tongue", wherein a grid with varying voltages is applied to a tongue. By mapping a grayscale image to the voltage grid, one can "see" with their tongue even if they are blind
What if we can translate these to a computer? Multimodal developments are already happening... but Andrew stresses to avoid overhyping. No one knows what will happen, but with hard work and time, AGI can happen eventually.
(DEF) Linear Regression: Fits a best-fitting, straight (linear) line to your data
(DEF) Univariate Linear Regression: Fancy name for linear regression with one variable (single x)
(DEF) Training Set: Data used to train the model
- Notation:
-
$x$ : features, "input" variable -
$y$ : targets, "output" variable -
$m$ : number of training examples -
$f$ : model, an equation obtained from training wherein we plug in$x$ to get$\hat{y}$ -
(EQUATION)$f_{w,b}(x)=wx+b$ - Can also drop the subscript
$w,b$ β$f(x)=wx+b$ $\hat{y} = w(x^{(i)}) + b$ -
$w$ : parameter weight -
$b$ : parameter bias
-
-
$\hat{y}$ : prediction for y -
$(x, y)$ : single training example (pair) -
$(x^{(i)}, y^{(i)})$ :$i^{th}$ training example with relation to the$i^th$ row (1st, 2nd, 3rd...)-
NOTE:
$x^{(i)}$ is not exponentiation, but denotes the row -
$(x^{(1)}, y^{(1)})$ refers to the 1st training example at row 1 of the training set
-
NOTE:
-
Process:
- Training Set β Learning Algorithm β Model
$f$ -
$x$ β$f$ β$\hat{y}$ - Ex: size β Model
$f$ β estimated price
- Ex: size β Model
Question: How to find how
Answer: Use a cost function
Squared Error Cost Function
- The most commonly used cost function for most regression-related models
-
(EQUATION)$J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)^2 = \frac{1}{2m} \sum_{i=1}^{m} \left( f_{w,b}(x^{(i)}) - y^{(i)} \right)^2$ - NOTE: Andrew says that the reason we divide by 2 here is to make future calculations "neater"
The squared error cost function is not the only cost function that exists -- there are more.
The goal of regression is to minimize the cost function
- When we use random
$w$ , we can get a graph with x-axis$w$ and y-axis$J(w)$ (note, this is excluding$b$ for now to make the example simpler). With this, we can find the minimum$J(w)$ and use it in our model$f$ (2D). - With both
$w, b$ , we get a plot where it looks like one part of a hyperbolic paraboloid (or "hammock", "soup bowl", and "curved dinner plate"). This plot would have$b$ and$w$ as parameters/inputs on the bottom axes and$J(w,b)$ on the vertical axis (3D).- This can also be accompanied by a contour (topographic) plot with
$w$ on the x-axis and$b$ on the y-axis. At the center of the contour plot (where the lines are "growing" from) is where$J(w,b)$ is the minimum.
- This can also be accompanied by a contour (topographic) plot with
Now, how can we more easily find the minimum
One of the most important building blocks in machine learning helps minimize some any function.
Outline:
- Start with some
$w,b$ (a common approach is to first set$w$ =0,$b$ =0) - Keep changing
$w,b$ to reduce$J(w,b)$ , until we settle at or near a minimum - NOTE: There may be >1 minimum, but with a squared error cost function, it will never have multiple local minima (only one global minimum)
Correct Implementation - Simultaneous Update
(EQUATION/ASSIGNMENT)
(EQUATION/ASSIGNMENT)
(ASSIGNMENT)
(ASSIGNMENT)
- Repeat until convergence
-
$\alpha$ : learning rate (usually a small positive number between 0 and 1)-
Highly important. If
$\alpha$ is chosen pooorly, the rate of descent may not even work at all. - Large
$\alpha$ : more aggressive descent- If too large, it is possible to miss/overshoot the minimum entirely. May fail to converge β diverge.
- Small
$\alpha$ : less agressive descent- If too small, the # of updates required will grow significantly, but gradient descent will still work
-
Highly important. If
-
$\frac{d}{dw}J(w,b)$ and$\frac{d}{db}J(w,b)$ : derivative of the cost function (gradient vector... this is where the "divide by two" from earlier comes in handy)$\frac{d}{dw}J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)})-y^{(i)})x^{(i)}$ $\frac{d}{dw}J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)})-y^{(i)})$ - Positive
$\frac{d}{dw}J(w,b)$ or$\frac{d}{db}J(w,b)$ :$w$ or$b$ decreases slightly - Negative
$\frac{d}{dw}J(w,b)$ or$\frac{d}{db}J(w,b)$ :$w$ or$b$ increases slightly
-
If you are already at a local minimum, then further gradient descent steps will do nothing.
- Near a local minimum, the derivative becomes smaller β update steps become smaller
- Thus, can reach minimum without modifying/decreasing the learning rate
$\alpha$
"Batch" Gradient Descent
- "Batch": Each step of gradient descent uses all the training examples
- There are other gradient descent algorithms that look at just the subsets
Rather than use just one feature
More Notation:
-
$x_j$ :$j^{th}$ feature -
$n$ : number of features -
$\vec{x}^{(i)}$ : number of$i^{th}$ training examples- For example, if we had three features,
$\vec{x}^{(1)}$ may equal [1416 3 2] (a row matrix)
- For example, if we had three features,
-
$x_{j}^{(i)}$ : value of feature$j$ in$i^{(th)}$ training example- Using the same values in the previous example,
$x_{1}^{(1)}$ = 1416 (1-indexed)
- Using the same values in the previous example,
Previously, a univariate linear regression model equation would be
A multiple linear regression equation would be (EQUATION)
- NOTE: This is not the same as multivariate regression (it is a different thing)
-
$\vec{w} = [w_1 \ w_2 \ w_3 ... w_n]$ (parameters of the model) -
$\vec{x} = [w_1 \ w_2 \ w_3 ... w_n]$ (vector) -
$b$ : a scalar number
Using (EQUATION)
Behind the scenes:
- Without vectorization (e.g. a for loop): will run iteratively
- With vectorization: will run in parallel MUCH MORE EFFICIENT (even more efficient with specialized hardware like GPUs with thousands of cores - CUDA)
- Efficient β Scale to large datasets
To implement this in Python, we can use NumPy with arrays:
w = np.array([1.0, 2.5, -3.3])
b = 4
x = np.array([10, 20, 30])
f = np.dot(w,x) + b
Will significantly run faster than manually specifying
The cost function can be re-written to include the vector
This can also be applied to the gradient descent algorithm (multiple linear reg. version)
- (repeat)
$w_j = w_j - \alpha\frac{d}{dw_j}J(\vec{w},b)$ $b = b - \alpha\frac{d}{db}J(\vec{w},b)$
In Python:
w = np.array([0.5, 1.3, ... 3.4])
d = np.array([0.3, 0.2, ... 0.4])
a = 0.1 # alpha, learning rate
w = w - a * d
$\vec{w} = (w_1 \ w_2 ... w_n)$ -
$\vec{d} = (d_1 \ d_2 ... d_n)$ (partial deriv. of the cost function with respect to$b$ and$w$ )
Only works for linear regression - solves for
Disadvantages
- Does not generalize other learning algorithms
- Slow when the number of features is large (> 10,000)
What you need to know
- The normal equation may be used in machine learning libraries that implement linear regression
Rescaling features (assume
In other words, it is a technique to make gradient descent run much faster.
Ways to feature scale:
- Divide the entire feature range by the maximum:
- Given the range
$300 \le x_1 \le 2000$ ,$x_{1, scaled} = \frac{x_1}{2000}$ - Thus, the scaled range would be
$0.15 \le x_1 \le 1$ - The same will be done to
$x_2$
- Given the range
- Mean Normalization:
- Find the average of each feature e.g., (
$x_1$ ) as$\mu_{1}$ - Given the range
$300 \le x_1 \le 2000$ ,$\mu_{1} = 600$ ,$x_{1, scaled} = \frac{x_1-\mu_1}{max-min} = \frac{x_1-600}{2000-300}$ - Thus, the scaled range would be
$-0.18 \le x_1 \le 0.82$ - The same will be done to
$x_2$
- Find the average of each feature e.g., (
- Z-Score Normalization:
- Find standard deviation
$\sigma$ and the average of each feature e.g., ($x_1$ ) as$\mu_{1}$ - Given the range
$300 \le x_1 \le 2000$ ,$\mu_{1} = 600, \sigma_1 = 450$ ,$x_{1, scaled} = \frac{x_1-\mu_1}{\sigma_1} = \frac{x_1-600}{650}$ - Thus, the scaled range would be
$-0.67 \le x_1 \le 3.1$ - The same will be done to
$x_2$
- Find standard deviation
Tips for feature scaling:
- Aim for range
$-1 \le x_j \le 1$ for each feature$x_j$ - Scalar multiples of these are fine as well e.g.,
$-3 \le x_j \le 3$ , as long as they are less than 10 and greater than 0.1 - Something like
$-1000 \le x_1 \le 1000$ (too large β rescale) - Something like
$0.001 \le x_1 \le 0.001$ (too small β rescale)
- Scalar multiples of these are fine as well e.g.,
- When in doubt, just feature rescale
Objective:
We can check the learning curve by plotting # of iterations (x-axis) with
-
$J(\vec{w},b)$ should decrease after every iteration - If the cost function does not decrease, then the alpha chosen was not viable
- The # of iterations will vary from application to application
Automatic Convergence Test:
- Let
$\epsilon$ be$10^{-3}$ - If
$J(\vec{w},b)$ decreases by$\le \epsilon$ in one iteration, declare convergence - However, finding the right
$\epsilon$ can be hard. Therefore, using the graph can be a more reliable method for determining convergence (look if$J(\vec{w},b)$ flattens out)
If the graph of # of iterations (x-axis) with
- Bug in code
- Learning rate
$\alpha$ is too large (too small will just make # of iterations larger)
How to tell if it is a bug or a learning rate problem: set learning rate
Values of
(DEF) Feature Engineering: Using intuition to design new features by transforming or combining original features.
Why? The choice of features can have a huge impact on a learning algorithm's performance. Thus, choosing the right features is critical to making an algorithm work well.
Example: Predicting the price of a house
- Assume we have and equation
$f(x) = w_1x_1 + w_2x_2 + b$ , where$x_1$ is the frontage width and$x_2$ is the depth. - While this is fine, we can refine this by taking the
$x_3$ area = frontage * depth. The resulting equation could look like$f(x) = w_1x_1 + w_2x_2 + w_3x_3 + b$ - Note that we did not remove features but added one to the equation
Polynomial Regression
Using the ideas of multiple linear regression + feature engineering, we can develop a new algorithm called polynomial regression.
Important: As the features are being raised to some power, feature scaling becomes increasingly more important. We can take the feature
Motivation: Since linear regression is not so good for classification-related problems, a logistic regression algorithm is widely used today. While "logistic regression" contains the word "regression", it is used more for classification.
Example of a classification task: Decide if an animal is a cat or not a cat
Example of a regression task: Estimate the weight of a cat based on its height
Why is linear regression bad for classification? This is because any single outlier may skew the linear regression model to where it may falsely identify a category (misclassification).
NOTE: Classes and categories are often used interchangeably here.
(DEF) Binary Classification: Output
(DEF) Negative Class: Those that connotate "false" or "absence" != bad
(DEF) Positive Class: Those that connotate "true" or "presence" != good
(DEF) Decision Boundary: A "line" that separates classes/categories.
Also referred to as a "logistic function", it looks like an "S" on a 2D graph. Outputs between 0 and 1:
(EQUATION)
- Notice if z = big positive number, the fraction becomes 1. If z = big negative number, the fraction becomes 0.
If we set
This becomes our logistic regression model:
(EQUATION)
In essence, it helps determine the "probability" that class is 1
Example:
-
$x$ is "tumor size" -
$y$ is 0 (not malignant) or 1 (malignant) -
$f(\vec{x})$ = 0.7 means that there is a 70% chance that$y$ is 1
Notation used in research example:
- Translation: Probability that
$y$ is 1, given input$\vec{x}$ , parameters$\vec{w}, b$
Now, given the probability, how do we decide at which probability would classify the input as 0 or 1?
A common choice is to choose 0.5 as the threshold: Is
- Yes:
$\hat{y} = 1$ - No:
$\hat{y} = 0$ - NOTE: With the thresholds, they do not always have to be 0.5, but it is usually better to choose a lower threshold (circumstances matter) and risk a false positive than a misclassification.
To find the decision boundary, we take
This can also be applied to non-linear decision boundaries where we apply polynomial regression to logistic regression.
Non-linear Decision Boundary Example:
- Set
$z = w_1x_1^2 + w_2x_2^2 + b$ , where$w_1,w_2=1, b=-1$ - Thus
$f(\vec{x}) = g(w_1x_1^2 + w_2x_2^2 + b)$ , where$z=x_1^2+x_2^2=1$ -
$x_1^2+x_2^2=1$ is conveniently a circle where inside the circle,$y = 0$ , and outside,$y = 1$
Compared to the squared error cost function we have been using for linear regression, in which its graph would result in a convex (bowl) shape, if we apply that to logistic regression, the graph would be non-convex (wiggly), meaning that there would be "valleys" that would disrupt our gradient descent algorithm.
A cost function for logistic regression can be defined as such:
- Recall the squared error cost function:
$J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} \left( f_{w,b}(x^{(i)}) - y^{(i)} \right)^2$ -
(EQUATION)Logistic Cost Function:$J(w, b) = \frac{1}{m} \sum_{i=1}^{m} [L(f_{\vec{w},b}(\vec{x}^{(i)}, y^{(i)}))]$ -
(DEF)Logistic Loss Function:$L(f_{\vec{w},b}(\vec{x}^{(i)}, y^{(i)}))$ - Equals
$-\log(f_{\vec{w},b}(\vec{x}^{(i)}))$ if$y^{(i)} = 1$ - Equals
$-\log(f_{\vec{w},b}(1 - \vec{x}^{(i)}))$ if$y^{(i)} = 0$ - Loss is lowest when
$f_{\vec{w},b}(\vec{x}^{(i)}, y^{(i)})$ predicts close to true label$y^{(i)}$ - If
$y^{(i)} = 1$ ...- As
$f_{\vec{w},b}(\vec{x}^{(i)}, y^{(i)})$ β 1, then loss β 0 (Good!) - As
$f_{\vec{w},b}(\vec{x}^{(i)}, y^{(i)})$ β 0, then loss β$\infty$ (bad)
- As
- Conversely, if
$y^{(i)} = 0$ ...- As
$f_{\vec{w},b}(\vec{x}^{(i)}, y^{(i)})$ β 0, then loss β 0 (Good!) - As
$f_{\vec{w},b}(\vec{x}^{(i)}, y^{(i)})$ β 1, then loss β$\infty$ (bad)
- As
- If
- Equals
(EQUATION) Simplified Loss Function:
- Is completely equivalent to our previous loss function
(EQUATION) Logistic Cost Function Using the Simplified Loss Function:
Of course, this is not the only cost function. This cost function is derived from maximum likelihood estimation.
Just like the gradient decent for linear regression, the process remains the same:
(EQUATION/ASSIGNMENT)
(EQUATION/ASSIGNMENT)
(ASSIGNMENT)
(ASSIGNMENT)
- Repeat until convergence (simultaneous updates)
The partial derivatives also remain the same:
$\frac{d}{d\vec{w}}J(\vec{w},b) = \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(x^{(i)})-y^{(i)})x^{(i)}$ $\frac{d}{db}J(\vec{w},b) = \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(x^{(i)})-y^{(i)})$
However, the only change remains in the definition of
Other techniques that can also be applied:
- Learning curve adjustments
- Feature scaling
- Vectorization
Both linear and logistic regression can work well for many tasks, but sometimes, in an application, the algorithm(s) can run into a problem called overfitting, which can cause it to perform poorly.
What does fitting mean? This refers in context to the best fit line. If we use a linear line on a training set of data points where a quadratic line may have been better, then the linear line does not fit the training set very well (underfit - high bias).
Another extreme scenario would be using some curve that may fit the training set perfectly where a simpler curve would suffice (to the point where the line is "wiggly"), and the cost is 0 on all points. If we use this model on an extraneous example, then it could predict the output completely wrong (overfitting - high variance).
(DEF) Underfitting (High Bias): Does not fit the training set that well as if the learning algorithm has some sort of strong preconception
(DEF) Generalization (Just Right): Fits training set pretty well (best fit)
(DEF) Overfitting (High Variance): Fits the training set extremely well, but new examples can result in highly variable predictions
How can we address overfitting?
Some options:
- Collect more training examples (the training algo will fit a curve that is "less wiggly" over time)
- Use fewer features
- If too many features + insufficient training data, it may result in overfit
- As such, select the most relevant features (feature selection)
- A disadvantage with feature selection is that useful features could be lost, if all features are indeed useful
-
(DEF)Regularization: Reduces the size of parameters$w_j$ - Gently reduces the impact of some features without removing them completely
As stated in its definition, we reduce the size of parameters
If we don't know what term to penalize, we can penalize all of them a bit.
Let's use start by regularizing linear regression's cost function β
- The symbol
$\lambda$ is called the regularization parameter, where$\lambda > 0$ -
$\frac{\lambda}{2m} \sum_{j=1}^n w_j^2$ is called the regularization term, which keeps$w_j$ small - Increasing
$\lambda$ will tend to decrease the size of parameters$w_j$ , while decreasing it will increase the size of parameters$w_j$ - We also divide
$\lambda$ by$2m$ to make it easier to choose a good value for$\lambda$ , as it is scaled the same as the cost function as the training set grows - The inclusion of parameter
$b$ with the$\lambda$ makes little difference, thus it is excluded
Regularizing Gradient Descent for Linear Regression:
- The process of gradient descent will remain the same, but the partial derivatives will change to include the new regularization term
$\frac{d}{d\vec{w}}J(\vec{w},b) = \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(x^{(i)})-y^{(i)})x^{(i)} + \frac{\lambda}{m}w_j$ -
$\frac{d}{db}J(\vec{w},b) = \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(x^{(i)})-y^{(i)})$ (remains the same since$b$ does not have a significant effect)
Regularizing Logistic Regression:
- Like the regularized cost function for linear regression, we just have to add a regularization term to logistic regression's cost function β
$J(w, b) = -\frac{1}{m} \sum_{i=1}^{m} [-y^{(i)}\log(f_{\vec{w},b}(\vec{x}^{(i)})) -(1-y^{(i)})\log(f_{\vec{w},b}(1 - \vec{x}^{(i)}))] + \frac{\lambda}{2m} \sum_{j=1}^n w_j^2$ - The intent is still the same: prevents the size of some parameters
$w_j$ from becoming too large - Just like before, the gradient descent algorithm remains the same:
$\frac{d}{d\vec{w}}J(\vec{w},b) = \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(x^{(i)})-y^{(i)})x^{(i)} + \frac{\lambda}{m}w_j$ -
$\frac{d}{db}J(\vec{w},b) = \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(x^{(i)})-y^{(i)})$ (remains the same since$b$ does not have a significant effect) - However, recall that the definition of
$f$ changes
This section will touch on:
- Neural Networks (Deep Learning)
- Inference (prediction)
- Training
- Practical advice for building ML systems
- Decision Trees
Origins: Ambition to develop algorithms that try to mimic the brain, but in modern NNs, we are shifting away from the idea of mimicking biological neurons
- Also known as deep learning
- Used in various applications today, from speech to images (CV) to text (NLP) and more
Why neural networks?
- As the amount of data increased in a wide range of applications, traditional AI like linear regression and logistic regression failed to scale up in performance, increasingly.
- A small neural network trained on that same dataset as traditional AI would see some performance gains -- even more so as we scale up to medium to large-sized neural networks
- The "size" of a neural network depends on the number of artificial "neurons" it has
- NOTE: Weights + Biases are automatically determined during the training process and will eventually be updated iteratively through a process called backpropagation, which minimizes the loss function using optimization w/ algos like gradient descent (will be discussed more later)
Example: Demand Prediction
-
(DEF)Activation:$a = g(z) = \frac{1}{1+e^{-\vec{w} \cdot \vec{x} +b}}$ - Imagine this like a single neuron in the brain; this accepts an input
$x$ and outputs$a$ - Notice that the function is the same as logistic regression
- Example: Take
$x$ as the price,$a$ being the probability of being a top seller (0 or 1)
- Imagine this like a single neuron in the brain; this accepts an input
Now, let's expand the number of features to price, shipping cost, marketing, and material to determine the probability of being a top seller.
We can "combine" some of these features into one neuron or several neurons in a layer:
-
(DEF)Layer: Grouping of neurons that take as input the same or similar features and that, in turn, outputs a few numbers together- Can have multiple or a singular neuron
- We can create the first layer as such:
- Price, Shipping Cost β Neuron (Affordability)
- Marketing β Neuron (Awareness)
- Price, Material β Neuron (Perceived Quality)
- Then, these feed into a second layer, which contains only a singular neuron:
- Affordability, Awareness, Perceived Quality β Neuron (Probability of Being a Top Seller)
-
(DEF)Input Layer ($\vec{x}$ ): Layer 0 containing all the inputs/features we are plugging into the neural network- In the example, "Price", "Shipping Cost", "Marketing", and "Material" would all be in the input layer
-
(DEF)Hidden Layer ($\vec{a}$ ): Intermediary layers between input + output- In the example, this would include the layer containing "Affordability" and "Awareness"
- There could be more than 1 hidden layer, where the number of neurons could vary to be larger than features or smaller than the features
-
(DEF)Output Layer ($a$ ): The final layer that outputs our prediction- In the example, the second layer would be the output layer
-
(DEF)"Activations": Refers to the output from a neuron- In the example, "Affordability" and "Awareness" would be activations from the first layer
- "Probability of Being a Top Seller" would be the activation from the final neuron
In reality, all neurons in one layer would be able to access all features from the previous layer
- For example, the neuron "Affordability" would take all inputs of "Price", "Shipping Cost", "Marketing", and "Material"
- We also do NOT define features or neurons in the hidden layer -- the neural network determines what it wants to use in the hidden layer(s), which is what makes it so powerful
The number of neurons in a hidden layer may be ambiguous, but a useful formula to determine the number is:
-
$N_i$ = number of input neurons -
$N_o$ = number of output neurons -
$N_s$ = number of training examples -
$\alpha$ = arbitrary scaling factor usually between 2-10 - Obtained from StackExchange
Process:
Input Layer (
(DEF) Multilayer Perceptron: Refers to neural networks with multiple hidden layers, often used in literature
The question with architecting a Neural Network: How many hidden layers and units do we need?
One application of a neural network is in computer vision (CV), which takes as an input a picture and outputs what you may want to know, like maybe the identity of a profile picture.
How do you convert pictures to features?
- The picture, if 1000px x 1000px, is actually a 1000 x 1000 matrix of varying pixel intensity values, which range from 0-255
- If we were to "roll" up these values into a singular vector
$\vec{x}$ , then it would contain 1 million pixels intensity values- How to roll up a matrix? One way is to do L β R, down one row, then go from R β L until you are done with the entire matrix
A possible process may look like this for facial recognition:
Input Picture
- HL 1 finds certain lines (looking at a small window)
- HL 2 groups these lines into certain facial features (looking at a bigger window)
- HL 3 aggregates these facial features into different faces (looking at an even bigger window)
- The Output Layer tries to determine the match probability of identity
A possible process may look like this for car identification:
Input Picture
- HL 1 finds certain lines (looking at a small window)
- HL 2 groups these lines into certain car features (looking at a bigger window)
- HL 3 aggregates these car features into different cars (looking at an even bigger window)
- The Output Layer tries to determine the match probability of the car
The fundamental building block of most modern neural networks is a layer of neurons
- Every layer inputs a vector of numbers and applies a bunch of logistic regression units to it, and then outputs another vector of numbers (activations) that will be the input into subsequent layers until the final/output layer's prediction of the NN that we then can then threshold
- The number of layers includes all hidden layers + output layer, excluding the input layer, and is indexed from 1 (where the input layer is 0)
A neural network layer comprises many neurons, each with its own weight
- By convention, the activations per layer are denoted by
$a^{[i]}$ , where$i$ is the index of the particular layer. -
$a^{[1]}$ means the activations from layer 1,$a^{[2]}$ means the activations from layer 2, etc. - To further differentiate neurons' parameters from different layers, we could use the superscript
$[i]$ again, where$w_j^{[i]}$ and$b_j^{[i]}$
The superscript notation continues into the individual scalar activations from each neuron
For a subsequent hidden layers after the first, the
- Ex:
$a_1^{[3]} = g(\vec{w}_1^{[3]} \cdot \vec{a}^{[2]} + b_1^{[3]})$
A general equation for each activation of a neuron (the Activation Function) is:
(EQUATION)
(DEF) Inference: Using a trained model to make predictions or classifications on new, unseen data
A popular example of inference is handwritten digit recognition.
(DEF) Forward Propagation: The process of moving forward from Input Layer β HL1 β HL2 β HL... β Output Layer from left to right
(DEF) Tensor: Think of it as a matrix
With TensorFlow + NumPy, we can build these layers:
Ex: x β HL1 (3 Neurons) β Output Layer β Prediction w/ Threshold
x = np.array([200.0, 17.0])
layer_1 = Dense(units=3, activation="sigmoid") # Dense is another name for a layer, units = num. of neurons
a1 = layer_1(x) # apply layer_1 to vector x, where a1 is the activations
### OUTPUT tf.Tensor([[0.2 0.7 0.3]], shape=(1, 3), dtype=float32)
# Translation: 1 x 3 matrix (shape) with type float
# Can use a1.numpy() to convert to numpy matrix
layer_2 = Dense(units=1, activation="sigmoid")
a2 = layer_2(a1) # apply layer_2 to the vector a1 activations from HL1
### OUTPUT tf.Tensor([[0.8]], shape=(1, 1), dtype=float32)
# Translation: 1 x1 matrix (shape) with type float
# The number 0.8 is the activation/prediction from the output layer
# apply threshold (given 0.5)
if a2 >= 0.5:
yhat = 1
else:
yhat = 0
Note about NumPy Arrays:
- In
np.array([[]]), every [] is a row, separated by commas. Inside [], each number belongs to a specific column - NOTE It is double square brackets (1 set to enclose everything, another for the rows)
For example, np.array([[1, 2]]) would be a row matrix (Will be more commonly used for applications like describing features):
Can also be represented as np.array([[1], [2]]) as a column matrix:
A 2D matrix with np.array([1, 2], [4, 5]) looks like this:
Structuring a inference with a Neural Network using TensorFlow:
layer_1 = Dense(units=3, activation="sigmoid") # Don't need to explicitly reference
layer_2 = Dense(units=1, activation="sigmoid") # Don't need to explicitly reference
model = Sequential([Dense(units=3, activation="sigmoid"),
Dense(units=1, activation="sigmoid")])
# sequentially strings together these two layers to build a NN
# by convention, we don't use layer declarations, just use Dense() directly
model.compile(...)
x = np.array([200.0, 17.0],
[120.0, 5.0],
[425.0, 20.0],
[212.0, 18.0]) # 4 x 2 matrix
y = np.array([1,0,0,1]) # targets
model.fit(x,y) # tells TF to sequentially string the layers and train it on x, y
model.predict(x_new) # outputs a new prediction based on a new dataset
Given a set of
(DEF) Epochs: Number of steps in gradient descent
Example Code:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
# Step 1: Specifies model telling TF how to compute for the inference
model = Sequential([Dense(units=25, activation="sigmoid"),
Dense(units=15, activation="sigmoid"),
Dense(units=1, activation="sigmoid")])
# Step 2: Compiles the model using a specific loss function
from tensorflow.keras.losses import BinaryCrossentropy
model.compile(loss=BinaryCrossentropy()) # example loss function "binary cross entropy"
# Step 3: Trains the model
model.fit(X,Y, epochs=100)
Model Training Step Parallels between Traditional + Neural Networks (TF):
- Define Model: Specify how to compute output given input
$x$ and parameters$w,b$ - Logistic Regression (TRAD)
- z = np.dot(w,x) + b
- f_x = 1/(1+np.exp(-z))
- model = Sequential([Dense(...),
Dense(...),
Dense(...)]) (TF)
- Logistic Regression (TRAD)
- Specify Loss and Cost Functions
- Logistic Loss (TRAD)
- loss = -y * np.log(f_x) - (1-y) * np.log(1-f_x)
- model.compile(loss=BinaryCrossentropy()) (TF)
- Logistic Loss (TRAD)
- Train on data to minimize the cost
$J(\vec{w}, b)$ - w = w - alpha * dj_dw (TRAD)
- b = b - alpha * dj_db (TRAD)
- model.fit(X,Y, epochs=100) (TF - .fit() uses Backpropagation)
(DEF) Binary Cross-entropy: Also known as logistic loss... binary re-emphasizes either 0 or 1 classification
NOTE: Binary cross-entropy is, of course, not the only loss function. For regression-related problems, we could use MeanSquaredError(), also imported from tensorflow.keras.losses
So far, we have been using the Sigmoid function (
However, there are more and their usages depend on various applications:
-
(DEF/EQUATION)ReLU (Rectified Linear Unit):$g(z) = max(0, z)$ -
NOTE:
$g(z)$ can never be 0 here, but$g(z)$ can be$+\infty$ if$z$ was
-
NOTE:
-
(DEF/EQUATION)Linear Activation Function:$g(z) = z$ - As if
$g$ was never there at all
- As if
To recap, these three are the most commonly used: Sigmoid, ReLU, and Linear AF
The next question is: how to choose an activation function to use?
For Binary Classification (y = 0/1): Use Sigmoid
For Regression (y = +/-): Use Linear AF
For Regression (y >= 0): Use ReLU
In reality, ReLU is the most common. Why?
- Faster to compute and learn
-
$\frac{d}{dw}J(W,B) \approx 0$ when$g(z)$ is flat, which is when$y<0$ for ReLU... speeds up gradient descent - Use almost all the time for hidden layers
Why do we need activation functions?
- If we, for example, used all Linear AF for our activation functions (incl. HL + Ouput Layer), this would be no different than linear regression (defeats the purpose of a NN)
- If we used Linear AF on all HL but Sigmoid on Output Layer, then it will be equivalent to logistic regression
- Thus, do not use linear activations in hidden layers (suggested: ReLU)
(DEF) Multiclass Classification: Refers to classfication problems where you can have more than just two output labels (not just 0/1 like binary)
- The "handwriting digit" example is a good start - classifying not just 0 or 1, but digits from 0-9
- For predictions, the notation still looks like
$P(y=n | \vec{x}$ for some$n$ , but instead of$n = 0/1$ , it can be anything
(DEF) Softmax: A generalization of logistic regression but with
-
(EQUATION)$z_j = \vec{w}_j \cdot \vec{x} + b_j$ where$j = 1..., N$ -
(EQUATION)$a_j = \frac{e^{z_j}}{\sum^{N}_{k=1} e^{z_k}} = P(y= j | \vec{x})$ -
NOTE:
$a_1 + a_2 + ... + a_N = 1$ (100%). Also, with only two outputs, Softmax reduces down to logistic regression but with different parameters -
(EQUATION)$loss(a_1,...,a_N, y) = -\log(a_N) if y = N$
The Softmax will replace the Output Layer as a Softmax Output Layer with the number of neurons (units) being equal to the number of classifications you wish to have
Example Code (not the most optimal version):
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
# Step 1: Specify the Model
model = Sequential([Dense(units=25, activation="relu"),
Dense(units=15, activation="relu"),
Dense(units=1, activation="softmax")])
# Step 2: Specify the Loss + Cost
from tensorflow.keras.losses import SparseCategoricalCrossentropy
model.compile(loss=SparseCategoricalCrossentropy()) # Note "SparseCategoricalCrossentropy"
# Step 3: Train model
model.fit(X,Y, epochs=100)
The reason why the above is not the best is because of numerical roundoff errors
A More Numerically Accurate Implementation of Softmax:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
# Step 1: Specify the Model
model = Sequential([Dense(units=25, activation="relu"),
Dense(units=15, activation="relu"),
Dense(units=1, activation="linear")]) # note linear now, NOT softmax
# Step 2: Specify the Loss + Cost
from tensorflow.keras.losses import SparseCategoricalCrossentropy
model.compile(loss=SparseCategoricalCrossentropy(from_logits=True)) # notice "from_logits"
# Step 3: Train model
model.fit(X,Y, epochs=100)
# Step 4: Predict
logits = model(X) # no longer outputting a, but rather z (intermediate val)
f_x = tf.nn.softmax(logits)
(DEF) Logits: An intermediate value
We can also use the parameter from_logits=True in our logistic regression algorithm to make it more numerically accurate, but it is not totally needed.
While seemingly like multiclass classification where we have one output and multiple labels (like instead of 0/1, we can have 0-9), multi-label classification refers to just multiple labels (outputs).
Street Image Example:
- Is there a car? Yes (1)
- Is there a bus? No (0)
- Is there a pedestrian? Yes (1)
- Thus, the target
$y$ would be a column matrix/vector
There are two neural network architectures that may be used here:
- Train three neural networks, one for each output of car, bus, and pedestrian
- Train one neural network with three outputs to simultaneously detect car, bus, and pedestrian
- The output layer here will have three neurons instead of one (using Sigmoid AF)
Recall the learning rate
There are some algorithms that can dynamically adjust the learning rate (start large then decrease as we approach the minimum) like the Adam algorithm
(DEF) Adaptive Moment Estimation (Adam): If
Example Code:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
# Step 1: Specify the Model
model = Sequential([Dense(units=25, activation="relu"),
Dense(units=15, activation="relu"),
Dense(units=1, activation="linear")])
# Step 2: Specify the Loss + Cost
from tensorflow.keras.losses import SparseCategoricalCrossentropy
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss=SparseCategoricalCrossentropy(from_logits=True))
# Step 3: Train model
model.fit(X,Y, epochs=100)
Most practitioners use the Adam algorithm rather than the optional gradient descent algorithm.
There are also additional layer types other than the dense layer (where each neuron output is a function of all the activation outputs of the previous layer).
Another type of layer often used is Convolutional Layers.
(DEF) Convolutional Layer: Each neuron only looks at part of the previous layer's outputs
- Allows for faster computation
- Need less training data (less prone to overfitting)
- Multiple convolutional layers contribute to a Convolutional Neural Network (CNN)
- Allows for flexibility in window size and opportunity for greater efficiency than NNs with dense layers
(DEF) Backpropagation: A technique that utilizes gradient descent algorithms to more efficiently calculate derivatives as a right-to-left calculation (in relation to a computation graph)
- Also called "autodiff"
The process is that after forward propagation (left-to-right) to compute the cost function, backpropagation (right-to-left) is used for the derivative calculation.
In terms of computational complexity, it would take roughly
Backpropagation is already built into most deep learning algorithms/machine learning frameworks.
"I've seen teams sometimes, say, six months to build a machine learning system, that I think a more skilled team could have taken or done in just a couple of weeks. The efficiency of how quickly you can get a machine learning system to work well will depend to a large part of how well you can repeatedly make good decisions about what to do next in the course of a machine learning project." - Andrew Ng
Say we implemented a regularized linear regression model on housing prices, but it makes unacceptably large prediction errors. What can we try next?
- Get more training examples
- Try smaller sets of features
- Try getting additional features
- Try adding polynomial features(
$x_1^2, x_1^3, x_1x_2, etc$ ) - Try increasing/decreasing the regularization term
$\lambda$
Many of these steps can take lots of time, like getting more training examples. How do we know what to do?
(DEF) (Machine Learning) Diagnostic: A test that you run to gain insight into what is/isn't working with a learning algorithm, to gain guidance into improving its performance
- Can take time to implement, but doing so can be a very good use of your time
How can one evaluate a model's performance? Evaluating a model's performance properly may lead to debugging, as it may perform worse than expected.
The 70/30 Technique: splitting the dataset into two subsets
- 70% can be used for training the data ((
$x^{m_{train}}, y^{m_{train}}$ ) where$m_{train}$ = # of training examples) - 30% can be used for testing the data ((
$x_{test}^{m_{test}}, y_{test}^{m_{test}}$ ) where$m_{test}$ = # of training examples)
To determine performance, these can be subbed into the cost function
- If
$J_{train}(\vec{w}, b)$ is low, but$J_{test}(\vec{w}, b)$ is high, then that may indicate that the performance on the training set is good, the performance on general/test set is not as good (overfitting) - For classification problems,
$J_{train}(\vec{w}, b)$ and$J_{test}(\vec{w}, b)$ indicates the fraction of their respective sets that has been misclassified
However, note that once
In other words,
Further Refinement 60/20/20: Training/Cross-Validation/Test Sets:
- 60% can be used for training the data ((
$x^{m_{train}}, y^{m_{train}}$ ) where$m_{train}$ = # of training examples) - 20% can be used for cross-validating the data ((
$x_{cv}^{m_{cv}}, y_{cv}^{m_{cv}}$ ) where$m_{cv}$ = # of training examples) - 20% can be used for testing the data ((
$x_{test}^{m_{test}}, y_{test}^{m_{test}}$ ) where$m_{test}$ = # of training examples)
(DEF) Cross-Validation: Extra dataset to check the validity of different models
- Also commonly called the validation set, development set, and dev set
Model Selection Procedure:
- Train the model using the training dataset
- Test different models with varying degrees
$d$ using the cross-validation set and choosing the degree with the lowest cost$J_{cv}(w^{}, b^{})$ - Estimate the generalization error using the test dataset with the chosen degree:
$J_{test}(w^{}, b^{})$
The above was intended for linear regression but also works for choosing a neural network architecture.
Model Selection Procedure - NN Architecture:
- Train the NN using the training dataset
- Test different
$d$ NNs using the cross-validation set and choosing the NN with the lowest cost$J_{cv}(w^{}, b^{})$ - Estimate the generalization error using the test dataset with the chosen degree:
$J_{test}(w^{}, b^{})$
When making decisions regarding your training algorithm, utilize the training and cv datasets only. Do not touch the test dataset until you've created one model as your final model to ensure that the test set is fair and not an overly optimistic estimate of how well a model may generalize new data.
Recall our concepts of High Bias (Underfit) and High Variance (Overfit). How might they relate to the costs
(
For high bias (underfit):
-
$J_{train}$ is high -
$J_{cv}$ is high - Relatively,
$J_{train} \approx J_{cv}$
For high variance (overfit):
-
$J_{train}$ is low -
$J_{cv}$ is high - Relatively,
$J_{train} << J_{cv}$
Both high bias + variance:
-
$J_{train}$ is high - Relatively,
$J_{train} << J_{cv}$
If "just right":
-
$J_{train}$ is low -
$J_{cv}$ is low
Comparing
How does bias and variance work with regularization (
- Very large
$\lambda$ = High bias (underfit), where$f_{\vec{w}, b}(\vec{x}) \approx b$ (e.g.$\lambda = 10,000$ ) - Very small
$\lambda$ = High variance (overfit) (e.g.$\lambda = 0$ )
Thus, how can we choose a good
Comparing
But now, from what perspective do we assume
- Large gap between baseline performance (error) and training error: high variance (overfit)
- Large gap between training error and cross-validation error: high bias (underfit)
Speech Recognition Example:
- Training error
$J_{train}$ : 10.8% - Cross Validation error
$J_{cv}$ : 14.8% - Just looking at these two, we may assume high bias from the baseline of 0%, right?
- However, if we factor in human-level performance (10.6%), we can't assume the learning algorithm to be much better than that
- Thus, if we consider 10.6% as the baseline,
$J_{train}$ is only +0.2% higher. The difference between$J_{train}$ and$J_{cv}$ remains the same (+4%). - Now, we may consider this to be a variance problem rather than a bias problem
In establishing a baseline level of performance, the question is: "What is the level of error you can reasonably hope to get to?"
- Human-level performance
- Competing algorithms performance
- Guess based on experience
Learning curves are a way to help you understand how your learning algorithm is doing as a function of the training set
If a learning algorithm has high bias:
- The CV error
$J_{cv}$ will decrease then plateau - The training error
$J_{train}$ will increase then plateau beneath the CV error - Both will be relatively higher than the human-level performance, regardless of
$m$ no matter how big$m$ is- If a learning algo suffers from high bias, getting more training data will (by itself) not help much
If a learning algorithm has high variance:
- The CV error
$J_{cv}$ will be much larger than$J_{train}$ , decreasing then plateauing as$m$ increases - The training error
$J_{train}$ will increase then plateau - The "point of plateau" is the human-level performance, which acts like a horizontal asymptote between the two errors
- However, this means more
$m$ = good! - If a learning algorithm suffers from high variance, getting more training data is likely to help
- However, this means more
Back to our example with housing predictions: Say we implemented a regularized linear regression model on housing prices, but it makes unacceptably large prediction errors. What can we try next? Which will be used if either high variance vs. high bias?
- Get more training examples (fixes high variance)
- Try smaller sets of features (fixes high variance)
- Try getting additional features (fixes high bias)
- Try adding polynomial features(
$x_1^2, x_1^3, x_1x_2, etc$ ) (fixes high bias) - Try increasing/decreasing the regularization term
$\lambda$ - Increase (fixes high variance)
- Decrease (fixes high bias)
If you have high bias, your training set is not the problem - your model is. As such, don't randomly throw away training examples.
Bias and Variance in Neural Networks:
Large neural networks are usually low-bias machines
Simple Process for evaluating bias and variance in neural networks:
-
Does it do well on the training set? (1)
- If not (
$J_{train}$ is relatively high), expand the neural network (add more HL or hidden units) until it does well on the training set (achieves comparable to human-level+ performance) -
If so, does it do well on the cross-validation set? (2)
- If not (
$J_{cv}$ is high), add more data and go back to step (1) - If so, done!
- If not (
- If not (
Understandably, the solutions to (1) and (2) may be hard:
- GPUs and AI accelerators are speeding this process up, but as NNs get super large, they will get significantly harder to compute
- Data may be limited due to the nature of the data
A large neural network will usually do as well or better than a smaller one so long as regularization is chosen appropriately (less risk of overfitting than traditional AI). However, they do become more computationally expensive.
To regularize, we add the parameter kernel_regularizer=(...) to Dense(...) layer.
Start
- Choose architecture (model, data, etc.)
- Train model
- Diagnostics (bias, variance, and error analysis) Repeat
Example Development Process - Building an Email Spam Classifier:
- Choose Architecture
- Supervised learning
-
$\vec{x}$ = features of email- Features: list the top 10,000 words of the email to compute
$x_1,x_2,...,x_{10,000}$
- Features: list the top 10,000 words of the email to compute
-
$y$ = spam (1) or not spam (0)
-
- Supervised learning
- Train model
- Can use either logistic regression or a neural network to predict
$y$ given features$x$
- Can use either logistic regression or a neural network to predict
- Diagnostics
- How do you try to reduce the error in your spam classifier?
- Collect more data (e.g., Spam "Honeypot")
- Develop sophisticated features based on email routing (from the email header)
- Design algorithms to detect misspellings
- How do you try to reduce the error in your spam classifier?
- Repeat if needed
Choosing a more promising direction can speed up your project many times over than if you chose the wrong path.
(DEF) Error Analysis: Manually examining misclassified examples and categorizing them based on common traits
- This can help determine what to focus on and what has a high impact on your model
- In relation to these categories, we may decide to gather more data or features
- These error categories may be overlapping and not mutually exclusive
- For large misclassified examples, just get a small subset (~100+) to examine
Some tips/techniques on engineering the data used by your system:
- Adding Data: Rather than adding more data on everything, add more data on the types where error analysis has indicated it might help
- E.g., with the email spam classifier, go to unlabeled data and find more examples of say, Pharma-related spam if it came up often in error analysis
- Could boost learning algorithm performance much more than just general data
(DEF)Data Augmentation: modifying an existing training example to create a new training example- Typically used in audio and image classification
- In image text recognition, it may involve rotations, distortions, etc. (like what you see in captchas)
- Speech recognition may involve adding noise (e.g., background, bad connection/quality, etc.)
- However, usually does not help to add purely random/meaningless noise to your data
- Typically used in audio and image classification
(DEF)Data Synthesis: synthetically generate realistic data- Often used in computer vision tasks and less for other applications
- An example is photo-to-text
- Real data may involve real-world photos
- Synthetic data may involve typing in a text editor with different fonts
- An example is photo-to-text
- Often used in computer vision tasks and less for other applications
Approaches to developing AI:
(DEF)Conventional Model-Centric Approach: AI = Code (algorithm/model) + Data- Significant emphasis on the code
(DEF)Data-Centric Approach: AI = Code (algorithm/model) + Data- More emphasis on data engineering and collection
(DEF) Transfer Learning: Using models and data from a different task
- Let's use the example of the handwritten digits classification 0-9 and an image classification NN with 1,000 classes/output units
- Since the image classification NN has already been trained, we could re-use some of its parameters
$w, b$ (hidden layers only; output layer is excluded due to size difference) on the digits classification - We have two options:
- only train output layer parameters
- train all parameters
- Since the image classification NN has already been trained, we could re-use some of its parameters
(DEF) Supervised Pretraining: Training a model on a very large dataset (possibly not a related topic, but a similar application) with the intention of then fine-tuning it to a specific topic
- Downloading a pre-trained model is one way to get a jumpstart on your desired application and topic area
- Use the same input type (images, audio, text, etc.)
To apply transfer learning:
- Download NN parameters pre-trained on a large data with the same input type (e.g., images, audio, text) as your application (or train your own)
- Further train (fine-tune) the network on your own data
Especially helpful if you lack the resources/availability to get large datasets pertaining to your application if the downloaded NN parameters are relatively similar
- Examples of popular transfer learning applications are GPT-3, BERTs, ImageNet, etc.
- Define Project Scope
- Define and Collect Data
- Train and Perform Error Analysis (*Iterative Improvement with (2) β»οΈ)
- Deploy, Monitor, and Maintain System (May need to go back to (2), (3) if needed)
Example Deployment Structure:
- Mobile app (sends API call (
$x$ ) to Inference Server, e.g., audio clip) - Inference Server (returns Inference (
$\hat{y}$ ) to Mobile App, e.g., text transcript)- Contains ML Model
Software engineering may be needed (Machine Learning Operations (MLOps)):
- Ensure reliable and efficient predictions
- Scaling
- Logging
- System monitoring (new data?)
- Model updates
Machine learning algorithms today affect billions of people. As such, it is necessary to approach designing machine learning systems with fairness and ethics in mind.
In the past, there were times when ML systems discriminated/biased against various groups:
- Hiring tool that discriminates against women
- Facial recognition system matching dark-skinned individuals to criminal mugshots
- Biased bank loan approvals
- Toxic effect of reinforcing negative stereotypes
There also have been adverse/negative use cases:
- Deepfakes
- Spreading toxic/incendiary speech through optimizing for engagement
- Generating fake content for commercial and political purposes
- Using ML to build products, commit fraud, etc.
- Spam vs. Anti-Spam; Fraud vs. Anti-Fraud
Possible Guidelines to Maintain Fairness & Ethics
- Get a diverse team to brainstorm things that might go wrong with emphasis on possible harm to vulnerable groups
- Carry out a literature search on standards/guidelines for your industry
- Audit systems against possible harm before deployment
- Develop a mitigation plan (if applicable), and after deployment, monitor for possible harm
- Simple mitigation: rollback to a previous model where it was reasonably fair
- Real-life scenario: All self-driving car companies developed mitigation plans for if a car gets into an accident
When working on a machine learning application (particularly classification), the ratio of positive to negative can be very far off from 50:50. In these cases, usual error metrics like accuracy do not work well.
Example - Rare Disease Classification:
- Assume
$y=1$ if disease is present,$y=0$ otherwise - If our training error
$J_{train}$ is 1% (99% correct diagnoses), but only 0.5% of patients have the disease, then a simpleprint("y=0")will be more effective since it has an "error" of 0.5% (99.5% correct diagnoses). A simple print statement outperformed a learning algorithm! - However, as stated before, if the dataset is skewed, the accuracy may not be a reliable error metric. Instead, we can use another error metric pair: Precision/Recall
To setup for precision/recall, we need to separate the data and predictions into 4 categories:
- True Positive (Actual = 1, Predicted = 1)
- False Positive (Actual = 0, Predicted = 1)
- False Negative (Actual = 1, Predicted = 0)
- True Negative (Actual = 0, Predicted = 0)
(DEF) Precision: Proportion of positive predictions that are actually correct; accuracy of positive predictions
-
(EQUATION)$P = \frac{\textrm{True Positives}}{\textrm{Total Predicted Positive}}$ - Example: Of all patients where we predicted
$y=1$ , what fraction actually have the rare disease?
(DEF) Recall: Proportion of actual positive cases that the model correctly identifies; how well the model finds relevant instances in a dataset
-
(EQUATION)$R = \frac{\textrm{True Positives}}{\textrm{Total Actual Positive}}$ - Example: Of all patients that actually have the rare disease, what fraction did we correctly detect as having it?
A learning algorithm with both low precision and low recall is not a useful algorithm.
However, getting high precision and high recall may not be possible -- there may be a trade-off. This depends on our classification threshold.
Suppose still the example of rare diseases and
- If the threshold was set at 0.9 (we want to predict
$y=1$ if we are very confident)- Higher precision, lower recall
- If the threshold was set at 0.3 (we want to avoid missing too many cases of rare diseases when in doubt, predict
$y=1$ ):- Lower precision, higher recall
Thresholds are usually up to you and depend on the nature of an application.
Luckily, there is one more useful metric that may help us: The F1 score
(DEF) F1 Score: Combines precision (
-
(EQUATION)$F_1 = 2 \frac{PR}{P+R}$ - Also called the harmonic mean
Decision trees are one of the learning algorithms that are very powerful, widely used in many applications, and even often win in machine learning competitions. Despite this, they lack attention from academia and are not so popular compared to their counterparts, neural networks.
The "tree" refers to one similar to a binary tree in computer science (not a biological tree in nature).
The structure is hard to illustrate in words click here to see illustration, but each root has a feature with various options to traverse downward (with binary classification, the example root may be "ear shape" and if "pointy", go left, else if "floppy", go right). The children's roots are not the specific options, but a new decision classification (e.g., face shape, whiskers, etc.).
The shapes around each feature have a meaning and may be separated into two categories: decision nodes (ovals) and leaf nodes (rectangles).
(DEF) Decision Nodes: All other nodes other than the leaf nodes. At each decision node, there are options to cause you to decide either to go left or right down the tree.
- The root node (very top) is also a decision node
(DEF) Leaf Nodes: The nodes at the very outside edges of the tree. These make a prediction.
With decision trees, there are some decisions we need to consider on our own:
- How to choose what feature to split on at each node to maximize purity (or minimize impurity)
- When do you stop splitting? (Splitting too much runs the risk of overfitting)
- When a node is 100% one class
- When splitting a node will result in the tree exceeding a maximum (user-set) depth
- When improvements in purity score are below a threshold
- When the number of examples in a node is below a threshold
(DEF) Entropy: A measure of impurity
- Commonly with the notation
$H(p_1)$ , where$p_1$ = fraction of positive examples according to the classification output criteria (e.g., cat or not cat) - If
$p_1$ is the fraction of positive examples,$p_0$ is the fraction of negative examples and is$p_0 = 1 - p_1$ -
(EQUATION)$H(p_1) = -p_1\log_2(p_1)-(1-p_1)\log_2(1-p_1)$ -
NOTE: It is possible to get
$0\log(0)$ if$p_1 = 1$ or$p_1 = 0$ , and it is known that$\log(0) = - \infty$ . However, just assume that this term equals$0$ .
-
NOTE: It is possible to get
-
$0 \le H(p_1) \le 1$ , where$1$ is the highest entropy (impurity)
(DEF) Information Gain: Reduction of entropy
-
(EQUATION)$w^{side} = \frac{\textrm{num of entries on a side}}{\textrm{num of total entries}}$ -
(EQUATION)Reduction =$H(p_1^{\textrm{root}}) - (w^{\textrm{left}} H(p_1^{\textrm{left}}) + w^{\textrm{right}} H(p_1^{\textrm{right}}))$
Decision Tree Learning Process:
- Start with all examples at the root node
- Calculate information gain for all possible features and pick the one with highest information gain
- Split the dataset according to the selected feature and create left/right branches of the tree
- Keep repeating the splitting process until the stopping criteria (1 or more) are met (recursive splitting):
- When a node is 100% one class
- When splitting a node will result in the tree exceeding a maximum (user-set) depth
- When improvements in purity score are below a threshold
- When the number of examples in a node is below a threshold
Rather than classifying with just one feature in the decision nodes, we can split that one feature into three or more specific features. This allows for more than two options for a decision.
- E.g., Ear Shape (Pointy/Oval) β Pointy Ears (0/1), Floppy Ears (0/1), Oval Ears (0/1)
- If a categorical feature can take on
$k$ values, create$k$ binary features (0/1 valued)
The table becomes a bit more detailed, too, which you can view here. Note in each row only one column has a hot (1) feature.
One-hot encoding can also be applied to neural networks/logistic regression.
Continuous Valued Features:
In many applications, there are features where there is no binary classification, such as weight. How can we represent them in decision trees? Just as before: splitting.
The recommended way to find the best split for a feature is to choose
Rather than taking decision trees as classification algorithms where they make a binary guess of yes (1) or no (0), we can also generalize decision trees to be regression algorithms that can predict a number.
Rather than focusing on reducing entropy, we will instead try to reduce the weighted variance of values
(DEF) Variance: How wide a set of numbers is
-
(EQUATION)$V(\textrm{node}) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}$ -
$x_i$ = the value of one observation -
$\bar{x}$ = the mean value of all observations -
$n$ = the number of observations
-
-
(EQUATION)Weighted Variance Reduction =$V(\textrm{root}) - (w^{\textrm{left}} V(\textrm{left}) + w^{\textrm{right}} V(\textrm{right}))$
A single decision tree can be highly sensitive to small changes in data. One solution to make the algorithm less sensitive is building many decision trees (a tree ensemble).
(DEF) Tree Ensemble: A collection of different decision trees
- Q: How can we split features into different decision trees?
- A: By changing the root feature.
- NOTE: decision nodes can reuse features from different trees in the ensemble
- Each tree makes a binary inference (prediction 0/1) based on classification (e.g., cat or not cat) as a "vote". The majority wins, so an ensemble always has an odd number of trees.
To build a tree ensemble, a technique called sampling with replacement is required:
- Imagine putting some colored tokens in a bag, shaking it, and taking one out (also noting what color it was). That is sampling. The "with replacement" part refers to returning the token to the bag for the next iteration.
- Why is "with replacement" important? Without replacement, the new training set will always be identical to the original training set
- This is needed because it will help construct multiple random training sets that are all slightly different from the original training set
There are many tree ensemble algorithms, and they work much better than a single decision tree.
One way of generating a tree sample (Bagged Decision Tree):
- Assume a training set of size
$m$ - For
$b = 1$ to$B$ :- Use sampling with replacement to create a new training set of size
$m$ - Train a decision tree on the new dataset
- Setting
$B$ to a large number never hurts performance, but beyond a certain point, you end up with diminishing returns and do not get much larger when$B \approx 100+$ . Many recommend 64 to 128 as the maximum.
- Use sampling with replacement to create a new training set of size
We can improve upon this with the Random Forest Algorithm:
- When choosing a feature to split at each node, if
$n$ features are available, pick a random subset of$k < n$ features and allow the algorithm to choose only from that subset of features. - A common choice is to set
$k = \sqrt{n}$
There are many different ways to build decision trees and decision tree ensembles, but by far, the most commonly used implementation of decision trees/ensembles is with an open-source algorithm called XGBoost.
(DEF) XGBoost: Stands for Extreme Gradient Boosting; a scalable gradient-boosted decision tree machine learning library
- Fast, efficient implementation
- Good choice of default splitting criteria and criteria for when to stop splitting
- Built-in regularization to prevent overfitting
- Highly competitive algorithm for ML competitions (e.g., Kaggle)
from xgboost import XGBClassifier()for classificationfrom xgboost import XGBRegressor()for regression
Boosted Trees Intuition:
- Recall our Bagged Decision Tree algorithm. We are going to tweak it slightly under "Use sampling with replacement to create a new training set of size
$m$ " - The methodology is similar, but instead of picking from all examples with equal (
$\frac{1}{m}$ ) probability, make it more likely to pick misclassified examples from previously trained trees- Looks at what we are not doing "quite well at" and tries to build future decision trees to be better at that misclassification problem
Decision Trees and Tree Ensembles:
- Works well on tabular (structured/spreadsheet) data
- Not recommended for unstructured data (images, audio, or text)
- Fast to train
- Small decision trees may be human-interpretable
- Most of the time, use XGBoost
Neural Networks:
- Works well on all types of data, including tabular (structured) and unstructured data
- May be slower than a decision tree
- Works with transfer learning
- When building a system of multiple models working together, it might be easier to string together multiple neural networks
This section will touch on the following topics:
- Unsupervised Learning
- Clustering
- Anomaly Detection
- Recommender Systems
- Reinforcement Learning
Recall that unsupervised learning is learning and structuring from data that only comes with inputs (x), not output labels (y). This type of learning is particularly helpful in finding patterns in a data pool (unstructured data) with assistance from an algorithm called clustering.
(DEF) Clustering: An algorithm that looks at a number of data points and automatically finds data points that are related/similar to each other
- Applications of clustering: Grouping similar news, market segmentation (identifying various groups), DNA genetic traits data, etc.
K-Means Clustering Algorithm:
- Initialization: Randomly initialize
$K$ cluster centroids ($\mu_1, \mu_2, ..., \mu_K$ ) on an unlabeled training set- Choose
$K < m$ -
(DEF)Cluster Centroids: Centers of clusters, initially randomly guessed but will move as more iterations of k-means occur - The number of clusters and cluster centroids may be ambiguous
- Sometimes, the guesses are not the best. We can run the initialization multiple times (maybe ~50 to 1000 times) and compare them using the distortion function (random initialization). Select the one with the lowest distortion (cost)
- Choose
- Assign each point to its closest cluster centroid
- Recompute centroids by taking an average of its group of points
- If a cluster had zero points (dividing by 0 points for mean would be undefined), then we just eliminate that cluster
- Repeat until no points are reassigned (convergence)
K-Means may also be applicable to unlabeled datasets where clusters are not well separated.
In supervised learning, the goal has always been to optimize a cost function with various algorithms like gradient descent. It turns out that clustering is also optimizing a specific cost function, but it is not with gradient descent.
K-Means Optimization Objective:
-
$c^{(i)}$ = index (from 1 to$K$ ) of cluster centroid closest to$x^{(i)}$ -
$\mu_k$ = cluster centroid$k$ -
$\mu_c^{(i)}$ = cluster centroid of cluster to which example$x^{(i)}$ has been assigned
(DEF) Distortion Function: Another name for cost function in clustering
-
(EQUATION)$J(c^{(1)}, ..., c^{(m)}, \mu_1, ..., \mu_K) = \frac{1}{m} \sum_{i=1}^{m} || x^{(i)} - \mu_{c^{(i)}} || ^2$ - As this function is being optimized, it is guaranteed to go down or stay the same on each step of k-means. If it goes up, there is likely a bug in the code.
Choosing a value of
-
(DEF)Elbow Method: Take the cost function$J$ as a function of$K$ clusters and find where the point decrease of$J$ where it starts plateauing. That point of$K$ should be your answer.- Andrew does not endorse this method
- Don't choose
$K$ just to minimize cost$J$ , since the$J$ always decreases
- Often, you want to get clusters for some later (downstream) purpose. Evaluate K-means based on how well it performs on that later purpose
- E.g., T-Shirt sizes, select
$K=3$ for sizes S, M, and LG
- E.g., T-Shirt sizes, select
Whereas clustering algorithms group similar events/values, anomaly detection looks at an unlabeled dataset of normal events and thereby learns to detect if there is an unusual or anomalous event.
How can we develop such an algorithm? With a technique called density estimation.
Density Estimation:
- Helps determine the probability of
$x$ ($p(x)$ )being seen in the dataset by determining regions of high probability (denser regions of$x$ ) and low probability (more sparse regions of$x$ ) -
$\epsilon$ = probability threshold- As we decrease
$\epsilon$ , the algorithm is less likely to detect an anomaly
- As we decrease
-
$p(x_{test}) < \epsilon$ = potential anomaly -
$p(x_{test}) \ge \epsilon$ = ok (normal)
Example Applications of Anomaly Detection:
- Fraud detection:
-
$x^{(i)}$ = features of user$i$ 's activities (how often logged in, how many pages visited, transactions, etc.) - Model
$p(x)$ from data - Identify unusual users by checking which have
$p(x) < \epsilon$
-
- Manufacturing
-
$x^{(i)}$ = features of product$i$ (airplane engines, circuits, phones)
-
Gaussian Distribution
Also known as the normal distribution, it will be useful in density estimation
- Assume
$x$ is a number - Probability of
$x$ is determined by a Gaussian with mean$\mu$ (signifying the central point in the curve) with variance$\sigma^2$ - Contains a bell-shaped curve,
$\sigma$ is the standard deviation - Area under the curve always sums up to 1
-
(EQUATION)$p(x) = \frac{1}{\sqrt{2\pi} \sigma} e^{\frac{-(x-\mu)^2}{2\sigma^2}}$ -
(EQUATION)$\mu = \frac{1}{m} \sum_{i=1}^m x^{(i)}$ -
(EQUATION)$\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \mu)^2$
Anomaly Detection Algorithm:
- Assume training set {
$\vec{x}^{(1)}, \vec{x}^{(2)} ,... \vec{x}^{(m)}$ } where each example$x^{(i)}$ has$n$ features -
(EQUATION)$p(\vec{x}) = p(x_1; \mu_1, \sigma_1^2) * p(x_2; \mu_2, \sigma_2^2) * ... p(x_n; \mu_n, \sigma_n^2) = \Pi_{j=1}^n p(x_j ; \mu_j, \sigma_j^2)$ - Assumes the features
$x_1, x_2, ..., x_m$ are statistically independent but work fine even if they are dependent
- Assumes the features
- Choose
$n$ features$x_i$ that you think might be indicative of anomalous examples - Fit parameters
$\mu_1, ... \mu_n, \sigma_1^2, ... \sigma_n^2$ (use equations from gaussian distribution) - Given new example
$x$ , compute$p(x)$ - Flag anomaly if
$p(x) < \epsilon$
The Importance of Real-Number Evaluation
- When developing a learning algorithm (choosing features, etc.), making decisions is much easier if we have a way of evaluating our learning algorithm (also called real-number evaluation)
- Assume we have some labeled data of anomalous and non-anomalous examples (y=1/0)
- We will be using the supervised learning idea of labeled data with CV and test sets to evaluate our algorithm (only a small portion of anomalies)
- The training set will have only normal (non-anomalous) examples and will remain unlabeled
- The CV and test sets will include a few anomalous examples but mostly normal examples
- Use CV set to choose parameter
$\epsilon$
Choosing Features:
- Replace highly non-Gaussian features with Gaussian features as our Gaussian probability functions
$p(x)$ are more likely to fit and find anomalies with Gaussian features- Some potential functions to try to replace
$x$ :$log(x), log(x+c), x^{\frac{1}{2}}, x^{\frac{1}{n}}$ -
$c$ = some random constant (add a small number like 0.0001 if min(x) is 0 for log) - Whatever transformation you apply to the training set, apply also to the CV and test sets as well
- Some potential functions to try to replace
- Q: With anomaly detection, there is a threshold
$\epsilon$ for$p(x)$ to separate anomalous and non-anomalous examples, but what if both are large and comparable to one another?- A: Add new features
Above, we discussed using some labeled data in the evaluation of our anomaly detection algorithm. Why not just go for supervised learning?
When to use Anomaly Detection:
- Very small number of positive examples/anomalies (
$y=1$ ) (0-20 is common) - Large number of negative (
$y=0$ ) examples - Many different "types of anomalies; it is hard for any algorithm to learn from positive examples what the anomalies look like
- Future anomalies may look nothing like any of the previous anomalous examples (unpredictable in the future)
- Ex: fraud detection, manufacturing (find new previously unseen defects), monitoring machines in data center
When to use Supervised Learning:
- Large number of positive and negative examples
- Enough positive examples for the algorithm to get a sense of what positive examples are like
- Future positive examples likely to be similar to ones in the training set (predictable in the future)
- Ex: Email spam classification, manufacturing (find previously seen defects), weather prediction, disease classification
Recommender systems are often used in industry applications like Amazon and Netflix but received far less attention in academia. Their impact is undeniable; they are directly responsible for a large fraction of sales in finance and e-commerce today.
Let's use this table of movie ratings to illustrate recommender systems
-
$r(i,j)$ = 1 if user$j$ has rating movie$i$ (0 otherwise) -
$n_u$ = no. of users -
$n_m$ = no. of movies -
$y^{(i,j)}$ = rating given by user$j$ on movie$i$ (if defined) -
$w^{(j)}, b^{(j)}$ = parameters for user$j$ -
$x^{(i)}$ = features for movie$i$ - Individual features may be hard to interpret
- To find other items related to it, find item
$k$ with$x^{(k)}$ similar to$x^{(i)}$ $\sum_{i=1}^n (x_l^{(k)} - x_l^{(i)})^2$
- For user
$j$ and movie$i$ , predict rating using linear regression (as there is a simple relationship):$f_{w,b}(x) = w^{(j)} \cdot x^{(i)} + b^{(j)}$ -
$m^{(j)}$ = no. of movies rated by user$j$ - Goal: to learn
$w^{(j)}, b^{(j)}$ - Cost Function with regularization parameter for user
$j$ ,$w^{(j)}, b^{(j)}$ , recall:$\frac{1}{2m^{(j)}} \sum_{i:r(i,j)=1} (w^{(j)} \cdot x^{(i)} + b^{(j)} - y^{(i,j)})^2 + \frac{\lambda}{2m^{(j)}} \sum_{k=1}^n (w_k^{(j)})^2$ - Notice the limiter
$i:r(i,j)=1$ under the summation, which limits to user-rated movies only
- Notice the limiter
Cost Function with regularization parameter for all users and parameters
(EQUATION)
What if we didn't have the features that describe the items of the movies in sufficient detail (like romance and action)? We can use an algorithm called Collaborative Filtering
Using the same example from before, we can predict features
In short, we can learn the input features!
Given
(EQUATION)
Cost function to learn
(EQUATION)
If we put the cost function for all parameters and the cost function for all features together:
(EQUATION)
Then, we can perform a gradient descent, where the cost function is now
Limitations of Collaborative Filtering:
- "Cold Start" problem: How to rank new items that few users have rated? Or show something reasonable to new users who have rated few items?
- Not able to use side information about users or items, only one (like rating)
- Item ex: Genre, movie stars, studio, ...
- User: Demographics (age, gender, location), expressed preferences, ...
Many important applications of recommender systems involve binary labels, not just a rating from 0-5.
- Example with item recommendation:
- Did the user
$j$ purchase an item? (0/1/?) Did the user$j$ like an item? (0/1/?) - Did the user
$j$ spend at least 30 secs. with an item? (0/1/?) - Did the user
$j$ click on an item? (0/1/?)
- Did the user
- Meaning of ratings
- 1: engaged after being shown
- 0: did not engage after being shown
- item not yet shown
Now, instead of regression, we use binary classification (logistic regression).
Previously: predict
For binary labels, we use the logistic function: predict that the probability of
From back in supervised learning, we saw that normalizing can help the algorithm run faster and more efficiently.
Using the movie recommendation example from earlier, the normalization process computes the average rating of each movie row and then groups them in a vector
- Instead of the ratings being 0-5, we subtract the vector
$\mu$ from each row and their respective elements (some elements, such as originally 0, can be negative... this will be handled in the prediction) - For user
$j$ on movie$i$ , predict$w^{(j)} \cdot x^{(i)} + b^{(j)} + \mu_i$
Collaborative filtering may recommend items to you based on ratings of users who gave similar ratings as you. Content-based filtering may recommend items to you based on the features of the user and the item to find a good match.
- Will continue to use
$r^{(i,j)} = 1$ to indicate if$j$ has rated item$i$ and$y^{(i,j)}$ as rating given by user$j$ , but with more features - We denote the feature vectors by
$x_u^{(j)}$ for user$j$ and$x_m^{(i)}$ for movie$i$ , and these vectors could vary in size in comparison to each other
Predict the rating of user
-
(EQUATION)$v_u^{(j)} \cdot v_m^{(i)}$ -
$v_u^{(j)}$ computed from$x_u^{(j)}$ -
$v_m^{(i)}$ computed from$x_m^{(i)}$ - Why "computed"? Since we are taking the dot product, both
$v$ vectors have to be the same size, whereas their$x$ counterparts could have been different - This is like the linear regression algorithm but without the
$b$
How do we compute the
- The number of units in each HL does not matter, but the output layer must have the same size/dimension
- Then, we process it through a cost function:
-
(EQUATION)$J = \sum_{(i,j):r(i,j)} (v_u^{(j)} \cdot v_m^{(i)} - y^{(i,j)}) ^2$ + NN regularization term - Trains all the parameters of the user and movie networks
-
How can you efficiently generate a recommendation from a large set of items? Two steps: Retrieval & Ranking
-
(DEF)Retrieval: Generate a large list of plausible item candidates, then combine retrieved items into list, removing duplicates and items already watched/purchased- e.g., For each of the last 10 movies watched by user, find the 10 most similar movies
- Retrieving more items results in better performance but slower recommendations
- To analyze/optimize the trade-off, carry out offline experiments to see if retrieving additional items results in more relevant recommendations (i.e.,
$p(y^{(i,j)} = 1$ of items displayed to the user are higher)
-
(DEF)Ranking: Takes the list retrieved and ranks using the learned model (NN), then displays ranked items to the user
(DEF) Principal Component Analysis (PCA): An unsupervised learning algorithm that is commonly used for visualization, specifically in applications with lots of features (hard to plot, say, 50 features)
- The goal is to reduce the number of features to 2-3 to graph and visualize
- To get there, PCA finds a new axis and coordinates from, many features (e.g., car length and height β size). The axis is not in another dimension but "overlaid" the graph and coordinates of the original features, confusingly called the "z-axis"
- Notation also changes, like on axes, we go from original features
$x_1, x_2, ..., x_n$ to maybe$z_1, z_2$ after PCA
When we choose an axis, we project the coordinates onto this new axis ("z-axis")
- How do we choose an axis? We may draw a "line" to where the variance is large(points spread out) to where we are capturing important info from the original data
- If the variance is small (points squished together), then that is an indicator of a bad axis selection
- When we achieve the max variance possible, then the axis is called the principal component
We can have 2-3 of these principal component axes. How so? By drawing the additional axes at a perpendicular (90 deg) angle to the 1st principal component axis.
NOTE: PCA is not linear regression. Linear regression attempts to minimize distance in the vertical direction (ground truth y-axis). In contrast, PCA treats the axes equally and tries to minimize distance according to a "z-axis" to maximize variance.
Now, let's go the other way. How can we find the (approximate) original values of
(DEF) Reconstruction: Approximates original values of
- We multiply
$z$ by the vector containing only vector lengths of$x_1, x_2$
PCA Algorithm:
- Optional Pre-processing: perform feature scaling
- "Fit" the data to obtain 2 (or 3) new axes (principal components) (we can use sklearn's
fitto do this automatically. Includes mean normalization) - Optionally examine how much variance is explained by each principal component (using sklearn's
explained_variance_ratio) - Transform (project) the data onto new axes
Reinforcement Learning (RL) is not widely applied in commercial applications today but is one of the pillars of machine learning. RL is not classified under supervised learning or unsupervised learning but is its own category.
- Examples of applications used today: controlling robots, factory optimization, financial (stock) trading, playing games (incl. video games)
The task is to find a function (policy) that maps a state
Example Scenario with an Autonomous Helicopter:
- Positive reward: Helicopter flying well (+1)
- Negative reward: Helicopter flying poorly (-1000)
Why not use supervised learning? For example, when controlling a robot, it is very difficult to obtain a data set of
Return refers to cost/benefit: When given a scenario of either walking 1 minute to get $5 vs. walking 30 minutes to get $10, wouldn't the $5 be more convenient?
(EQUATION) Return (R(s)) =
(DEF) Discount Factor (
- Notice that there is no discount on the first term, meaning full reward if achieved on the first step (as in, it is currently on a reward)
- The number of steps includes its current state
- The purpose of the discount factor is to punish additional steps/actions
In essence, there are many ways to get a reward of some sort, whether it be small or large. What is a specific set of instructions should we follow? In RL, we can develop a policy whose job is to take a state
(DEF) Policy (
(DEF) Markov Decision Process (MDP): Future actions only depends on current state, not how we go there
- Agent
$\pi$ β Action$a$ β Environment/World β State$s$ , Reward$R$ β Agent$\pi$
(DEF) State-Action Value Function (
- Depending on the number of actions
$a$ , there are many possible values of$Q(s,a)$ - The best possible return from state
$s$ is$\textrm{max}Q(s,a)$ , so we choose that action$a$ based off that for our policy$\pi(s)$ - Also known as Optimal
$Q$ with$Q*(s,a)$ in some literature
Now, how do we compute these values of
(EQUATION) Bellman Equation:
-
$R(s)$ : reward of current state -
$s$ : current state -
$a$ : current action -
$s'$ : state you get to after taking action$a$ -
$a'$ : action that you take in state$s'$ -
$\gamma$ : discount factor - Split into two parts:
-
(DEF)Immediate Reward:$R_1$ , the reward for starting out in some state, and we do not apply discount$\gamma$ to it - The second term
$\gamma max_{a'} Q(s',a')$ represents the future reward - Sequence may look like
$Q(s,a) = R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 +...$
-
Difference between Return and State-Action Value: Return is simply the best (max) state-action value at each state
In some applications, when you take an action, the outcome is not always completely reliable (in other words, random; stochastic) -- there may be an unaccounted factor.
In stochastic environments, we don't look for the maximum return because the number can be random, but the average value of the sum of discounted rewards.
(DEF/EQUATION) Expected Return =
-
$E[]$ is shorthand for the average of all future rewards
Compared to discrete states (finite sets of numbers), continuous states may involve infinitesimally small numbers in a set range. Continuous states may also be vectors that include not just the position states but also velocity, rotation, etc.
Discrete State Example: 1, 2, 3, 4, 5, 6 in range 0-6
Continuous State Example:
The key idea is to train a neural network to compute/approximate the state-action value function
The number of HL and their units may be variable, but the output layer will always have one unit. The input
In essence, in a state
What will be our training set? Since we don't have a policy yet, we can observe a lot of examples of attempting various actions and its rewards
Full Deep-Q Learning Process:
- Initialize the neural network randomly as a guess of
$Q(s,a)$ - Repeat:
- Take actions, Get
$(s, a R(s), s')$ - Store (some number, e.g., 10,000) most recent
$(s, a, R(s), s')$ tuples (Replay Buffer) - Train Neural Network
- Create a training set of (some number, e.g., 10,000) using
$x = (s^{(n)}, a^{(n)})$ and$y=R(s'^{(n)})^{(n)} + \gamma max_{a'} Q(s'^{(n)},a')$ - Train
$Q_{new}$ such that$Q_{new}(s,a) \approx y$ - Set
$Q = Q_{new}$
- Create a training set of (some number, e.g., 10,000) using
- Take actions, Get
Enhancement to Improve Efficiency
- Rather than carrying out inference separately 4 times for
$Q(s, \textrm{nothing}), Q(s, \textrm{left}), Q(s, \textrm{main}), Q(s, \textrm{right})$ , we could instead train the neural network to output all four of these values simultaneously - The output layer will now have four units, which still map to
$Q(s, \textrm{nothing}), Q(s, \textrm{left}), Q(s, \textrm{main}), Q(s, \textrm{right})$ , and we still pick the action$a$ that maximizes$Q(s,a)$ . This would only require inference once.
How do you choose actions while still learning?:
- Option 1:
- Pick the action
$a$ that maximizes$Q(s,a)$
- Pick the action
- Option 2 (
$\epsilon-greedy policy$ ):- With probability 0.95, pick the action
$a$ that maximizes$Q(s,a)$ - Greedy, "Exploitation" - With probability 0.95, pick an action
$a$ randomly - "Exploration"
- With probability 0.95, pick the action
The
- One modification may be to start
$\epsilon$ high, then gradually decrease
Mini-batch and Soft Updates are refinements we can make to the RL algorithm (and mini-batches are also applicable to supervised learning applications as well)
(DEF) Mini-Batch: Pick some subset with
- In terms of supervised learning, rather than performing gradient descent on, say, 100,000,000 examples each step, we only perform it on a "mini-batch" (say, 1000) to improve performance. The next iteration may take the next unique 1,000 examples of the 100,000,000
- On average, not as reliable and much more noisy, but significantly less computationally expensive (we typically use Adam in supervised learning instead as a result)
- The modification here with regard to Deep-Q applies to "Create a training set", where instead of 10,000, we may choose 1,000
(DEF) Soft Updates: Updates a parameter to be the sum of a small fraction of the new with the large majority being the old. The fraction is expected to add up to 1.
- Ex:
$W = 0.01 W_new + 0.99W$ ,$B = 0.01 B_new + 0.99B$ - Without soft update, the parameter updates will be
$W = 1W_new + 0W$ ,$B = 1B_new + B$ - The purpose is to make a gradual change to
$Q$ or neural network params$W, B$ and cause the RL algorithm to converge more reliability (like the$\alpha$ learning rate)
Limitations of Reinforcement Learning:
- Much easier to get to work in a simulation than a real robot
- Far fewer applications than supervised and unsupervised learning
However, there is an exciting research direction with potential for future applications.
Huge thanks to Andrew Ng and the Machine Learning Specialization team!