Notes on AI | Paras Dahal

Open-Endedness Quality Diversity

Minimal Criterion Coevolution

Evolve diverse populations of challenges and agents without an archive or pre-defined behavioral dimensions.

MAP-Elites

Open-Endedness AI4Science

Mathematical Exploration and Discovery at Scale (with AlphaEvolve)

How LLM agents can be used to pursue mathematical research at scale, by Terence Tao et al.

AlphaEvolve

Causality Large Language Models (LLMs)

Better Think Thrice - Learning to Reason Causally with Double Counterfactual Consistency

How to make LLMs more causally consistent.

Open-Endedness Reinforcement Learning Quality Diversity Exploration

Go-Explore

How to explore RL search space better than intrinsic motivation.

MAP-Elites Monte-Carlo Tree Search PPO - Proximal Policy Optimization

Open-Endedness LLM Agents

Automated Design of Agentic Systems (ADAS)

Learn agentic architecture without manual design.

Darwin Godel Machine (DGM)

Open-Endedness Quality Diversity Self Play

Foundation Model Self-Play (FMSP)

Learn policy in open-ended loop with self-play and LLMs.

Darwin Godel Machine (DGM) MAP-Elites

Open-Endedness LLM Agents

Darwin Godel Machine (DGM)

How can agents improve themselves? Self-modify and validate if it works + maintain library of previous agents as stepping stones.

MAP-Elites Go-Explore AlphaEvolve

Open-Endedness Large Language Models (LLMs)

OMNI - Open-endedness via Models of human Notions of Interestingness

How to measure notions of interestingness without hand-coded formulas for learning agents

PPO - Proximal Policy Optimization

Open-Endedness AI4Science LLM Agents

Towards an AI co-scientist

How to use LLM agents to drive scientific workflow

Open-Endedness Quality Diversity

MAP-Elites

Create and maintain rich diversity of high performing solutions, often outperforming direct performance optimization.

Go-Explore AlphaEvolve Gaussian Processes Stochastic Gradient Descent Challenges of optimizing deep models

AI4Science LLM Agents

AlphaEvolve

How can LLMs be scaffolded together to generate new breakthroughs?

MAP-Elites

Intrinsic Motivation Reinforcement Learning Exploration

Intrinsically-Motivated Humans and Agents in Open-World Exploration

How do humans approach exploration? Turns out its empowerment with entropy early on.

Latent Reasoning Transformers

Deep Supervision with Recursion

Train a model with recursion without large costs of BPTT

Hierarchical Reasoning Model (HRM) Tiny Reasoning Model (TRM) Transformers

Transformers Latent Reasoning

Tiny Reasoning Model (TRM)

Small transformers lack multi-step reasoning ability. Use latent recurrence in hidden states instead of explicit chain-of-thought tokens.

Transformers Deep Supervision with Recursion Hierarchical Reasoning Model (HRM) Multi-Network Training with Moving Average Target

Latent Reasoning Transformers

Hierarchical Reasoning Model (HRM)

Flat latent reasoning has limited depth. Add hierarchical structure to latent recurrence for deeper reasoning in transformers.

Expectation Maximization Backpropagation Through Time (BPTT) Deep Supervision with Recursion Transformers Temporal Difference Learning Tiny Reasoning Model (TRM)

Transformers Attention

Autoregressive Generation and KV Caching in Transformers

Reduce inference cost from O(n^3) to O(n^2) in autoregressive generation

Transformers Attention

Grouped Query Attention (GQA)

Multi-head attention KV cache is too large for efficient inference. Share KV heads across query head groups to reduce memory.

Transformers Attention Autoregressive Generation and KV Caching in Transformers

Transformers

Mixture of Experts in Transformers (MoE)

Scaling all transformer parameters is too expensive. Route each token to a subset of expert layers to scale capacity without proportional compute.

REINFORCE - Score Function Estimator Transformers

Attention Transformers

Multi-Head Latent Attention (MLA)

Make size of intermediate representations much smaller

Transformers Attention Autoregressive Generation and KV Caching in Transformers Rotary Position Embeddings (RoPE)

Transformers Positional Encoding

Rotary Position Embeddings (RoPE)

Absolute position embeddings don't generalize to unseen sequence lengths. Encode relative position through rotation of query and key vectors.

Transformers Positional Encoding

Data Curation for ML

Confident Learning - Principled Data Cleaning

How to find out most problematic data points automatically

Large Language Models (LLMs)

Emergent Misalignment in LLMs

Fine-tuning LLMs on narrow tasks can unexpectedly produce misaligned behavior on unrelated tasks.

Cross entropy Loss Functions

Focal Loss

Deal with class imbalance and calibration principally

Loss Functions Cross entropy

Large Language Models (LLMs)

Direct Preference Optimization (DPO)

Learn from preference dataset without the complicated RL setup

KL Divergence RLHF - Reinforcement Learning with Human Feedback Importance Sampling Bradley-Terry Model PPO - Proximal Policy Optimization TRPO - Trust-Region Policy Optimization Distribution Shift Maximum Likelihood Estimation Cross entropy

Logistic Regression

Bradley-Terry Model

How to infer ratings from a dataset of outcomes of binary comparisons

Direct Preference Optimization (DPO) Maximum Likelihood Estimation Logistic Regression

Large Language Models (LLMs) Reinforcement Learning

Group Relative Policy Optimization (GRPO)

Avoid learning an explicit value function in RL alignment setup

KL Divergence RLHF - Reinforcement Learning with Human Feedback TRPO - Trust-Region Policy Optimization PPO - Proximal Policy Optimization Control Variates

Probability Theory

Maximum Mean Discrepancy (MMD)

Need a computationally tractable distance measure between empirical distributions without requiring density estimation.

KL Divergence

Normalization

RMSNorm

Layer normalization is expensive due to mean centering. Normalize using only root mean square for equivalent effect with less cost.

Layer Normalization Normalization

Reinforcement Learning

Advantage Functions

Raw action values mix state quality with action quality. Subtract the state value baseline to isolate the advantage of each action.

Control Variates

Reinforcement Learning

Deep Q-Learning

Q-learning with tabular methods doesn't scale to large state spaces. Approximate the Q-function with a deep neural network.

Deep-Q-Network (DQN) Multi-Network Training with Moving Average Target

Reinforcement Learning Partial Observability

Eligibility Trace

RL task is partially observable or need memory of past states (non-Markov)

Temporal Difference Learning Partial Observability

Probability Theory

Maximum Entropy Principle

Many probability distributions are consistent with known constraints. Choose the one with maximum entropy as the least biased estimate.

Lagrange Multipliers Gaussian Distribution

Deep Learning

Multi-Network Training with Moving Average Target

How do you train interdependent neural networks without them destabilizing each other?

Deep Deterministic Policy Gradient (DDPG) JEPA Self-supervised Learning Reinforcement Learning

Reinforcement Learning

Partial Observability

Agent cannot observe the full environment state. Maintain a belief state or use memory to act under incomplete information.

Eligibility Trace Recurrent Neural Networks (RNN)

Linear Algebra

High-Dimensional Dot Product Normalization

Dot products between high-dimensional vectors grow larger with dimension, causing training instability in neural networks

Attention Weight Initialization in Deep Neural Networks

Activation Functions

ReLU

Sigmoid/tanh activations saturate and cause vanishing gradients. Use max(0,x) for sparse, non-saturating activation.

Weight Initialization in Deep Neural Networks Activation Functions

Statistics Fundamentals

Fisher Information

How to quantify the information a random variable carries about an unknown parameter? Use the curvature of the log-likelihood.

Machine Learning

Covariate Shift

Input distribution changes between training and test time while the conditional label distribution stays the same.

Machine Learning

Distribution Shift

Training and deployment data come from different distributions, degrading model performance.

Covariate Shift

Seq2Seq Language Models (Classical) Language Model Sampling

Beam Search

Greedy decoding misses high-probability sequences. Maintain top-k partial hypotheses at each step for better approximate search.

Machine Learning

Similarity Measures

How to quantify how alike two data points or distributions are? Use distance metrics, kernels, or divergences depending on the setting.

Jensen–Shannon Divergence

Machine Learning

Ensemble Methods

Single models have high variance and limited accuracy. Combine multiple models to reduce error through averaging or boosting.

Model Based Reinforcement Learning World Models

Dreamer

Learn robust (non-correlational) and data-efficient RL agents with full analytical gradient from learned environment.

Model Based Reinforcement Learning Policy Gradient

Machine Learning

Capsule Networks (CapsNet)

CNNs lose spatial hierarchy and part-whole relationships through pooling. Use capsules with routing-by-agreement to preserve equivariance.

Gaussian Mixture Model

Uncertainty in Machine Learning

Conformal Prediction

Need very strong prediction intervals or high accuracy with limited data

Distribution Shift Covariate Shift Uncertainty in Machine Learning

Reinforcement Learning

Inverse Reinforcement Learning

The reward function is unknown but expert demonstrations are available. Infer the reward that explains observed expert behavior.

Reinforcement Learning Language Models (Classical)

RLHF - Reinforcement Learning with Human Feedback

Specifying a reward function for complex tasks like language generation is intractable. Learn a reward model from human preferences and optimize with RL.

BERT Inverse Reinforcement Learning Policy Gradient PPO - Proximal Policy Optimization TRPO - Trust-Region Policy Optimization Transformers

Loss Functions Machine Learning

Polyloss

Cross-entropy loss weighs all polynomial terms equally. Reweight the polynomial expansion of the loss for task-specific improvement.

Focal Loss Loss Functions Cross entropy

Machine Learning Classification Metrics and Evaluation

Class Imbalance

Training data has highly skewed class distribution, biasing the model toward majority classes.

Focal Loss

Machine Learning Uncertainty in Machine Learning

Learning to Defer

Model is uncertain or likely wrong on some inputs. Learn when to defer predictions to a human expert.

Uncertainty in Machine Learning

Machine Learning

Uncertainty in Machine Learning

Models produce point predictions without knowing how confident they are. Estimate prediction uncertainty for safer decision-making.

Calibration

Machine Learning Bayesian Model Selection with Model Evidence

Model Complexity and Occams Razor

Complex models overfit while simple models underfit. Use Bayesian model evidence to automatically trade off fit and complexity.

Bayesian Model Selection with Model Evidence

Linear Algebra

Singular Value Decomposition (SVD)

Find the optimal low-rank approximation of any matrix (square or rectangular)

Natural Language Processing

Tokenization

SentencePiece - Unigram LM Encoding Transformers Byte Pair Encoding BERT

Recommender Systems Information Retrieval

Collaborative filtering

Predict user preferences without content features. Leverage patterns in user-item interaction matrices to recommend items.

Singular Value Decomposition (SVD)

Information Retrieval

Learning to Rank

Pointwise scoring doesn't optimize for ranking quality. Learn to directly optimize document ordering for search relevance.

RankNet ListNet and ListMLE LambdaRank

Reinforcement Learning Model Free Reinforcement Learning

PPO - Proximal Policy Optimization

TRPO's constrained optimization is complex to implement. Use a clipped surrogate objective to approximate trust region behavior simply.

KL Divergence Policy Gradient Model Free Reinforcement Learning TRPO - Trust-Region Policy Optimization

Tokenization Natural Language Processing

Byte Pair Encoding

Fixed-vocabulary tokenizers can't handle unseen words. Iteratively merge the most frequent character pairs to build a subword vocabulary.

Tokenization BERT

Tokenization

SentencePiece - Unigram LM Encoding

Fixed vocabulary tokenizers can't handle open vocabularies. Use a unigram language model to learn a subword vocabulary from raw text.

Byte Pair Encoding Tokenization

Machine Learning

Distant Supervision

Manual labeling is expensive. Use existing knowledge bases to automatically generate noisy training labels.

Reinforcement Learning Markov Decision Processes

State Update Functions in Partially Observable MDP

In POMDPs the agent receives observations, not states. Maintain and update a belief distribution over hidden states.

Markov Decision Processes

Natural Language Processing

BLEU

How to automatically evaluate machine translation quality? Compare n-gram overlap between generated and reference translations.

Natural Language Processing Convolutional Neural Networks (CNN)

CNNs for NLP

RNNs are slow for text classification due to sequential processing. Apply convolutions over word embeddings to capture local n-gram features in parallel.

Recurrent Neural Networks (RNN) Graph Convolutional Networks (GCN) Convolutional Neural Networks (CNN)

Deep Learning

Dropout

Neural networks overfit by co-adapting neurons. Randomly drop units during training to regularize and approximate ensemble averaging.

Machine Learning

Hopfield Networks

How to build an associative memory that stores and retrieves patterns? Use a recurrent network with symmetric weights that minimizes an energy function.

Transformers Attention

Deep Learning

Positional Encoding

Self-attention is permutation invariant and has no notion of token order. Inject position information to preserve sequence structure.

Transformers

Machine Learning

Recurrent Neural Networks (RNN)

Feedforward networks can't handle variable-length sequences or temporal dependencies. Use recurrent connections to maintain hidden state over time.

LSTM Backpropagation Through Time (BPTT) GRU

Deep Learning

Transformers

RNNs process sequences serially and struggle with long-range dependencies. Use self-attention to process all positions in parallel.

Layer Normalization LSTM Attention Backpropagation Through Time (BPTT) Normalization Positional Encoding

Deep Learning

Attention

Fixed-size representations bottleneck sequence-to-sequence models. Dynamically attend to relevant parts of the input at each decoding step.

Transformers Autoregressive Generation and KV Caching in Transformers High-Dimensional Dot Product Normalization

Transformers Deep Learning

Scaling Attention

Self-attention costs O(n²) in sequence length. Use sparse or linear approximations to handle longer sequences.

Grouped Query Attention (GQA) Kernel Methods Convolution Attention Mixture of Experts BERT Multi-Head Latent Attention (MLA) Autoregressive Generation and KV Caching in Transformers Recurrent Neural Networks (RNN) Transformers

Reinforcement Learning Model Based Reinforcement Learning

Dyna-Q - Planning and Learning

Pure model-free RL is sample-inefficient. Interleave real experience with simulated experience from a learned model.

Model Based Reinforcement Learning

Reinforcement Learning Policy Gradient

Generalized Advantage Estimate

Monte Carlo advantage has low bias but high variance; TD has low variance but high bias. Exponentially weight multi-step TD errors to control the tradeoff.

Policy Gradient TRPO - Trust-Region Policy Optimization

Reinforcement Learning Policy Gradient

Natural Policy Gradient

Vanilla policy gradient steps in parameter space distort the policy unevenly. Use the Fisher information to take equal-size steps in distribution space.

KL Divergence Policy Gradient

Reinforcement Learning

Prioritized Sweeping

Uniform random updates in model-based RL waste computation on low-priority states. Prioritize updates by expected value change.

Reinforcement Learning Model Free Reinforcement Learning

TRPO - Trust-Region Policy Optimization

Policy gradient updates can be too large and collapse performance. Constrain the KL divergence between old and new policies for monotonic improvement.

Policy Gradient Model Free Reinforcement Learning

Reinforcement Learning

Bellman Equation and Value Functions

How to recursively define the expected long-term return from a state? Express value as immediate reward plus discounted future value.

Markov Decision Processes

Reinforcement Learning

Dynamic Programming (RL)

How to compute optimal policies when the full MDP model is known? Iteratively apply Bellman updates to converge to optimal value functions.

Bellman Equation and Value Functions

Reinforcement Learning

Incremental Implementation of Estimating Action Values

Storing all past rewards to compute action value averages is memory-inefficient. Use incremental running averages instead.

Reinforcement Learning

Markov Decision Processes

How to mathematically formalize sequential decision-making? Define states, actions, transitions, and rewards with the Markov property.

Markov Reward Processes Semi-Markov Decision Processes Dynamic Programming (RL) Partial Observability

Reinforcement Learning

Multi-Armed Bandits

How to balance exploring unknown options vs exploiting the best known option to maximize cumulative reward?

Incremental Implementation of Estimating Action Values

Reinforcement Learning

Off-policy learning with approximation

Off-policy learning with function approximation can diverge (the deadly triad). Use importance sampling corrections or gradient methods for stability.

Importance Sampling

Reinforcement Learning

On-policy learning with approximation

Tabular value functions don't scale to large state spaces. Use function approximation to generalize values across similar states on-policy.

Reinforcement Learning Model Free Reinforcement Learning

PGT Actor-Critic

REINFORCE has high variance from using full returns. Use a learned value function (critic) as baseline to reduce policy gradient variance.

REINFORCE - Monte Carlo Policy Gradient Model Free Reinforcement Learning Temporal Difference Learning

Reinforcement Learning REINFORCE - Score Function Estimator

REINFORCE - Monte Carlo Policy Gradient

How to optimize a policy when environment dynamics are unknown? Use sampled returns to estimate the policy gradient via Monte Carlo rollouts.

Stochastic Gradients Policy Gradient REINFORCE - Score Function Estimator Multi-Armed Bandits Control Variates

Reinforcement Learning

Reinforcement Learning Problem Setup

How to formalize sequential decision-making under uncertainty? Define agents, environments, states, actions, and rewards.

Dynamic Programming (RL) Model Free Reinforcement Learning PGT Actor-Critic Markov Decision Processes Model Based Reinforcement Learning

Reinforcement Learning

Temporal Difference Learning

Monte Carlo methods require waiting until episode end to update. Bootstrap from current value estimates to learn online from incomplete episodes.

Maximum Likelihood Estimation

Reinforcement Learning

Policy Gradient

Value-based methods struggle with continuous actions or stochastic policies. Directly differentiate expected return w.r.t. policy parameters.

REINFORCE - Score Function Estimator Stochastic Gradients

Probability Theory Stochastic Gradients

Importance Sampling

How do you estimate expectations under a target distribution when you only have samples from a different distribution?

Policy Gradient TRPO - Trust-Region Policy Optimization PPO - Proximal Policy Optimization Stochastic Gradients

Reinforcement Learning

Monte-Carlo RL Methods

Model dynamics are unknown and bootstrapping introduces bias. Use complete episode returns for unbiased value estimates.

Dynamic Programming (RL) Multi-Armed Bandits

Deep Learning Computer Vision

Convolution

Fully connected layers ignore spatial structure and have too many parameters for grid data. Use local weight-sharing filters that exploit translation invariance.

Natural Language Processing Deep Learning Language Models (Classical)

BERT

Language models only use left context, missing bidirectional understanding. Mask random tokens and train to predict them using full context.

Transformers

Machine Learning

Meta Learning

Standard learning trains from scratch for each new task. Learn to learn so new tasks can be solved with very few examples.

MAML - Model-Agnostic Meta-Learning

Deep Learning Graphs

Graph Convolutional Networks (GCN)

Standard neural networks can't operate on graph-structured data. Generalize convolutions to graphs by aggregating neighbor features.

Meta Learning

MAML - Model-Agnostic Meta-Learning

Task-specific fine-tuning from random initialization needs too much data. Find an initialization that can be quickly adapted to new tasks in few gradient steps.

Stochastic Gradient Descent Meta Learning Policy Gradient

Information Retrieval

BM25

TF-IDF doesn't account for document length or term frequency saturation. Use probabilistic term weighting with length normalization.

Information Retrieval

Counterfactual Evaluation and LTR

Can't deploy a new ranking policy just to evaluate it. Use logged interaction data with importance weighting to estimate performance offline.

Information Retrieval

Online Evaluation and LTR

Offline metrics may not reflect real user satisfaction. Use interleaving or A/B tests to evaluate ranking quality from live user behavior.

Counterfactual Evaluation and LTR

Information Retrieval Learning to Rank

LambdaRank

Directly optimizing IR metrics like NDCG is non-differentiable. Define implicit gradients (lambdas) that approximate the desired metric-driven update.

Learning to Rank

Information Retrieval Learning to Rank

ListNet and ListMLE

Pairwise ranking losses don't consider the full list structure. Define a distribution over permutations and optimize list-level likelihood.

Learning to Rank

Information Retrieval Learning to Rank

RankNet

Pointwise scoring ignores relative document ordering. Learn pairwise preferences using a cross-entropy loss over document pairs.

Learning to Rank

Information Retrieval

Expected Reciprocal Rank

Traditional IR metrics assume independent document relevance. Model user browsing as a cascade where each document's value depends on those above it.

Signal Processing

Compressed Sensing

Nyquist sampling requires too many measurements for sparse signals. Recover sparse signals from far fewer random measurements.

Machine Learning Representation Learning

Disentangled Representations

Learned representations entangle multiple factors of variation. Separate independent generative factors into distinct latent dimensions.

Machine Learning Deep Learning

Loss Functions

How to quantify prediction error to guide optimization? Choose objective functions that align with the task and have good gradient properties.

Polyloss Cross entropy Activation Functions

Reinforcement Learning

Model Free Reinforcement Learning

Learning the environment model is hard or unnecessary. Learn value functions or policies directly from interaction without modeling dynamics.

Reinforcement Learning Model Free Reinforcement Learning

Deep-Q-Network (DQN)

Q-learning with neural networks is unstable due to correlated samples and moving targets. Use experience replay and target networks for stability.

Stochastic Gradient Descent Model Free Reinforcement Learning Loss Functions

Neural Networks Machine Learning

Activation Functions

Linear transformations alone can't learn nonlinear decision boundaries. Apply nonlinear functions element-wise to enable universal approximation.

ReLU

Deep Learning Uncertainty in Machine Learning

Calibration

Neural networks output confident but unreliable probabilities. Adjust predicted probabilities to match true outcome frequencies.

Depth and Trainability Normalization Uncertainty in Machine Learning

Machine Learning Computer Vision Deep Learning

Convolutional Neural Networks (CNN)

Fully connected networks ignore spatial structure and have too many parameters for images. Use local receptive fields with shared weights for spatial hierarchy.

Convolution Group Equivariant Convolutional Neural Networks

Machine Learning Convolutional Neural Networks (CNN)

Group Equivariant Convolutional Neural Networks

Standard CNNs are only translation equivariant. Generalize convolutions to be equivariant to rotations, reflections, and other symmetry groups.

Convolutional Neural Networks (CNN)

Computer Vision

Harris Corner Detection

How to detect interest points in images? Find locations where intensity changes significantly in all directions.

Clustering Unsupervised Learning

K-Means

How to partition data into k coherent groups without labels? Iteratively assign points to nearest centroid and update centroids.

Gaussian Mixture Model

Fourier Transform

Discrete Fourier Transform

Analyzing frequency content of discrete signals requires transforming from time domain. Decompose signal into sum of complex sinusoids.

Computer Vision

Hough Transform

Detecting geometric shapes in noisy images with edge detection alone is fragile. Vote in parameter space to robustly find lines and circles.

Reinforcement Learning

Monte-Carlo Tree Search

Exhaustive game tree search is intractable for large state spaces. Use random simulations to selectively expand promising branches.

Dynamic Programming (RL) Multi-Armed Bandits

Reinforcement Learning Markov Decision Processes

Markov Reward Processes

How to evaluate expected long-term reward in a stochastic process without actions? Define value functions over Markov chains with rewards.

Markov Decision Processes

Reinforcement Learning Markov Decision Processes

Semi-Markov Decision Processes

Standard MDPs assume fixed time steps. Extend MDPs to handle actions with variable durations.

Markov Decision Processes

Machine Learning

Stochastic Gradients

Computing exact gradients over full datasets is too expensive. Use random mini-batch samples to get unbiased gradient estimates.

Variational Autoencoders Monte-Carlo Estimation Variational Inference Policy Gradient REINFORCE - Score Function Estimator Control Variates

Reinforcement Learning

Model Based Reinforcement Learning

Model-free RL is sample-inefficient. Learn a model of environment dynamics and plan or generate synthetic experience from it.

Probability Theory

Monte-Carlo Estimation

Analytical computation of expectations is intractable. Approximate expectations by averaging random samples.

Machine Learning Probability Theory

Variational Inference

Exact posterior inference is intractable for complex models. Approximate the posterior with a simpler distribution by minimizing KL divergence.

Latenent Variable Models Importance Sampling Jensen's Inequality

Machine Learning

Why implicit density models

Explicit density models are restricted by tractability requirements. Implicit models like GANs generate samples without computing likelihoods.

Contrastive Divergence Variational Autoencoders Variational Inference Generative Adversarial Networks Boltzmann Machines

Machine Learning

Generative Adversarial Networks

Generating realistic samples without tractable density estimation. Train a generator and discriminator adversarially to learn the data distribution.

KL Divergence Jensen–Shannon Divergence Maximum Likelihood Estimation

Information Theory Probability Theory

Jensen–Shannon Divergence

KL divergence goes to infinity when there is no support overlap resulting in no gradient

[ [ G e n e r a t i v e A d v e r s a r i a l N e t w o r k s ] ] Generative Adversarial Networks KL Divergence

Information Theory

KL Divergence

How to measure how one probability distribution differs from another? Compute the expected excess surprise from using the wrong distribution.

Gaussian Distribution Maximum Likelihood Estimation Cross entropy Variational Autoencoders

Machine Learning Deep Learning

Normalizing Flows

Most generative models can't compute exact likelihoods. Use invertible transformations to get both exact density evaluation and efficient sampling.

Maximum Likelihood Estimation Variational Autoencoders

Natural Language Processing

Compositional semantics and sentence representations

Words combine into sentences with complex meaning. Build structured representations that capture compositional semantics.

Recurrent Neural Networks (RNN) LSTM

Natural Language Processing

Coreference Resolution

Multiple expressions in text refer to the same entity. Identify which mentions correspond to the same real-world referent.

Generative Adversarial Networks

Challenges of GAN

GANs suffer from mode collapse, training instability, and vanishing generator gradients.

Generative Adversarial Networks Normalizing Flows Stochastic Gradients Normalization

Deep Learning

Depth and Trainability

Deeper networks are more expressive but harder to train due to vanishing/exploding gradients and optimization challenges.

Adaptive Learning Rate Optimizers Weight Initialization in Deep Neural Networks

Machine Learning

Maximum Likelihood Estimation

How to estimate model parameters from observed data? Find parameters that maximize the probability of the observations.

Deep Learning

Adaptive Learning Rate Optimizers

A single global learning rate is suboptimal for all parameters. Adapt learning rates per-parameter based on gradient history.

Machine Learning

Autoregressive Models

How to model complex joint distributions tractably? Factor into a product of conditionals and model each sequentially.

Machine Learning

Backpropagation Through Time (BPTT)

Standard backprop doesn't handle recurrent connections over time. Unroll the recurrence and backpropagate through the full sequence.

Backpropagation

Machine Learning

Boltzmann Machines

How to learn an undirected probabilistic generative model? Use stochastic binary units with symmetric connections to model the data distribution.

Contrastive Divergence Hopfield Networks

Deep Learning

Challenges of optimizing deep models

Deep networks face vanishing/exploding gradients, saddle points, and ill-conditioned loss landscapes.

Generative Adversarial Networks

Conditional GAN

Standard GANs have no control over what they generate. Condition generator and discriminator on class labels or other inputs.

Generative Adversarial Networks

Machine Learning

Contrastive Divergence

MCMC-based learning in energy models requires expensive sampling to convergence. Approximate the gradient using only a few Gibbs sampling steps.

KL Divergence Maximum Likelihood Estimation

Stochastic Gradients

Control Variates

Variance of score-function estimator (non-diff optimization) is too high to be usable

Advantage Functions Soft Actor-Critic Stochastic Gradients

Generative Adversarial Networks

InfoGAN

Standard GANs don't learn interpretable latent representations. Maximize mutual information between latent codes and outputs for disentangled generation.

Generative Adversarial Networks

Recurrent Neural Networks (RNN)

LSTM

Standard RNNs forget over long sequences due to vanishing gradients. Use gated memory cells to selectively remember and forget over long horizons.

Recurrent Neural Networks (RNN)

Machine Learning Unsupervised Learning

Latenent Variable Models

Observed data alone may not capture underlying structure. Introduce hidden variables to explain data through simpler latent factors.

Variational Autoencoders Boltzmann Machines Expectation Maximization Gaussian Mixture Model

Deep Learning

Normalization

Internal covariate shift slows convergence and requires careful tuning. Normalize activations to stabilize and accelerate training.

Layer Normalization Transformers RMSNorm

Deep Learning Stochastic Gradients

Pathwise Gradient Estimator

The forward pass involves sampling from distributions whose parameters you're optimizing and loss function is differentiable

Variational Autoencoders Policy Gradient Stochastic Gradients Normalizing Flows REINFORCE - Score Function Estimator

Autoregressive Models

PixelRNN

Generating high-quality images requires modeling complex pixel dependencies. Use autoregressive RNNs to generate images one pixel at a time.

Autoregressive Models LSTM

Deep Learning

REINFORCE - Score Function Estimator

Need gradients through non-differentiable stochastic operations. Use the log-derivative trick to estimate gradients from samples.

Control Variates

Machine Learning Deep Learning

Variational Autoencoders

Autoencoders don't provide a proper generative model with meaningful latent space. Optimize a variational lower bound for principled generation.

Latenent Variable Models KL Divergence Gaussian Distribution Jensen's Inequality

Deep Learning

Weight Initialization in Deep Neural Networks

Poor initialization causes exploding or vanishing activations. Initialize weights to preserve signal variance across layers.

Deep Learning

Why Generative Models

Discriminative models can't generate new data or capture the full data distribution. Generative models enable sampling, density estimation, and unsupervised learning.

Model Based Reinforcement Learning Probabilistic Generative Models Discriminant Functions

Machine Learning

Gaussian Distribution

Many phenomena cluster around a mean with known spread. The Gaussian is the maximum entropy distribution for known mean and variance.

Stochastic Gradients Variational Autoencoders Pathwise Gradient Estimator Normalizing Flows Gaussian Mixture Model Generative Adversarial Networks

Machine Learning Latenent Variable Models

Autoencoders

How to learn compact representations without labels? Train a network to reconstruct its input through a bottleneck layer.

Latenent Variable Models

Machine Learning

Energy based models

How to model complex distributions without explicit normalization? Assign low energy to likely configurations and learn the energy function.

Boltzmann Machines Hopfield Networks

Probability Theory

Jensen's Inequality

Proving convexity/concavity properties and establishing bounds like such KL non-negativity and convexity of loss functions.

KL Divergence

Machine Learning

Expectation Maximization

Exact computation of MLE is not possible

Gaussian Mixture Model K-Means

Machine Learning Recurrent Neural Networks (RNN)

GRU

LSTMs are effective but have many parameters. Simplify the gating mechanism while retaining the ability to capture long-range dependencies.

Recurrent Neural Networks (RNN)

Machine Learning

Equivalent Kernel

How to understand the implicit smoothing behavior of a regression model? Express predictions as a kernel-weighted sum of training targets.

Bayesian Linear Regression

Normalization Deep Learning

Layer Normalization

Batch normalization depends on batch statistics and fails with small batches or recurrent nets. Normalize across features within each example instead.

Normalization

Machine Learning

Bayesian Estimation

Point estimates don't capture parameter uncertainty. Use Bayes' theorem to maintain a full posterior distribution over parameters.

Maximum A Posteriori (MAP) Maximum Likelihood Estimation Uncertainty in Machine Learning

Machine Learning

Logistic Regression

How to model binary classification probabilities? Apply a sigmoid to a linear function and optimize with cross-entropy loss.

Perceptron Stochastic Gradient Descent Least squares for classification Probabilistic Generative Models Cross entropy

Machine Learning

Bayesian Linear Regression

Ordinary linear regression gives point estimates with no uncertainty. Place priors on weights to get a full posterior and predictive distribution.

Bayesian Estimation Maximum Likelihood Estimation

Machine Learning

Bias vs Variance in Machine Learning

Simple models underfit (high bias) and complex models overfit (high variance). Understanding this tradeoff guides model selection.

Bayesian Estimation Maximum Likelihood Estimation

Uncertainty in Machine Learning Non-parametric Models

Gaussian Processes

Learn non-parametric approximations to unknown functions AND identify where these approximations are unreliable.

Hyperparameters in Deep Neural Networks Gaussian Distribution Bayesian Linear Regression Equivalent Kernel Uncertainty in Machine Learning

Machine Learning

Basis Functions

Linear models can't capture nonlinear relationships. Transform inputs through nonlinear basis functions to enable nonlinear modeling.

Machine Learning

Cross Validation

Evaluating on training data overestimates performance. Hold out different data subsets in rotation to get reliable generalization estimates.

Machine Learning Maximum Likelihood Estimation

Maximum A Posteriori (MAP)

MLE overfits with limited data by ignoring prior knowledge. Incorporate a prior and find the mode of the posterior instead.

Maximum Likelihood Estimation

Machine Learning

Regularized Least Squares

Ordinary least squares overfits with limited data or many features. Add a penalty on weight magnitudes to constrain model complexity.

Maximum A Posteriori (MAP)

Machine Learning Optimization

Stochastic Gradient Descent

Full-batch gradient descent is too slow for large datasets. Update parameters using gradients from random mini-batches.

Machine Learning

Support Vector Machines (SVM)

Many decision boundaries separate the training data. Find the maximum-margin hyperplane for best generalization.

Kernel Methods Lagrange Multipliers Gaussian Processes

Machine Learning

Lagrange Multipliers

How to optimize a function subject to equality constraints? Introduce multipliers to convert into an unconstrained saddle-point problem.

Machine Learning

Mixture of Experts

A single model struggles with heterogeneous data. Route inputs to specialized expert sub-models via a learned gating function.

Expectation Maximization

Machine Learning

Principle Component Analysis (PCA)

High-dimensional data is hard to visualize and process. Project onto directions of maximum variance to reduce dimensionality.

Lagrange Multipliers

Machine Learning

Probabilistic Generative Models

Discriminative models can't model the data-generating process. Learn the joint distribution for generation, missing data handling, and outlier detection.

Gaussian Distribution

Machine Learning Gaussian Distribution

Gaussian Mixture Model

A single Gaussian can't model multi-modal data. Use a weighted mixture of Gaussians to represent complex distributions.

Lagrange Multipliers K-Means Gaussian Distribution Expectation Maximization

Machine Learning

Kernel Methods

Operating in high-dimensional feature spaces is computationally expensive. Use the kernel trick to compute inner products implicitly.

Basis Functions Equivalent Kernel

Neural Networks

Backpropagation

How to efficiently compute gradients in multi-layer networks? Apply the chain rule layer-by-layer from output to input.

Stochastic Gradient Descent

Information Theory Loss Functions

Cross entropy

How to measure the difference between predicted and true distributions? Compute the expected log-loss under the true distribution.

Loss Functions

Machine Learning

Discriminant Functions

How to directly assign inputs to classes without modeling probabilities? Learn functions that map inputs to class-specific scores.

Machine Learning

Least squares for classification

How to apply regression to classification? Fit class labels as continuous targets, though it has known limitations for multi-class.

Discriminant Functions

Machine Learning

Perceptron

How to learn a linear decision boundary from data? Iteratively adjust weights on misclassified examples.

Stochastic Gradient Descent Least squares for classification Discriminant Functions

Bayesian Estimation

Bayesian Model Selection with Model Evidence

How to choose between models of different complexity? Compare marginal likelihoods which automatically penalize unnecessary complexity.

Bayesian Estimation

Machine Learning

Decision Theory

How to make optimal decisions under uncertainty? Combine probability estimates with loss functions to minimize expected risk.

Topics

Notes

Linked