These are my working notes in ML, both foundational and recent. I write them as a way to absorb good ideas and think out loud — they're opinionated, often incomplete, and updated over time. They're also interlinked; feel free to wander.

Topics

Notes

Linked

Minimal Criterion Coevolution

Evolve diverse populations of challenges and agents without an archive or pre-defined behavioral dimensions.

Mathematical Exploration and Discovery at Scale (with AlphaEvolve)

How LLM agents can be used to pursue mathematical research at scale, by Terence Tao et al.

Darwin Godel Machine (DGM)

How can agents improve themselves? Self-modify and validate if it works + maintain library of previous agents as stepping stones.

Towards an AI co-scientist

How to use LLM agents to drive scientific workflow

MAP-Elites

Create and maintain rich diversity of high performing solutions, often outperforming direct performance optimization.

AlphaEvolve

How can LLMs be scaffolded together to generate new breakthroughs?

Intrinsically-Motivated Humans and Agents in Open-World Exploration

How do humans approach exploration? Turns out its empowerment with entropy early on.

Tiny Reasoning Model (TRM)

Small transformers lack multi-step reasoning ability. Use latent recurrence in hidden states instead of explicit chain-of-thought tokens.

Autoregressive Generation and KV Caching in Transformers

Reduce inference cost from O(n^3) to O(n^2) in autoregressive generation

Grouped Query Attention (GQA)

Multi-head attention KV cache is too large for efficient inference. Share KV heads across query head groups to reduce memory.

Mixture of Experts in Transformers (MoE)

Scaling all transformer parameters is too expensive. Route each token to a subset of expert layers to scale capacity without proportional compute.

Rotary Position Embeddings (RoPE)

Absolute position embeddings don't generalize to unseen sequence lengths. Encode relative position through rotation of query and key vectors.

Confident Learning - Principled Data Cleaning

How to find out most problematic data points automatically

Emergent Misalignment in LLMs

Fine-tuning LLMs on narrow tasks can unexpectedly produce misaligned behavior on unrelated tasks.

Focal Loss

Deal with class imbalance and calibration principally

Maximum Mean Discrepancy (MMD)

Need a computationally tractable distance measure between empirical distributions without requiring density estimation.

RMSNorm

Layer normalization is expensive due to mean centering. Normalize using only root mean square for equivalent effect with less cost.

Advantage Functions

Raw action values mix state quality with action quality. Subtract the state value baseline to isolate the advantage of each action.

Deep Q-Learning

Q-learning with tabular methods doesn't scale to large state spaces. Approximate the Q-function with a deep neural network.

Eligibility Trace

RL task is partially observable or need memory of past states (non-Markov)

Maximum Entropy Principle

Many probability distributions are consistent with known constraints. Choose the one with maximum entropy as the least biased estimate.

Multi-Network Training with Moving Average Target

How do you train interdependent neural networks without them destabilizing each other?

Partial Observability

Agent cannot observe the full environment state. Maintain a belief state or use memory to act under incomplete information.

High-Dimensional Dot Product Normalization

Dot products between high-dimensional vectors grow larger with dimension, causing training instability in neural networks

ReLU

Sigmoid/tanh activations saturate and cause vanishing gradients. Use max(0,x) for sparse, non-saturating activation.

Fisher Information

How to quantify the information a random variable carries about an unknown parameter? Use the curvature of the log-likelihood.

Covariate Shift

Input distribution changes between training and test time while the conditional label distribution stays the same.

Distribution Shift

Training and deployment data come from different distributions, degrading model performance.

Beam Search

Greedy decoding misses high-probability sequences. Maintain top-k partial hypotheses at each step for better approximate search.

Similarity Measures

How to quantify how alike two data points or distributions are? Use distance metrics, kernels, or divergences depending on the setting.

Ensemble Methods

Single models have high variance and limited accuracy. Combine multiple models to reduce error through averaging or boosting.

Dreamer

Learn robust (non-correlational) and data-efficient RL agents with full analytical gradient from learned environment.

Capsule Networks (CapsNet)

CNNs lose spatial hierarchy and part-whole relationships through pooling. Use capsules with routing-by-agreement to preserve equivariance.

Conformal Prediction

Need very strong prediction intervals or high accuracy with limited data

Inverse Reinforcement Learning

The reward function is unknown but expert demonstrations are available. Infer the reward that explains observed expert behavior.

RLHF - Reinforcement Learning with Human Feedback

Specifying a reward function for complex tasks like language generation is intractable. Learn a reward model from human preferences and optimize with RL.

Polyloss

Cross-entropy loss weighs all polynomial terms equally. Reweight the polynomial expansion of the loss for task-specific improvement.

Class Imbalance

Training data has highly skewed class distribution, biasing the model toward majority classes.

Learning to Defer

Model is uncertain or likely wrong on some inputs. Learn when to defer predictions to a human expert.

Uncertainty in Machine Learning

Models produce point predictions without knowing how confident they are. Estimate prediction uncertainty for safer decision-making.

Model Complexity and Occams Razor

Complex models overfit while simple models underfit. Use Bayesian model evidence to automatically trade off fit and complexity.

Singular Value Decomposition (SVD)

Find the optimal low-rank approximation of any matrix (square or rectangular)

Collaborative filtering

Predict user preferences without content features. Leverage patterns in user-item interaction matrices to recommend items.

Learning to Rank

Pointwise scoring doesn't optimize for ranking quality. Learn to directly optimize document ordering for search relevance.

PPO - Proximal Policy Optimization

TRPO's constrained optimization is complex to implement. Use a clipped surrogate objective to approximate trust region behavior simply.

Byte Pair Encoding

Fixed-vocabulary tokenizers can't handle unseen words. Iteratively merge the most frequent character pairs to build a subword vocabulary.

SentencePiece - Unigram LM Encoding

Fixed vocabulary tokenizers can't handle open vocabularies. Use a unigram language model to learn a subword vocabulary from raw text.

Distant Supervision

Manual labeling is expensive. Use existing knowledge bases to automatically generate noisy training labels.

State Update Functions in Partially Observable MDP

In POMDPs the agent receives observations, not states. Maintain and update a belief distribution over hidden states.

BLEU

How to automatically evaluate machine translation quality? Compare n-gram overlap between generated and reference translations.

CNNs for NLP

RNNs are slow for text classification due to sequential processing. Apply convolutions over word embeddings to capture local n-gram features in parallel.

Dropout

Neural networks overfit by co-adapting neurons. Randomly drop units during training to regularize and approximate ensemble averaging.

Hopfield Networks

How to build an associative memory that stores and retrieves patterns? Use a recurrent network with symmetric weights that minimizes an energy function.

Positional Encoding

Self-attention is permutation invariant and has no notion of token order. Inject position information to preserve sequence structure.

Recurrent Neural Networks (RNN)

Feedforward networks can't handle variable-length sequences or temporal dependencies. Use recurrent connections to maintain hidden state over time.

Transformers

RNNs process sequences serially and struggle with long-range dependencies. Use self-attention to process all positions in parallel.

Attention

Fixed-size representations bottleneck sequence-to-sequence models. Dynamically attend to relevant parts of the input at each decoding step.

Dyna-Q - Planning and Learning

Pure model-free RL is sample-inefficient. Interleave real experience with simulated experience from a learned model.

Generalized Advantage Estimate

Monte Carlo advantage has low bias but high variance; TD has low variance but high bias. Exponentially weight multi-step TD errors to control the tradeoff.

Natural Policy Gradient

Vanilla policy gradient steps in parameter space distort the policy unevenly. Use the Fisher information to take equal-size steps in distribution space.

Prioritized Sweeping

Uniform random updates in model-based RL waste computation on low-priority states. Prioritize updates by expected value change.

TRPO - Trust-Region Policy Optimization

Policy gradient updates can be too large and collapse performance. Constrain the KL divergence between old and new policies for monotonic improvement.

Bellman Equation and Value Functions

How to recursively define the expected long-term return from a state? Express value as immediate reward plus discounted future value.

Dynamic Programming (RL)

How to compute optimal policies when the full MDP model is known? Iteratively apply Bellman updates to converge to optimal value functions.

Incremental Implementation of Estimating Action Values

Storing all past rewards to compute action value averages is memory-inefficient. Use incremental running averages instead.

Markov Decision Processes

How to mathematically formalize sequential decision-making? Define states, actions, transitions, and rewards with the Markov property.

Multi-Armed Bandits

How to balance exploring unknown options vs exploiting the best known option to maximize cumulative reward?

Off-policy learning with approximation

Off-policy learning with function approximation can diverge (the deadly triad). Use importance sampling corrections or gradient methods for stability.

On-policy learning with approximation

Tabular value functions don't scale to large state spaces. Use function approximation to generalize values across similar states on-policy.

PGT Actor-Critic

REINFORCE has high variance from using full returns. Use a learned value function (critic) as baseline to reduce policy gradient variance.

REINFORCE - Monte Carlo Policy Gradient

How to optimize a policy when environment dynamics are unknown? Use sampled returns to estimate the policy gradient via Monte Carlo rollouts.

Reinforcement Learning Problem Setup

How to formalize sequential decision-making under uncertainty? Define agents, environments, states, actions, and rewards.

Temporal Difference Learning

Monte Carlo methods require waiting until episode end to update. Bootstrap from current value estimates to learn online from incomplete episodes.

Policy Gradient

Value-based methods struggle with continuous actions or stochastic policies. Directly differentiate expected return w.r.t. policy parameters.

Importance Sampling

How do you estimate expectations under a target distribution when you only have samples from a different distribution?

Monte-Carlo RL Methods

Model dynamics are unknown and bootstrapping introduces bias. Use complete episode returns for unbiased value estimates.

Convolution

Fully connected layers ignore spatial structure and have too many parameters for grid data. Use local weight-sharing filters that exploit translation invariance.

BERT

Language models only use left context, missing bidirectional understanding. Mask random tokens and train to predict them using full context.

Meta Learning

Standard learning trains from scratch for each new task. Learn to learn so new tasks can be solved with very few examples.

Graph Convolutional Networks (GCN)

Standard neural networks can't operate on graph-structured data. Generalize convolutions to graphs by aggregating neighbor features.

MAML - Model-Agnostic Meta-Learning

Task-specific fine-tuning from random initialization needs too much data. Find an initialization that can be quickly adapted to new tasks in few gradient steps.

BM25

TF-IDF doesn't account for document length or term frequency saturation. Use probabilistic term weighting with length normalization.

Counterfactual Evaluation and LTR

Can't deploy a new ranking policy just to evaluate it. Use logged interaction data with importance weighting to estimate performance offline.

Online Evaluation and LTR

Offline metrics may not reflect real user satisfaction. Use interleaving or A/B tests to evaluate ranking quality from live user behavior.

LambdaRank

Directly optimizing IR metrics like NDCG is non-differentiable. Define implicit gradients (lambdas) that approximate the desired metric-driven update.

ListNet and ListMLE

Pairwise ranking losses don't consider the full list structure. Define a distribution over permutations and optimize list-level likelihood.

RankNet

Pointwise scoring ignores relative document ordering. Learn pairwise preferences using a cross-entropy loss over document pairs.

Expected Reciprocal Rank

Traditional IR metrics assume independent document relevance. Model user browsing as a cascade where each document's value depends on those above it.

Compressed Sensing

Nyquist sampling requires too many measurements for sparse signals. Recover sparse signals from far fewer random measurements.

Disentangled Representations

Learned representations entangle multiple factors of variation. Separate independent generative factors into distinct latent dimensions.

Loss Functions

How to quantify prediction error to guide optimization? Choose objective functions that align with the task and have good gradient properties.

Model Free Reinforcement Learning

Learning the environment model is hard or unnecessary. Learn value functions or policies directly from interaction without modeling dynamics.

Deep-Q-Network (DQN)

Q-learning with neural networks is unstable due to correlated samples and moving targets. Use experience replay and target networks for stability.

Activation Functions

Linear transformations alone can't learn nonlinear decision boundaries. Apply nonlinear functions element-wise to enable universal approximation.

Calibration

Neural networks output confident but unreliable probabilities. Adjust predicted probabilities to match true outcome frequencies.

Convolutional Neural Networks (CNN)

Fully connected networks ignore spatial structure and have too many parameters for images. Use local receptive fields with shared weights for spatial hierarchy.

Group Equivariant Convolutional Neural Networks

Standard CNNs are only translation equivariant. Generalize convolutions to be equivariant to rotations, reflections, and other symmetry groups.

Harris Corner Detection

How to detect interest points in images? Find locations where intensity changes significantly in all directions.

K-Means

How to partition data into k coherent groups without labels? Iteratively assign points to nearest centroid and update centroids.

Discrete Fourier Transform

Analyzing frequency content of discrete signals requires transforming from time domain. Decompose signal into sum of complex sinusoids.

Hough Transform

Detecting geometric shapes in noisy images with edge detection alone is fragile. Vote in parameter space to robustly find lines and circles.

Monte-Carlo Tree Search

Exhaustive game tree search is intractable for large state spaces. Use random simulations to selectively expand promising branches.

Markov Reward Processes

How to evaluate expected long-term reward in a stochastic process without actions? Define value functions over Markov chains with rewards.

Semi-Markov Decision Processes

Standard MDPs assume fixed time steps. Extend MDPs to handle actions with variable durations.

Stochastic Gradients

Computing exact gradients over full datasets is too expensive. Use random mini-batch samples to get unbiased gradient estimates.

Model Based Reinforcement Learning

Model-free RL is sample-inefficient. Learn a model of environment dynamics and plan or generate synthetic experience from it.

Monte-Carlo Estimation

Analytical computation of expectations is intractable. Approximate expectations by averaging random samples.

Variational Inference

Exact posterior inference is intractable for complex models. Approximate the posterior with a simpler distribution by minimizing KL divergence.

Why implicit density models

Explicit density models are restricted by tractability requirements. Implicit models like GANs generate samples without computing likelihoods.

Generative Adversarial Networks

Generating realistic samples without tractable density estimation. Train a generator and discriminator adversarially to learn the data distribution.

Jensen–Shannon Divergence

KL divergence goes to infinity when there is no support overlap resulting in no gradient

KL Divergence

How to measure how one probability distribution differs from another? Compute the expected excess surprise from using the wrong distribution.

Normalizing Flows

Most generative models can't compute exact likelihoods. Use invertible transformations to get both exact density evaluation and efficient sampling.

Compositional semantics and sentence representations

Words combine into sentences with complex meaning. Build structured representations that capture compositional semantics.

Coreference Resolution

Multiple expressions in text refer to the same entity. Identify which mentions correspond to the same real-world referent.

Challenges of GAN

GANs suffer from mode collapse, training instability, and vanishing generator gradients.

Depth and Trainability

Deeper networks are more expressive but harder to train due to vanishing/exploding gradients and optimization challenges.

Maximum Likelihood Estimation

How to estimate model parameters from observed data? Find parameters that maximize the probability of the observations.

Adaptive Learning Rate Optimizers

A single global learning rate is suboptimal for all parameters. Adapt learning rates per-parameter based on gradient history.

Autoregressive Models

How to model complex joint distributions tractably? Factor into a product of conditionals and model each sequentially.

Backpropagation Through Time (BPTT)

Standard backprop doesn't handle recurrent connections over time. Unroll the recurrence and backpropagate through the full sequence.

Boltzmann Machines

How to learn an undirected probabilistic generative model? Use stochastic binary units with symmetric connections to model the data distribution.

Challenges of optimizing deep models

Deep networks face vanishing/exploding gradients, saddle points, and ill-conditioned loss landscapes.

Conditional GAN

Standard GANs have no control over what they generate. Condition generator and discriminator on class labels or other inputs.

Contrastive Divergence

MCMC-based learning in energy models requires expensive sampling to convergence. Approximate the gradient using only a few Gibbs sampling steps.

Control Variates

Variance of score-function estimator (non-diff optimization) is too high to be usable

InfoGAN

Standard GANs don't learn interpretable latent representations. Maximize mutual information between latent codes and outputs for disentangled generation.

LSTM

Standard RNNs forget over long sequences due to vanishing gradients. Use gated memory cells to selectively remember and forget over long horizons.

Latenent Variable Models

Observed data alone may not capture underlying structure. Introduce hidden variables to explain data through simpler latent factors.

Normalization

Internal covariate shift slows convergence and requires careful tuning. Normalize activations to stabilize and accelerate training.

Pathwise Gradient Estimator

The forward pass involves sampling from distributions whose parameters you're optimizing and loss function is differentiable

PixelRNN

Generating high-quality images requires modeling complex pixel dependencies. Use autoregressive RNNs to generate images one pixel at a time.

REINFORCE - Score Function Estimator

Need gradients through non-differentiable stochastic operations. Use the log-derivative trick to estimate gradients from samples.

Variational Autoencoders

Autoencoders don't provide a proper generative model with meaningful latent space. Optimize a variational lower bound for principled generation.

Weight Initialization in Deep Neural Networks

Poor initialization causes exploding or vanishing activations. Initialize weights to preserve signal variance across layers.

Why Generative Models

Discriminative models can't generate new data or capture the full data distribution. Generative models enable sampling, density estimation, and unsupervised learning.

Gaussian Distribution

Many phenomena cluster around a mean with known spread. The Gaussian is the maximum entropy distribution for known mean and variance.

Autoencoders

How to learn compact representations without labels? Train a network to reconstruct its input through a bottleneck layer.

Energy based models

How to model complex distributions without explicit normalization? Assign low energy to likely configurations and learn the energy function.

Jensen's Inequality

Proving convexity/concavity properties and establishing bounds like such KL non-negativity and convexity of loss functions.

Expectation Maximization

Exact computation of MLE is not possible

GRU

LSTMs are effective but have many parameters. Simplify the gating mechanism while retaining the ability to capture long-range dependencies.

Equivalent Kernel

How to understand the implicit smoothing behavior of a regression model? Express predictions as a kernel-weighted sum of training targets.

Layer Normalization

Batch normalization depends on batch statistics and fails with small batches or recurrent nets. Normalize across features within each example instead.

Bayesian Estimation

Point estimates don't capture parameter uncertainty. Use Bayes' theorem to maintain a full posterior distribution over parameters.

Logistic Regression

How to model binary classification probabilities? Apply a sigmoid to a linear function and optimize with cross-entropy loss.

Bayesian Linear Regression

Ordinary linear regression gives point estimates with no uncertainty. Place priors on weights to get a full posterior and predictive distribution.

Bias vs Variance in Machine Learning

Simple models underfit (high bias) and complex models overfit (high variance). Understanding this tradeoff guides model selection.

Gaussian Processes

Learn non-parametric approximations to unknown functions AND identify where these approximations are unreliable.

Basis Functions

Linear models can't capture nonlinear relationships. Transform inputs through nonlinear basis functions to enable nonlinear modeling.

Cross Validation

Evaluating on training data overestimates performance. Hold out different data subsets in rotation to get reliable generalization estimates.

Maximum A Posteriori (MAP)

MLE overfits with limited data by ignoring prior knowledge. Incorporate a prior and find the mode of the posterior instead.

Regularized Least Squares

Ordinary least squares overfits with limited data or many features. Add a penalty on weight magnitudes to constrain model complexity.

Stochastic Gradient Descent

Full-batch gradient descent is too slow for large datasets. Update parameters using gradients from random mini-batches.

Support Vector Machines (SVM)

Many decision boundaries separate the training data. Find the maximum-margin hyperplane for best generalization.

Lagrange Multipliers

How to optimize a function subject to equality constraints? Introduce multipliers to convert into an unconstrained saddle-point problem.

Mixture of Experts

A single model struggles with heterogeneous data. Route inputs to specialized expert sub-models via a learned gating function.

Principle Component Analysis (PCA)

High-dimensional data is hard to visualize and process. Project onto directions of maximum variance to reduce dimensionality.

Probabilistic Generative Models

Discriminative models can't model the data-generating process. Learn the joint distribution for generation, missing data handling, and outlier detection.

Gaussian Mixture Model

A single Gaussian can't model multi-modal data. Use a weighted mixture of Gaussians to represent complex distributions.

Kernel Methods

Operating in high-dimensional feature spaces is computationally expensive. Use the kernel trick to compute inner products implicitly.

Backpropagation

How to efficiently compute gradients in multi-layer networks? Apply the chain rule layer-by-layer from output to input.

Cross entropy

How to measure the difference between predicted and true distributions? Compute the expected log-loss under the true distribution.

Discriminant Functions

How to directly assign inputs to classes without modeling probabilities? Learn functions that map inputs to class-specific scores.

Least squares for classification

How to apply regression to classification? Fit class labels as continuous targets, though it has known limitations for multi-class.

Perceptron

How to learn a linear decision boundary from data? Iteratively adjust weights on misclassified examples.

Bayesian Model Selection with Model Evidence

How to choose between models of different complexity? Compare marginal likelihoods which automatically penalize unnecessary complexity.

Decision Theory

How to make optimal decisions under uncertainty? Combine probability estimates with loss functions to minimize expected risk.