These are my working notes in ML, both foundational and recent. I write them as a way to absorb good ideas and think out loud — they're opinionated, often incomplete, and updated over time. They're also interlinked; feel free to wander.
Topics
Notes
Linked
Evolve diverse populations of challenges and agents without an archive or pre-defined behavioral dimensions.
How LLM agents can be used to pursue mathematical research at scale, by Terence Tao et al.
How to make LLMs more causally consistent.
How to explore RL search space better than intrinsic motivation.
Learn agentic architecture without manual design.
Learn policy in open-ended loop with self-play and LLMs.
How can agents improve themselves? Self-modify and validate if it works + maintain library of previous agents as stepping stones.
How to measure notions of interestingness without hand-coded formulas for learning agents
How to use LLM agents to drive scientific workflow
Create and maintain rich diversity of high performing solutions, often outperforming direct performance optimization.
How can LLMs be scaffolded together to generate new breakthroughs?
How do humans approach exploration? Turns out its empowerment with entropy early on.
Train a model with recursion without large costs of BPTT
Small transformers lack multi-step reasoning ability. Use latent recurrence in hidden states instead of explicit chain-of-thought tokens.
Flat latent reasoning has limited depth. Add hierarchical structure to latent recurrence for deeper reasoning in transformers.
Reduce inference cost from O(n^3) to O(n^2) in autoregressive generation
Multi-head attention KV cache is too large for efficient inference. Share KV heads across query head groups to reduce memory.
Scaling all transformer parameters is too expensive. Route each token to a subset of expert layers to scale capacity without proportional compute.
Make size of intermediate representations much smaller
Absolute position embeddings don't generalize to unseen sequence lengths. Encode relative position through rotation of query and key vectors.
How to find out most problematic data points automatically
Fine-tuning LLMs on narrow tasks can unexpectedly produce misaligned behavior on unrelated tasks.
Deal with class imbalance and calibration principally
Learn from preference dataset without the complicated RL setup
How to infer ratings from a dataset of outcomes of binary comparisons
Avoid learning an explicit value function in RL alignment setup
Need a computationally tractable distance measure between empirical distributions without requiring density estimation.
Layer normalization is expensive due to mean centering. Normalize using only root mean square for equivalent effect with less cost.
Raw action values mix state quality with action quality. Subtract the state value baseline to isolate the advantage of each action.
Q-learning with tabular methods doesn't scale to large state spaces. Approximate the Q-function with a deep neural network.
RL task is partially observable or need memory of past states (non-Markov)
Many probability distributions are consistent with known constraints. Choose the one with maximum entropy as the least biased estimate.
How do you train interdependent neural networks without them destabilizing each other?
Agent cannot observe the full environment state. Maintain a belief state or use memory to act under incomplete information.
Dot products between high-dimensional vectors grow larger with dimension, causing training instability in neural networks
Sigmoid/tanh activations saturate and cause vanishing gradients. Use max(0,x) for sparse, non-saturating activation.
How to quantify the information a random variable carries about an unknown parameter? Use the curvature of the log-likelihood.
Input distribution changes between training and test time while the conditional label distribution stays the same.
Training and deployment data come from different distributions, degrading model performance.
Greedy decoding misses high-probability sequences. Maintain top-k partial hypotheses at each step for better approximate search.
How to quantify how alike two data points or distributions are? Use distance metrics, kernels, or divergences depending on the setting.
Single models have high variance and limited accuracy. Combine multiple models to reduce error through averaging or boosting.
Learn robust (non-correlational) and data-efficient RL agents with full analytical gradient from learned environment.
CNNs lose spatial hierarchy and part-whole relationships through pooling. Use capsules with routing-by-agreement to preserve equivariance.
Need very strong prediction intervals or high accuracy with limited data
The reward function is unknown but expert demonstrations are available. Infer the reward that explains observed expert behavior.
Specifying a reward function for complex tasks like language generation is intractable. Learn a reward model from human preferences and optimize with RL.
Cross-entropy loss weighs all polynomial terms equally. Reweight the polynomial expansion of the loss for task-specific improvement.
Training data has highly skewed class distribution, biasing the model toward majority classes.
Model is uncertain or likely wrong on some inputs. Learn when to defer predictions to a human expert.
Models produce point predictions without knowing how confident they are. Estimate prediction uncertainty for safer decision-making.
Complex models overfit while simple models underfit. Use Bayesian model evidence to automatically trade off fit and complexity.
Find the optimal low-rank approximation of any matrix (square or rectangular)
Predict user preferences without content features. Leverage patterns in user-item interaction matrices to recommend items.
Pointwise scoring doesn't optimize for ranking quality. Learn to directly optimize document ordering for search relevance.
TRPO's constrained optimization is complex to implement. Use a clipped surrogate objective to approximate trust region behavior simply.
Fixed-vocabulary tokenizers can't handle unseen words. Iteratively merge the most frequent character pairs to build a subword vocabulary.
Fixed vocabulary tokenizers can't handle open vocabularies. Use a unigram language model to learn a subword vocabulary from raw text.
Manual labeling is expensive. Use existing knowledge bases to automatically generate noisy training labels.
In POMDPs the agent receives observations, not states. Maintain and update a belief distribution over hidden states.
How to automatically evaluate machine translation quality? Compare n-gram overlap between generated and reference translations.
RNNs are slow for text classification due to sequential processing. Apply convolutions over word embeddings to capture local n-gram features in parallel.
Neural networks overfit by co-adapting neurons. Randomly drop units during training to regularize and approximate ensemble averaging.
How to build an associative memory that stores and retrieves patterns? Use a recurrent network with symmetric weights that minimizes an energy function.
Self-attention is permutation invariant and has no notion of token order. Inject position information to preserve sequence structure.
Feedforward networks can't handle variable-length sequences or temporal dependencies. Use recurrent connections to maintain hidden state over time.
RNNs process sequences serially and struggle with long-range dependencies. Use self-attention to process all positions in parallel.
Fixed-size representations bottleneck sequence-to-sequence models. Dynamically attend to relevant parts of the input at each decoding step.
Self-attention costs O(n²) in sequence length. Use sparse or linear approximations to handle longer sequences.
Pure model-free RL is sample-inefficient. Interleave real experience with simulated experience from a learned model.
Monte Carlo advantage has low bias but high variance; TD has low variance but high bias. Exponentially weight multi-step TD errors to control the tradeoff.
Vanilla policy gradient steps in parameter space distort the policy unevenly. Use the Fisher information to take equal-size steps in distribution space.
Uniform random updates in model-based RL waste computation on low-priority states. Prioritize updates by expected value change.
Policy gradient updates can be too large and collapse performance. Constrain the KL divergence between old and new policies for monotonic improvement.
How to recursively define the expected long-term return from a state? Express value as immediate reward plus discounted future value.
How to compute optimal policies when the full MDP model is known? Iteratively apply Bellman updates to converge to optimal value functions.
Storing all past rewards to compute action value averages is memory-inefficient. Use incremental running averages instead.
How to mathematically formalize sequential decision-making? Define states, actions, transitions, and rewards with the Markov property.
How to balance exploring unknown options vs exploiting the best known option to maximize cumulative reward?
Off-policy learning with function approximation can diverge (the deadly triad). Use importance sampling corrections or gradient methods for stability.
Tabular value functions don't scale to large state spaces. Use function approximation to generalize values across similar states on-policy.
REINFORCE has high variance from using full returns. Use a learned value function (critic) as baseline to reduce policy gradient variance.
How to optimize a policy when environment dynamics are unknown? Use sampled returns to estimate the policy gradient via Monte Carlo rollouts.
How to formalize sequential decision-making under uncertainty? Define agents, environments, states, actions, and rewards.
Monte Carlo methods require waiting until episode end to update. Bootstrap from current value estimates to learn online from incomplete episodes.
Value-based methods struggle with continuous actions or stochastic policies. Directly differentiate expected return w.r.t. policy parameters.
How do you estimate expectations under a target distribution when you only have samples from a different distribution?
Model dynamics are unknown and bootstrapping introduces bias. Use complete episode returns for unbiased value estimates.
Fully connected layers ignore spatial structure and have too many parameters for grid data. Use local weight-sharing filters that exploit translation invariance.
Language models only use left context, missing bidirectional understanding. Mask random tokens and train to predict them using full context.
Standard learning trains from scratch for each new task. Learn to learn so new tasks can be solved with very few examples.
Standard neural networks can't operate on graph-structured data. Generalize convolutions to graphs by aggregating neighbor features.
Task-specific fine-tuning from random initialization needs too much data. Find an initialization that can be quickly adapted to new tasks in few gradient steps.
TF-IDF doesn't account for document length or term frequency saturation. Use probabilistic term weighting with length normalization.
Can't deploy a new ranking policy just to evaluate it. Use logged interaction data with importance weighting to estimate performance offline.
Offline metrics may not reflect real user satisfaction. Use interleaving or A/B tests to evaluate ranking quality from live user behavior.
Directly optimizing IR metrics like NDCG is non-differentiable. Define implicit gradients (lambdas) that approximate the desired metric-driven update.
Pairwise ranking losses don't consider the full list structure. Define a distribution over permutations and optimize list-level likelihood.
Pointwise scoring ignores relative document ordering. Learn pairwise preferences using a cross-entropy loss over document pairs.
Traditional IR metrics assume independent document relevance. Model user browsing as a cascade where each document's value depends on those above it.
Nyquist sampling requires too many measurements for sparse signals. Recover sparse signals from far fewer random measurements.
Learned representations entangle multiple factors of variation. Separate independent generative factors into distinct latent dimensions.
How to quantify prediction error to guide optimization? Choose objective functions that align with the task and have good gradient properties.
Learning the environment model is hard or unnecessary. Learn value functions or policies directly from interaction without modeling dynamics.
Q-learning with neural networks is unstable due to correlated samples and moving targets. Use experience replay and target networks for stability.
Linear transformations alone can't learn nonlinear decision boundaries. Apply nonlinear functions element-wise to enable universal approximation.
Neural networks output confident but unreliable probabilities. Adjust predicted probabilities to match true outcome frequencies.
Fully connected networks ignore spatial structure and have too many parameters for images. Use local receptive fields with shared weights for spatial hierarchy.
Standard CNNs are only translation equivariant. Generalize convolutions to be equivariant to rotations, reflections, and other symmetry groups.
How to detect interest points in images? Find locations where intensity changes significantly in all directions.
How to partition data into k coherent groups without labels? Iteratively assign points to nearest centroid and update centroids.
Analyzing frequency content of discrete signals requires transforming from time domain. Decompose signal into sum of complex sinusoids.
Detecting geometric shapes in noisy images with edge detection alone is fragile. Vote in parameter space to robustly find lines and circles.
Exhaustive game tree search is intractable for large state spaces. Use random simulations to selectively expand promising branches.
How to evaluate expected long-term reward in a stochastic process without actions? Define value functions over Markov chains with rewards.
Standard MDPs assume fixed time steps. Extend MDPs to handle actions with variable durations.
Computing exact gradients over full datasets is too expensive. Use random mini-batch samples to get unbiased gradient estimates.
Model-free RL is sample-inefficient. Learn a model of environment dynamics and plan or generate synthetic experience from it.
Analytical computation of expectations is intractable. Approximate expectations by averaging random samples.
Exact posterior inference is intractable for complex models. Approximate the posterior with a simpler distribution by minimizing KL divergence.
Explicit density models are restricted by tractability requirements. Implicit models like GANs generate samples without computing likelihoods.
Generating realistic samples without tractable density estimation. Train a generator and discriminator adversarially to learn the data distribution.
KL divergence goes to infinity when there is no support overlap resulting in no gradient
How to measure how one probability distribution differs from another? Compute the expected excess surprise from using the wrong distribution.
Most generative models can't compute exact likelihoods. Use invertible transformations to get both exact density evaluation and efficient sampling.
Words combine into sentences with complex meaning. Build structured representations that capture compositional semantics.
Multiple expressions in text refer to the same entity. Identify which mentions correspond to the same real-world referent.
GANs suffer from mode collapse, training instability, and vanishing generator gradients.
Deeper networks are more expressive but harder to train due to vanishing/exploding gradients and optimization challenges.
How to estimate model parameters from observed data? Find parameters that maximize the probability of the observations.
A single global learning rate is suboptimal for all parameters. Adapt learning rates per-parameter based on gradient history.
How to model complex joint distributions tractably? Factor into a product of conditionals and model each sequentially.
Standard backprop doesn't handle recurrent connections over time. Unroll the recurrence and backpropagate through the full sequence.
How to learn an undirected probabilistic generative model? Use stochastic binary units with symmetric connections to model the data distribution.
Deep networks face vanishing/exploding gradients, saddle points, and ill-conditioned loss landscapes.
Standard GANs have no control over what they generate. Condition generator and discriminator on class labels or other inputs.
MCMC-based learning in energy models requires expensive sampling to convergence. Approximate the gradient using only a few Gibbs sampling steps.
Variance of score-function estimator (non-diff optimization) is too high to be usable
Standard GANs don't learn interpretable latent representations. Maximize mutual information between latent codes and outputs for disentangled generation.
Standard RNNs forget over long sequences due to vanishing gradients. Use gated memory cells to selectively remember and forget over long horizons.
Observed data alone may not capture underlying structure. Introduce hidden variables to explain data through simpler latent factors.
Internal covariate shift slows convergence and requires careful tuning. Normalize activations to stabilize and accelerate training.
The forward pass involves sampling from distributions whose parameters you're optimizing and loss function is differentiable
Generating high-quality images requires modeling complex pixel dependencies. Use autoregressive RNNs to generate images one pixel at a time.
Need gradients through non-differentiable stochastic operations. Use the log-derivative trick to estimate gradients from samples.
Autoencoders don't provide a proper generative model with meaningful latent space. Optimize a variational lower bound for principled generation.
Poor initialization causes exploding or vanishing activations. Initialize weights to preserve signal variance across layers.
Discriminative models can't generate new data or capture the full data distribution. Generative models enable sampling, density estimation, and unsupervised learning.
Many phenomena cluster around a mean with known spread. The Gaussian is the maximum entropy distribution for known mean and variance.
How to learn compact representations without labels? Train a network to reconstruct its input through a bottleneck layer.
How to model complex distributions without explicit normalization? Assign low energy to likely configurations and learn the energy function.
Proving convexity/concavity properties and establishing bounds like such KL non-negativity and convexity of loss functions.
Exact computation of MLE is not possible
LSTMs are effective but have many parameters. Simplify the gating mechanism while retaining the ability to capture long-range dependencies.
How to understand the implicit smoothing behavior of a regression model? Express predictions as a kernel-weighted sum of training targets.
Batch normalization depends on batch statistics and fails with small batches or recurrent nets. Normalize across features within each example instead.
Point estimates don't capture parameter uncertainty. Use Bayes' theorem to maintain a full posterior distribution over parameters.
How to model binary classification probabilities? Apply a sigmoid to a linear function and optimize with cross-entropy loss.
Ordinary linear regression gives point estimates with no uncertainty. Place priors on weights to get a full posterior and predictive distribution.
Simple models underfit (high bias) and complex models overfit (high variance). Understanding this tradeoff guides model selection.
Learn non-parametric approximations to unknown functions AND identify where these approximations are unreliable.
Linear models can't capture nonlinear relationships. Transform inputs through nonlinear basis functions to enable nonlinear modeling.
Evaluating on training data overestimates performance. Hold out different data subsets in rotation to get reliable generalization estimates.
MLE overfits with limited data by ignoring prior knowledge. Incorporate a prior and find the mode of the posterior instead.
Ordinary least squares overfits with limited data or many features. Add a penalty on weight magnitudes to constrain model complexity.
Full-batch gradient descent is too slow for large datasets. Update parameters using gradients from random mini-batches.
Many decision boundaries separate the training data. Find the maximum-margin hyperplane for best generalization.
How to optimize a function subject to equality constraints? Introduce multipliers to convert into an unconstrained saddle-point problem.
A single model struggles with heterogeneous data. Route inputs to specialized expert sub-models via a learned gating function.
High-dimensional data is hard to visualize and process. Project onto directions of maximum variance to reduce dimensionality.
Discriminative models can't model the data-generating process. Learn the joint distribution for generation, missing data handling, and outlier detection.
A single Gaussian can't model multi-modal data. Use a weighted mixture of Gaussians to represent complex distributions.
Operating in high-dimensional feature spaces is computationally expensive. Use the kernel trick to compute inner products implicitly.
How to efficiently compute gradients in multi-layer networks? Apply the chain rule layer-by-layer from output to input.
How to measure the difference between predicted and true distributions? Compute the expected log-loss under the true distribution.
How to directly assign inputs to classes without modeling probabilities? Learn functions that map inputs to class-specific scores.
How to apply regression to classification? Fit class labels as continuous targets, though it has known limitations for multi-class.
How to learn a linear decision boundary from data? Iteratively adjust weights on misclassified examples.
How to choose between models of different complexity? Compare marginal likelihoods which automatically penalize unnecessary complexity.
How to make optimal decisions under uncertainty? Combine probability estimates with loss functions to minimize expected risk.