A comprehensive reference for all technical terms used throughout the book.
- Bold terms: Primary definitions
- Italics: Related concepts
- [Chapter X]: Where concept is introduced
- → : See also
Definition: A non-linear function applied to neuron outputs to introduce non-linearity into neural networks.
Common Types:
- ReLU (Rectified Linear Unit):
f(x) = max(0, x) - Sigmoid:
f(x) = 1 / (1 + e^(-x)) - Tanh:
f(x) = tanh(x) - GELU:
f(x) = x * Φ(x)where Φ is Gaussian CDF
Why It Matters: Without activation functions, neural networks would be limited to linear transformations, unable to learn complex patterns.
[Chapter 1] → Neuron, Forward Propagation
Definition: Adaptive Moment Estimation optimizer that combines momentum and adaptive learning rates.
Formula:
m_t = β₁ * m_{t-1} + (1-β₁) * g_t
v_t = β₂ * v_{t-1} + (1-β₂) * g_t²
θ_t = θ_{t-1} - α * m_t / (√v_t + ε)
Key Features:
- Maintains moving averages of gradients
- Adapts learning rate per parameter
- Generally works well out-of-the-box
[Chapter 2] → Optimizer, Gradient Descent, Learning Rate
Definition: A mechanism that allows models to focus on relevant parts of input sequences by computing weighted combinations based on learned importance scores.
Core Computation:
Attention(Q, K, V) = softmax(QK^T / √d_k) * V
Components:
- Query (Q): What we're looking for
- Key (K): What we're matching against
- Value (V): What we retrieve
Types:
- Self-attention: Attention within same sequence
- Cross-attention: Attention between two sequences
- Multi-head attention: Multiple attention operations in parallel
[Chapter 5] → Query, Key, Value, Transformer
Definition: A property where the model generates outputs sequentially, with each token depending on all previous tokens.
Example: In language generation, predicting "the cat sat on the ___", the next word depends on all previous words.
Contrast with:
- Non-autoregressive: Generates all outputs in parallel
- Masked LM: Can attend to both past and future
[Chapter 11] → Causal Language Model, Generation
Definition: An algorithm for computing gradients of loss with respect to model parameters using the chain rule of calculus.
Process:
- Forward pass: Compute predictions
- Compute loss
- Backward pass: Compute gradients
- Update parameters
Why Critical: Enables training deep neural networks by efficiently computing how to adjust parameters to reduce loss.
[Chapter 2] → Gradient Descent, Chain Rule, Training
Definition: A technique that normalizes layer inputs to have zero mean and unit variance, stabilizing and accelerating training.
Formula:
x_norm = (x - μ_batch) / √(σ_batch² + ε)
x_out = γ * x_norm + β
Benefits:
- Reduces internal covariate shift
- Allows higher learning rates
- Acts as regularization
TRM Usage: Often replaced by Layer Normalization in transformers.
[Chapter 6] → Layer Normalization, Training Stability
Definition: A search algorithm that maintains top-k most probable sequences at each decoding step, balancing between greedy and exhaustive search.
Parameters:
- Beam width (k): Number of candidates kept
- Length penalty: Prevents bias toward shorter sequences
Trade-offs:
- Larger beam → Better quality, slower
- Beam=1 → Greedy decoding
[Chapter 12] → Generation, Sampling, Inference
Definition: A transformer model trained with masked language modeling, able to attend to both past and future context.
Key Features:
- Bidirectional context
- Pre-trained on masked LM
- Fine-tuned for downstream tasks
Contrast with GPT: BERT is encoder-only (bidirectional), GPT is decoder-only (autoregressive).
[Chapter 11] → Masked Language Modeling, Transformer
Definition: A tokenization algorithm that iteratively merges most frequent byte pairs to build vocabulary.
Process:
- Start with character-level tokens
- Find most frequent pair
- Merge into single token
- Repeat until desired vocab size
Advantages:
- Handles unknown words
- Balances vocab size and sequence length
- Language-agnostic
[Chapter 9] → Tokenization, Vocabulary
Definition: Training objective where model predicts next token given previous tokens, with causal masking preventing attention to future tokens.
Loss: Cross-entropy between predicted and actual next token.
Architecture: Decoder-only transformer (like GPT).
Applications:
- Text generation
- Code completion
- Conversational AI
[Chapter 11] → Autoregressive, Masked Attention, GPT
Definition: Calculus rule for computing derivatives of composite functions.
Formula:
d/dx[f(g(x))] = f'(g(x)) * g'(x)
In Neural Networks: Enables backpropagation by computing gradients layer-by-layer.
[Chapter 2] → Backpropagation, Gradient
Definition: Maximum number of tokens a model can process in a single forward pass.
Limitations:
- Determined by positional encoding
- Memory complexity: O(n²) for standard attention
- Computational cost increases quadratically
Extensions:
- Sliding window attention
- Sparse attention patterns
- Alternative position encodings (RoPE, ALiBi)
[Chapter 13] → Attention, Positional Encoding, Context Window
Definition: Loss function measuring difference between predicted probability distribution and true distribution.
Formula (classification):
L = -∑ y_i * log(ŷ_i)
Formula (language modeling):
L = -log P(w_t | w_1, ..., w_{t-1})
Properties:
- Higher penalty for confident wrong predictions
- Commonly used for classification and LMs
[Chapter 2, 11] → Loss Function, Training, Perplexity
Definition: Transformer component that generates output sequence autoregressively, attending to encoder outputs (if present) and previous decoder outputs.
Components:
- Masked self-attention
- Cross-attention (if encoder-decoder)
- Feed-forward network
Architecture Types:
- Decoder-only (GPT): Causal LM
- Encoder-decoder (T5): Seq2seq tasks
[Chapter 6] → Transformer, Encoder, Attention
Definition: Training technique where a smaller "student" model learns to mimic a larger "teacher" model.
Loss Components:
- Hard loss: Cross-entropy with true labels
- Soft loss: KL divergence with teacher predictions
- Feature loss: Match intermediate representations
Benefits for TRMs:
- Compress large models into tiny ones
- Transfer knowledge without full retraining
- Improve small model performance
[Chapter 15] → Compression, Teacher-Student, Training
Definition: Regularization technique that randomly sets neurons to zero during training with probability p.
Purpose:
- Prevents overfitting
- Creates ensemble effect
- Encourages robust features
Usage in TRMs:
- Applied to attention weights
- Applied to feed-forward layers
- Typically p=0.1 to 0.3
[Chapter 10] → Regularization, Training, Overfitting
Definition: Dense vector representation of discrete tokens, mapping vocabulary items to continuous space.
Properties:
- Learned during training
- Dimension typically 128-1024
- Captures semantic relationships
Types:
- Token embeddings: Word/subword vectors
- Position embeddings: Encode sequence position
- Task embeddings: Multi-task conditioning
[Chapter 3] → Word2Vec, Token, Representation Learning
Definition: Transformer component that processes input sequence bidirectionally to create contextualized representations.
Components:
- Self-attention
- Feed-forward network
- Layer normalization
- Residual connections
Use Cases:
- BERT-style models
- Encoder-decoder architectures
- Classification tasks
[Chapter 6] → Transformer, Decoder, Self-Attention
Definition: Multi-layer perceptron applied position-wise in transformer blocks.
Architecture:
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
Typical Dimensions:
- Input/output: d_model
- Hidden: 4 * d_model (standard)
- TRM: 2 * d_model (more efficient)
Purpose:
- Add non-linearity
- Increase model capacity
- Process each position independently
[Chapter 6] → Transformer, MLP, Activation Function
Definition: Adapting a pre-trained model to a specific task by continuing training on task-specific data.
Types:
- Full fine-tuning: Update all parameters
- Parameter-efficient: Update small subset (LoRA, adapters)
- Prompt tuning: Only update soft prompts
For TRMs: Essential for domain adaptation while maintaining efficiency.
[Chapter 16] → Transfer Learning, LoRA, Adapters
Definition: Vector of partial derivatives indicating direction and magnitude of steepest increase in loss.
Formula:
∇L = [∂L/∂θ₁, ∂L/∂θ₂, ..., ∂L/∂θₙ]
Uses:
- Parameter updates:
θ_new = θ_old - α * ∇L - Indicates how to reduce loss
Challenges:
- Vanishing gradients: Become too small
- Exploding gradients: Become too large
[Chapter 2] → Backpropagation, Optimizer, Training
Definition: Technique to prevent exploding gradients by capping gradient magnitude.
Methods:
- By value: Clip each gradient component
- By norm: Scale entire gradient vector
Formula (by norm):
if ||g|| > threshold:
g = g * (threshold / ||g||)
Essential for: Training small models, RNNs, transformers.
[Chapter 10] → Training Stability, Gradient, Optimization
Definition: Decoder-only transformer trained autoregressively for causal language modeling.
Architecture:
- Masked self-attention
- No encoder
- Autoregressive generation
Variants: GPT-2, GPT-3, GPT-4
Relevance to TRMs: TRMs often use GPT-style architecture with parameter sharing.
[Chapter 6, 7] → Causal LM, Transformer, Decoder
Hidden State
Definition: Internal representation of input at intermediate layers of a neural network.
Properties:
- Captures learned features
- Dimension typically 256-4096
- Can be interpreted or visualized
Uses:
- Feed into next layer
- Task-specific prediction heads
- Feature extraction
[Chapter 1, 4] → RNN, LSTM, Representation
Definition: Configuration value set before training (not learned from data).
Examples:
- Learning rate
- Batch size
- Number of layers
- Hidden dimension
- Dropout rate
Tuning: Essential for model performance. See Hyperparameter Guide.
[Chapter 2, 10] → Training, Optimization
Definition: Using a trained model to make predictions on new data.
Phases:
- Load model weights
- Process input
- Forward pass
- Generate output
Optimization:
- Quantization
- Pruning
- KV-cache
- Batch processing
[Chapter 12, 23] → Generation, Deployment, Serving
Definition: In attention mechanism, the representation used to match against queries.
Computation: K = XW_K
Role: Determines what information is relevant to each query.
[Chapter 5] → Attention, Query, Value
→ See Distillation
Definition: Optimization technique that stores previously computed key and value vectors during autoregressive generation.
Why Needed:
- Autoregressive generation recomputes attention for all previous tokens
- Caching avoids redundant computation
Memory Trade-off:
- Speeds up generation significantly
- Increases memory usage: O(n * d_model)
[Chapter 12] → Generation, Inference, Optimization
Definition: Normalization technique that normalizes across feature dimension for each example.
Formula:
x_norm = (x - μ_layer) / √(σ_layer² + ε)
x_out = γ * x_norm + β
Advantages over Batch Norm:
- Works with batch size = 1
- No train/test discrepancy
- Better for sequences
Standard in: Transformers, TRMs
[Chapter 6] → Normalization, Training Stability
Definition: Hyperparameter controlling size of parameter updates during training.
Formula: θ_new = θ_old - lr * gradient
Typical Values: 1e-5 to 1e-2
Scheduling:
- Constant
- Warmup then decay
- Cosine annealing
- Cyclical
Critical for: Training stability and convergence.
[Chapter 2, 10] → Optimizer, Training, Hyperparameter
Definition: Parameter-efficient fine-tuning method that adds trainable low-rank matrices to frozen pre-trained weights.
Formula:
W' = W_frozen + AB
where A ∈ ℝ^(d×r), B ∈ ℝ^(r×k), r << d,k
Benefits:
- Drastically fewer trainable parameters
- Multiple task adaptations without duplicating base model
- Minimal performance loss
Ideal for: Fine-tuning TRMs to specific domains.
[Chapter 16] → Fine-Tuning, Adapters, Efficiency
Definition: Function measuring difference between model predictions and true values, guiding training.
Common Types:
- MSE: Regression tasks
- Cross-entropy: Classification, LM
- Contrastive: Embedding learning
Role: Defines what model optimizes during training.
[Chapter 2, 11] → Training, Cross-Entropy, Optimization
Definition: RNN variant with gating mechanisms to better capture long-term dependencies.
Gates:
- Forget gate: What to discard
- Input gate: What to add
- Output gate: What to output
Advantages: Mitigates vanishing gradient problem in RNNs.
In TRMs: Largely replaced by attention, but concepts remain relevant.
[Chapter 4] → RNN, Sequence Modeling, Gates
Definition: Training objective where random tokens are masked and model predicts them using bidirectional context.
Example:
- Input: "The [MASK] sat on the mat"
- Target: Predict "cat"
Used in: BERT, RoBERTa
Contrast with: Causal LM (GPT-style)
[Chapter 11] → Training Objective, BERT, Pre-training
Definition: Running multiple attention operations in parallel, each with different learned projections.
Formula:
MultiHead(Q,K,V) = Concat(head₁, ..., headₕ)W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
Benefits:
- Attend to different representation subspaces
- Capture diverse relationships
- Increase model capacity
Typical: 8-16 heads in standard transformers, 4-8 in TRMs.
[Chapter 5] → Attention, Query, Key, Value
Definition: Technique to standardize layer inputs/outputs to improve training stability.
Types:
- Batch Normalization: Across batch
- Layer Normalization: Across features
- RMS Normalization: Simplified layer norm
Why Important: Prevents internal covariate shift, enables higher learning rates.
[Chapter 6] → Layer Normalization, Training
Definition: Sampling technique that selects from smallest token set whose cumulative probability exceeds threshold p.
Algorithm:
- Sort tokens by probability
- Find smallest set with P(set) ≥ p
- Sample from this set
Advantages:
- Adapts to probability distribution
- More diverse than top-k
- Avoids low-probability tokens
Typical p: 0.9 to 0.95
[Chapter 12] → Generation, Sampling, Inference
Definition: Algorithm for updating model parameters based on gradients to minimize loss.
Common Types:
- SGD: Stochastic Gradient Descent
- Adam: Adaptive Moment Estimation
- AdamW: Adam with decoupled weight decay
Components:
- Gradient computation
- Update rule
- Learning rate
- Momentum (optional)
[Chapter 2] → Adam, Gradient Descent, Training
Definition: When model learns training data too well, including noise, hurting generalization.
Symptoms:
- Low training loss, high validation loss
- Perfect training accuracy, poor test accuracy
Solutions:
- More data
- Regularization (dropout, weight decay)
- Early stopping
- Data augmentation
[Chapter 10] → Regularization, Generalization, Training
Definition: Learnable weights in neural network, updated during training.
Types:
- Weight matrices: Linear transformations
- Biases: Additive offsets
- Embeddings: Token representations
Count: Critical metric for model size. TRMs aim for < 50M parameters.
[Chapter 1] → Training, Weight, Bias
Definition: Using same parameters multiple times in model architecture.
In TRMs:
- Recursive layers reuse weights
- Input/output embeddings tied
- Reduces total parameter count
Trade-offs:
- Fewer parameters
- May limit expressiveness
- Requires more recursion depth
[Chapter 8] → Recursive, TRM, Efficiency
Definition: Evaluation metric for language models measuring average branching factor (uncertainty) per token.
Formula:
PPL = exp(-1/N ∑ log P(w_i | context))
Interpretation:
- Lower is better
- PPL of 20 ≈ model uncertain between 20 tokens
- Common range: 10-100 depending on task
[Chapter 11, 24] → Evaluation, Language Modeling, Cross-Entropy
Definition: Adding position information to embeddings so model can distinguish token order.
Types:
- Sinusoidal: Fixed, based on sine/cosine
- Learned: Trainable embeddings
- Relative: RoPE, ALiBi
Necessity: Transformers have no inherent notion of sequence order.
[Chapter 6, 13] → Transformer, Embedding, Context Length
Definition: Training model on large general corpus before fine-tuning on specific task.
Objectives:
- Causal LM (GPT)
- Masked LM (BERT)
- Denoising (T5)
Benefits:
- Learns general language understanding
- Improves downstream performance
- Enables transfer learning
[Chapter 10, 11] → Fine-Tuning, Transfer Learning
Definition: Removing unimportant parameters to reduce model size.
Types:
- Unstructured: Remove individual weights
- Structured: Remove entire neurons/heads
- Magnitude-based: Remove smallest weights
Process:
- Train full model
- Identify unimportant parameters
- Remove them
- Fine-tune remaining
[Chapter 17] → Compression, Quantization, Efficiency
Definition: Reducing numerical precision of model parameters (e.g., float32 → int8).
Types:
- Post-Training Quantization (PTQ): After training
- Quantization-Aware Training (QAT): During training
Benefits:
- 2-4x smaller models
- 2-4x faster inference
- Lower memory usage
Trade-off: Slight accuracy loss (typically <2%)
[Chapter 17] → Compression, Efficiency, Deployment
Definition: In attention mechanism, the representation that searches for relevant information.
Computation: Q = XW_Q
Role: Determines what information to retrieve from keys/values.
[Chapter 5] → Attention, Key, Value
Definition: Neural network that processes sequences by maintaining hidden state and applying same operation at each time step.
Formula:
h_t = tanh(W_h * h_{t-1} + W_x * x_t + b)
Limitations:
- Vanishing/exploding gradients
- Sequential processing (no parallelization)
- Limited long-term memory
Evolution: LSTM → GRU → Transformer → TRM
[Chapter 4] → LSTM, Sequence Modeling
Definition: Techniques to prevent overfitting and improve generalization.
Methods:
- Dropout: Random neuron deactivation
- Weight decay: L2 penalty on parameters
- Data augmentation: Expand training data
- Early stopping: Stop when validation improves
[Chapter 10] → Dropout, Overfitting, Training
Definition: Adding input of layer directly to its output, enabling gradient flow through deep networks.
Formula: output = F(x) + x
Benefits:
- Eases gradient flow
- Enables deeper networks
- Provides identity mapping
Standard in: Transformers, TRMs, ResNets
[Chapter 6] → Transformer, Training Stability
Definition: Relative position encoding that applies rotary transformations to queries and keys.
Advantages:
- Encodes relative positions naturally
- Enables context length extrapolation
- More efficient than learned embeddings
Used in: Modern efficient transformers, many TRMs
[Chapter 13] → Positional Encoding, Context Length, Attention
Definition: Generating tokens from probability distribution rather than always picking highest probability.
Methods:
- Greedy: Always pick argmax
- Temperature: Scale logits before softmax
- Top-k: Sample from k most likely
- Top-p (nucleus): Sample from top cumulative p
Trade-offs:
- Higher randomness → More diverse but less coherent
- Lower randomness → More coherent but repetitive
[Chapter 12] → Generation, Temperature, Inference
Definition: Attention mechanism where queries, keys, and values all come from same sequence.
Purpose: Model dependencies within sequence.
Formula: Attention(X, X, X)
Variants:
- Masked: Causal (no future tokens)
- Bidirectional: Full context
[Chapter 5, 6] → Attention, Transformer
Definition: Model architecture that maps input sequence to output sequence, potentially of different length.
Components:
- Encoder: Process input
- Decoder: Generate output
Applications:
- Translation
- Summarization
- Question answering
[Chapter 6] → Encoder, Decoder, Transformer
Definition: Function that converts logits to probability distribution.
Formula:
softmax(x_i) = exp(x_i) / ∑_j exp(x_j)
Properties:
- Output sums to 1
- Differentiable
- Amplifies differences
Uses: Attention weights, output probabilities
[Chapter 1, 5] → Activation Function, Attention
Definition: Parameter that controls randomness in sampling by scaling logits before softmax.
Formula: softmax(logits / T)
Effects:
- T = 1: No change
- T < 1: More confident (sharper distribution)
- T > 1: More random (flatter distribution)
Typical values: 0.7 (focused) to 1.2 (creative)
[Chapter 12] → Sampling, Generation, Softmax
Definition: Process of splitting text into discrete units (tokens) that model can process.
Levels:
- Character: Individual characters
- Subword: BPE, WordPiece
- Word: Whole words
Trade-offs:
- Smaller units → Longer sequences, better unknown words
- Larger units → Shorter sequences, larger vocab
[Chapter 3, 9] → BPE, Vocabulary, Embedding
Definition: Neural architecture based on self-attention, processing sequences in parallel rather than sequentially.
Components:
- Multi-head self-attention
- Position-wise FFN
- Positional encoding
- Layer normalization
- Residual connections
Variants:
- Encoder-only (BERT)
- Decoder-only (GPT)
- Encoder-decoder (T5)
Why Revolutionary: Parallelizable, scalable, state-of-the-art performance.
[Chapter 6] → Attention, Encoder, Decoder
Definition: Leveraging knowledge learned on one task to improve performance on related task.
Process:
- Pre-train on large general dataset
- Fine-tune on specific task
- Optionally freeze some layers
Benefits:
- Less task-specific data needed
- Better performance
- Faster training
[Chapter 10, 16] → Pre-training, Fine-Tuning
Definition: Small language model using recursive parameter sharing to maximize efficiency.
Key Features:
- < 50M parameters
- Recursive layers
- Parameter sharing
- Optimized for edge deployment
Design Principles:
- Weight sharing
- Efficient attention
- Compact vocabulary
- Optimized representations
[Chapter 7-12] → Recursive, Parameter Sharing, Efficiency
Definition: In attention mechanism, the representation that gets retrieved and combined.
Computation: V = XW_V
Role: Provides actual information returned by attention.
[Chapter 5] → Attention, Query, Key
Definition: Problem where gradients become extremely small during backpropagation, preventing learning in early layers.
Causes:
- Deep networks
- Activation functions (sigmoid, tanh)
- Long sequences (RNNs)
Solutions:
- ReLU activation
- Residual connections
- Layer normalization
- Gradient clipping
- LSTMs/GRUs
[Chapter 2, 4] → Backpropagation, LSTM, Training
Definition: Set of all tokens model can process.
Size Trade-offs:
- Large vocab (50k+): Better coverage, bigger embeddings
- Small vocab (10k-30k): Fewer parameters, longer sequences
For TRMs: Often 10k-20k tokens to minimize embedding parameters.
[Chapter 3, 9] → Tokenization, BPE, Embedding
Definition: Learnable parameter in linear transformations of neural networks.
Representation: Typically matrices or tensors
Initialization: Critical for training success (Xavier, He, etc.)
Updates: Via gradient descent during training
[Chapter 1, 2] → Parameter, Training, Gradient
Definition: Regularization technique adding penalty term to loss proportional to magnitude of weights.
Formula: L_total = L_task + λ * ||θ||²
Effect: Encourages smaller weights, reducing overfitting.
Typical λ: 1e-4 to 1e-2
[Chapter 10] → Regularization, L2, Training
Definition: Neural network architecture for learning word embeddings by predicting context from words (Skip-gram) or words from context (CBOW).
Key Idea: Words in similar contexts have similar meanings.
Output: Dense vector representations of words
Influence: Foundation for modern embeddings
[Chapter 3] → Embedding, Representation Learning
- BPE: Byte-Pair Encoding
- BERT: Bidirectional Encoder Representations from Transformers
- CBOW: Continuous Bag of Words
- FFN: Feed-Forward Network
- GPT: Generative Pre-trained Transformer
- LSTM: Long Short-Term Memory
- LM: Language Model
- MLM: Masked Language Modeling
- MLP: Multi-Layer Perceptron
- NLP: Natural Language Processing
- PTQ: Post-Training Quantization
- QAT: Quantization-Aware Training
- ReLU: Rectified Linear Unit
- RNN: Recurrent Neural Network
- RoPE: Rotary Position Embedding
- SGD: Stochastic Gradient Descent
- TRM: Tiny Recursive Model
- Main Book: TRM-BOOK-COMPLETE.md
- Getting Started: GETTING-STARTED.md
- FAQ: FAQ.md
- References: REFERENCES.md
Last updated: October 9, 2025