Skip to content

Latest commit

 

History

History
1185 lines (795 loc) · 27.7 KB

File metadata and controls

1185 lines (795 loc) · 27.7 KB

Glossary: Building Tiny Recursive Models

A comprehensive reference for all technical terms used throughout the book.


How to Use This Glossary

  • Bold terms: Primary definitions
  • Italics: Related concepts
  • [Chapter X]: Where concept is introduced
  • → : See also

A

Activation Function

Definition: A non-linear function applied to neuron outputs to introduce non-linearity into neural networks.

Common Types:

  • ReLU (Rectified Linear Unit): f(x) = max(0, x)
  • Sigmoid: f(x) = 1 / (1 + e^(-x))
  • Tanh: f(x) = tanh(x)
  • GELU: f(x) = x * Φ(x) where Φ is Gaussian CDF

Why It Matters: Without activation functions, neural networks would be limited to linear transformations, unable to learn complex patterns.

[Chapter 1] → Neuron, Forward Propagation


Adam Optimizer

Definition: Adaptive Moment Estimation optimizer that combines momentum and adaptive learning rates.

Formula:

m_t = β₁ * m_{t-1} + (1-β₁) * g_t
v_t = β₂ * v_{t-1} + (1-β₂) * g_t²
θ_t = θ_{t-1} - α * m_t / (√v_t + ε)

Key Features:

  • Maintains moving averages of gradients
  • Adapts learning rate per parameter
  • Generally works well out-of-the-box

[Chapter 2] → Optimizer, Gradient Descent, Learning Rate


Attention Mechanism

Definition: A mechanism that allows models to focus on relevant parts of input sequences by computing weighted combinations based on learned importance scores.

Core Computation:

Attention(Q, K, V) = softmax(QK^T / √d_k) * V

Components:

  • Query (Q): What we're looking for
  • Key (K): What we're matching against
  • Value (V): What we retrieve

Types:

  • Self-attention: Attention within same sequence
  • Cross-attention: Attention between two sequences
  • Multi-head attention: Multiple attention operations in parallel

[Chapter 5] → Query, Key, Value, Transformer


Autoregressive

Definition: A property where the model generates outputs sequentially, with each token depending on all previous tokens.

Example: In language generation, predicting "the cat sat on the ___", the next word depends on all previous words.

Contrast with:

  • Non-autoregressive: Generates all outputs in parallel
  • Masked LM: Can attend to both past and future

[Chapter 11] → Causal Language Model, Generation


B

Backpropagation

Definition: An algorithm for computing gradients of loss with respect to model parameters using the chain rule of calculus.

Process:

  1. Forward pass: Compute predictions
  2. Compute loss
  3. Backward pass: Compute gradients
  4. Update parameters

Why Critical: Enables training deep neural networks by efficiently computing how to adjust parameters to reduce loss.

[Chapter 2] → Gradient Descent, Chain Rule, Training


Batch Normalization

Definition: A technique that normalizes layer inputs to have zero mean and unit variance, stabilizing and accelerating training.

Formula:

x_norm = (x - μ_batch) / √(σ_batch² + ε)
x_out = γ * x_norm + β

Benefits:

  • Reduces internal covariate shift
  • Allows higher learning rates
  • Acts as regularization

TRM Usage: Often replaced by Layer Normalization in transformers.

[Chapter 6] → Layer Normalization, Training Stability


Beam Search

Definition: A search algorithm that maintains top-k most probable sequences at each decoding step, balancing between greedy and exhaustive search.

Parameters:

  • Beam width (k): Number of candidates kept
  • Length penalty: Prevents bias toward shorter sequences

Trade-offs:

  • Larger beam → Better quality, slower
  • Beam=1 → Greedy decoding

[Chapter 12] → Generation, Sampling, Inference


BERT (Bidirectional Encoder Representations from Transformers)

Definition: A transformer model trained with masked language modeling, able to attend to both past and future context.

Key Features:

  • Bidirectional context
  • Pre-trained on masked LM
  • Fine-tuned for downstream tasks

Contrast with GPT: BERT is encoder-only (bidirectional), GPT is decoder-only (autoregressive).

[Chapter 11] → Masked Language Modeling, Transformer


Byte-Pair Encoding (BPE)

Definition: A tokenization algorithm that iteratively merges most frequent byte pairs to build vocabulary.

Process:

  1. Start with character-level tokens
  2. Find most frequent pair
  3. Merge into single token
  4. Repeat until desired vocab size

Advantages:

  • Handles unknown words
  • Balances vocab size and sequence length
  • Language-agnostic

[Chapter 9] → Tokenization, Vocabulary


C

Causal Language Modeling

Definition: Training objective where model predicts next token given previous tokens, with causal masking preventing attention to future tokens.

Loss: Cross-entropy between predicted and actual next token.

Architecture: Decoder-only transformer (like GPT).

Applications:

  • Text generation
  • Code completion
  • Conversational AI

[Chapter 11] → Autoregressive, Masked Attention, GPT


Chain Rule

Definition: Calculus rule for computing derivatives of composite functions.

Formula:

d/dx[f(g(x))] = f'(g(x)) * g'(x)

In Neural Networks: Enables backpropagation by computing gradients layer-by-layer.

[Chapter 2] → Backpropagation, Gradient


Context Length

Definition: Maximum number of tokens a model can process in a single forward pass.

Limitations:

  • Determined by positional encoding
  • Memory complexity: O(n²) for standard attention
  • Computational cost increases quadratically

Extensions:

  • Sliding window attention
  • Sparse attention patterns
  • Alternative position encodings (RoPE, ALiBi)

[Chapter 13] → Attention, Positional Encoding, Context Window


Cross-Entropy Loss

Definition: Loss function measuring difference between predicted probability distribution and true distribution.

Formula (classification):

L = -∑ y_i * log(ŷ_i)

Formula (language modeling):

L = -log P(w_t | w_1, ..., w_{t-1})

Properties:

  • Higher penalty for confident wrong predictions
  • Commonly used for classification and LMs

[Chapter 2, 11] → Loss Function, Training, Perplexity


D

Decoder

Definition: Transformer component that generates output sequence autoregressively, attending to encoder outputs (if present) and previous decoder outputs.

Components:

  • Masked self-attention
  • Cross-attention (if encoder-decoder)
  • Feed-forward network

Architecture Types:

  • Decoder-only (GPT): Causal LM
  • Encoder-decoder (T5): Seq2seq tasks

[Chapter 6] → Transformer, Encoder, Attention


Distillation (Knowledge Distillation)

Definition: Training technique where a smaller "student" model learns to mimic a larger "teacher" model.

Loss Components:

  1. Hard loss: Cross-entropy with true labels
  2. Soft loss: KL divergence with teacher predictions
  3. Feature loss: Match intermediate representations

Benefits for TRMs:

  • Compress large models into tiny ones
  • Transfer knowledge without full retraining
  • Improve small model performance

[Chapter 15] → Compression, Teacher-Student, Training


Dropout

Definition: Regularization technique that randomly sets neurons to zero during training with probability p.

Purpose:

  • Prevents overfitting
  • Creates ensemble effect
  • Encourages robust features

Usage in TRMs:

  • Applied to attention weights
  • Applied to feed-forward layers
  • Typically p=0.1 to 0.3

[Chapter 10] → Regularization, Training, Overfitting


E

Embedding

Definition: Dense vector representation of discrete tokens, mapping vocabulary items to continuous space.

Properties:

  • Learned during training
  • Dimension typically 128-1024
  • Captures semantic relationships

Types:

  • Token embeddings: Word/subword vectors
  • Position embeddings: Encode sequence position
  • Task embeddings: Multi-task conditioning

[Chapter 3] → Word2Vec, Token, Representation Learning


Encoder

Definition: Transformer component that processes input sequence bidirectionally to create contextualized representations.

Components:

  • Self-attention
  • Feed-forward network
  • Layer normalization
  • Residual connections

Use Cases:

  • BERT-style models
  • Encoder-decoder architectures
  • Classification tasks

[Chapter 6] → Transformer, Decoder, Self-Attention


F

Feed-Forward Network (FFN)

Definition: Multi-layer perceptron applied position-wise in transformer blocks.

Architecture:

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

Typical Dimensions:

  • Input/output: d_model
  • Hidden: 4 * d_model (standard)
  • TRM: 2 * d_model (more efficient)

Purpose:

  • Add non-linearity
  • Increase model capacity
  • Process each position independently

[Chapter 6] → Transformer, MLP, Activation Function


Fine-Tuning

Definition: Adapting a pre-trained model to a specific task by continuing training on task-specific data.

Types:

  • Full fine-tuning: Update all parameters
  • Parameter-efficient: Update small subset (LoRA, adapters)
  • Prompt tuning: Only update soft prompts

For TRMs: Essential for domain adaptation while maintaining efficiency.

[Chapter 16] → Transfer Learning, LoRA, Adapters


G

Gradient

Definition: Vector of partial derivatives indicating direction and magnitude of steepest increase in loss.

Formula:

∇L = [∂L/∂θ₁, ∂L/∂θ₂, ..., ∂L/∂θₙ]

Uses:

  • Parameter updates: θ_new = θ_old - α * ∇L
  • Indicates how to reduce loss

Challenges:

  • Vanishing gradients: Become too small
  • Exploding gradients: Become too large

[Chapter 2] → Backpropagation, Optimizer, Training


Gradient Clipping

Definition: Technique to prevent exploding gradients by capping gradient magnitude.

Methods:

  • By value: Clip each gradient component
  • By norm: Scale entire gradient vector

Formula (by norm):

if ||g|| > threshold:
    g = g * (threshold / ||g||)

Essential for: Training small models, RNNs, transformers.

[Chapter 10] → Training Stability, Gradient, Optimization


GPT (Generative Pre-trained Transformer)

Definition: Decoder-only transformer trained autoregressively for causal language modeling.

Architecture:

  • Masked self-attention
  • No encoder
  • Autoregressive generation

Variants: GPT-2, GPT-3, GPT-4

Relevance to TRMs: TRMs often use GPT-style architecture with parameter sharing.

[Chapter 6, 7] → Causal LM, Transformer, Decoder


H

Hidden State

Definition: Internal representation of input at intermediate layers of a neural network.

Properties:

  • Captures learned features
  • Dimension typically 256-4096
  • Can be interpreted or visualized

Uses:

  • Feed into next layer
  • Task-specific prediction heads
  • Feature extraction

[Chapter 1, 4] → RNN, LSTM, Representation


Hyperparameter

Definition: Configuration value set before training (not learned from data).

Examples:

  • Learning rate
  • Batch size
  • Number of layers
  • Hidden dimension
  • Dropout rate

Tuning: Essential for model performance. See Hyperparameter Guide.

[Chapter 2, 10] → Training, Optimization


I

Inference

Definition: Using a trained model to make predictions on new data.

Phases:

  1. Load model weights
  2. Process input
  3. Forward pass
  4. Generate output

Optimization:

  • Quantization
  • Pruning
  • KV-cache
  • Batch processing

[Chapter 12, 23] → Generation, Deployment, Serving


K

Key

Definition: In attention mechanism, the representation used to match against queries.

Computation: K = XW_K

Role: Determines what information is relevant to each query.

[Chapter 5] → Attention, Query, Value


Knowledge Distillation

→ See Distillation


KV-Cache

Definition: Optimization technique that stores previously computed key and value vectors during autoregressive generation.

Why Needed:

  • Autoregressive generation recomputes attention for all previous tokens
  • Caching avoids redundant computation

Memory Trade-off:

  • Speeds up generation significantly
  • Increases memory usage: O(n * d_model)

[Chapter 12] → Generation, Inference, Optimization


L

Layer Normalization

Definition: Normalization technique that normalizes across feature dimension for each example.

Formula:

x_norm = (x - μ_layer) / √(σ_layer² + ε)
x_out = γ * x_norm + β

Advantages over Batch Norm:

  • Works with batch size = 1
  • No train/test discrepancy
  • Better for sequences

Standard in: Transformers, TRMs

[Chapter 6] → Normalization, Training Stability


Learning Rate

Definition: Hyperparameter controlling size of parameter updates during training.

Formula: θ_new = θ_old - lr * gradient

Typical Values: 1e-5 to 1e-2

Scheduling:

  • Constant
  • Warmup then decay
  • Cosine annealing
  • Cyclical

Critical for: Training stability and convergence.

[Chapter 2, 10] → Optimizer, Training, Hyperparameter


LoRA (Low-Rank Adaptation)

Definition: Parameter-efficient fine-tuning method that adds trainable low-rank matrices to frozen pre-trained weights.

Formula:

W' = W_frozen + AB

where A ∈ ℝ^(d×r), B ∈ ℝ^(r×k), r << d,k

Benefits:

  • Drastically fewer trainable parameters
  • Multiple task adaptations without duplicating base model
  • Minimal performance loss

Ideal for: Fine-tuning TRMs to specific domains.

[Chapter 16] → Fine-Tuning, Adapters, Efficiency


Loss Function

Definition: Function measuring difference between model predictions and true values, guiding training.

Common Types:

  • MSE: Regression tasks
  • Cross-entropy: Classification, LM
  • Contrastive: Embedding learning

Role: Defines what model optimizes during training.

[Chapter 2, 11] → Training, Cross-Entropy, Optimization


LSTM (Long Short-Term Memory)

Definition: RNN variant with gating mechanisms to better capture long-term dependencies.

Gates:

  • Forget gate: What to discard
  • Input gate: What to add
  • Output gate: What to output

Advantages: Mitigates vanishing gradient problem in RNNs.

In TRMs: Largely replaced by attention, but concepts remain relevant.

[Chapter 4] → RNN, Sequence Modeling, Gates


M

Masked Language Modeling (MLM)

Definition: Training objective where random tokens are masked and model predicts them using bidirectional context.

Example:

  • Input: "The [MASK] sat on the mat"
  • Target: Predict "cat"

Used in: BERT, RoBERTa

Contrast with: Causal LM (GPT-style)

[Chapter 11] → Training Objective, BERT, Pre-training


Multi-Head Attention

Definition: Running multiple attention operations in parallel, each with different learned projections.

Formula:

MultiHead(Q,K,V) = Concat(head₁, ..., headₕ)W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Benefits:

  • Attend to different representation subspaces
  • Capture diverse relationships
  • Increase model capacity

Typical: 8-16 heads in standard transformers, 4-8 in TRMs.

[Chapter 5] → Attention, Query, Key, Value


N

Normalization

Definition: Technique to standardize layer inputs/outputs to improve training stability.

Types:

  • Batch Normalization: Across batch
  • Layer Normalization: Across features
  • RMS Normalization: Simplified layer norm

Why Important: Prevents internal covariate shift, enables higher learning rates.

[Chapter 6] → Layer Normalization, Training


Nucleus Sampling (Top-p)

Definition: Sampling technique that selects from smallest token set whose cumulative probability exceeds threshold p.

Algorithm:

  1. Sort tokens by probability
  2. Find smallest set with P(set) ≥ p
  3. Sample from this set

Advantages:

  • Adapts to probability distribution
  • More diverse than top-k
  • Avoids low-probability tokens

Typical p: 0.9 to 0.95

[Chapter 12] → Generation, Sampling, Inference


O

Optimizer

Definition: Algorithm for updating model parameters based on gradients to minimize loss.

Common Types:

  • SGD: Stochastic Gradient Descent
  • Adam: Adaptive Moment Estimation
  • AdamW: Adam with decoupled weight decay

Components:

  • Gradient computation
  • Update rule
  • Learning rate
  • Momentum (optional)

[Chapter 2] → Adam, Gradient Descent, Training


Overfitting

Definition: When model learns training data too well, including noise, hurting generalization.

Symptoms:

  • Low training loss, high validation loss
  • Perfect training accuracy, poor test accuracy

Solutions:

  • More data
  • Regularization (dropout, weight decay)
  • Early stopping
  • Data augmentation

[Chapter 10] → Regularization, Generalization, Training


P

Parameter

Definition: Learnable weights in neural network, updated during training.

Types:

  • Weight matrices: Linear transformations
  • Biases: Additive offsets
  • Embeddings: Token representations

Count: Critical metric for model size. TRMs aim for < 50M parameters.

[Chapter 1] → Training, Weight, Bias


Parameter Sharing

Definition: Using same parameters multiple times in model architecture.

In TRMs:

  • Recursive layers reuse weights
  • Input/output embeddings tied
  • Reduces total parameter count

Trade-offs:

  • Fewer parameters
  • May limit expressiveness
  • Requires more recursion depth

[Chapter 8] → Recursive, TRM, Efficiency


Perplexity

Definition: Evaluation metric for language models measuring average branching factor (uncertainty) per token.

Formula:

PPL = exp(-1/N ∑ log P(w_i | context))

Interpretation:

  • Lower is better
  • PPL of 20 ≈ model uncertain between 20 tokens
  • Common range: 10-100 depending on task

[Chapter 11, 24] → Evaluation, Language Modeling, Cross-Entropy


Positional Encoding

Definition: Adding position information to embeddings so model can distinguish token order.

Types:

  • Sinusoidal: Fixed, based on sine/cosine
  • Learned: Trainable embeddings
  • Relative: RoPE, ALiBi

Necessity: Transformers have no inherent notion of sequence order.

[Chapter 6, 13] → Transformer, Embedding, Context Length


Pre-training

Definition: Training model on large general corpus before fine-tuning on specific task.

Objectives:

  • Causal LM (GPT)
  • Masked LM (BERT)
  • Denoising (T5)

Benefits:

  • Learns general language understanding
  • Improves downstream performance
  • Enables transfer learning

[Chapter 10, 11] → Fine-Tuning, Transfer Learning


Pruning

Definition: Removing unimportant parameters to reduce model size.

Types:

  • Unstructured: Remove individual weights
  • Structured: Remove entire neurons/heads
  • Magnitude-based: Remove smallest weights

Process:

  1. Train full model
  2. Identify unimportant parameters
  3. Remove them
  4. Fine-tune remaining

[Chapter 17] → Compression, Quantization, Efficiency


Q

Quantization

Definition: Reducing numerical precision of model parameters (e.g., float32 → int8).

Types:

  • Post-Training Quantization (PTQ): After training
  • Quantization-Aware Training (QAT): During training

Benefits:

  • 2-4x smaller models
  • 2-4x faster inference
  • Lower memory usage

Trade-off: Slight accuracy loss (typically <2%)

[Chapter 17] → Compression, Efficiency, Deployment


Query

Definition: In attention mechanism, the representation that searches for relevant information.

Computation: Q = XW_Q

Role: Determines what information to retrieve from keys/values.

[Chapter 5] → Attention, Key, Value


R

Recurrent Neural Network (RNN)

Definition: Neural network that processes sequences by maintaining hidden state and applying same operation at each time step.

Formula:

h_t = tanh(W_h * h_{t-1} + W_x * x_t + b)

Limitations:

  • Vanishing/exploding gradients
  • Sequential processing (no parallelization)
  • Limited long-term memory

Evolution: LSTM → GRU → Transformer → TRM

[Chapter 4] → LSTM, Sequence Modeling


Regularization

Definition: Techniques to prevent overfitting and improve generalization.

Methods:

  • Dropout: Random neuron deactivation
  • Weight decay: L2 penalty on parameters
  • Data augmentation: Expand training data
  • Early stopping: Stop when validation improves

[Chapter 10] → Dropout, Overfitting, Training


Residual Connection (Skip Connection)

Definition: Adding input of layer directly to its output, enabling gradient flow through deep networks.

Formula: output = F(x) + x

Benefits:

  • Eases gradient flow
  • Enables deeper networks
  • Provides identity mapping

Standard in: Transformers, TRMs, ResNets

[Chapter 6] → Transformer, Training Stability


RoPE (Rotary Position Embedding)

Definition: Relative position encoding that applies rotary transformations to queries and keys.

Advantages:

  • Encodes relative positions naturally
  • Enables context length extrapolation
  • More efficient than learned embeddings

Used in: Modern efficient transformers, many TRMs

[Chapter 13] → Positional Encoding, Context Length, Attention


S

Sampling

Definition: Generating tokens from probability distribution rather than always picking highest probability.

Methods:

  • Greedy: Always pick argmax
  • Temperature: Scale logits before softmax
  • Top-k: Sample from k most likely
  • Top-p (nucleus): Sample from top cumulative p

Trade-offs:

  • Higher randomness → More diverse but less coherent
  • Lower randomness → More coherent but repetitive

[Chapter 12] → Generation, Temperature, Inference


Self-Attention

Definition: Attention mechanism where queries, keys, and values all come from same sequence.

Purpose: Model dependencies within sequence.

Formula: Attention(X, X, X)

Variants:

  • Masked: Causal (no future tokens)
  • Bidirectional: Full context

[Chapter 5, 6] → Attention, Transformer


Sequence-to-Sequence (Seq2Seq)

Definition: Model architecture that maps input sequence to output sequence, potentially of different length.

Components:

  • Encoder: Process input
  • Decoder: Generate output

Applications:

  • Translation
  • Summarization
  • Question answering

[Chapter 6] → Encoder, Decoder, Transformer


Softmax

Definition: Function that converts logits to probability distribution.

Formula:

softmax(x_i) = exp(x_i) / ∑_j exp(x_j)

Properties:

  • Output sums to 1
  • Differentiable
  • Amplifies differences

Uses: Attention weights, output probabilities

[Chapter 1, 5] → Activation Function, Attention


T

Temperature

Definition: Parameter that controls randomness in sampling by scaling logits before softmax.

Formula: softmax(logits / T)

Effects:

  • T = 1: No change
  • T < 1: More confident (sharper distribution)
  • T > 1: More random (flatter distribution)

Typical values: 0.7 (focused) to 1.2 (creative)

[Chapter 12] → Sampling, Generation, Softmax


Tokenization

Definition: Process of splitting text into discrete units (tokens) that model can process.

Levels:

  • Character: Individual characters
  • Subword: BPE, WordPiece
  • Word: Whole words

Trade-offs:

  • Smaller units → Longer sequences, better unknown words
  • Larger units → Shorter sequences, larger vocab

[Chapter 3, 9] → BPE, Vocabulary, Embedding


Transformer

Definition: Neural architecture based on self-attention, processing sequences in parallel rather than sequentially.

Components:

  • Multi-head self-attention
  • Position-wise FFN
  • Positional encoding
  • Layer normalization
  • Residual connections

Variants:

  • Encoder-only (BERT)
  • Decoder-only (GPT)
  • Encoder-decoder (T5)

Why Revolutionary: Parallelizable, scalable, state-of-the-art performance.

[Chapter 6] → Attention, Encoder, Decoder


Transfer Learning

Definition: Leveraging knowledge learned on one task to improve performance on related task.

Process:

  1. Pre-train on large general dataset
  2. Fine-tune on specific task
  3. Optionally freeze some layers

Benefits:

  • Less task-specific data needed
  • Better performance
  • Faster training

[Chapter 10, 16] → Pre-training, Fine-Tuning


TRM (Tiny Recursive Model)

Definition: Small language model using recursive parameter sharing to maximize efficiency.

Key Features:

  • < 50M parameters
  • Recursive layers
  • Parameter sharing
  • Optimized for edge deployment

Design Principles:

  • Weight sharing
  • Efficient attention
  • Compact vocabulary
  • Optimized representations

[Chapter 7-12] → Recursive, Parameter Sharing, Efficiency


V

Value

Definition: In attention mechanism, the representation that gets retrieved and combined.

Computation: V = XW_V

Role: Provides actual information returned by attention.

[Chapter 5] → Attention, Query, Key


Vanishing Gradient

Definition: Problem where gradients become extremely small during backpropagation, preventing learning in early layers.

Causes:

  • Deep networks
  • Activation functions (sigmoid, tanh)
  • Long sequences (RNNs)

Solutions:

  • ReLU activation
  • Residual connections
  • Layer normalization
  • Gradient clipping
  • LSTMs/GRUs

[Chapter 2, 4] → Backpropagation, LSTM, Training


Vocabulary

Definition: Set of all tokens model can process.

Size Trade-offs:

  • Large vocab (50k+): Better coverage, bigger embeddings
  • Small vocab (10k-30k): Fewer parameters, longer sequences

For TRMs: Often 10k-20k tokens to minimize embedding parameters.

[Chapter 3, 9] → Tokenization, BPE, Embedding


W

Weight

Definition: Learnable parameter in linear transformations of neural networks.

Representation: Typically matrices or tensors

Initialization: Critical for training success (Xavier, He, etc.)

Updates: Via gradient descent during training

[Chapter 1, 2] → Parameter, Training, Gradient


Weight Decay

Definition: Regularization technique adding penalty term to loss proportional to magnitude of weights.

Formula: L_total = L_task + λ * ||θ||²

Effect: Encourages smaller weights, reducing overfitting.

Typical λ: 1e-4 to 1e-2

[Chapter 10] → Regularization, L2, Training


Word2Vec

Definition: Neural network architecture for learning word embeddings by predicting context from words (Skip-gram) or words from context (CBOW).

Key Idea: Words in similar contexts have similar meanings.

Output: Dense vector representations of words

Influence: Foundation for modern embeddings

[Chapter 3] → Embedding, Representation Learning


Acronyms Quick Reference

  • BPE: Byte-Pair Encoding
  • BERT: Bidirectional Encoder Representations from Transformers
  • CBOW: Continuous Bag of Words
  • FFN: Feed-Forward Network
  • GPT: Generative Pre-trained Transformer
  • LSTM: Long Short-Term Memory
  • LM: Language Model
  • MLM: Masked Language Modeling
  • MLP: Multi-Layer Perceptron
  • NLP: Natural Language Processing
  • PTQ: Post-Training Quantization
  • QAT: Quantization-Aware Training
  • ReLU: Rectified Linear Unit
  • RNN: Recurrent Neural Network
  • RoPE: Rotary Position Embedding
  • SGD: Stochastic Gradient Descent
  • TRM: Tiny Recursive Model

Related Resources


Last updated: October 9, 2025