GLOSSARY.md

Glossary: Building Tiny Recursive Models

A comprehensive reference for all technical terms used throughout the book.

How to Use This Glossary

Bold terms: Primary definitions
Italics: Related concepts
[Chapter X]: Where concept is introduced
→ : See also

A

Activation Function

Definition: A non-linear function applied to neuron outputs to introduce non-linearity into neural networks.

Common Types:

ReLU (Rectified Linear Unit): f(x) = max(0, x)
Sigmoid: f(x) = 1 / (1 + e^(-x))
Tanh: f(x) = tanh(x)
GELU: f(x) = x * Φ(x) where Φ is Gaussian CDF

Why It Matters: Without activation functions, neural networks would be limited to linear transformations, unable to learn complex patterns.

[Chapter 1] → Neuron, Forward Propagation

Adam Optimizer

Definition: Adaptive Moment Estimation optimizer that combines momentum and adaptive learning rates.

Formula:

m_t = β₁ * m_{t-1} + (1-β₁) * g_t
v_t = β₂ * v_{t-1} + (1-β₂) * g_t²
θ_t = θ_{t-1} - α * m_t / (√v_t + ε)

Key Features:

Maintains moving averages of gradients
Adapts learning rate per parameter
Generally works well out-of-the-box

[Chapter 2] → Optimizer, Gradient Descent, Learning Rate

Attention Mechanism

Definition: A mechanism that allows models to focus on relevant parts of input sequences by computing weighted combinations based on learned importance scores.

Core Computation:

Attention(Q, K, V) = softmax(QK^T / √d_k) * V

Components:

Query (Q): What we're looking for
Key (K): What we're matching against
Value (V): What we retrieve

Types:

Self-attention: Attention within same sequence
Cross-attention: Attention between two sequences
Multi-head attention: Multiple attention operations in parallel

[Chapter 5] → Query, Key, Value, Transformer

Autoregressive

Definition: A property where the model generates outputs sequentially, with each token depending on all previous tokens.

Example: In language generation, predicting "the cat sat on the ___", the next word depends on all previous words.

Contrast with:

Non-autoregressive: Generates all outputs in parallel
Masked LM: Can attend to both past and future

[Chapter 11] → Causal Language Model, Generation

B

Backpropagation

Definition: An algorithm for computing gradients of loss with respect to model parameters using the chain rule of calculus.

Process:

Forward pass: Compute predictions
Compute loss
Backward pass: Compute gradients
Update parameters

Why Critical: Enables training deep neural networks by efficiently computing how to adjust parameters to reduce loss.

[Chapter 2] → Gradient Descent, Chain Rule, Training

Batch Normalization

Definition: A technique that normalizes layer inputs to have zero mean and unit variance, stabilizing and accelerating training.

Formula:

x_norm = (x - μ_batch) / √(σ_batch² + ε)
x_out = γ * x_norm + β

Benefits:

Reduces internal covariate shift
Allows higher learning rates
Acts as regularization

TRM Usage: Often replaced by Layer Normalization in transformers.

[Chapter 6] → Layer Normalization, Training Stability

Beam Search

Definition: A search algorithm that maintains top-k most probable sequences at each decoding step, balancing between greedy and exhaustive search.

Parameters:

Beam width (k): Number of candidates kept
Length penalty: Prevents bias toward shorter sequences

Trade-offs:

Larger beam → Better quality, slower
Beam=1 → Greedy decoding

[Chapter 12] → Generation, Sampling, Inference

BERT (Bidirectional Encoder Representations from Transformers)

Definition: A transformer model trained with masked language modeling, able to attend to both past and future context.

Key Features:

Bidirectional context
Pre-trained on masked LM
Fine-tuned for downstream tasks

Contrast with GPT: BERT is encoder-only (bidirectional), GPT is decoder-only (autoregressive).

[Chapter 11] → Masked Language Modeling, Transformer

Byte-Pair Encoding (BPE)

Definition: A tokenization algorithm that iteratively merges most frequent byte pairs to build vocabulary.

Process:

Start with character-level tokens
Find most frequent pair
Merge into single token
Repeat until desired vocab size

Advantages:

Handles unknown words
Balances vocab size and sequence length
Language-agnostic

[Chapter 9] → Tokenization, Vocabulary

C

Causal Language Modeling

Definition: Training objective where model predicts next token given previous tokens, with causal masking preventing attention to future tokens.

Loss: Cross-entropy between predicted and actual next token.

Architecture: Decoder-only transformer (like GPT).

Applications:

Text generation
Code completion
Conversational AI

[Chapter 11] → Autoregressive, Masked Attention, GPT

Chain Rule

Definition: Calculus rule for computing derivatives of composite functions.

Formula:

d/dx[f(g(x))] = f'(g(x)) * g'(x)

In Neural Networks: Enables backpropagation by computing gradients layer-by-layer.

[Chapter 2] → Backpropagation, Gradient

Context Length

Definition: Maximum number of tokens a model can process in a single forward pass.

Limitations:

Determined by positional encoding
Memory complexity: O(n²) for standard attention
Computational cost increases quadratically

Extensions:

Sliding window attention
Sparse attention patterns
Alternative position encodings (RoPE, ALiBi)

[Chapter 13] → Attention, Positional Encoding, Context Window

Cross-Entropy Loss

Definition: Loss function measuring difference between predicted probability distribution and true distribution.

Formula (classification):

L = -∑ y_i * log(ŷ_i)

Formula (language modeling):

L = -log P(w_t | w_1, ..., w_{t-1})

Properties:

Higher penalty for confident wrong predictions
Commonly used for classification and LMs

[Chapter 2, 11] → Loss Function, Training, Perplexity

D

Decoder

Definition: Transformer component that generates output sequence autoregressively, attending to encoder outputs (if present) and previous decoder outputs.

Components:

Masked self-attention
Cross-attention (if encoder-decoder)
Feed-forward network

Architecture Types:

Decoder-only (GPT): Causal LM
Encoder-decoder (T5): Seq2seq tasks

[Chapter 6] → Transformer, Encoder, Attention

Distillation (Knowledge Distillation)

Definition: Training technique where a smaller "student" model learns to mimic a larger "teacher" model.

Loss Components:

Hard loss: Cross-entropy with true labels
Soft loss: KL divergence with teacher predictions
Feature loss: Match intermediate representations

Benefits for TRMs:

Compress large models into tiny ones
Transfer knowledge without full retraining
Improve small model performance

[Chapter 15] → Compression, Teacher-Student, Training

Dropout

Definition: Regularization technique that randomly sets neurons to zero during training with probability p.

Purpose:

Prevents overfitting
Creates ensemble effect
Encourages robust features

Usage in TRMs:

Applied to attention weights
Applied to feed-forward layers
Typically p=0.1 to 0.3

[Chapter 10] → Regularization, Training, Overfitting

E

Embedding

Definition: Dense vector representation of discrete tokens, mapping vocabulary items to continuous space.

Properties:

Learned during training
Dimension typically 128-1024
Captures semantic relationships

Types:

Token embeddings: Word/subword vectors
Position embeddings: Encode sequence position
Task embeddings: Multi-task conditioning

[Chapter 3] → Word2Vec, Token, Representation Learning

Encoder

Definition: Transformer component that processes input sequence bidirectionally to create contextualized representations.

Components:

Self-attention
Feed-forward network
Layer normalization
Residual connections

Use Cases:

BERT-style models
Encoder-decoder architectures
Classification tasks

[Chapter 6] → Transformer, Decoder, Self-Attention

F

Feed-Forward Network (FFN)

Definition: Multi-layer perceptron applied position-wise in transformer blocks.

Architecture:

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

Typical Dimensions:

Input/output: d_model
Hidden: 4 * d_model (standard)
TRM: 2 * d_model (more efficient)

Purpose:

Add non-linearity
Increase model capacity
Process each position independently

[Chapter 6] → Transformer, MLP, Activation Function

Fine-Tuning

Definition: Adapting a pre-trained model to a specific task by continuing training on task-specific data.

Types:

Full fine-tuning: Update all parameters
Parameter-efficient: Update small subset (LoRA, adapters)
Prompt tuning: Only update soft prompts

For TRMs: Essential for domain adaptation while maintaining efficiency.

[Chapter 16] → Transfer Learning, LoRA, Adapters

G

Gradient

Definition: Vector of partial derivatives indicating direction and magnitude of steepest increase in loss.

Formula:

∇L = [∂L/∂θ₁, ∂L/∂θ₂, ..., ∂L/∂θₙ]

Uses:

Parameter updates: θ_new = θ_old - α * ∇L
Indicates how to reduce loss

Challenges:

Vanishing gradients: Become too small
Exploding gradients: Become too large

[Chapter 2] → Backpropagation, Optimizer, Training

Gradient Clipping

Definition: Technique to prevent exploding gradients by capping gradient magnitude.

Methods:

By value: Clip each gradient component
By norm: Scale entire gradient vector

Formula (by norm):

if ||g|| > threshold:
    g = g * (threshold / ||g||)

Essential for: Training small models, RNNs, transformers.

[Chapter 10] → Training Stability, Gradient, Optimization

GPT (Generative Pre-trained Transformer)

Definition: Decoder-only transformer trained autoregressively for causal language modeling.

Architecture:

Masked self-attention
No encoder
Autoregressive generation

Variants: GPT-2, GPT-3, GPT-4

Relevance to TRMs: TRMs often use GPT-style architecture with parameter sharing.

[Chapter 6, 7] → Causal LM, Transformer, Decoder

H

Hidden State

Definition: Internal representation of input at intermediate layers of a neural network.

Properties:

Captures learned features
Dimension typically 256-4096
Can be interpreted or visualized

Uses:

Feed into next layer
Task-specific prediction heads
Feature extraction

[Chapter 1, 4] → RNN, LSTM, Representation

Hyperparameter

Definition: Configuration value set before training (not learned from data).

Examples:

Learning rate
Batch size
Number of layers
Hidden dimension
Dropout rate

Tuning: Essential for model performance. See Hyperparameter Guide.

[Chapter 2, 10] → Training, Optimization

I

Inference

Definition: Using a trained model to make predictions on new data.

Phases:

Load model weights
Process input
Forward pass
Generate output

Optimization:

Quantization
Pruning
KV-cache
Batch processing

[Chapter 12, 23] → Generation, Deployment, Serving

K

Key

Definition: In attention mechanism, the representation used to match against queries.

Computation: K = XW_K

Role: Determines what information is relevant to each query.

[Chapter 5] → Attention, Query, Value

Knowledge Distillation

→ See Distillation

KV-Cache

Definition: Optimization technique that stores previously computed key and value vectors during autoregressive generation.

Why Needed:

Autoregressive generation recomputes attention for all previous tokens
Caching avoids redundant computation

Memory Trade-off:

Speeds up generation significantly
Increases memory usage: O(n * d_model)

[Chapter 12] → Generation, Inference, Optimization

L

Layer Normalization

Definition: Normalization technique that normalizes across feature dimension for each example.

Formula:

x_norm = (x - μ_layer) / √(σ_layer² + ε)
x_out = γ * x_norm + β

Advantages over Batch Norm:

Works with batch size = 1
No train/test discrepancy
Better for sequences

Standard in: Transformers, TRMs

[Chapter 6] → Normalization, Training Stability

Learning Rate

Definition: Hyperparameter controlling size of parameter updates during training.

Formula: θ_new = θ_old - lr * gradient

Typical Values: 1e-5 to 1e-2

Scheduling:

Constant
Warmup then decay
Cosine annealing
Cyclical

Critical for: Training stability and convergence.

[Chapter 2, 10] → Optimizer, Training, Hyperparameter

LoRA (Low-Rank Adaptation)

Definition: Parameter-efficient fine-tuning method that adds trainable low-rank matrices to frozen pre-trained weights.

Formula:

W' = W_frozen + AB

where A ∈ ℝ^(d×r), B ∈ ℝ^(r×k), r << d,k

Benefits:

Drastically fewer trainable parameters
Multiple task adaptations without duplicating base model
Minimal performance loss

Ideal for: Fine-tuning TRMs to specific domains.

[Chapter 16] → Fine-Tuning, Adapters, Efficiency

Loss Function

Definition: Function measuring difference between model predictions and true values, guiding training.

Common Types:

MSE: Regression tasks
Cross-entropy: Classification, LM
Contrastive: Embedding learning

Role: Defines what model optimizes during training.

[Chapter 2, 11] → Training, Cross-Entropy, Optimization

LSTM (Long Short-Term Memory)

Definition: RNN variant with gating mechanisms to better capture long-term dependencies.

Gates:

Forget gate: What to discard
Input gate: What to add
Output gate: What to output

Advantages: Mitigates vanishing gradient problem in RNNs.

In TRMs: Largely replaced by attention, but concepts remain relevant.

[Chapter 4] → RNN, Sequence Modeling, Gates

M

Masked Language Modeling (MLM)

Definition: Training objective where random tokens are masked and model predicts them using bidirectional context.

Example:

Input: "The [MASK] sat on the mat"
Target: Predict "cat"

Used in: BERT, RoBERTa

Contrast with: Causal LM (GPT-style)

[Chapter 11] → Training Objective, BERT, Pre-training

Multi-Head Attention

Definition: Running multiple attention operations in parallel, each with different learned projections.

Formula:

MultiHead(Q,K,V) = Concat(head₁, ..., headₕ)W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Benefits:

Attend to different representation subspaces
Capture diverse relationships
Increase model capacity

Typical: 8-16 heads in standard transformers, 4-8 in TRMs.

[Chapter 5] → Attention, Query, Key, Value

N

Normalization

Definition: Technique to standardize layer inputs/outputs to improve training stability.

Types:

Batch Normalization: Across batch
Layer Normalization: Across features
RMS Normalization: Simplified layer norm

Why Important: Prevents internal covariate shift, enables higher learning rates.

[Chapter 6] → Layer Normalization, Training

Nucleus Sampling (Top-p)

Definition: Sampling technique that selects from smallest token set whose cumulative probability exceeds threshold p.

Algorithm:

Sort tokens by probability
Find smallest set with P(set) ≥ p
Sample from this set

Advantages:

Adapts to probability distribution
More diverse than top-k
Avoids low-probability tokens

Typical p: 0.9 to 0.95

[Chapter 12] → Generation, Sampling, Inference

O

Optimizer

Definition: Algorithm for updating model parameters based on gradients to minimize loss.

Common Types:

SGD: Stochastic Gradient Descent
Adam: Adaptive Moment Estimation
AdamW: Adam with decoupled weight decay

Components:

Gradient computation
Update rule
Learning rate
Momentum (optional)

[Chapter 2] → Adam, Gradient Descent, Training

Overfitting

Definition: When model learns training data too well, including noise, hurting generalization.

Symptoms:

Low training loss, high validation loss
Perfect training accuracy, poor test accuracy

Solutions:

More data
Regularization (dropout, weight decay)
Early stopping
Data augmentation

[Chapter 10] → Regularization, Generalization, Training

P

Parameter

Definition: Learnable weights in neural network, updated during training.

Types:

Weight matrices: Linear transformations
Biases: Additive offsets
Embeddings: Token representations

Count: Critical metric for model size. TRMs aim for < 50M parameters.

[Chapter 1] → Training, Weight, Bias

Parameter Sharing

Definition: Using same parameters multiple times in model architecture.

In TRMs:

Recursive layers reuse weights
Input/output embeddings tied
Reduces total parameter count

Trade-offs:

Fewer parameters
May limit expressiveness
Requires more recursion depth

[Chapter 8] → Recursive, TRM, Efficiency

Perplexity

Definition: Evaluation metric for language models measuring average branching factor (uncertainty) per token.

Formula:

PPL = exp(-1/N ∑ log P(w_i | context))

Interpretation:

Lower is better
PPL of 20 ≈ model uncertain between 20 tokens
Common range: 10-100 depending on task

[Chapter 11, 24] → Evaluation, Language Modeling, Cross-Entropy

Positional Encoding

Definition: Adding position information to embeddings so model can distinguish token order.

Types:

Sinusoidal: Fixed, based on sine/cosine
Learned: Trainable embeddings
Relative: RoPE, ALiBi

Necessity: Transformers have no inherent notion of sequence order.

[Chapter 6, 13] → Transformer, Embedding, Context Length

Pre-training

Definition: Training model on large general corpus before fine-tuning on specific task.

Objectives:

Causal LM (GPT)
Masked LM (BERT)
Denoising (T5)

Benefits:

Learns general language understanding
Improves downstream performance
Enables transfer learning

[Chapter 10, 11] → Fine-Tuning, Transfer Learning

Pruning

Definition: Removing unimportant parameters to reduce model size.

Types:

Unstructured: Remove individual weights
Structured: Remove entire neurons/heads
Magnitude-based: Remove smallest weights

Process:

Train full model
Identify unimportant parameters
Remove them
Fine-tune remaining

[Chapter 17] → Compression, Quantization, Efficiency

Q

Quantization

Definition: Reducing numerical precision of model parameters (e.g., float32 → int8).

Types:

Post-Training Quantization (PTQ): After training
Quantization-Aware Training (QAT): During training

Benefits:

2-4x smaller models
2-4x faster inference
Lower memory usage

Trade-off: Slight accuracy loss (typically <2%)

[Chapter 17] → Compression, Efficiency, Deployment

Query

Definition: In attention mechanism, the representation that searches for relevant information.

Computation: Q = XW_Q

Role: Determines what information to retrieve from keys/values.

[Chapter 5] → Attention, Key, Value

R

Recurrent Neural Network (RNN)

Definition: Neural network that processes sequences by maintaining hidden state and applying same operation at each time step.

Formula:

h_t = tanh(W_h * h_{t-1} + W_x * x_t + b)

Limitations:

Vanishing/exploding gradients
Sequential processing (no parallelization)
Limited long-term memory

Evolution: LSTM → GRU → Transformer → TRM

[Chapter 4] → LSTM, Sequence Modeling

Regularization

Definition: Techniques to prevent overfitting and improve generalization.

Methods:

Dropout: Random neuron deactivation
Weight decay: L2 penalty on parameters
Data augmentation: Expand training data
Early stopping: Stop when validation improves

[Chapter 10] → Dropout, Overfitting, Training

Residual Connection (Skip Connection)

Definition: Adding input of layer directly to its output, enabling gradient flow through deep networks.

Formula: output = F(x) + x

Benefits:

Eases gradient flow
Enables deeper networks
Provides identity mapping

Standard in: Transformers, TRMs, ResNets

[Chapter 6] → Transformer, Training Stability

RoPE (Rotary Position Embedding)

Definition: Relative position encoding that applies rotary transformations to queries and keys.

Advantages:

Encodes relative positions naturally
Enables context length extrapolation
More efficient than learned embeddings

Used in: Modern efficient transformers, many TRMs

[Chapter 13] → Positional Encoding, Context Length, Attention

S

Sampling

Definition: Generating tokens from probability distribution rather than always picking highest probability.

Methods:

Greedy: Always pick argmax
Temperature: Scale logits before softmax
Top-k: Sample from k most likely
Top-p (nucleus): Sample from top cumulative p

Trade-offs:

Higher randomness → More diverse but less coherent
Lower randomness → More coherent but repetitive

[Chapter 12] → Generation, Temperature, Inference

Self-Attention

Definition: Attention mechanism where queries, keys, and values all come from same sequence.

Purpose: Model dependencies within sequence.

Formula: Attention(X, X, X)

Variants:

Masked: Causal (no future tokens)
Bidirectional: Full context

[Chapter 5, 6] → Attention, Transformer

Sequence-to-Sequence (Seq2Seq)

Definition: Model architecture that maps input sequence to output sequence, potentially of different length.

Components:

Encoder: Process input
Decoder: Generate output

Applications:

Translation
Summarization
Question answering

[Chapter 6] → Encoder, Decoder, Transformer

Softmax

Definition: Function that converts logits to probability distribution.

Formula:

softmax(x_i) = exp(x_i) / ∑_j exp(x_j)

Properties:

Output sums to 1
Differentiable
Amplifies differences

Uses: Attention weights, output probabilities

[Chapter 1, 5] → Activation Function, Attention

T

Temperature

Definition: Parameter that controls randomness in sampling by scaling logits before softmax.

Formula: softmax(logits / T)

Effects:

T = 1: No change
T < 1: More confident (sharper distribution)
T > 1: More random (flatter distribution)

Typical values: 0.7 (focused) to 1.2 (creative)

[Chapter 12] → Sampling, Generation, Softmax

Tokenization

Definition: Process of splitting text into discrete units (tokens) that model can process.

Levels:

Character: Individual characters
Subword: BPE, WordPiece
Word: Whole words

Trade-offs:

Smaller units → Longer sequences, better unknown words
Larger units → Shorter sequences, larger vocab

[Chapter 3, 9] → BPE, Vocabulary, Embedding

Transformer

Definition: Neural architecture based on self-attention, processing sequences in parallel rather than sequentially.

Components:

Multi-head self-attention
Position-wise FFN
Positional encoding
Layer normalization
Residual connections

Variants:

Encoder-only (BERT)
Decoder-only (GPT)
Encoder-decoder (T5)

Why Revolutionary: Parallelizable, scalable, state-of-the-art performance.

[Chapter 6] → Attention, Encoder, Decoder

Transfer Learning

Definition: Leveraging knowledge learned on one task to improve performance on related task.

Process:

Pre-train on large general dataset
Fine-tune on specific task
Optionally freeze some layers

Benefits:

Less task-specific data needed
Better performance
Faster training

[Chapter 10, 16] → Pre-training, Fine-Tuning

TRM (Tiny Recursive Model)

Definition: Small language model using recursive parameter sharing to maximize efficiency.

Key Features:

< 50M parameters
Recursive layers
Parameter sharing
Optimized for edge deployment

Design Principles:

Weight sharing
Efficient attention
Compact vocabulary
Optimized representations

[Chapter 7-12] → Recursive, Parameter Sharing, Efficiency

V

Value

Definition: In attention mechanism, the representation that gets retrieved and combined.

Computation: V = XW_V

Role: Provides actual information returned by attention.

[Chapter 5] → Attention, Query, Key

Vanishing Gradient

Definition: Problem where gradients become extremely small during backpropagation, preventing learning in early layers.

Causes:

Deep networks
Activation functions (sigmoid, tanh)
Long sequences (RNNs)

Solutions:

ReLU activation
Residual connections
Layer normalization
Gradient clipping
LSTMs/GRUs

[Chapter 2, 4] → Backpropagation, LSTM, Training

Vocabulary

Definition: Set of all tokens model can process.

Size Trade-offs:

Large vocab (50k+): Better coverage, bigger embeddings
Small vocab (10k-30k): Fewer parameters, longer sequences

For TRMs: Often 10k-20k tokens to minimize embedding parameters.

[Chapter 3, 9] → Tokenization, BPE, Embedding

W

Weight

Definition: Learnable parameter in linear transformations of neural networks.

Representation: Typically matrices or tensors

Initialization: Critical for training success (Xavier, He, etc.)

Updates: Via gradient descent during training

[Chapter 1, 2] → Parameter, Training, Gradient

Weight Decay

Definition: Regularization technique adding penalty term to loss proportional to magnitude of weights.

Formula: L_total = L_task + λ * ||θ||²

Effect: Encourages smaller weights, reducing overfitting.

Typical λ: 1e-4 to 1e-2

[Chapter 10] → Regularization, L2, Training

Word2Vec

Definition: Neural network architecture for learning word embeddings by predicting context from words (Skip-gram) or words from context (CBOW).

Key Idea: Words in similar contexts have similar meanings.

Output: Dense vector representations of words

Influence: Foundation for modern embeddings

[Chapter 3] → Embedding, Representation Learning

Acronyms Quick Reference

BPE: Byte-Pair Encoding
BERT: Bidirectional Encoder Representations from Transformers
CBOW: Continuous Bag of Words
FFN: Feed-Forward Network
GPT: Generative Pre-trained Transformer
LSTM: Long Short-Term Memory
LM: Language Model
MLM: Masked Language Modeling
MLP: Multi-Layer Perceptron
NLP: Natural Language Processing
PTQ: Post-Training Quantization
QAT: Quantization-Aware Training
ReLU: Rectified Linear Unit
RNN: Recurrent Neural Network
RoPE: Rotary Position Embedding
SGD: Stochastic Gradient Descent
TRM: Tiny Recursive Model

Related Resources

Main Book: TRM-BOOK-COMPLETE.md
Getting Started: GETTING-STARTED.md
FAQ: FAQ.md
References: REFERENCES.md

Last updated: October 9, 2025

FilesExpand file tree

GLOSSARY.md

Latest commit

History

GLOSSARY.md

File metadata and controls

Glossary: Building Tiny Recursive Models

How to Use This Glossary

A

Activation Function

Adam Optimizer

Attention Mechanism

Autoregressive

B

Backpropagation

Batch Normalization

Beam Search

BERT (Bidirectional Encoder Representations from Transformers)

Byte-Pair Encoding (BPE)

C

Causal Language Modeling

Chain Rule

Context Length

Cross-Entropy Loss

D

Decoder

Distillation (Knowledge Distillation)

Dropout

E

Embedding

Encoder

F

Feed-Forward Network (FFN)

Fine-Tuning

G

Gradient

Gradient Clipping

GPT (Generative Pre-trained Transformer)

H

Hidden State

Hyperparameter

I

Inference

K

Key

Knowledge Distillation

KV-Cache

L

Layer Normalization

Learning Rate

LoRA (Low-Rank Adaptation)

Loss Function

LSTM (Long Short-Term Memory)

M

Masked Language Modeling (MLM)

Multi-Head Attention

N

Normalization

Nucleus Sampling (Top-p)

O

Optimizer

Overfitting

P

Parameter

Parameter Sharing

Perplexity

Positional Encoding

Pre-training

Pruning

Q

Quantization

Query

R

Recurrent Neural Network (RNN)

Regularization

Residual Connection (Skip Connection)

RoPE (Rotary Position Embedding)

S

Sampling

Self-Attention