Eran Raviv https://eranraviv.com Modern statistics and econometrics with applications to financial markets Fri, 23 Jan 2026 10:33:11 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.1 https://eranraviv.com/wp-content/uploads/2019/06/cropped-eranfav-32x32.png Eran Raviv https://eranraviv.com 32 32 Correlation and Correlation Structure (11) – Estimation using Random Matrix Theory https://eranraviv.com/correlation-correlation-structure-11-estimation-using-random-matrix-theory/ https://eranraviv.com/correlation-correlation-structure-11-estimation-using-random-matrix-theory/#respond Fri, 23 Jan 2026 10:33:11 +0000 https://eranraviv.com/?p=5891 In the classical regime, when we have plenty of observations relative to what we need to estimate, we can rely on the sample covariance matrix as a faithful representation of the underlying covariance structure. However, in the high-dimensional settings common to modern data science – where the number of attributes/features p is comparable to the number of observations n, the sample covariance matrix is a bad estimator. It is not merely noisy; it is misleading. The eigenvalues of such matrices undergo a predictable yet deceptive dispersion, inflating into “signals” that have no basis in reality. But, using some machinery from Random Matrix Theory is one way to correct for biases that high-dimensional data creates.

Suppose we observe n independent vectors \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n from a p-dimensional distribution with covariance matrix \boldsymbol{\Sigma}. The obvious estimate is the sample covariance matrix:

    \[\mathbf{S} = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^\top.\]

The Problem That Won’t Go Away.

When p is fixed and n \to \infty, we’re in the classical regime where \mathbf{S} \to \boldsymbol{\Sigma} and all is well. But modern datasets, say in genomics or finance.., routinely present us with p \approx n or even p > n. Here, \mathbf{S} becomes a terrible estimate of \boldsymbol{\Sigma}.

How terrible? Consider the eigenvalues. And remind yourself that

Each eigenvalue = variance along its eigenvector direction

Even when the true covariance is the identity matrix \boldsymbol{\Sigma} = \mathbf{I} (pure noise, all eigenvalues are 1 in actuality), the eigenvalues of \mathbf{S} don’t cluster near 1. Instead, they spread themselves along an interval roughly [(1-\sqrt{p/n})^2, (1+\sqrt{p/n})^2]. The largest eigenvalue can be inflated by a factor of 4 when p = n. This isn’t sampling variability we can ignore – it’s systematic, serious, distortion.

Enter Random Matrix Theory

Around 1967, mathematicians Marčenko and Pastur proved a remarkable theorem: as n, p \to \infty with p/n \to c \in (0,1], the empirical distribution of eigenvalues of \mathbf{S} converges to a specific limiting density, now called the Marčenko-Pastur law:

    \[f_{MP}(\lambda) = \frac{1}{2\pi c \lambda} \sqrt{(\lambda - \lambda_{-})(\lambda_{+} - \lambda)}, \quad \lambda \in [\lambda_{-}, \lambda_{+}]\]

where \lambda_{\pm} = (1 \pm \sqrt{c})^2 – aka the bounds of the density.

This is beautiful mathematics, but what can we do with it? The key insight came from physicists (Laloux et al., 1999) studying financial correlation matrices. They noticed that most eigenvalues of real covariance matrices pile up in a “bulk” matching the Marčenko-Pastur prediction, while a few large eigenvalues stick out above. The bulk represents noise; the outliers represent signal.

A Simple Cleaning Procedure

This suggests an obvious strategy:

  • Identify the bulk: Estimate the noise level \sigma^2 and compute the Marčenko-Pastur bounds \lambda_{\pm} = \sigma^2(1 \pm \sqrt{c})^2
  • Classify eigenvalues:
    • Signal: l_i > \lambda_+ (or l_i < \lambda_- if c < 1)
    • Noise: l_i \in [\lambda_-, \lambda_+]
  • Shrink the noise: Replace noise eigenvalues with their average \bar{l} = \frac{1}{|\text{noise}|}\sum_{i \in \text{noise}} l_i
  • Reconstruct: \tilde{\mathbf{S}} = \mathbf{U}\tilde{\mathbf{L}}\mathbf{U}^\top where \tilde{\mathbf{L}} contains the cleaned eigenvalues
  • This is simple. We’re not done yet – the signal eigenvalues themselves are inflated by noise.

    The Baik-Silverstein Correction

    In 2006, mathematicians Baik and Silverstein analyzed the “spiked covariance model” where \boldsymbol{\Sigma} has a few large eigenvalues (the “spikes”) and the rest equal to 1. They derived the precise relationship between sample and population eigenvalues. Specifically, for a signal eigenvalue \lambda > 1 + \sqrt{c}, the corresponding sample eigenvalue l satisfies (asymptotically):

    l \approx \lambda + \frac{c \lambda}{\lambda - 1}
    This can be inverted: given l, we estimate

        \[\hat{\lambda} = \frac{1}{2}\left((l - c + 1) + \sqrt{(l - c + 1)^2 - 4l}\right)\]

    And so our \hat{\lambda} can now serve as a proper estimate for the true eigenvalue in the reconstruction \tilde{\mathbf{S}} = \mathbf{U}\tilde{\mathbf{L}}\mathbf{U}^\top.

    Here is how it looks:
    RMT example
    What is plotted is the empirical distribution of sample eigenvalues (blue histogram) from a 100×200 data matrix compared to the MP density (in black). The population covariance contains three (real) signal eigenvalues, indicated by the green dash-dot lines. c = 0.5 in this example. Red dashed lines mark the MP bounds. Magenta dot lines show the expected (bias-corrected) locations of sample signal eigenvalues as estimated by the inversion formula outlined above.

    A valuable addition to your covariance estimation arsenal.

    Code

    set.seed(123)  # For reproducibility
    
    n <- 200  # Number of observations (rows)
    p <- 100  # Number of variables (columns)
    
    # Define eigenvalues
    signal_eigenvalues <- c(0.1, 3, 4)
    noise_eigenvalue <- 0.5
    n_signal <- length(signal_eigenvalues)
    
    # Create the full eigenvalue vector
    eigenvalues <- c(signal_eigenvalues, rep(noise_eigenvalue, p - n_signal))
    
    # Generate random orthogonal matrix for eigenvectors
    # Use QR decomposition of a random matrix
    random_matrix <- matrix(rnorm(p * p), p, p)
    Q <- qr.Q(qr(random_matrix))  # Orthogonal matrix (eigenvectors)
    
    # Computes the QR decomposition of a random matrix
    # Any matrix can be factored as M = QR where:
    # Q is orthogonal (columns are orthonormal)
    # R is upper triangular
    # qr.Q Extracts just the Q matrix
    # Bottom line: It's a standard trick in simulation to generate random covariance matrices with controlled eigenvalues—you specify the eigenvalues (diagonal matrix) and use a random orthogonal matrix for the directions (eigenvectors)
    
    # Construct the population covariance matrix
    Sigma <- Q %*% diag(eigenvalues) %*% t(Q)
    
    Sigma[1:10,1:10]
    # Generate data from multivariate normal with this covariance
    # Mean vector is zero
    mu <- rep(0, p)
    
    # Generate the data matrix using the Cholesky decomposition
    # X = Z * L where L is Cholesky factor of Sigma
    L <- chol(Sigma)  # Upper triangular Cholesky factor
    Z <- matrix(rnorm(n * p), n, p)  # Standard normal random variables
    X <- Z %*% L  # n x p data matrix
    
    dim(X)
    
    # Verify the structure
    cat("Data matrix dimensions:", dim(X), "\n")
    cat("Population eigenvalues (first 10):", eigenvalues[1:10], "\n\n")
    
    # Compute sample covariance matrix
    S <- cov(X)
    
    # Check sample eigenvalues
    sample_eigenvalues <- eigen(S, symmetric = TRUE, only.values = TRUE)$values
    cat("Sample eigenvalues (first 10):", round(sample_eigenvalues[1:10], 3), "\n")
    cat("Sample eigenvalues (last 3):", round(tail(sample_eigenvalues, 3), 3), "\n\n")
    barplot(sample_eigenvalues)
    
    # Compute aspect ratio and Marchenko-Pastur bounds
    c_ratio <- p / n
    lambda_plus <- noise_eigenvalue * (1 + sqrt(c_ratio))^2
    lambda_minus <- noise_eigenvalue * (1 - sqrt(c_ratio))^2
    
    cat("Aspect ratio c = p/n:", round(c_ratio, 3), "\n")
    cat("Marchenko-Pastur bounds for noise (sigma^2 = 0.5):", 
        round(lambda_minus, 3), "to", round(lambda_plus, 3), "\n\n")
    
    # Expected locations of signal eigenvalues
    phi <- function(lambda, c, sigma_sq = 0.5) {
      lambda_scaled <- lambda / sigma_sq
      result <- sigma_sq * (lambda_scaled + c * lambda_scaled / (lambda_scaled - 1))
      return(result)
    }
    
    # this gives us the locations of the sample eigenvalues that we should expect for our signal, based on the theory.
    cat("Expected sample eigenvalue locations:\n")
    for(i in 1:n_signal) {
      expected <- phi(signal_eigenvalues[i], c_ratio)
      cat(sprintf("  Population λ = %.1f → Expected sample eigenvalue ≈ %.3f\n", 
                  signal_eigenvalues[i], expected))
    }
    
    mp_density <- function(x, c, sigma_sq) {
      lambda_plus <- sigma_sq * (1 + sqrt(c))^2
      lambda_minus <- sigma_sq * (1 - sqrt(c))^2
      density <- ifelse(x >= lambda_minus & x <= lambda_plus,
                        sqrt((lambda_plus - x) * (x - lambda_minus)) / (2 * pi * sigma_sq * c * x), 0)
      return(density)
    }

    References

    • Laloux, L., Cizeau, P., Bouchaud, J.-P., & Potters, M. (1999). Noise dressing of financial correlation matrices. Physical Review Letters, 83, 1467.
    • Baik, J., & Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis, 97, 1382-1408.
    • Marčenko, V. A., & Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4), 457-483.
    ]]>
    https://eranraviv.com/correlation-correlation-structure-11-estimation-using-random-matrix-theory/feed/ 0
    Most popular posts – 2025 https://eranraviv.com/popular-posts-2025/ https://eranraviv.com/popular-posts-2025/#respond Wed, 31 Dec 2025 10:42:50 +0000 https://eranraviv.com/?p=6638 Today is the last day of 2025. Depending on where you’re reading this, the party might have already begun. I begin this end-of-year post by wishing you a safe and fun time tonight.

    This blog is just a personal hobby. When I’m extra busy as I was this year the blog is a front-line casualty. This is why 2025 saw a weaker posting stream. As I do almost each year, I checked the analytics stats to see what generated the most interest. I am pleased with around 13K (active) visits this year, an average of 37 seconds spent per visit (when the browser-tab is in the foreground and active). Given our collective TikTok-wired brains, I consider those 37 seconds as a compliment.

    The two most popular posts this year were also coincidentally the two pieces I most enjoyed creating. Both dive into the all-important world of word-embedding:


    The one is Understanding Word Embeddings (1) – Algebra and the other is Understanding Word Embeddings (2) – Geometry.

    On the left (scroll down) you can find the most popular posts from previous years.

    To my readership. Thank you for reading, sharing, for your emails, corrections, comments and good questions. Happy, healthy, and productive 2026!


    pic credit: Morvanic Lee

    ]]>
    https://eranraviv.com/popular-posts-2025/feed/ 0
    Correlation and correlation structure (10) – Inverse Covariance https://eranraviv.com/correlation-correlation-structure-10-inverse-covariance/ https://eranraviv.com/correlation-correlation-structure-10-inverse-covariance/#respond Mon, 29 Sep 2025 08:24:28 +0000 https://eranraviv.com/?p=6620 The covariance matrix is central to many statistical methods. It tells us how variables move together, and its diagonal entries – variances – are very much our go-to measure of uncertainty. But the real action lives in its inverse. We call the inverse covariance matrix either the precision matrix or the concentration matrix. Where did these terms come from? I’ll now explain the origin of these terms and why the inverse of the covariance is named that way. I doubt this has kept you up at night, but I still think you’ll find it interesting.

    Why the Inverse Covariance is Called Precision?

    Variance is just a noisy soloist, if you want to know who really controls the music – who depends on whom – you listen to precision \Omega. While a variable may look wiggly and wild on its own, you often can tell where it lands precisely, conditional on the other variables in the system. The inverse of the covariance matrix encodes the conditional dependence any two variables after controlling the rest. The mathematical details appear in an earlier post and the curious reader should consult that one.

    Here the following code and figure provide only the illustration for the precision terminology. Consider this little experiment:

        \[ X_2, X_3 \sim \mathcal{N}(0,1) \text{, independent and ordinary.} \]

        \[ X_1 = 2X_2 + 3X_3 + \text{small noise}.\]

    Now, X_1 has a large variance (marginal variance), look, it’s all over the place:
    X1 variance
    But but but… given the other two variables you can determine X_1​ quite accurately (because it doesn’t carry much noise on its own); hence the term precision. The precision matrix captures exactly this phenomenon. Its diagonal entries are not about marginal uncertainty, but conditional uncertainty; how much variability remains when the values of the other variables are given. The inverse of the precision entry \Omega_{11} is the residual variance of X_1 after regression it on the other two variables. The math behind it is found in an earlier post, for now it’s suffice to write:

        \[\text{For each } i=1,\dots,n: \quad  X_i = \sum_{j \neq i} \beta_{ij} X_j + \varepsilon_i,  \quad \text{with } \mathrm{Var}(\varepsilon_i) = \sigma_i^2.\]

        \[\quad \Omega_{ii} = \tfrac{1}{\sigma_i^2},  \qquad \Omega_{ij} = -\tfrac{\beta_{ij}}{\sigma_i^2}.\]

    So after accounting for the other two variables, you are left with

        \[\text{small noise} --> \frac{1}{\text{small noise}}  --> \text{high precision}\]

    which in this case looks as follows:
    X1 precise given X2,X3

    This small illustration also reveals a useful computational insight. Instead of directly inverting the covariance matrix (expensive for high dimensions), you can also run parallel regressions of each variable on all others, which may scale better on distributed systems.

    Why the Inverse Covariance is Called Concentration?

    Now, what motivates the concentration terminology? What is concentrated? Let’s unwrap it. Let’s begin by first looking at the density of a single normally distributed random variable:

        \[f(x) \propto \exp\left(-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}\right).\]

    So if x = \mu we have e^{-(x-\mu)^2}= e^0=1, and otherwise we have e^{-(x-\mu)^2}= e^{(\text{negative number})}. This negative number will then be divided by the variance, or, in our context multiplied by the precision (which is the reciprocal of the variance for a single variable). A higher precision value makes for a negativier (😀) exponent. In turn, it reduces the overall density the further we drift from the mean (think faster mass-drop in the tails), so a sharper, more peaked density where the variable’s values are tightly concentrated around the mean. A numeric sanity check. Below are two cases with mean zero, one with variance 1 (so precision \tau=1), and the other with variance 4 (\tau=0.25). We look at two values, one at the mean (x=0), and one farther away (x=1), and check the density mass at those values for the two cases (p_{var=1}(0), p_{var=1}(1), and p_{var=4}(0) and p_{var=4}(1)) :

        \[ X\sim\mathcal N(0,\sigma^2),\quad \tau=\frac{1}{\sigma^2},\quad p_\tau(x)=\frac{\sqrt{\tau}}{\sqrt{2\pi}}\exp\!\left(-\tfrac12\tau x^{2}\right) \]

        \[ p_{1}(0)=\frac{1}{\sqrt{2\pi}}\approx 0.39,\qquad p_{4}(0)=\frac{2}{\sqrt{2\pi}}=\sqrt{\frac{2}{\pi}}\approx 0.79 \]

        \[ p_{1}(1)=\frac{1}{\sqrt{2\pi}}e^{-1/2}\approx 0.24,\qquad p_{4}(1)=\frac{2}{\sqrt{2\pi}}e^{-2}\approx 0.10 \]

        \[ \tau\uparrow\;\Rightarrow\;p(0)\uparrow,\;p(1)\downarrow \]

    In words: higher precision leads to lower density mass away from the mean and, therefore, higher density mass around the mean (because the density has to sum up to one, and the mass must go somewhere).

    Moving to the multivariate case. Say that also X_1 is normally distributed, then the joint multivariate Gaussian distribution of our 3 variables is proportional to:

        \[f(\mathbf{x}) \propto \exp\!\left(-\tfrac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \mathbf{\Omega} (\mathbf{x}-\boldsymbol{\mu})\right)\]

        \[\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}, \quad \boldsymbol{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \end{bmatrix}, \quad \mathbf{\Omega} = \begin{bmatrix} \Omega_{11} & \Omega_{12} & \Omega_{13} \\ \Omega_{21} & \Omega_{22} & \Omega_{23} \\ \Omega_{31} & \Omega_{32} & \Omega_{33} \end{bmatrix}\]

    In the same fashion, \Omega directly sets the shape and orientation of the contours of the multivariate density. If there is no correlation (think a diagonal \Omega), what would you expect to see? that we have a wide, diffused, spread-out cloud (indicating little concentration). By way of contrast, a full \Omega weights the directions differently; it determines how much probability mass gets concentrated in each direction through the space.

    Another way to see this is to remember that for the multivariate Gaussian density case, \mathbf{\Omega} appears in the nominator, so the inverse, the covariance \mathbf{\Sigma^{(-1)}} would be in the denominator. Higher covariance entries means more spread, and as a result a lower density values at individual points and thus a more diffused multivariate distribution overall.

    The following two simple 3 \times 3 scenarios illustrate the concentration principle explained above. In the code below you can see that while I plot only the first 2 variables, there are actually 3 variables, but the third one is independent; so high covariance would remain high even if we account for the third variable (I say it so that you don’t get confused that we now work with the covariance, rather then with the inverse). Here are the two scenarios:

    \textbf{Round hill:} \boldsymbol{\Sigma}_1 with correlation \rho = 0.1 creates spherical contours.

    \textbf{Elongated ridge:} \boldsymbol{\Sigma}_2 with correlation \rho = 0.9 creates elliptical contours stretched along the correlation direction.

    Rotate the below interactive plots to get a clearer sense of what we mean by more/less concentration. Don’t forget to check the density scale.

    Round hill: diffused, less concentrated

    Elongated ridge: steep\peaky, more concentrated

    Hopefully, this explanation makes the terminology for the inverse covariance clearer.

    Code

    For Precision

    # Combine into matrix
    X <- cbind(X1, X2, X3)
    
    # Compute covariance and precision
    set.seed(123)
    n <- 100
    # Generate correlated data where X1 has high variance but is predictable from X2, X3
    # X1 = 2*X2 + 3*X3 + small noise
    X2 <- rnorm(n, 0, 1)
    X3 <- rnorm(n, 0, 1)
    X1 <- 2*X2 + 3*X3 + rnorm(n, 0, 0.5)  # Small conditional variance!
    
    # Combine into matrix
    X <- cbind(X1, X2, X3)
    
    # Compute covariance and precision
    Sigma <- cov(X)
    Omega <- solve(Sigma)
    
    # Display results
    cat("MARGINAL variances (diagonal of Sigma):\n")
    print(diag(Sigma))
            X1         X2         X3 
    11.3096683  0.8332328  0.9350631 
    
    cat("\nPRECISION values (diagonal of Omega):\n")
    print(diag(Omega))
           X1        X2        X3 
     4.511182 18.066305 41.995608 
    
    cat("\nCONDITIONAL variances (1/diagonal of Omega):\n")
    print(1/diag(Omega))
            X1         X2         X3 
    0.22167138 0.05535166 0.02381201 
    
    Verification - Residual variance from regression:
    
    cat("Var(X1|X2,X3) =", var(fit$residuals), "\n")
    Var(X1|X2,X3) = 0.2216714 
    cat("1/Omega[1,1] =", 1/Omega[1,1], "\n")
    1/Omega[1,1] = 0.2216714

    For Concentration

    library(MASS)
    library(rgl)
    library(htmltools)
    set.seed(123)
    n <- 5000
    mu <- c(0,0,0)
    
    # Case 1: Low correlation (round hill)
    Sigma1 <- matrix(c(1, 0.1, 0,
                       0.1, 1, 0,
                       0, 0, 1), 3, 3)
    X1 <- mvrnorm(n, mu=mu, Sigma=Sigma1)
    
    # Case 2: High correlation (elongated ridge)
    Sigma2 <- matrix(c(1, 0.9, 0,
                       0.9, 1, 0,
                       0,   0, 1), 3, 3)
    X2 <- mvrnorm(n, mu=mu, Sigma=Sigma2)
    
    # Density for marginal (X1,X2)
    kd1 <- kde2d(X1[,1], X1[,2], n=150, lims=c(range(X1[,1]), range(X1[,2])))
    kd2 <- kde2d(X2[,1], X2[,2], n=150, lims=c(range(X2[,1]), range(X2[,2])))
    
    # Plot 1: Low correlation → "round mountain"
    open3d(useNULL=TRUE)
    persp3d(kd1$x, kd1$y, kd1$z,
            col = terrain.colors(100)[cut(kd1$z, 100)],
            aspect = c(1,1,0.4),
            xlab="X1", ylab="X2", zlab="Density",
            smooth=TRUE, alpha=0.9)
    title3d("Low correlation (round, concentrated)", line=2)
    widget1 <- rglwidget(width=450, height=450)
    # Plot 2: High correlation → "ridge mountain"
    open3d(useNULL=TRUE)
    persp3d(kd2$x, kd2$y, kd2$z,
            col = terrain.colors(100)[cut(kd2$z, 100)],
            aspect = c(1,1,0.4),
            xlab="X1", ylab="X2", zlab="Density",
            smooth=TRUE, alpha=0.9)
    title3d("High correlation (elongated, less concentrated)", line=2)
    widget2 <- rglwidget(width=450, height=450)

    ]]>
    https://eranraviv.com/correlation-correlation-structure-10-inverse-covariance/feed/ 0
    Dot Product in the Attention Mechanism https://eranraviv.com/dot-product-attention-mechanism/ https://eranraviv.com/dot-product-attention-mechanism/#respond Mon, 28 Jul 2025 19:38:36 +0000 https://eranraviv.com/?p=6617 The dot product of two embedding vectors \mathbf{x} and \mathbf{y} with dimension d is defined as

        \[\mathbf{x} \cdot \mathbf{y} = x_1y_1 + x_2y_2 + \dots + x_dy_d.\]

    Hardly the first thing that jumps to mind when thinking about a “similarity score”. Indeed, the result of a dot product is a single numbers (a scalar), with no predefined range (e.g. not between zero and one). So, it’s hard to quantify whether a particular score is high/low on its own. Still, deep learning Transformer family of models rely heavily on the dot product in the attention mechanism; to weigh the importance of different parts of the input sentence. This post explains why the dot product which seems like an odd pick as a “similarity scores”, actually makes good sense.

    Dot Product and Vector Similarity

    I assume here you already understand the geometry of word embeddings.

    Consider the following: two vectors pointing the same way (0° angle), so we want a maximum similarity score; two vectors pointing in opposite ways (180° angle), so we want a minimum (least similar) score, and two perpendicular vectors (90° angle), so we want a zero similarity score, meaning no relationship.

    If we just used the angle itself, the numbers would be backward: 180° (least similar) is a higher number than 0° (most similar). That’s not at all intuitive for a “similarity” scale. Solution, feed the angle to the cosine function. This gives us exactly the order we order:

    \theta (degrees) \cos(\theta)
    180 -1.000
    150 -0.866
    120 -0.500
    90 0.000
    60 0.500
    30 0.866
    0 1.000

    The formula to find the cosine of the angle between two vectors \mathbf {x} and \mathbf {y} of size d is:

        \[\cos(\theta )={\mathbf {x} \cdot \mathbf {y} \over \|\mathbf {x} \|\|\mathbf {y} \|}={\frac {\sum \limits _{i=1}^{d}{x_{i}y_{i}}}{{\sqrt {\sum \limits _{i=1}^{d}{x_{i}^{2}}}}{\sqrt {\sum \limits _{i=1}^{d}{y_{i}^{2}}}}}}.\]

    Manipulating the above just a bit to get our dot product friend on the left-hand gives us:

        \[\mathbf {x} \cdot \mathbf {y} = \|\mathbf {x} \| ( \|\mathbf {y} \| \cos(\theta ))\]

    now what is \|\mathbf{y}\|\cos(\theta)? It’s how closely two vectors are pointing in the same direction multiplied by the magnitude of \mathbf{y}: \|\mathbf{y}\| = \sqrt{y_1^2 + y_2^2 + \cdots + y_d^2}. Or put differently: how much of \mathbf{y} lies in the direction of \mathbf{x}, in other words, if I project \mathbf{y} onto \mathbf{x}, how much is “captured”? (answer: exactly: \|\mathbf {y} \| \cos(\theta )). Then, we multiply it by the magnitude of \mathbf{x}, \|\mathbf {x} \| so as to account for its magnitude as well.

    Can a Vector Be More “Similar” to Others Than to Itself???

    If you google self-attention matrix images you will see that the diagonal is quite high, for example: A description of the image that is because as we mentioned, if the angle is zero, so the cosine is 1 and we boil down to multiplying the magnitude of the two embedding vectors.

    That said, I got to thinking about this topic because of the observation that while a vector should theoretically be most similar to itself, that’s not always true with some of the images I saw in papers or generated myself (diagonal is not always the highest scores as you can see above figure as well). So it is possible for different vectors to receive higher similarity scores than a vector does to itself – even in self-attention matrices. This is because the embeddings coordinates are not always normalized to a consistent length. So, even if two vectors point in almost the same direction (a very small angle), the sheer “size” of one of them can still make their similarity score surprisingly high. This is why I explained it that way above.

    By the way, that’s also the reason for the normalizing factor \sqrt{d_k}. Because, As d grows, the variability of the dot product grows; numbers there can become very large because of the magnitude of the vectors themselves (larger d –> more elements to square and sum). It’s not good to feed large numbers to the softmax (e^{(\cdot)..}), because it’s going to explode, so before we exponentiate (e^{(\cdot)..}), we “stabilize” that dot product by dividing it with \sqrt{d_k}. Why \sqrt{d_k}? No real reason. You can choose another term if you wish. The story is that the term \sqrt{d_k} is the standard deviation of the dot product if you assume the embedding vectors are normally distributed (they are not, but this Gaussian assumption provides a good excuse for using \sqrt{d_k}).

    ]]>
    https://eranraviv.com/dot-product-attention-mechanism/feed/ 0
    Understanding Word Embeddings (2) – Geometry https://eranraviv.com/understanding-word-embeddings-2-geometry/ https://eranraviv.com/understanding-word-embeddings-2-geometry/#respond Mon, 02 Jun 2025 17:32:02 +0000 https://eranraviv.com/?p=6605 I have noticed that when I use the term “coordinates” to talk about vectors, it doesn’t always click for everyone in the room. The previous post covered the algebra of word embeddings and now we explain why you should think about the vector of the word embedding simply as coordinates in space. We skip the trivial 1D and 2D cases since they are straightforward. 4 dimensions is too complicated for me to gif around with, so 3D dimensions would have to suffice for our illustrations.

    Geometric interpretation of a word embedding

    Take a look at this basic matrix:

        \[ \begin{array}{c|ccc} & \text{dog} & \text{cat} & \text{fish} \\ \hline \text{word}_1 & 1 & 0 & 0 \\ \text{word}_2 & 0 & 1 & 0 \\ \text{word}_3 & 0 & 0 & 1 \end{array} \]

    The one-hot word-vectors are visualized in the 3D space as follows:

    To better understand this coordinate system, we can rotate these vectors. This changes the words’ coordinates, but preserves their relationships—specifically, the angles between them remains at 90°.

    Observe that he word fish remained where it was, the word cat now sits at [-0.707, 0.707, 0] and the word dog sits at [0.707, 0.707, 0], but the relationship between the words has not changed (it’s a 3d image, the angles are still 90° apart). This illustrates a specific example for what is called “basis transformation” (the term “basis” is explained in the previous post).

    Basis transformation in our context of word embeddings means that we change our rudimentary one-hot representation, where words are represented in the standard basis, to the embedding representation where words are represented in a semantic basis.

    Semantic basis 🤨 ? Yes, let me explain. Semantics is the branch of linguistics and logic concerned with meaning. But, “meaning” is a latent and ill-defined concept. There are different ways to describe what one means: both “inflation” and “rising-prices” map to almost identical meaning. Related to that, a frequent misconception in the field of NLP topic modeling is the belief that topics are actually-defined things, which they are not. In fact, we might consider ourselves fortunate if a topic can even indirectly be inferred from the words assigned to the same cluster (e.g., “topic 3”). For instance, if words like deflation, stagflation, and inflation appear together in the same cluster, we could interpret that cluster as “price stability” topic – even if, as it happens, the cluster also includes many other, unrelated, words. So when we refer to semantics, we’re basically talking about the underlying, unobserved\abstract\latent and ill-defined term: meaning.

    What makes a space semantic is how words connect to each other. That’s different from the example above where each word just stands on its own – ‘cat’ doesn’t have any special relationship to ‘dog’ or ‘fish’, all three words are alone standing.

    Now that we understand the term “semantic”, let’s see what is gained by shifting from our clear and precise one-hot encoding space, to a semantic space.

    Word Coordinates in Semantic Space

    Semantic space is better than a formal/symbolic/non-semantic space largely due to these two advantages:

    1. Dimension reduction (we save storage- and computational costs).
    2. We can relate words to each other. It’s beneficial if words like revenues and earnings aren’t entirely independent algebraically, because in reality they are not independent in what they mean for us (e.g. both words imply to potentially higher profits).

    Unpacking in progress:

    1. Dimension reduction: In one-hot word-representation each word is completely distinct (all 90°, i.e. independent). Words share no common components and are completely dissimilar (similarity=0) in that representation. The upside is that we capture all words in our vocabulary and each word has a clear specific location. But, that space is massive: each vector is of size equals to the number of unique words (vocabulary). When we embed word-vectors we cast each word-vector into a lower dimension space. Instead of a coordinate system with V dimensions, we use a lower-dimension coordinate system, say 768. Now each word will not be where it should be exactly. Why not? because we don’t have V entries to place that word in space, we only have 768, so each word will be placed somewhere inside our 768 coordinate system. By compressing all those V words into just 768 dimensions, we produce a denser representations instead of the extremely sparse one-hot vectors. We inevitably lose the independence that one-hot encodings provided, but this also presents an opportunity to compress the V words in a way that places related words closer together in the compressed 768 dimensional space.
    2. We can relate words to each other: For the following dense word-vectors representation (being silent for now about how to find these vectors)

          \[\begin{pmatrix} \text{dog:} & 0.8 & 0.2 & 0.3 \\ \text{cat:} & 0.7 & 0.3 & 0.4 \\ \text{fish:} & 0.1 & 0.9 & 0.2 \end{pmatrix}\]

      The plot below shows ‘cat’ and ‘dog’ are spatially closer, indicating their greater semantic similarity compared to their similarity with ‘fish’.

      I won’t get into the details of how we find these dense word-vectors. The short version is that we some transformer model to create those vectors for us. The transformers family of models has the great power to, well.. transform an initial and imprecise, a guess if you will, set of coordinates (word-vectors), into a much more reasonable one. Reasonable in that words eventually end up placed near other words that relate to them (think back to our ‘revenues’ and ‘earnings’ example).

    Word embeddings

    Note that we did not demonstrate dimension reduction here (the representation stayed in 3 dimensions), this illustration focused only on the acquisition of meaning.

    To exemplify dimension reduction we could can map the three vectors into a 2 dimensional space:mapping into 2 dimensions

    As mentioned, dimension reduction helps reduce storage and compute costs, but do we lose anything? Absolutely we do.

    The Crowding Problem

    Notice that in the 3d space, our one-hot basis can be placed such that all vectors are perpendicular to each other. This is not possible in a 2d space. But we deliberately chose a projection that maintains key distinctions, in our simple example: keeping “dog” and “cat” close to each other while “fish” is distant. What about higher dimensions?

    When reducing dimensionality from the exact one-hot space, with the number of vectors in the hundreds of thousands, to a smaller space, say 768 dimensions, we distort our representation, call it compression costs. Simply put, half a million points, once living large, now have to cram into a cheap dorms, with some tokens snoring more than others. This compression-induced distortion is known by the evocative term ‘the crowding problem’. You may have wondered (I know I did) why do we stop at fairly moderate dimensions? Early language models had dimensions of 128, then 512, 768, 1024, 3072 and recently 4096 and that’s about it. Don’t we gain better accuracy if we use say 10^4?

    We don’t. Enter the Johnson-Lindenstrauss (JL) Lemma.

    The Johnson-Lindenstrauss (JL) Lemma

    One variation of the lemma is:
    Let 0 < \varepsilon < 1, and let X \subset \mathbb{R}^n be a set of n points. Then, for any integer k \geq \frac{4 \ln n}{\varepsilon^2}, there exists a linear map f: \mathbb{R}^n \rightarrow \mathbb{R}^k such that for all a, b \in X:

        \[ (1 - \varepsilon) \|a - b\|^2 \leq \|f(a) - f(b)\|^2 \leq (1 + \varepsilon) \|a - b\|^2,  \]

    where f is The function that maps points from high-dimensional space to low-dimensional space. f(a), f(b) are the projections of points a and b in the lower-dimensional space. Specifically f(x) = Rx where R is the projection matrix. If you are reading this you probably know what PCA is, so think about R as the rotation matrix; to find the first few factors you need to multiply the original variables with the first few rotation-vectors. ||\cdot|| is the Euclidean norm (squared distance), and finally \varepsilon is the distortion parameter (typically between 0 and 1).

    Simply speaking, the JL lemma states that a set of n points, a point in our context is a word-vector, in high-dimensional space can be mapped into a space of dimension O\left(\frac{\log n}{\varepsilon^2}\right) (much lower than n..) while still preserving pairwise distances up to a factor of (1 \pm \varepsilon). As an example, 100,000 points (vocabulary of 100,000) can be theoretically mapped to

        \[\frac{log_2(100,000) \approx 17}{0.1^2} \approx 1700.\]

    Setting \varepsilon = 0.1 means that we accept a distortion of 10%; so if the original distance between vector a and vector b is d, then after the compression it will be between 0.9d and 1.1d.

    Remarks:

    • Fun fact: in sufficiently large dimension even random projection – where f(x) = RX, with the vectors of R simply drawn randomly, also approximately preserves pairwise distances. You can (should) compress large dimensional data without carefully engineering R, for example you don’t always have to spend time doing singular value decomposition. I leave it for the curious reader to check this counter-intuitive fact.
    • As typical data are, most of the structure is often captured by the first few dimensions. Again, you can think about the first few factors as capturing most of the variation in the data. Adding factors beyond a certain point can add value, but with sharply diminishing returns. Another fun fact (perhaps I should rethink my perception of fun 🥴), you capture most of the variation with the first few dimensions even if the data is completely random (meaning no structure to be captured whatsoever). This fact was flagged recently in the prestigious Econometrica journal as spurious factors (looks like a factor but simply the result of the numerical procedure).
    • As oppose to the curse of dimensionality, we can think of the ability to compress high-dimensional data without losing much information as the blessing of dimensionality, because it’s simply the flip side of the curse. It cuts both ways: it’s fairly easy to shrink high-dimensional spaces down because points in that space are so spread-out, which is exactly why finding meaningful patterns in high-dimensional space is such a headache to begin with.

    In sum

    We moved from one-hot representation of words in a high-dimensional space to a lower-dimensional “semantic space” where word relationships are captured. We showed how word vectors can be rotated without changing their relative positions. This transformation is crucial because it allows us to represent word meaning and relationships, reducing storage and computation costs. We then moved on to the “crowding problem” that arises from dimensionality reduction and introduced the Johnson-Lindenstrauss (JL) Lemma, which provides the theoretical legitimacy for compressing high-dimensional text data.

    I hope you now have a better grip on:

  • Why we can refer to word-vectors as coordinates in space
  • Why it’s ok to do, and why we don’t lose much information about word relationships
  • Together with the previous post providing the algebraic foundation for word-embeddings you are well-positioned to, actually understand, transformers and LLMs better.
  • ]]>
    https://eranraviv.com/understanding-word-embeddings-2-geometry/feed/ 0