Eran Raviv

Correlation and Correlation Structure (11) – Estimation using Random Matrix Theory

Eran Raviv — Fri, 23 Jan 2026 10:33:11 +0000

In the classical regime, when we have plenty of observations relative to what we need to estimate, we can rely on the sample covariance matrix as a faithful representation of the underlying covariance structure. However, in the high-dimensional settings common to modern data science – where the number of attributes/features is comparable to the number of observations , the sample covariance matrix is a bad estimator. It is not merely noisy; it is misleading. The eigenvalues of such matrices undergo a predictable yet deceptive dispersion, inflating into “signals” that have no basis in reality. But, using some machinery from Random Matrix Theory is one way to correct for biases that high-dimensional data creates.

Suppose we observe independent vectors from a -dimensional distribution with covariance matrix . The obvious estimate is the sample covariance matrix:

The Problem That Won’t Go Away.

When is fixed and , we’re in the classical regime where and all is well. But modern datasets, say in genomics or finance.., routinely present us with or even . Here, becomes a terrible estimate of .

How terrible? Consider the eigenvalues. And remind yourself that

Each eigenvalue = variance along its eigenvector direction

Even when the true covariance is the identity matrix (pure noise, all eigenvalues are 1 in actuality), the eigenvalues of don’t cluster near 1. Instead, they spread themselves along an interval roughly . The largest eigenvalue can be inflated by a factor of 4 when . This isn’t sampling variability we can ignore – it’s systematic, serious, distortion.

Enter Random Matrix Theory

Around 1967, mathematicians Marčenko and Pastur proved a remarkable theorem: as with , the empirical distribution of eigenvalues of converges to a specific limiting density, now called the Marčenko-Pastur law:

where – aka the bounds of the density.

This is beautiful mathematics, but what can we do with it? The key insight came from physicists (Laloux et al., 1999) studying financial correlation matrices. They noticed that most eigenvalues of real covariance matrices pile up in a “bulk” matching the Marčenko-Pastur prediction, while a few large eigenvalues stick out above. The bulk represents noise; the outliers represent signal.

A Simple Cleaning Procedure

This suggests an obvious strategy:

Identify the bulk: Estimate the noise level and compute the Marčenko-Pastur bounds

Classify eigenvalues:

Signal: (or if )
Noise:

Shrink the noise: Replace noise eigenvalues with their average

Reconstruct: where contains the cleaned eigenvalues

This is simple. We’re not done yet – the signal eigenvalues themselves are inflated by noise.

The Baik-Silverstein Correction

In 2006, mathematicians Baik and Silverstein analyzed the “spiked covariance model” where has a few large eigenvalues (the “spikes”) and the rest equal to 1. They derived the precise relationship between sample and population eigenvalues. Specifically, for a signal eigenvalue , the corresponding sample eigenvalue satisfies (asymptotically):

This can be inverted: given , we estimate

And so our can now serve as a proper estimate for the true eigenvalue in the reconstruction .

Here is how it looks:

What is plotted is the empirical distribution of sample eigenvalues (blue histogram) from a 100×200 data matrix compared to the MP density (in black). The population covariance contains three (real) signal eigenvalues, indicated by the green dash-dot lines. c = 0.5 in this example. Red dashed lines mark the MP bounds. Magenta dot lines show the expected (bias-corrected) locations of sample signal eigenvalues as estimated by the inversion formula outlined above.

A valuable addition to your covariance estimation arsenal.

Code

set.seed(123)  # For reproducibility

n <- 200  # Number of observations (rows)
p <- 100  # Number of variables (columns)

# Define eigenvalues
signal_eigenvalues <- c(0.1, 3, 4)
noise_eigenvalue <- 0.5
n_signal <- length(signal_eigenvalues)

# Create the full eigenvalue vector
eigenvalues <- c(signal_eigenvalues, rep(noise_eigenvalue, p - n_signal))

# Generate random orthogonal matrix for eigenvectors
# Use QR decomposition of a random matrix
random_matrix <- matrix(rnorm(p * p), p, p)
Q <- qr.Q(qr(random_matrix))  # Orthogonal matrix (eigenvectors)

# Computes the QR decomposition of a random matrix
# Any matrix can be factored as M = QR where:
# Q is orthogonal (columns are orthonormal)
# R is upper triangular
# qr.Q Extracts just the Q matrix
# Bottom line: It's a standard trick in simulation to generate random covariance matrices with controlled eigenvalues—you specify the eigenvalues (diagonal matrix) and use a random orthogonal matrix for the directions (eigenvectors)

# Construct the population covariance matrix
Sigma <- Q %*% diag(eigenvalues) %*% t(Q)

Sigma[1:10,1:10]
# Generate data from multivariate normal with this covariance
# Mean vector is zero
mu <- rep(0, p)

# Generate the data matrix using the Cholesky decomposition
# X = Z * L where L is Cholesky factor of Sigma
L <- chol(Sigma)  # Upper triangular Cholesky factor
Z <- matrix(rnorm(n * p), n, p)  # Standard normal random variables
X <- Z %*% L  # n x p data matrix

dim(X)

# Verify the structure
cat("Data matrix dimensions:", dim(X), "\n")
cat("Population eigenvalues (first 10):", eigenvalues[1:10], "\n\n")

# Compute sample covariance matrix
S <- cov(X)

# Check sample eigenvalues
sample_eigenvalues <- eigen(S, symmetric = TRUE, only.values = TRUE)$values
cat("Sample eigenvalues (first 10):", round(sample_eigenvalues[1:10], 3), "\n")
cat("Sample eigenvalues (last 3):", round(tail(sample_eigenvalues, 3), 3), "\n\n")
barplot(sample_eigenvalues)

# Compute aspect ratio and Marchenko-Pastur bounds
c_ratio <- p / n
lambda_plus <- noise_eigenvalue * (1 + sqrt(c_ratio))^2
lambda_minus <- noise_eigenvalue * (1 - sqrt(c_ratio))^2

cat("Aspect ratio c = p/n:", round(c_ratio, 3), "\n")
cat("Marchenko-Pastur bounds for noise (sigma^2 = 0.5):", 
    round(lambda_minus, 3), "to", round(lambda_plus, 3), "\n\n")

# Expected locations of signal eigenvalues
phi <- function(lambda, c, sigma_sq = 0.5) {
  lambda_scaled <- lambda / sigma_sq
  result <- sigma_sq * (lambda_scaled + c * lambda_scaled / (lambda_scaled - 1))
  return(result)
}

# this gives us the locations of the sample eigenvalues that we should expect for our signal, based on the theory.
cat("Expected sample eigenvalue locations:\n")
for(i in 1:n_signal) {
  expected <- phi(signal_eigenvalues[i], c_ratio)
  cat(sprintf("  Population λ = %.1f → Expected sample eigenvalue ≈ %.3f\n", 
              signal_eigenvalues[i], expected))
}

mp_density <- function(x, c, sigma_sq) {
  lambda_plus <- sigma_sq * (1 + sqrt(c))^2
  lambda_minus <- sigma_sq * (1 - sqrt(c))^2
  density <- ifelse(x >= lambda_minus & x <= lambda_plus,
                    sqrt((lambda_plus - x) * (x - lambda_minus)) / (2 * pi * sigma_sq * c * x), 0)
  return(density)
}

References

Laloux, L., Cizeau, P., Bouchaud, J.-P., & Potters, M. (1999). Noise dressing of financial correlation matrices. Physical Review Letters, 83, 1467.
Baik, J., & Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis, 97, 1382-1408.
Marčenko, V. A., & Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4), 457-483.

Most popular posts – 2025

Eran Raviv — Wed, 31 Dec 2025 10:42:50 +0000

Today is the last day of 2025. Depending on where you’re reading this, the party might have already begun. I begin this end-of-year post by wishing you a safe and fun time tonight.

This blog is just a personal hobby. When I’m extra busy as I was this year the blog is a front-line casualty. This is why 2025 saw a weaker posting stream. As I do almost each year, I checked the analytics stats to see what generated the most interest. I am pleased with around 13K (active) visits this year, an average of 37 seconds spent per visit (when the browser-tab is in the foreground and active). Given our collective TikTok-wired brains, I consider those 37 seconds as a compliment.

The two most popular posts this year were also coincidentally the two pieces I most enjoyed creating. Both dive into the all-important world of word-embedding:

The one is Understanding Word Embeddings (1) – Algebra and the other is Understanding Word Embeddings (2) – Geometry.

On the left (scroll down) you can find the most popular posts from previous years.

To my readership. Thank you for reading, sharing, for your emails, corrections, comments and good questions. Happy, healthy, and productive 2026!

pic credit: Morvanic Lee

Correlation and correlation structure (10) – Inverse Covariance

Eran Raviv — Mon, 29 Sep 2025 08:24:28 +0000

The covariance matrix is central to many statistical methods. It tells us how variables move together, and its diagonal entries – variances – are very much our go-to measure of uncertainty. But the real action lives in its inverse. We call the inverse covariance matrix either the precision matrix or the concentration matrix. Where did these terms come from? I’ll now explain the origin of these terms and why the inverse of the covariance is named that way. I doubt this has kept you up at night, but I still think you’ll find it interesting.

Why the Inverse Covariance is Called Precision?

Variance is just a noisy soloist, if you want to know who really controls the music – who depends on whom – you listen to precision . While a variable may look wiggly and wild on its own, you often can tell where it lands precisely, conditional on the other variables in the system. The inverse of the covariance matrix encodes the conditional dependence any two variables after controlling the rest. The mathematical details appear in an earlier post and the curious reader should consult that one.

Here the following code and figure provide only the illustration for the precision terminology. Consider this little experiment:

Now, has a large variance (marginal variance), look, it’s all over the place:

But but but… given the other two variables you can determine quite accurately (because it doesn’t carry much noise on its own); hence the term precision. The precision matrix captures exactly this phenomenon. Its diagonal entries are not about marginal uncertainty, but conditional uncertainty; how much variability remains when the values of the other variables are given. The inverse of the precision entry is the residual variance of after regression it on the other two variables. The math behind it is found in an earlier post, for now it’s suffice to write:

So after accounting for the other two variables, you are left with

which in this case looks as follows:

This small illustration also reveals a useful computational insight. Instead of directly inverting the covariance matrix (expensive for high dimensions), you can also run parallel regressions of each variable on all others, which may scale better on distributed systems.

Why the Inverse Covariance is Called Concentration?

Now, what motivates the concentration terminology? What is concentrated? Let’s unwrap it. Let’s begin by first looking at the density of a single normally distributed random variable:

So if we have , and otherwise we have . This negative number will then be divided by the variance, or, in our context multiplied by the precision (which is the reciprocal of the variance for a single variable). A higher precision value makes for a negativier (😀) exponent. In turn, it reduces the overall density the further we drift from the mean (think faster mass-drop in the tails), so a sharper, more peaked density where the variable’s values are tightly concentrated around the mean. A numeric sanity check. Below are two cases with mean zero, one with variance 1 (so precision ), and the other with variance 4 (). We look at two values, one at the mean (), and one farther away (), and check the density mass at those values for the two cases (, , and and ) :

In words: higher precision leads to lower density mass away from the mean and, therefore, higher density mass around the mean (because the density has to sum up to one, and the mass must go somewhere).

Moving to the multivariate case. Say that also is normally distributed, then the joint multivariate Gaussian distribution of our 3 variables is proportional to:

In the same fashion, directly sets the shape and orientation of the contours of the multivariate density. If there is no correlation (think a diagonal ), what would you expect to see? that we have a wide, diffused, spread-out cloud (indicating little concentration). By way of contrast, a full weights the directions differently; it determines how much probability mass gets concentrated in each direction through the space.

Another way to see this is to remember that for the multivariate Gaussian density case, appears in the nominator, so the inverse, the covariance would be in the denominator. Higher covariance entries means more spread, and as a result a lower density values at individual points and thus a more diffused multivariate distribution overall.

The following two simple scenarios illustrate the concentration principle explained above. In the code below you can see that while I plot only the first 2 variables, there are actually 3 variables, but the third one is independent; so high covariance would remain high even if we account for the third variable (I say it so that you don’t get confused that we now work with the covariance, rather then with the inverse). Here are the two scenarios:

with correlation creates spherical contours.

with correlation creates elliptical contours stretched along the correlation direction.

Rotate the below interactive plots to get a clearer sense of what we mean by more/less concentration. Don’t forget to check the density scale.

Round hill: diffused, less concentrated

Elongated ridge: steep\peaky, more concentrated

Hopefully, this explanation makes the terminology for the inverse covariance clearer.

Code

For Precision

# Combine into matrix
X <- cbind(X1, X2, X3)

# Compute covariance and precision
set.seed(123)
n <- 100
# Generate correlated data where X1 has high variance but is predictable from X2, X3
# X1 = 2*X2 + 3*X3 + small noise
X2 <- rnorm(n, 0, 1)
X3 <- rnorm(n, 0, 1)
X1 <- 2*X2 + 3*X3 + rnorm(n, 0, 0.5)  # Small conditional variance!

# Combine into matrix
X <- cbind(X1, X2, X3)

# Compute covariance and precision
Sigma <- cov(X)
Omega <- solve(Sigma)

# Display results
cat("MARGINAL variances (diagonal of Sigma):\n")
print(diag(Sigma))
        X1         X2         X3 
11.3096683  0.8332328  0.9350631 

cat("\nPRECISION values (diagonal of Omega):\n")
print(diag(Omega))
       X1        X2        X3 
 4.511182 18.066305 41.995608 

cat("\nCONDITIONAL variances (1/diagonal of Omega):\n")
print(1/diag(Omega))
        X1         X2         X3 
0.22167138 0.05535166 0.02381201 

Verification - Residual variance from regression:

cat("Var(X1|X2,X3) =", var(fit$residuals), "\n")
Var(X1|X2,X3) = 0.2216714 
cat("1/Omega[1,1] =", 1/Omega[1,1], "\n")
1/Omega[1,1] = 0.2216714

For Concentration

library(MASS)
library(rgl)
library(htmltools)
set.seed(123)
n <- 5000
mu <- c(0,0,0)

# Case 1: Low correlation (round hill)
Sigma1 <- matrix(c(1, 0.1, 0,
                   0.1, 1, 0,
                   0, 0, 1), 3, 3)
X1 <- mvrnorm(n, mu=mu, Sigma=Sigma1)

# Case 2: High correlation (elongated ridge)
Sigma2 <- matrix(c(1, 0.9, 0,
                   0.9, 1, 0,
                   0,   0, 1), 3, 3)
X2 <- mvrnorm(n, mu=mu, Sigma=Sigma2)

# Density for marginal (X1,X2)
kd1 <- kde2d(X1[,1], X1[,2], n=150, lims=c(range(X1[,1]), range(X1[,2])))
kd2 <- kde2d(X2[,1], X2[,2], n=150, lims=c(range(X2[,1]), range(X2[,2])))

# Plot 1: Low correlation → "round mountain"
open3d(useNULL=TRUE)
persp3d(kd1$x, kd1$y, kd1$z,
        col = terrain.colors(100)[cut(kd1$z, 100)],
        aspect = c(1,1,0.4),
        xlab="X1", ylab="X2", zlab="Density",
        smooth=TRUE, alpha=0.9)
title3d("Low correlation (round, concentrated)", line=2)
widget1 <- rglwidget(width=450, height=450)
# Plot 2: High correlation → "ridge mountain"
open3d(useNULL=TRUE)
persp3d(kd2$x, kd2$y, kd2$z,
        col = terrain.colors(100)[cut(kd2$z, 100)],
        aspect = c(1,1,0.4),
        xlab="X1", ylab="X2", zlab="Density",
        smooth=TRUE, alpha=0.9)
title3d("High correlation (elongated, less concentrated)", line=2)
widget2 <- rglwidget(width=450, height=450)

Dot Product in the Attention Mechanism

Eran Raviv — Mon, 28 Jul 2025 19:38:36 +0000

The dot product of two embedding vectors and with dimension is defined as

Hardly the first thing that jumps to mind when thinking about a “similarity score”. Indeed, the result of a dot product is a single numbers (a scalar), with no predefined range (e.g. not between zero and one). So, it’s hard to quantify whether a particular score is high/low on its own. Still, deep learning Transformer family of models rely heavily on the dot product in the attention mechanism; to weigh the importance of different parts of the input sentence. This post explains why the dot product which seems like an odd pick as a “similarity scores”, actually makes good sense.

Dot Product and Vector Similarity

I assume here you already understand the geometry of word embeddings.

Consider the following: two vectors pointing the same way (0° angle), so we want a maximum similarity score; two vectors pointing in opposite ways (180° angle), so we want a minimum (least similar) score, and two perpendicular vectors (90° angle), so we want a zero similarity score, meaning no relationship.

If we just used the angle itself, the numbers would be backward: 180° (least similar) is a higher number than 0° (most similar). That’s not at all intuitive for a “similarity” scale. Solution, feed the angle to the cosine function. This gives us exactly the order we order:

(degrees)
180	-1.000
150	-0.866
120	-0.500
90	0.000
60	0.500
30	0.866
0	1.000

The formula to find the cosine of the angle between two vectors and of size is:

Manipulating the above just a bit to get our dot product friend on the left-hand gives us:

now what is ? It’s how closely two vectors are pointing in the same direction multiplied by the magnitude of : . Or put differently: how much of lies in the direction of , in other words, if I project onto , how much is “captured”? (answer: exactly: ). Then, we multiply it by the magnitude of , so as to account for its magnitude as well.

Can a Vector Be More “Similar” to Others Than to Itself???

If you google self-attention matrix images you will see that the diagonal is quite high, for example: that is because as we mentioned, if the angle is zero, so the cosine is 1 and we boil down to multiplying the magnitude of the two embedding vectors.

That said, I got to thinking about this topic because of the observation that while a vector should theoretically be most similar to itself, that’s not always true with some of the images I saw in papers or generated myself (diagonal is not always the highest scores as you can see above figure as well). So it is possible for different vectors to receive higher similarity scores than a vector does to itself – even in self-attention matrices. This is because the embeddings coordinates are not always normalized to a consistent length. So, even if two vectors point in almost the same direction (a very small angle), the sheer “size” of one of them can still make their similarity score surprisingly high. This is why I explained it that way above.

By the way, that’s also the reason for the normalizing factor . Because, As grows, the variability of the dot product grows; numbers there can become very large because of the magnitude of the vectors themselves (larger –> more elements to square and sum). It’s not good to feed large numbers to the softmax (), because it’s going to explode, so before we exponentiate (), we “stabilize” that dot product by dividing it with . Why ? No real reason. You can choose another term if you wish. The story is that the term is the standard deviation of the dot product if you assume the embedding vectors are normally distributed (they are not, but this Gaussian assumption provides a good excuse for using ).

Understanding Word Embeddings (2) – Geometry

Eran Raviv — Mon, 02 Jun 2025 17:32:02 +0000

I have noticed that when I use the term “coordinates” to talk about vectors, it doesn’t always click for everyone in the room. The previous post covered the algebra of word embeddings and now we explain why you should think about the vector of the word embedding simply as coordinates in space. We skip the trivial 1D and 2D cases since they are straightforward. 4 dimensions is too complicated for me to gif around with, so 3D dimensions would have to suffice for our illustrations.

Geometric interpretation of a word embedding

Take a look at this basic matrix:

The one-hot word-vectors are visualized in the 3D space as follows:

To better understand this coordinate system, we can rotate these vectors. This changes the words’ coordinates, but preserves their relationships—specifically, the angles between them remains at 90°.

Observe that he word fish remained where it was, the word cat now sits at [-0.707, 0.707, 0] and the word dog sits at [0.707, 0.707, 0], but the relationship between the words has not changed (it’s a 3d image, the angles are still 90° apart). This illustrates a specific example for what is called “basis transformation” (the term “basis” is explained in the previous post).

Basis transformation in our context of word embeddings means that we change our rudimentary one-hot representation, where words are represented in the standard basis, to the embedding representation where words are represented in a semantic basis.

Semantic basis 🤨 ? Yes, let me explain. Semantics is the branch of linguistics and logic concerned with meaning. But, “meaning” is a latent and ill-defined concept. There are different ways to describe what one means: both “inflation” and “rising-prices” map to almost identical meaning. Related to that, a frequent misconception in the field of NLP topic modeling is the belief that topics are actually-defined things, which they are not. In fact, we might consider ourselves fortunate if a topic can even indirectly be inferred from the words assigned to the same cluster (e.g., “topic 3”). For instance, if words like deflation, stagflation, and inflation appear together in the same cluster, we could interpret that cluster as “price stability” topic – even if, as it happens, the cluster also includes many other, unrelated, words. So when we refer to semantics, we’re basically talking about the underlying, unobserved\abstract\latent and ill-defined term: meaning.

What makes a space semantic is how words connect to each other. That’s different from the example above where each word just stands on its own – ‘cat’ doesn’t have any special relationship to ‘dog’ or ‘fish’, all three words are alone standing.

Now that we understand the term “semantic”, let’s see what is gained by shifting from our clear and precise one-hot encoding space, to a semantic space.

Word Coordinates in Semantic Space

Semantic space is better than a formal/symbolic/non-semantic space largely due to these two advantages:

Dimension reduction (we save storage- and computational costs).
We can relate words to each other. It’s beneficial if words like revenues and earnings aren’t entirely independent algebraically, because in reality they are not independent in what they mean for us (e.g. both words imply to potentially higher profits).

Unpacking in progress:

Dimension reduction: In one-hot word-representation each word is completely distinct (all 90°, i.e. independent). Words share no common components and are completely dissimilar (similarity=0) in that representation. The upside is that we capture all words in our vocabulary and each word has a clear specific location. But, that space is massive: each vector is of size equals to the number of unique words (vocabulary). When we embed word-vectors we cast each word-vector into a lower dimension space. Instead of a coordinate system with dimensions, we use a lower-dimension coordinate system, say 768. Now each word will not be where it should be exactly. Why not? because we don’t have entries to place that word in space, we only have 768, so each word will be placed somewhere inside our 768 coordinate system. By compressing all those words into just 768 dimensions, we produce a denser representations instead of the extremely sparse one-hot vectors. We inevitably lose the independence that one-hot encodings provided, but this also presents an opportunity to compress the words in a way that places related words closer together in the compressed 768 dimensional space.
We can relate words to each other: For the following dense word-vectors representation (being silent for now about how to find these vectors)

The plot below shows ‘cat’ and ‘dog’ are spatially closer, indicating their greater semantic similarity compared to their similarity with ‘fish’.

I won’t get into the details of how we find these dense word-vectors. The short version is that we some transformer model to create those vectors for us. The transformers family of models has the great power to, well.. transform an initial and imprecise, a guess if you will, set of coordinates (word-vectors), into a much more reasonable one. Reasonable in that words eventually end up placed near other words that relate to them (think back to our ‘revenues’ and ‘earnings’ example).

Note that we did not demonstrate dimension reduction here (the representation stayed in 3 dimensions), this illustration focused only on the acquisition of meaning.

To exemplify dimension reduction we could can map the three vectors into a 2 dimensional space:

As mentioned, dimension reduction helps reduce storage and compute costs, but do we lose anything? Absolutely we do.

The Crowding Problem

Notice that in the 3d space, our one-hot basis can be placed such that all vectors are perpendicular to each other. This is not possible in a 2d space. But we deliberately chose a projection that maintains key distinctions, in our simple example: keeping “dog” and “cat” close to each other while “fish” is distant. What about higher dimensions?

When reducing dimensionality from the exact one-hot space, with the number of vectors in the hundreds of thousands, to a smaller space, say 768 dimensions, we distort our representation, call it compression costs. Simply put, half a million points, once living large, now have to cram into a cheap dorms, with some tokens snoring more than others. This compression-induced distortion is known by the evocative term ‘the crowding problem’. You may have wondered (I know I did) why do we stop at fairly moderate dimensions? Early language models had dimensions of 128, then 512, 768, 1024, 3072 and recently 4096 and that’s about it. Don’t we gain better accuracy if we use say ?

We don’t. Enter the Johnson-Lindenstrauss (JL) Lemma.

The Johnson-Lindenstrauss (JL) Lemma

One variation of the lemma is:
Let , and let be a set of points. Then, for any integer , there exists a linear map such that for all :

where is The function that maps points from high-dimensional space to low-dimensional space. are the projections of points and in the lower-dimensional space. Specifically where is the projection matrix. If you are reading this you probably know what PCA is, so think about as the rotation matrix; to find the first few factors you need to multiply the original variables with the first few rotation-vectors. is the Euclidean norm (squared distance), and finally is the distortion parameter (typically between 0 and 1).

Simply speaking, the JL lemma states that a set of points, a point in our context is a word-vector, in high-dimensional space can be mapped into a space of dimension (much lower than ..) while still preserving pairwise distances up to a factor of . As an example, 100,000 points (vocabulary of 100,000) can be theoretically mapped to

Setting means that we accept a distortion of 10%; so if the original distance between vector and vector is , then after the compression it will be between and .

Remarks:

Fun fact: in sufficiently large dimension even random projection – where , with the vectors of simply drawn randomly, also approximately preserves pairwise distances. You can (should) compress large dimensional data without carefully engineering , for example you don’t always have to spend time doing singular value decomposition. I leave it for the curious reader to check this counter-intuitive fact.
As typical data are, most of the structure is often captured by the first few dimensions. Again, you can think about the first few factors as capturing most of the variation in the data. Adding factors beyond a certain point can add value, but with sharply diminishing returns. Another fun fact (perhaps I should rethink my perception of fun 🥴), you capture most of the variation with the first few dimensions even if the data is completely random (meaning no structure to be captured whatsoever). This fact was flagged recently in the prestigious Econometrica journal as spurious factors (looks like a factor but simply the result of the numerical procedure).
As oppose to the curse of dimensionality, we can think of the ability to compress high-dimensional data without losing much information as the blessing of dimensionality, because it’s simply the flip side of the curse. It cuts both ways: it’s fairly easy to shrink high-dimensional spaces down because points in that space are so spread-out, which is exactly why finding meaningful patterns in high-dimensional space is such a headache to begin with.

In sum

We moved from one-hot representation of words in a high-dimensional space to a lower-dimensional “semantic space” where word relationships are captured. We showed how word vectors can be rotated without changing their relative positions. This transformation is crucial because it allows us to represent word meaning and relationships, reducing storage and computation costs. We then moved on to the “crowding problem” that arises from dimensionality reduction and introduced the Johnson-Lindenstrauss (JL) Lemma, which provides the theoretical legitimacy for compressing high-dimensional text data.

I hope you now have a better grip on:

Why we can refer to word-vectors as coordinates in space

Why it’s ok to do, and why we don’t lose much information about word relationships

Together with the previous post providing the algebraic foundation for word-embeddings you are well-positioned to, actually understand, transformers and LLMs better.