blank

Link Drift

2026-03-07T14:30:00+00:00

I have a Links page on this site that I keep adding to whenever I run into something I want to remember. Over time it became useful, but also a little stiff. It started to feel less like a living collection of things I genuinely enjoyed and more like a shelf.

So I made Link Drift, a small playground where those links float around the screen. You can click one, drag one, fling one, or hit “I’m feeling lucky” and let the page pick for you.

What I wanted was a little more serendipity. When I look back at the things that have shaped how I think, a lot of them did not arrive in some tidy, optimized order. They came from wandering around, opening the wrong tab, following a footnote too far, or revisiting something I had forgotten. Link Drift is my attempt to keep a bit of that feeling.

I still like having the plain list. Sometimes you just want the clean version. But I also like that there is now a more playful door into the same archive. The links are the same. The experience is not.

DeepSeek OCR, and why I think vision eats language

2025-10-24T12:00:00+00:00

Recently I’ve been trying to get along with school and all the coursework and settling into a new environment. So I wasn’t putting much effort into reading papers, which I used to enjoy. But this week was different. I found a paper that pulled me back in: DeepSeek OCR. It’s a peculiar paper in a good way. I like DeepSeek’s papers: they’re thorough and open, they don’t hide things from researchers.

I think its findings are very much aligned with the Bitter Lesson.

Key Takeaways

The core takeaway from the paper is that using a vision encoder to process documents as images they were able to achieve ten times more efficiency than using text tokens. In other words, it’s vastly more efficient to screenshot a document and feed it to a large language model than to paste the raw text, with little to no loss in performance.

But this isn’t entirely new. About a year ago, when Gemini 2.5 Pro came out, I was playing with it in Google AI Studio. I pasted a document and compared it to drag‑and‑drop upload. The full text token count was much larger than the image‑based upload. They were converting files to images and counting vision tokens. So they already kind of knew this.

Also, I don’t know about ChatGPT, but for a long time, Anthropic’s Claude seemed to feed uploaded documents as images, not parsed text, from their webpage. They must have tested it and green‑lit it because it was more feasible and efficient.

Where does vision stop and language begin?

This also reminded me of a 2020 Lex Fridman conversation with Ilya Sutskever:

Ilya Sutskever: Where does vision stop and language begin? … You have a vision system, you say it’s the best human‑level vision system, I open a book and show you letters: will it understand how these letters form into words and sentences and meaning? Is this part of the vision problem?

Lex Fridman: Thats an really interesting question… One possibility is that it’s impossible to achieve really deep understanding in either images or language without basically using the same kind of system, so you’re going to get the other for free…
Ilya Sutskever: A lot of it depends on your definitions of perfect vision: because really, you know, reading is vision. but should it count?

I emphasize again: this was recorded in 2020. Five years is a long time in this field. But still, when I think of this converstation, there is a lot of food for thought.

My opinion: vision precedes language here; or said differently, for documents, language sits inside vision. If you can see the page, you can get the language. Starting from text alone can’t recover layout, typography, figures, or spatial structure. All that subtle rich signals are lost when you flatten the document into text. It’s not just about efficiency, it’s about the nature of the task.

Bitter Lesson vibes

This feels like the Bitter Lesson again. Methods that scale win. Text is one‑dimensional and throws away structure; you end up with fragile pipelines. Vision is two‑dimensional and more general. If we want to be more Bitter‑Lesson‑aligned and pursue methods that scale, language tasks will increasingly be subsumed by vision tasks for documents. Looking at human modality: we have five senses. There is no “text modality.” What we do for “natural language processing” in our brains, the modality is auditory and vision. For text, it’s purely vision.

Twenty years ago, to take notes people have to type and store strings. Today, if I want to copy a whiteboard or an announcement in a classroom, everybody takes pictures. Nobody writes it down. It’s more general. I think the same thing will happen with language models. Twenty years from now, god knows how we’ll interact with AI models, but we’ll probably do the equivalent of taking pictures of text for much more capable models, and look back at “paste-the-whole-document-as-text-language-models” as quaint.

Final thoughts

I think we’re getting the answer Ilya hinted at back in 2020. Where does vision stop and language begin? For documents, language lives inside vision. DeepSeek OCR is interesting not because it invents a new modality, but because it treats the obvious with rigor: for documents, seeing beats parsing. Once you accept that, a lot of design choices get simpler.

The fact that labs like Anthropic have long defaulted to image‑based document uploads suggests they already tested this and know it’s more feasible and efficient. It makes you wonder how much frontier models already know, and how far ahead they are.

My Take on GPT-5

2025-08-18T04:51:00+00:00

OpenAI recently released GPT-5, with claims of a new state-of-the-art model that tops benchmarks. After spending some time with it, my initial impression is that while it’s a decent model, it doesn’t feel groundbreaking. However, I’ve come to realize that this release wasn’t really intended for power users like me. For the majority of people, this model is a game-changer.

The Divide Between Power Users and Everyday Users

Before GPT-5, I used OpenAI’s o3 model almost exclusively since March. As I’ve discussed in a previous post, I have high regard for o3, mainly because of its “agentic” nature. It could actively surf the web to gather context and provide more reliable answers. This ability to use tools like web search and retrieve context on its own, in my opinion, separates a useful AI from a toy.

This is why it sometimes frustrates me to see friends and colleagues, even those with a ChatGPT Plus subscription, stick to the basic GPT-4o model. They often complain that it hallucinates or makes things up, and when I ask which model they’re using, most of the time it’s the 4o model. A model without a dedicated reasoning process and tool usage is going to be less reliable for complex tasks. I’ve made it a personal rule to never trust a non-reasoning model for anything beyond simple tasks like drafting an email or editing my writing.

The value of a “thinking” model comes from test-time compute scaling. When you allow a model to “think harder” about a problem, the quality of the result is much better than what a non-reasoning model can produce. With GPT-5, this capability is now dynamically available to everyone.

The Router: A Cornerstone for a New Business Model

The most significant change with GPT-5 isn’t the base model itself, but the introduction of the router. This system dynamically decides whether a query requires the deeper “GPT-5 Thinking” model or can be handled by a simpler one.

A recent article from SemiAnalysis by Dylan Patel and his team really opened my eyes to the business implications of this. They argue that the router is the cornerstone for OpenAI to finally monetize its massive base of free users. The router can distinguish between a trivial query like, “What is the capital of France?” and a commercially valuable one like, “What are the best running shoes I can buy?”

The first query doesn’t require deep reasoning and is cheap to answer. The second, however, has high commercial intent. The router can allocate more resources to it, use web search, and provide a detailed, reasoned recommendation. This creates an opportunity for OpenAI to take a transaction fee or affiliate revenue, turning the chatbot into a monetizable super-app. It’s a way to monetize without resorting to intrusive ads, which Sam Altman has expressed a distaste for.

While I agree that the router enables this, I’d push back slightly and argue that a sufficiently advanced model could theoretically make these decisions on its own. Still, implementing it as a dedicated router is a clear and deliberate step toward this new paradigm.

Final Thoughts

My experience with GPT-5 has solidified a key belief: always use a thinking model. Since its release, I’ve used “GPT-5 Thinking” exclusively, and I don’t care about the automatic routing for my own use.

If you’re reading this, the main takeaway I want to leave you with is this: whenever you have the choice, opt for the model that thinks. The difference in quality and reliability is night and day. For the average user, GPT-5’s greatest gift is making that choice for them, seamlessly bringing the power of reasoning to hundreds of millions of users for the first time.

The Murmuring Woman

2025-05-10T11:00:00+00:00

I had a peculiar experience today that I wanted to share. I was at a cafe with my girlfriend, planning our vacation. Nearby, there was a woman, maybe in her mid-40s, wearing a face mask. She was constantly murmuring to herself – not loudly, I couldn’t make out the words, but it was non-stop. She looked really agitated, getting up, sitting down slightly differently, pacing out of the room and then returning to the same seat. This went on and on. As we wrapped up our trip planning, I looked up, and she was just gone. That quick.

Our vacation plans came together well, but the image of that woman kind of lingered in my head. When I saw her, it really reminded me of large language models. I had to admit that my brain is so stuffed with AI these days – for better or worse.

Hear me out.

From Self-Talk to Chain-of-Thought

The woman’s constant self-talking, that murmuring, felt exactly like what chain-of-thought reasoning models are currently doing. It’s a very simple, almost too easy analogy, but if you start to put weight on it, it feels really, really profound. I don’t know why this happens, but I find that anthropomorphizing large language models sometimes helps me see what capabilities they might need or what data we should give them to make them more capable. These kinds of analogies make it easier for me to see things.

There’s a sort of stack here, a progression:

Traditional LLMs: These models don’t really “think” in a step-by-step way. They just generate, often verbatim, without much pause – a kind of knee-jerk reaction. This is like System 1 thinking.
Reasoning Models (Chain-of-Thought): When these came along, they blew the older models out of the water. This introduced a new scaling paradigm: test-time compute. Introspection, or thinking step-by-step, is much better than a knee-jerk response for many tasks. This is System 2, and it’s really good for improving capabilities. Noam Brown’s work really pioneered this area.

The Limits of Introspection and the Need for Tools

Now, what current models are doing is moving towards becoming agents. And here’s where the analogy with the woman (and I want to be clear, I don’t know her or what she was going through, but she really looked like she was having a tough time – this is just an observation for the sake of analogy) becomes even more interesting.

Constant introspection, just talking to oneself, only gets you so far. And that’s exactly the limit I see with first-generation reasoning models, like some of the DeepSeek models or OpenAI’s o1. They can think, they can “talk to themselves” on and on, but they can’t verify their own thoughts quite reliably.

Compare this to how people generally operate. When “normal” people think, they can self-verify using external tools or interactions. They might talk something through with someone else for verification, or rely on external aids like their iPhone, a book, or a quick search. This analogy is simple. And that’s what models like Anthropic’s Claude 3.7 Sonnet and OpenAI’s o3 are doing now. They are good at interacting with the real world via an external pipeline, a bridge we call “tools.”

The Fine Line of Anthropomorphism

When you anthropomorphize a large language model this way, the need for tools and external interaction becomes obvious. But there’s a key caveat: the internal modeling of an LLM is very different from human cognition. I’m anthropomorphizing for the sake of seeing what LLMs might benefit from, what might make them more capable. It’s a fine line to walk.

Aren’t We All Just Next-Word Predictors?

As this kind of anthropomorphization continues, and it’s easy to do because language models can seem persuasive and lifelike, it reminded me of something Scott Aaronson said a year ago. When LLMs first emerged and there were naysayers arguing “it’s just next-word prediction, just statistical modeling,” he’d retort (paraphrasing) “But what about you? Aren’t you just a next-word predictor? What about your mom?”

It really kind of cracked me up at the time. If I’d said that to some of my close friends who looked down upon LLMs, they would have fumed, they’d be outraged! But when ChatGPT came out, I intuitively, wholeheartedly agreed with Aaronson’s point. My mind hasn’t changed on that.

I think as models get more capable, Aaronson’s quip: “aren’t we just the next-word predictor?”, will become true in a functional sense. Recently, LLMs passed the turing test, but society has moved on like nothing happened. Sooner or later, for every verifiable task, model capabilities will likely exceed human capabilities. And still, when that happens, they will be, at their core, next-word predictors. Superhuman next-word predictors, better than us at any given task.

Then what would we become?

Dropout - Review

2025-04-28T10:00:00+00:00

Dropout is one of those techniques in deep learning that feels ubiquitous, yet revisiting the original 2014 paper, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” by Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov, reveals layers of insights.

The Evolutionary Analogy: Mix-ability over Brittle Co-adaptation

One of the most striking parts of the paper is the motivation drawn from evolutionary biology, specifically the role of sexual reproduction.

“One possible explanation for the superiority of sexual reproduction is that, over the long term, the criterion for natural selection may not be individual fitness but rather mix-ability of genes.”

The paper contrasts this with asexual reproduction, where a well-adapted set of genes might be perfectly optimized for a specific environment but could be brittle if conditions change. Sexual reproduction, by constantly shuffling genes, forces individual genes to be effective in collaboration with a random set of other genes. This “mix-ability” fosters robustness.

This analogy maps beautifully onto neural networks. A standard network might develop complex “co-adaptations” between hidden units, perfectly fitting the training data but failing on unseen examples. Dropout, by randomly removing units during training, acts like gene shuffling. It forces each unit to be useful on its own or in conjunction with various randomly chosen subsets of other units. This prevents the network from relying on fragile partnerships that only exist in the training data, promoting robustness. As the paper humorously adds, ten small conspiracies might be more robust than one large one requiring everyone to play their part perfectly.

The Real Goal: Generalizing to the Test Set

This ties into a crucial point, echoing sentiments sometimes expressed by researchers like Ilya Sutskever: the ultimate objective isn’t just fitting the training data, but generalizing to the test set. The paper highlights this early on:

“With limited training data, however, many of these complicated relationships will be the result of sampling noise, so they will exist in the training set but not in real test data even if it is drawn from the same distribution. This leads to overfitting…”

Dropout directly attacks this problem. Overfitting often involves learning spurious correlations, patterns that exist purely by chance in the training sample. Standard networks, especially high-capacity ones, have the “luxury” of using their parameters to memorize this noise to minimize training loss.

Dropout’s Incentive: Learning Robust Features, Not Noise

Dropout changes the incentive structure during training. By constantly disrupting pathways (randomly dropping units), it makes it significantly harder for the network to rely on specific, complex interactions between neurons that might only capture spurious correlations. The “reward” (gradient signal) for learning these fragile patterns becomes inconsistent.

Conversely, strong, prominent features that reflect the true underlying data structure are likely detectable through multiple, more robust pathways or redundant representations. These features “survive” the dropout process more reliably, receiving more consistent positive reinforcement. Dropout, therefore, incentivizes the network to invest its capacity in learning features that are resilient to this random disruption – precisely the features most likely to generalize.

Approximating an Exponential Ensemble

The core mechanism is simple:

During Training: For each training case (or minibatch), randomly “thin” the network by dropping units (setting their output to zero) with a certain probability 1-p. This means training an exponentially large ensemble of networks (potentially 2^N for N units) that all share weights.
At Test Time: Explicitly averaging the predictions of all possible thinned networks is intractable. Instead, use the single, full network but scale down the outgoing weights of units by the retention probability p. This simple scaling provides a good approximation of the average prediction of the ensemble.

This allows training what is effectively a huge ensemble but performing inference efficiently with a single model.

Synergy with Max-Norm Regularization

The paper notes that dropout often works best with high learning rates and momentum. However, this can risk weights growing uncontrollably. They found a specific technique particularly helpful: Max-Norm Regularization.

“…constraining the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant c. In other words, if w represents the vector of weights incident on any hidden unit, the neural network was optimized under the constraint IIwII₂ ≤ c.”

This acts as an important stabilizer. By capping the magnitude (L2 norm) of incoming weights to each neuron, it prevents weights from exploding, allowing the use of aggressive learning rates needed to overcome the noise introduced by dropout, without sacrificing stability.

Sparsity as a Side-Effect

Interestingly, the paper demonstrates (Figures 7 & 8) that dropout often leads to sparser activations in hidden units, even without explicit sparsity-inducing penalties. Neurons learn to be more selective, potentially making the learned representations more interpretable or efficient.

Final Thoughts: Foundational Simplicity

Dropout shows the power of simple, well-motivated ideas. It provides a computationally feasible way to prevent overfitting by mimicking evolutionary pressure towards robustness and discouraging the memorization of spurious, training-set-specific correlations. While not a silver bullet (factors like training time and interaction with Batch Normalization mean it’s not always the best choice), its impact on deep learning has been large.

Rethinking Sequence-to-Sequence - Review

2025-04-26T09:00:00+00:00

Reading foundational papers often provides a clearer perspective on how current ideas evolved. Recently, I went through the 2015 ICLR paper Neural Machine Translation by Jointly Learning to Align and Translate by Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. It tackles a core problem in early sequence-to-sequence models for machine translation.

The main issue they identified was the “bottleneck” inherent in the standard RNN Encoder-Decoder framework popular at the time. These models tried to compress the entire meaning of a source sentence, regardless of its length, into a single fixed-length vector. As the paper noted, this makes it difficult to handle long sentences well, performance tended to drop off significantly as sentences got longer.

Their proposed solution was to allow the decoder to look back at the source sentence and selectively focus on relevant parts when generating each target word. This avoids forcing all information through one fixed vector.

Key Concepts

Here’s a breakdown of the core ideas discussed:

The Problem: Fixed-Length Vector Bottleneck: Standard encoder-decoders map an input sequence x = (x_1, ..., x_{T_x}) to a fixed context vector c. The decoder then generates the output y = (y_1, ..., y_{T_y}) based solely on c and previously generated words. This compression limits the model’s capacity, especially for long inputs.
The Solution: Alignment Mechanism (Decoder Focus): Instead of one c, the proposed model computes a distinct context vector c_i for each target word y_i. This c_i is a weighted sum of annotations (h_1, ..., h_{T_x}) from the encoder. Each h_j corresponds to a source word x_j (or rather, the hidden state around it).
How it Works: Alignment Model & Context Vector:
- The weight a_{ij} for each annotation h_j when generating y_i depends on how well the input around position j aligns with the output at position i.
- These weights are calculated using an “alignment model” a, which takes the previous decoder hidden state s_{i-1} and the encoder annotation h_j as input to produce a score e_{ij}.
- e_{ij} = a(s_{i-1}, h_j)
- The weights a_{ij} are obtained by normalizing these scores with a softmax: a_{ij} = exp(e_{ij}) / Σ_k exp(e_{ik}).
- The context vector c_i is then the weighted sum: c_i = Σ_j a_{ij} h_j.
- Crucially, the alignment model a (parameterized as a small feedforward network) is trained jointly with the rest of the system.
Soft vs. Hard Alignment: The paper uses the term “soft alignment.” This contrasts with “hard alignment,” which would involve making a deterministic choice of which single source word aligns with the target word. Soft alignment uses a weighted average over all source annotations. This makes the mechanism differentiable and allows the model to learn alignments implicitly through backpropagation. It also naturally handles situations where a target word might depend on multiple source words, or vice-versa.
The Encoder: Bidirectional RNN (BiRNN): To ensure the annotation h_j captures context from both before and after the source word x_j, they used a BiRNN. This consists of a forward RNN processing the sequence from x_1 to x_{T_x} and a backward RNN processing it from x_{T_x} to x_1. The annotation h_j is the concatenation of the forward hidden state \vec{h}_j and the backward hidden state \cev{h}_j. While BiRNNs weren’t new, their use here makes sense for creating richer annotations.

Key Takeaways

Reflecting on the paper, several points stand out:

Performance Improvement (Especially on Long Sentences): The results clearly show the benefit. The standard RNNencdec model’s performance drops sharply with sentence length, while the proposed RNNsearch model remains much more robust. The BLEU scores confirm a significant improvement, bringing NMT closer to traditional phrase-based systems of the time.
Interpretability via Alignment: The alignment weights a_{ij} can be visualized. This provides insight into what parts of the source sentence the model focuses on when generating a specific target word. The visualizations showed mostly monotonic alignments (as expected between English and French) but also the ability to handle local reordering (like adjective-noun flips) correctly. This interpretability is a nice side effect compared to trying to understand a monolithic RNN.
Handling Reordering and Length Differences: The soft alignment naturally deals with source and target phrases having different lengths or requiring non-trivial mappings, without needing explicit mechanisms like NULL tokens used in traditional SMT.
Evolutionary Link to Transformers: Reading this after knowing about Transformers makes the connection clear. The core mechanism, scoring source annotations based on the current decoder state, using softmax for weights, and computing a weighted sum, is essentially the attention mechanism. It reads as a precursor; the Transformer built upon this by removing recurrence and adding multi-head attention, positional encodings, etc. It’s like seeing an earlier stage in the “evolution” of sequence models.

Summary & Final Thoughts

This paper reads as a big step in NMT. It addressed a clear limitation (the fixed-length vector bottleneck) with a straightforward solution: allowing the model to learn where to focus in the source sequence. The “soft alignment” mechanism introduced is, in essence, the attention mechanism that became central to later architectures like the Transformer.

Looking back now, the ideas seem intuitive, but implementing this effectively and showing its benefits in 2014/2015 was a contribution. It’s a clear paper that explains the problem, the proposed solution, and provides evidence. Reading it helps appreciate the progression of ideas leading to the models we use today.

Knowledge Distillation - Review

2025-04-22T14:00:00+00:00

I’ve known about the concept of knowledge distillation for a while – the core idea is simple: soft labels (the full probability distribution from a model) contain richer information about class relationships than hard labels alone. I first encountered it in a lecture by Geoffrey Hinton (like this one discussing paths to intelligence) and decided to read the original 2015 paper, “Distilling the Knowledge in a Neural Network,” co-authored with Oriol Vinyals and Jeff Dean. It’s short, but with clear insight.

The Insect Analogy: Training vs. Deployment

What struck me immediately was the opening analogy:

“Many insects have a larval form that is optimized for extracting energy and nutrients from the environment and a completely different adult form that is optimized for the very different requirements of traveling and reproduction.”

I haven’t seen many ML papers start with a biological analogy like this. I hadn’t thought about insect life stages this way before. The larva is about consumption and growth, slow-moving, maybe not complex, but efficient at extracting resources (like a large training model absorbing information from data). The adult form is optimized for different tasks, lightweight, fast, mobile, focused on specific functions like reproduction (like an efficient deployment model needing low latency and computational cost).

The analogy fits perfectly with the challenge in machine learning:

Training: We often use huge, “cumbersome” models (or ensembles) that take lots of computation and time but are great at extracting every bit of signal from large datasets.
Deployment: We need models that are fast, efficient, and have low latency for real-world use.

Distillation, then, is like the metamorphosis: transforming the knowledge captured by the cumbersome larva/training model into the efficient adult/deployment model.

Knowledge Beyond Weights

The paper points out a potential “conceptual block”:

“…we tend to identify the knowledge in a trained model with the learned parameter values.”

This makes it hard to think about transferring knowledge without just copying weights. Prior work like Rich Caruana’s model compression focused on matching the outputs before the final softmax (the logits). Hinton et al.’s approach refines this by using the probabilities from the softmax, arguing that this captures the learned distribution more meaningfully.

The Value of “Wrong” Answers

A key insight is how the large, cumbersome model generalizes. It’s not just about getting the right answer.

“…a side-effect of the learning is that the trained model assigns probabilities to all of the incorrect answers… The relative probabilities of incorrect answers tell us a lot about how the cumbersome model tends to generalize.”

The example they give is clear: an image of a BMW might have a tiny probability of being mistaken for a garbage truck, but that probability, however small, is likely higher than it being mistaken for a carrot. This network of similarities and differences between classes is knowledge learned by the teacher model. Hard labels (just “BMW”) throw this information away. Soft labels (the full probability distribution) preserve it.

This aligns with the objective: we don’t just want models to perform well on training data, we want them to generalize well to new data. Soft targets directly transfer the generalization behavior of the teacher model to the student.

The Mechanism: Temperature Scaling

So how do we use these soft labels? If the teacher model is very confident (assigns probability ~1.0 to the correct class), the probabilities for incorrect classes are tiny. Even if their ratios contain information, they have almost no impact on the cross-entropy loss during student training.

The solution is to “raise the temperature” T of the softmax function:

q_i = exp(z_i / T) / Σ_j exp(z_j / T)

where z_i are the logits. Normally T=1. Using a higher T > 1 “softens” the probability distribution, increasing the probabilities of incorrect classes and allowing them to contribute more to the loss function. The student model is trained to match this softened distribution, using the same high temperature T. (After training, the student uses T=1 for inference).

This temperature scaling is the core mechanism. The paper notes that in the high-temperature limit, this method becomes equivalent to matching the logits (Caruana’s approach), but at intermediate temperatures, it focuses more on matching the more probable incorrect classes, potentially ignoring noise from very negative logits.

Training the Student

The best results often come from combining two objectives:

Matching the soft targets from the teacher (using cross-entropy with high temperature T).
Matching the true hard labels (using cross-entropy with T=1).

They found a weighted average works well, often with a lower weight on the hard target loss. As they say: “Typically, the small model cannot exactly match the soft targets and erring in the direction of the correct answer turns out to be helpful.”

Proof of Generalization: The MNIST Experiment

A clear experiment highlights the value of this approach. They trained a student model on MNIST, but omitted all examples of the digit ‘3’ from the transfer set. From the student’s perspective, ‘3’ was a “mythical digit” it had never directly seen.

Despite this, the distilled model performed well on classifying ‘3’s at test time (with a bias adjustment). It had learned about ‘3’ indirectly, through the soft targets for other digits, for example, by learning which ‘8’s looked a bit like a ‘3’ according to the teacher model. This is evidence that soft targets transfer generalization capabilities, not just labels.

Final Thoughts

This paper is a classic example of clear insight. The claim is simple, but such a powerful one:

“…a lot of helpful information can be carried in soft targets that could not possibly be encoded with a single hard target.”

Neural Probabilistic Language Model - Review

2025-04-20T15:00:00+00:00

I recently dove into Yoshua Bengio et al.’s 2003 paper, “A Neural Probabilistic Language Model”. Reading such an old paper, foundational work from over two decades ago, is fascinating. What struck me most wasn’t just the specific model (which is simple by today’s standards), but the clarity with which Bengio laid out the core problems and principles of language modeling, principles that are still relevant. I got a respect for his vision; it feels like this paper set the trajectory for much of what followed.

The Problem: The Curse of Dimensionality

Bengio starts by framing the fundamental challenge: the curse of dimensionality. As he puts it,

“…a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training.”

This is because the number of possible sentences is essentially infinite, like the Library of Babel. Any specific sentence has almost zero probability of occurring randomly.

The “curse” goes deeper than just the sheer number of sequences. As the number of dimensions (e.g., the length of the sequence, or the number of features considered) increases:

Space Expands Exponentially: The volume of the space grows very fast, making the available data extremely sparse.
Distance Intuition Breaks: In high dimensions, points tend to become equidistant from each other, and most of the volume is concentrated far from the center, near the “surface” of the high-dimensional space. Our low-dimensional intuitions about proximity and density fail.
Spurious Correlations: With so many dimensions, it becomes easy to find apparent patterns in data that are just noise.

This is a core challenge for many real-world problems, especially with rich sensory data spanning many dimensions. How do you find the signal in such a vast, sparse space without getting lost?

The Solution: Fighting Fire with Fire

Bengio and his colleagues proposed a way to fight this curse:

“…learning a distributed representation for words…”

Essentially, they proposed learning dense, low-dimensional feature vectors (embeddings) for each word in the vocabulary. This is like fighting fire with fire: while the vocabulary space is huge and discrete, the learned feature space is much smaller (e.g., 30-100 dimensions in their experiments vs. 17k+ words) but continuous. Because it’s a dense, continuous space, even a relatively low-dimensional one has a large capacity to represent complex relationships. They are mapping the discrete, high-dimensional vocabulary into a structured, continuous latent space.

The Magic: How Generalization Happens

So how does this help? The paper explains:

“Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence.”

This, for me, is the crux of it. The model learns which words play similar roles (semantically, syntactically) and places them close together in the embedding space. Because the probability function operates smoothly over this continuous space, seeing “The cat sat on the mat” helps the model assign a higher probability to the unseen sentence “A dog rested on the rug,” because the corresponding words have similar learned representations. It’s this mapping from discrete symbols to a meaningful continuous space that allows generalization beyond simply memorizing n-grams. This is fundamentally how current LLMs achieve their (still limited, but powerful) generalization capabilities.

Learning End-to-End

A key part of their proposal was point 3:

“learn simultaneously the word feature vectors and the parameters of that probability function.”

They recognized that the embeddings and the prediction mechanism need to learn from each other. You can’t just fix one and train the other; they have to be optimized together, end-to-end, for the embeddings to become useful for the prediction task and vice-versa.

A Historical Aside: Parallel Processing with CPUs

What also caught my eye was the extensive discussion on parallelizing the training process. Remember, this was 2003 when widespread GPU computing for ML wasn’t a thing yet. They detail their efforts using parameter-parallel processing across multiple CPUs (up to 64 Athlon processors in their cluster!). They discuss asynchronous updates and communication overhead (MPI). It feels like they were laying the conceptual groundwork for the kind of massive parallelization (now mostly on GPUs/TPUs) that is essential for training today’s large models.

Lasting Impact

While the specific MLP architecture used in the paper is rudimentary now, the core ideas, tackling the curse of dimensionality with learned distributed representations, enabling generalization through semantic similarity in embedding space, and the need for end-to-end training, remain central to modern NLP and deep learning. Reading this paper felt like a clear early articulation that illuminated the path forward for the field. It helped define the paradigm we’re still working within.

Revisiting the 2014 Sequence-to-Sequence Paper

2025-04-20T11:00:00+00:00

I recently went back to read the 2014 paper “Sequence to Sequence Learning with Neural Networks” by Sutskever, Vinyals, and Le. Since it’s such a seminal paper, practically ancient by today’s ML standards, I thought it would be interesting to look back. While the method itself (LSTM encoder-decoder) is relatively simple compared to modern architectures, focusing on their thinking process back when deep learning was still in its infancy was insightful. I found some things they mentioned almost casually that felt non-trivial to me now.

Here are some of my main takeaways:

Refreshing Views on DNN Power

The paper starts by framing Deep Neural Networks (DNNs) in ways I hadn’t explicitly considered before. They state:

“DNNs are powerful because they can perform arbitrary parallel computation for a modest number of steps.”

This sentence, while true and maybe obvious in retrospect, struck me. Of course, we know matrix multiplications are parallelizable and run well on GPUs, but thinking about it from the perspective of individual neurons performing computations in parallel felt like a useful angle on why NNs are suited for this hardware.

They also highlight:

“…their ability to sort N N-bit numbers using only 2 hidden layers of quadratic size…”

Again, a specific example of computational power packed into a relatively simple network that I hadn’t really internalized.

The Core Problem and the LSTM Solution

The authors clearly state the limitation they were tackling:

“Despite their flexibility and power, DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality.”

This was a major hurdle for many important tasks like machine translation or question answering, where sequence lengths vary. This is where the Long Short-Term Memory (LSTM) network comes in.

For me, the biggest contribution of this paper is that they took the LSTM and successfully trained an encoder-decoder architecture at scale on a difficult task (English-to-French translation). They showed that LSTMs weren’t just a theoretical curiosity; they were practical for real-world, large-scale NLP problems. This was the first paper that demonstrated LSTMs could work in this way, paving the path for much future research.

The Input Reversal Trick

One specific technical detail that stood out was their trick of reversing the input sentence:

“…reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.”

My first reaction was: Okay, reversing the input brings the first source word closer to the first target word, which makes sense for translation since the beginning often sets the context. But what about the last words of the source sentence? Don’t they get pushed really far away from the end of the target sentence?

That’s a valid point, but I think it highlights a trade-off. Getting the beginning of the translation right is often very important; it lays the groundwork. By reversing the input, they made it easier for the optimization process (SGD) to “establish communication” between the early parts of the source and target sequences. The performance gains they reported (perplexity dropping from 5.8 to 4.7, BLEU jumping from 25.9 to 30.6) suggest this was an effective trade-off, making the model learn better, even if it seems counter-intuitive for the tail end of the sequences.

Final Thoughts

Reading this paper was a great reminder of how far the field has come, but also of the clear thinking that set the foundations. I admire Ilya Sutskever’s way of thinking; when I listen to him on podcasts, he often speaks with clarity. Looking at this early work reinforces that impression. Maybe I should make a point of reading through more of his papers.

AI as Personal Guardians

2025-04-19T11:00:00+00:00

I’ve been thinking a lot about how LLMs might reshape society, and a thought clicked: AI could become personal guardians for each of us.

The background is this: As I’ve discussed before, context is important for LLMs. Even now, the insights they provide in split-second decisions can be helpful. They aren’t perfect, but the intelligence they offer brings value. The main thing limiting their ability to help us more consistently is access. If we don’t actively query the model with the right context, it can’t respond to our specific, nuanced needs. Our lives are complex, and the same query can mean different things depending on the web of our individual circumstances.

So, the context an LLM needs to be truly helpful is immense and deeply personal.

Beyond Universal Assistants

Now, the idea of a universal AI assistant isn’t new – think Her or Jarvis. We all nod along, assuming something like that is coming. But I don’t think most people grasp the gravity of this, the potential impact if put into the palm of our hands. It will touch on the nature of our experience.

What I envision is this: Imagine wearing a small device, maybe a pendant, that continuously and passively records context from your daily life – conversations you have, things you hear, places you go, maybe even subtle reactions. Right now, most of this rich contextual data is ephemeral, lost the moment it happens because we don’t record our everyday lives.

If this data were captured objectively, it could provide the grounding LLMs need to become more helpful.

Confronting Our Subjectivity

Here’s why I think people underestimate the impact: we don’t fully appreciate how subjective, limited, fragile, and unreliable our own perception and memory are. Psychological literature makes it clear: human memory isn’t a perfect recording device. We reshape memories, constructing narratives to make sense of the world. Our accounts of the same event differ from person to person, filtered through our limited viewpoints and emotional states. We aren’t purely rational decision-makers.

An AI, fed with continuous, objective context, could hold up a mirror to this subjectivity. It could help us see patterns and realities that our own minds obscure.

The Guardian’s Role

Imagine the possibilities:

Objective Recall & Comparison: The AI could provide an objective summary of your day, week, month, or even year. It could compare your activities, moods, or interactions over time in ways impossible for our biased human memory. “How does my interaction pattern today compare to last month?” is a question we can barely guess at; the AI could answer with data.
Personalized Planning: Based on this deep, objective understanding of your past actions, goals, and context, it could suggest optimal plans for tomorrow, complete with relevant reminders grounded in your actual history.
Social Shield: For interactions, it could offer insights or warnings. Imagine someone easily manipulated, such as an elderly person. This AI could recognize patterns of deception or fraud that the person might miss, acting as a protective layer by providing information they didn’t previously have.

Thinking about this “social shield” aspect, particularly its potential to guardrail individuals away from harmful decisions, is when the core concept clicked for me. Imagine the AI noticing subtle health patterns and suggesting a check-up, or recognizing manipulative language in a conversation. Preventing bad outcomes by providing timely information, nipping problems in the bud, could be one of the most impactful aspects of this technology. That realization solidified the idea of “A personal guardian for everyone.”

With such guardians, society could evolve. Individuals might become better decision-makers overall. Imagine consulting your guardian in-depth before making major life choices: which university course to take, which habit you didn’t notice is detrimental to your health, which job offer is best aligned with your long-term patterns and goals.

A Guardian in the Cloud

This isn’t about the AI becoming a godlike entity dictating our lives. It’s about having an immensely valuable and practical tool – an intelligent counterpart striving to help us make better decisions, understand ourselves more clearly, and navigate the world more effectively.

I believe this is possible with current technology, perhaps needing refinement and scale. When (not if) this kind of personalized, context-aware AI guardian becomes widespread, the impact on individual productivity, efficiency, and overall well-being could be large. Everyone could be better off with their guardian than without.

It leads directly to the future Yuval Noah Harari described, where algorithms might genuinely know you better than you know yourself in certain aspects. What a fascinating, and perhaps slightly unnerving, time to be alive.