RecsysML + LLMs

LLM Evals From Scratch: Run Your First Benchmarks

Vish Sangale — Thu, 09 Apr 2026 18:03:44 GMT

Prerequisites

Python 3.9+
A terminal and a virtual environment
No GPU required (CPU is fine for GPT-2)

Step 1 — Why training loss isn’t enough

Here’s a failure mode worth knowing before you spend time training anything. You run a model for a few thousand steps, validation loss drops steadily, you call it converged. Then you throw a basic factual question at it and it hallucinates confidently.

This happens because loss only measures how well the model predicts the next token on your training data. A model that memorizes plausible-sounding token sequences can get low loss while being wrong about nearly everything testable.

An eval ties a task, a metric, and a decision together — so the number you get means something beyond “did loss go down.”

The taxonomy is worth knowing too, because people mix up format and scoring:

GSM8K asks for a final numeric answer — it’s a short-answer benchmark scored with exact match, not a “generation eval” just because the model generates text. This tutorial uses multiple-choice with log-prob scoring.

Step 2 — The three benchmarks we’ll run

ARC-Easy (AI2 Reasoning Challenge)

4-way multiple choice, grade-school science. Here’s a real item from the dataset:

“Which of the following best describes a physical change? (A) Burning wood (B) Rusting iron (C) Melting ice (D) Digesting food”

Answer: C. The question requires knowing that melting is reversible and doesn’t change chemical composition. Random baseline: 25%.

PIQA (Physical Intuition QA)

2-way multiple choice about physical tasks. Real item:

Goal: Separate egg whites from yolk using a water bottle. Solution 1: Squeeze the bottle, hold it over the yolk, then release — the suction pulls the yolk in. Solution 2: Fill the bottle with water and pour it over the egg to wash away the whites.

Answer: Solution 1. The model has to reason about suction and physical manipulation. Random baseline: 50%.

HellaSwag

4-way sentence completion, adversarially constructed. Correct completions describe what naturally happens next; wrong options were specifically written to fool models that rely on surface-level patterns rather than understanding. That makes it harder than it looks, and harder than ARC and PIQA for most small models. Random baseline: 25%.

Step 3 — Why we use acc_norm in this tutorial

For multiple-choice tasks, the harness scores each answer choice by its log-probability under the model. The problem: longer choices accumulate more log-probability tokens, creating a length bias toward verbose options.

acc_norm divides each choice’s log-probability by its token count before comparing. When answer lengths are similar (like PIQA’s two short options), the difference from raw acc is small. When they vary significantly, it matters — so we use acc_norm throughout to stay consistent.

The harness outputs both acc and acc_norm. The numbers for PIQA will be nearly identical. For HellaSwag, they can differ by a few points because the wrong answers are sometimes much longer than the correct one.

Step 4 — Set up your environment

python -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate

pip install lm_eval transformers torch

lm_eval is the EleutherAI lm-evaluation-harness.

Step 5 — Get the script

Clone the full repo:

git clone https://github.com/vishsangale/bonsai-llm
cd bonsai-llm/posts/part1-llm-evals-intro
python eval_gpt2.py

Step 6 — What the script does

The core call:

results = lm_eval.simple_evaluate(
    model=”hf”,
    model_args=”pretrained=gpt2”,
    tasks=[”arc_easy”, “piqa”, “hellaswag”],
    num_fewshot=0,
    batch_size=8,
)

num_fewshot=0 means zero-shot: just the question, no worked examples in the prompt.

On the first run, the harness downloads benchmark datasets from HuggingFace into ~/.cache/huggingface/datasets/ — you’ll see progress bars for each task. If you’re on a slow connection this is where most of the wait time goes. The GPT-2 weights are about 550MB and land in ~/.cache/huggingface/hub/. Subsequent runs skip all of this.

One thing that trips up beginners: if a task name is wrong or not installed, the harness fails with a KeyError rather than a clear error message. If that happens, run lm_eval --tasks list to see what’s available and check your spelling.

Step 7 — Read the output

The Stderr column tells you how reliable the estimate is. For ARC-Easy (~2600 questions) it’s ±0.01, meaning the true number is probably within one percentage point of what’s shown. For tasks with fewer examples, stderr grows — keep that in mind when comparing small differences between models.

Step 8 — Interpret the results

The PIQA score is the interesting one. GPT-2 at 62.5% on a 50% baseline is a larger gain than ARC-Easy’s 39.5% on a 25% baseline, in relative terms. That’s not because PIQA is easier — it’s because GPT-2’s training data (web text) is saturated with “how to” content about physical tasks. GPT-2 picked up enough statistical regularity about physical tasks from web text that it transfers to this benchmark.

HellaSwag at 31.1% is close to chance. That’s expected here: GPT-2 was trained to predict tokens, not to reason about what comes next in a scenario. HellaSwag was designed so that wrong answers score high on surface-level statistics — exactly the thing GPT-2 is good at — which is why small base completion models consistently score poorly on it.

Step 9 — Compare model sizes

Edit the MODEL variable at the top of eval_gpt2.py:

MODEL = “gpt2”         # 124M parameters
MODEL = “gpt2-medium”  # 355M parameters
MODEL = “gpt2-large”   # 774M parameters
MODEL = “gpt2-xl”      # 1.5B parameters

All three scores improve with size. Here are the actual numbers across all four variants (acc_norm, zero-shot):

HellaSwag scales the most — +19.8pp from base to xl — consistent with the adversarial task benefiting more from additional capacity. PIQA shows diminishing returns: GPT-2’s web text training already gives it a strong foundation for physical intuition tasks, so additional parameters help less. ARC-Easy sits in the middle.

The script saves results_.json after each run. Diff two JSON files to track which tasks regress after fine-tuning — that’s the main practical use of these baselines.

Step 10 — Limits and contamination

Benchmark contamination is worth knowing about before you trust leaderboard numbers. If a model’s training data contains benchmark test items — even partially — the scores overstate real capability. This is a genuine concern for models trained on large web crawls, many of which postdate these benchmarks. GPT-2 is old enough that contamination is less of a concern here.

The practical implication: public benchmarks are reliable for comparing models trained under similar conditions and for catching regressions. For product decisions, use held-out evals that match your actual use case — ones you control and the model has never seen.

When to use these evals — and when not to

My rule of thumb: if you can’t say what decision changes based on the eval result, you probably don’t need that eval yet.

What’s next

Part 2 moves into open-ended generation evals. We’ll use a small instruction-tuned model, build a RAG pipeline, and measure it with RAGAS — scoring faithfulness, answer relevance, and context recall.

Part 3 compares the major eval frameworks side-by-side: lm-eval-harness, DeepEval, RAGAS, and Inspect.

Full code: github.com/vishsangale/bonsai-llm/tree/main/posts/part1-llm-evals-intro

Personalization at Bluesky

Ian Wesley-Smith — Mon, 23 Feb 2026 19:03:07 GMT

At Bluesky, we are building an open foundation for the social internet, where anyone can create a feed, such as the Science feed, For You feed by spacecowboy, or GLAMS feed. We also aim to provide a great default Discover feed. This post discusses personalization of the Discover feed, from historical attempts to current deployment, and a path forward inspired by Pinterest’s work. If interested, come work with me at Bluesky!

An example of the Bluesky Discover feed

As the first MLE at Bluesky, I initially attempted a two tower model but it failed to converge, possibly due to insufficient data or being a poor fit for Bluesky’s short-lifetime items and skewed interaction distributions. Bluesky was (and still is) a small team, so I couldn’t spend forever debugging this issue. Instead I switched to building a system that would generate post embeddings based on the content of a post, with the idea that I could build a personalization system on top of that.

Currently, posts are embedded using BLIP2, a variant of CLIP, which powers our topic models (27 topics users select during onboarding). While this topic model is accurate it is also quite broad, which hurts the user experience. I’ve also run HDBSCAN over a sample of the post embedding space to generate ~600 clusters which provide finer grained grouping of content. By measuring a user’s interaction with content from these clusters or topics we have a rudimentary personalization system that can help users find content they might be interested in.

My goal is to substantially improve personalization of the Discover feed. After reviewing papers, I chose to investigate techniques from Pinterest, specifically their PinnerSage paper. This choice was based on budget fit, simplicity, avoiding extensive fine-tuning, and the requirement to treat user and post representations separately. There are a lot of similarities between the papers published by Pinterest and Twitter, but I choose to use the Pinterest papers because they’ve continued publishing, providing a path to utilize more advanced models as the ML team at Bluesky grows.

Bluesky is hiring!

Speaking of growing the team, are you a mid-senior MLE with experience in recommender systems? Do you want to join a team laying the groundwork for how ML will operate at a fast growing social media platform? Do you want to increase your scope of work? Want to experiment with new, unconventional ideas? Think distributed social media is the way of the future? Then come work with me at Bluesky!

PinnerSage

Published in 2020, PinnerSage addresses the issue of a single user preference embedding failing to capture a user’s full range of interests, especially short and long-term interests. It does this by generating several (10-100) user preference embeddings via an offline path (last 90 days) and an online path (today’s interactions). This resulted in a 2% increase in user engagement propensity and a 4% increase in engagement volume in online A/B tests.

How it works

PinnerSage is a rather simple approach to the problem, with intentional design choices that match my own. They specifically mention that item embeddings should be fixed, which is a requirement for me.

Step 1: Cluster User Interactions

First, for a given user they take the last 90-days of item interactions (i.e. action pins) and gather the item embeddings. Next they cluster these embeddings using Ward clustering, generating a ‘small number’ (10-100) of clusters for a user. Their specific Ward implementation is based on the Lance-Williams algorithm, and has a complexity of O(n^2) where n is the number of items being clustered.

Step 2: Calculate the Medoid

Second, for each cluster, a medoid—an actual member of the cluster that minimizes the sum of the squared distances with other members—is calculated. This simplifies deployment by allowing Pinterest to reuse existing pin infrastructure.

Step 3: Importance Scoring

Finally, they calculate a user-cluster importance score. Since a user can have 10-100 clusters we need a way to choose which clusters to use during retrieval. They use a simple time decay average model:

lambda is a hyper-parameter that controls recency, with 0 ignoring time effects and 0.1 emphasizing recent interaction. Pinterest found 0.01 to be a good balance.

With these three steps we now have a set of per-user interest medoids (i.e. pins) and weights for how much a user interacts with those pins.

Integrating with your Recommender System

Applying this to retrieval is fairly straightforward. The medoids can be sampled, weighted by importance, and used as candidate sources for an ANN-based candidate generator. Pinterest sampled up to 3 medoids at a time, and applied additional (though unspecified) filtering to remove near duplicates and poor quality candidates.

One weakness with PinnerSage is the difficulty in using these user preference embeddings during ranking. Traditionally you create a feature for each item that is the similarity of that item’s embedding and the user preference embedding. With PinnerSage there are anywhere from 10-100 preference embeddings for each user, so it is unclear which of these embeddings you should choose. You could try using all of them, and taking the max score for a given item and the user preference embeddings, but this is expensive to do at runtime (i.e. 100 embeddings x 1,000 items = 100,000 ops). Another option is to take a weighted average of the user preference embeddings to combine them into a single user-preference embedding, but this naive approach will likely result in loss of accuracy due to smearing the users preferences.

The difficulty of integrating multiple user preference embeddings into ranking was a key motivator for PinnerFormer (Pancha et al., “PinnerFormer.”), which Pinterest developed to generate a single user preference embedding using Transformers to better capture user interests. We will discuss PinnerFormer in a future blogpost.

Short Term Interests & Item Embeddings

PinnerSage architecture diagram from Pal et al., “PinnerSage.”

Earlier we alluded to an online system that captures short term interests. An event-based streaming system captures short-term interests by performing the same clustering and importance estimation steps on the twenty most recent actions since the last batch job. These results are combined with the batch results.

One thing not discussed in this paper is how item embeddings are generated. At the time of publication (2020), Pinterest used a sophisticated graph based embedding model called PinSage (Hamilton et al., “Inductive Representation Learning on Large Graphs.”). At BlueSky we are using BLIP2 to generate post embeddings. If you don’t already have an item embedding model then you can’t deploy PinnerSage.

Conclusion

This blog post presented an overview of PinnerSage, a clustering based approach to generating user preference embeddings while keeping item embeddings fixed. I also discussed a brief history of personalization at Bluesky, and provided my motivation for investigating PinnerSage. My current plans are to implement PinnerSage as a candidate generator, then move to PinnerFormer to generate a single user preference embedding for ranking. As we make progress on various parts of the stack we will share our work.

Bibliography

Pal, Aditya, Chantat Eksombatchai, Yitong Zhou, Bo Zhao, Charles Rosenberg, and Jure Leskovec. “PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest.” Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 23, 2020, 2311–20. https://doi.org/10.1145/3394486.3403280.

Hamilton, William L., Rex Ying, and Jure Leskovec. “Inductive Representation Learning on Large Graphs.” arXiv:1706.02216 [Cs, Stat], June 7, 2017. http://arxiv.org/abs/1706.02216.

Pancha, Nikil, Andrew Zhai, Jure Leskovec, and Charles Rosenberg. “PinnerFormer: Sequence Modeling for User Representation at Pinterest.” arXiv:2205.04507. Preprint, arXiv, May 9, 2022.http://arxiv.org/abs/2205.04507.

The Mathematics of Intelligence: A Deep Dive into LLM Training

Rahul Agarwal — Sat, 31 Jan 2026 17:17:59 GMT

Creating a state-of-the-art Large Language Model (LLM) is not a single act of training. It is a multi-stage evolution that transforms a raw neural network from a statistical pattern-matcher into a refined, reliable assistant.

This post breaks down the technical pipeline—from foundational knowledge acquisition to parameter-efficient fine-tuning with LoRA, and the critical “leash” provided by KL-Divergence.

1. Pre-training: The Foundation of Knowledge

In this first stage, the model absorbs the statistical structure of language. It is exposed to trillions of tokens with the objective of Next-Token Prediction. At this level, the model is learning the “world model”—facts, grammar, and reasoning—but it has no concept of a “user” or a “task.”

The Logic: We measure how “surprised” the model is by the actual next word in a sentence. The goal is to minimize this surprise across the entire internet.

The Math: Cross-Entropy Loss The model learns a probability distribution P over its vocabulary. We minimize the Negative Log-Likelihood (NLL):

θ: The massive parameter set of the base model.
The Result: The model develops a “world model,” internalizing facts, grammar, and reasoning. However, it doesn’t yet know how to follow instructions.

2. Instruction Tuning (SFT) & LoRA

To transform a knowledge base into an assistant, we use Supervised Fine-Tuning (SFT). We provide the model with “Gold Standard” examples of how to follow a prompt. To make this efficient, we use LoRA (Low-Rank Adaptation).

The Logic: Instead of updating all 70+ billion parameters (which is slow and expensive), we freeze the original model and add tiny, specialized “adapter” matrices. These matrices are “Low-Rank,” meaning they compress the complex task of “being an assistant” into a much smaller mathematical space.

The Math of LoRA: Instead of updating the full weight matrix W0, we freeze it and train two small, low-rank matrices A and B. The rank r is much smaller than the original dimensions (r≪d,k).

The Loss: The goal is to maximize the likelihood of the specific human-labeled response:

3. Reinforcement Learning (PPO): Joint Optimization

Instruction tuning teaches the model the format of being an assistant, but Reinforcement Learning from Human Feedback (RLHF) teaches it quality and safety. In this stage, we often use a "Joint Optimization" where we update the same LoRA adapter from the SFT stage.

A. The Bradley-Terry Reward Model

We don’t just give the model a raw score. We use an “Advantage” calculation.

The Logic: First a Reward Model model is trained using the human preference data on model responses, so that we can predict human preference on any given responses, not just where we have data.

The Reward Model is a static judge that gives a score. However, a score of “7/10” is meaningless without context. Is 7 good for a complex coding prompt? Or is it bad for a simple “Hello”? But the Advantage measures how much better a specific answer was compared to what the model usually produces. This prevents the model from getting “lazy” and ensures it is constantly striving for a higher-than-average response.

The Math: The Reward Model (rϕ) is trained to predict which answer humans prefer (y_w vs y_l):

Then we use a Value Model (V) to estimate the “expected reward” for a prompt. The Advantage is the difference between the actual Reward we received and what the Value Model expected:

if Advantage >0, the model performed better than its own baseline, and we "push" the LoRA weights to make that response more likely.

C. The PPO Clipped Objective (LPPO)

When we “push” the model to be better, we have to be careful not to push it so hard that it “breaks” and starts generating gibberish.

The Logic: We use a Clipped Objective. This acts as a mathematical guardrail that prevents the LoRA adapter from changing too drastically in a single update. If the model wants to change its behavior by 50%, the clip function forces it to only change by a safe 10-20%.

The Math:

Where:

ϵ: A hyperparameter (usually 0.1 or 0.2) that limits how much the policy can change.

C. The Joint Loss

On the same adapter, we combine the RL objective with SFT-regularization and the KL-penalty:

The Logic: Here KL penalty, measures distance between predictions of new model with LORA and ensures that the model do not diverges much from pre-trained model that learns the language structure

4. Direct Preference Optimization (DPO): The Sequential Shortcut

DPO is a modern alternative that skips the Reward Model entirely. Unlike PPO, it is typically a sequential process regarding adapters.

The Strategy:

Reference: You freeze the SFT LoRA adapter as πref.
Policy: You train a new, separate LoRA adapter (πθ) that learns to maximize the log-ratio between preferred and unpreferred responses.

The Math:

Summary Table: Adapter & Loss Strategies

Conclusion

Modern LLM training is a balancing act. PPO uses Advantage and Clipping on a single LoRA adapter to incrementally move toward human preferences, while DPO uses a Log-Ratio on sequential adapters to simplify the math. In both cases, the goal is to maximize the delta between what the model did and what the human wanted.

Would you like me to help you draft a specific “How to implement this” section using a library like Hugging Face’s TRL (Transformer Reinforcement Learning)?

Disclaimer: These are the personal opinions of the author(s). Any assumptions, opinions stated here are theirs and not representative of their current or any prior employer(s). Apart from publicly available information, any other information here is not claimed to refer to any company including ones the author(s) may have worked in or been associated with.

Modernizing GPT-2: A Journey from 2019 to 2025

Vish Sangale — Sat, 24 Jan 2026 15:01:44 GMT

Part I
Link to GitHub Repository

Introduction

GPT-2 is a legendary model, effectively the “Hello World” of modern LLMs thanks to Andrej Karpathy’s nanoGPT. But in the fast-moving world of AI, 2019 might as well be ancient history.
We set out to answer a simple question: What happens if we take the classic GPT-2 architecture and inject the architectural improvements that power today’s leading models like Llama 3?
This post details our journey of implementing RoPE, RMSNorm, and SwiGLU into GPT-2, the backward-compatibility challenges we solved, and the surprising results we found when testing on datasets of different sizes.

1. The Upgrades: Why We Made Them

We focused on three key modernizations that have become standard in post-2023 LLMs.
🔄 Rotary Positional Embeddings (RoPE)

The Old Way: Standard GPT-2 uses learned absolute positional embeddings. The model learns a unique vector for position 0, position 1, etc. This doesn’t scale well to longer contexts and fails to capture the relative distance between tokens effectively.
The Upgrade: We replaced this with RoPE. Instead of adding a vector, we rotate the query and key vectors in the attention mechanism based on their position. This allows the model to naturally understand “token A is 5 steps before token B” regardless of where they appear in the sequence.

📐 RMSNorm (Root Mean Square Normalization)

The Old Way: LayerNorm. It centers and scales the input.
The Upgrade: RMSNorm. It skips the centering step and only re-scales. It’s computationally simpler and, empirically, often leads to more stable training at scale. It’s a small simplification that has helped models like Llama scale to massive sizes.

🧠 SwiGLU Activation

The Old Way: GeLU (Gaussian Error Linear Unit).
The Upgrade: SwiGLU (Swish Gated Linear Unit). This is one of the most impactful changes. It essentially gives the Feed-Forward Network (FFN) a “gate” implementation, increasing the model’s capacity and expressivity. It requires slightly more parameters (due to the extra gate projection), but the performance capabilities per parameter are generally higher.

2. Challenge: The Backward Compatibility Trap

We didn’t just want a new model; we wanted a unified codebase. We needed to ensure that we could still load old, vanilla GPT-2 checkpoints.

if config.use_rmsnorm:
    self.ln1 = RMSNorm(config.n_embd)
else:
    self.ln1 = nn.LayerNorm(config.n_embd)

We even wrote a

3. Experiment 1: The Overfitting Trap (Tiny Shakespeare)

Our first test was on the classic Tiny Shakespeare dataset. We anticipated the modernized model would crush the baseline.

What happened?

4. Experiment 2: Redemption (FineWeb)

To prove the architecture works, we needed a dataset that could withstand the power of the modernized model. We switched to FineWeb, a high-quality, massive web dataset.

Training Loss

Validation Loss

On the larger dataset, the overfitting vanished. The modernized model leveraged its superior architecture to learn more generalized patterns, achieving an

Seeing is Believing

We generated samples from the FineWeb models starting with

5. Experiment 2: The Ablation Study (Who contributed what?)

We also ran a quick ablation to see which feature contributes most to early convergence (at Step 100).

Conclusion

Modernizing legacy architectures isn’t just about pasting in new code. It’s about understanding the relationship between model expressivity and data scale.

RoPE + SwiGLU + RMSNorm makes the model a more efficient learner.
On small data, this efficiency manifests as overfitting.
On large data, it manifests as superior performance and generalization.

We now have a GPT-2 codebase that is backward compatible with 2019 checkpoints but capable of 2025 performance when fed the right data.

Disclaimer: These are the personal opinions of the author(s). Any assumptions, opinions stated here are theirs and not representative of or attributable to their current or any prior employer(s). Apart from publicly available information, any other information here is not claimed to refer to any company including ones the author(s) may have worked in or been associated with.

A framework for ML Design Interviews

Gaurav Chakravorty — Sat, 10 Jan 2026 15:02:45 GMT

Over time, we have developed a framework for approaching ML Design interviews. Using this framework has had a high success rate. (11+ successful offers at E9 to E6 at Meta/Google/etc, 1 failure). Having a framework allows you to (a) ensure that you pace yourself to cover all key elements (b) have a coherent narrative in your delivery and (c) helps the interviewer know what you would have covered if you had more time.

Please message us on substack and we will share the framework with you. (We don’t want to post it to avoid it being used by too many ppl).

The Case for RL-Aligned Ranking in RecSys

Gaurav Chakravorty — Sat, 03 Jan 2026 11:02:22 GMT

LLMs and Recommender systems, like the ones used in video recommendation and friend recommendation, might seem very different to most but in this post we compare them to show they are surprisingly similar. We highlight one key opportunity for recsys community to improve. Under the surface they are a similar problem:

Given a context, choose the next action that maximizes value.
Action can be “Next token” for LLMs and “Next item/action” for RecSys.

1. Retrieval ≈ Pretraining+SFT. Ranking is missing Reward optimization.

LLMs are developed using

pretraining using next token prediction loss on a lot of data not specific to a domain.
fine tuned still using next token prediction loss but only on high quality data to improve for the domain, typically at smaller learning rates.
optimized using Reinforcement Learning to maximize expected reward (e.g. using Policy Gradient).

Retrieval in recsys is done by finding user and item embeddings to maximize the probability of next item interacted by the user. This post shows that Pretraining/SFT = Recsys Retrieval and how Semantic ID + clustering are blurring them even more.

2. Reward optimization in LLMs vs “Ranking” in RecSys

This is where the two worlds spiritually reconnect but architecturally diverge.

Note: RL-Aligned vs. Generative RecSys It is important to distinguish this proposal from “Generative Recommendation” (where an LLM directly generates Item IDs as tokens). What I am proposing here is an evolution of the training paradigm, not a replacement of the inference engine. We move from minimizing classification error to maximizing policy reward. It just changes the training loss without any degradation of inference latency or increase in inference cost.

3. LLM Alignment (RLHF etc) uses a reward model to maximize user value

Train a reward model that scores outputs.
Train the policy (the LLM) to maximize that reward using some form of policy gradient.
Use KL-regularization to keep the policy safe, stable, and within distribution.

The key point: LLMs actively optimize against a reward model.

Training learns to find parameters θ such that the probability of generating a token πθ(y|x) is highest for pairs (x, y) which will lead to highest reward under the reward model provided to it.

4. RecSys Ranking model is a probability estimator

See this post for a deep dive on Ranking models in recsys. “Ranking” models are, as of today, trained to predict a vector of probabilities like:

p(click)
p(engagement)
p(return next day)

On top of these predictions, engineers write a hand-coded value function, for example: Value/Reward = a * p(click)+ b * p(engage) + c * p(return next day)

This is a modular, transparent, and auditable solution that enables rapid, component-wise experimentation and deployment.

But critically:

RecSys ranking models do not optimize the value function.
They only predict the labels that go into it.

There is no policy gradient step. No RLHF stage.
No optimization w.r.t. actual business or user value.

This is a fundamental architectural gap.

And unlike LLMs, RecSys systems operate in a ~live feedback loop:

The policy (Ranking estimator + Value model) determines the user experience,
the experience determines user actions,
those actions populate the training data,
and the model is trained on that data again.

Yet RecSys systems still treat the “ranking” model as a static classifier rather than as a policy.

5. The Opportunity: RecSys Needs Its “RLHF Moment”

The RecSys world already has all the ingredients LLMs needed for RLHF:

a probability estimator (the ranking model)
a scalar value model (the downstream business/value estimator)
logged human preference data
a feedback loop
constraints on drift and safety (analogous to KL-regularization in LLMs)

But RecSys stops short of the final step:

Treating the ranking model as a policy and training it to maximize reward.

Imagine a RecSys training pipeline where:

The value model becomes the “reward model” (both inference reward model and training reward model).
The ranking model is updated to maximize this reward:
1. The value model score for each item is computed using task predictions. This is an invocation of the “inference reward model”
2. From this we compute the probability of this item being ranked first (Plackett-Luce model). ‘N’ refers to the number of items being ranked.
3. Optionally but recommended to compute the probability of the logging policy ranking this item first. You will need to log the probabilities estimated by that model for this. If you don’t have this then you can assume \pi-beta(i|x) = 1
4. Compute the inverse propensity estimate:
5. Compute an “observed reward”. We can use the “value model” as this training reward model. For instance, this could be
6. Add an RL loss which when minimized helps us learn a ranking model that maximizes the reward.
7. Try improvements like GRPO (DeepSeekMath) / ECPO (OneRec) / GBPO (OneRec-V2) / CISPO(Minimax-M1) to improve the Off-Policy estimate of the reward. These will no doubt improve the variance of ‘ρ’.
8. Note: The only term in this loss that is affected by model parameters are the predictions, same as current ranking state. The gradient flows back through the probability estimator $p_\theta(t)$.

Why is this better?

The system optimizes the actual long-term metric end-to-end.
Training is paying more attention to instances that will actually deliver metrics instead of being more accurate in low ROI parts of the training data.
This will open up a lever for growth for your recsys, which is to improve the training reward model.

“Ranking” would finally become reward optimization.

6. Code implementation

See code here

7. Closing Thoughts

LLMs and RecSys systems share a deeper architectural similarity than most people realize.
Pretraining mirrors retrieval.
Policy sampling at inference mirrors top-K ranking.
Reward models mirror value models.

But LLMs have unlocked remarkable capabilities through policy-gradient-based preference optimization,
while RecSys still primarily relies on probability estimation + handcrafted value models.

The RecSys field is on the cusp of the same evolution.

The next major leap in recommender systems will come when ranking models shift from estimating probabilities to maximizing reward—just like modern LLMs.

We already have retrieval.
We already have value models.
We already have logged user preferences.
We already have constraints and safety layers.

All that’s missing is the optimization layer.

RecSys is ready for its RLHF moment.

Learning ML by doing Part1 | GPT-2

Vish Sangale — Sat, 20 Dec 2025 14:02:40 GMT

Link to GitHub Repository

“What I cannot create, I do not understand.” — Richard Feynman

In the age of massive API-based LLMs, the art of training your own model from scratch can feel like lost knowledge. It is often assumed that you need a cluster of H100s to do anything meaningful. But I wanted to challenge that assumption.

My goal was simple yet ambitious: Replicate the 124M parameter GPT-2 Small model from scratch on a single consumer GPU (RTX 5080) and engineer it to run as fast as possible.

This post describes the journey from a blank Python file to a highly optimized training pipeline that processes over 92,000 tokens per second—a 3.1x speedup over the baseline implementation.

1. The Architecture: What is GPT-2?

At its core, GPT-2 is a Decoder-only Transformer. Unlike BERT (which uses an Encoder) or T5 (Encoder-Decoder), GPT-2 is built to predict the next token in a sequence. This autoregressive property is what makes it a “Generative” model.

Our implementation (model.py) mirrors the original OpenAI specifications for the 124M model:

Parameters: ~124 Million
Layers: 12 Transformer Blocks
Attention Heads: 12
Embedding Dimension: 768
Context Length: 1024 tokens
Tokenizer: Byte Pair Encoding (BPE) using OpenAI’s tiktoken library.

Key Components

Causal Self-Attention: The heart of the model. It allows each token to attend to previous tokens but masks future tokens so the model can’t “cheat” by seeing the answer.
Learned Positional Embeddings: Since Transformers process tokens in parallel, they have no inherent sense of order. We learn a vector for each position (0 to 1023) to give the model spatial awareness.
Weight Tying: A clever memory-saving trick where the embedding layer weights are reused for the final output projection.

2. The Implementation

We started with a clean slate. The dataset of choice was TinyShakespeare, a classic character modeling benchmark that fits easily into memory but is complex enough to learn grammar and structure.

Everything was written in pure PyTorch. The training loop was standard:

Sample a batch of text.
Forward pass (compute logits).
Compute Cross-Entropy Loss against the targets (shifted input).
Backward pass (compute gradients).
Optimizer step (update weights).

The “Naive” Baseline

My first attempt was functional but fragile.

Batch Size: 4 (The GPU ran out of memory at 8 or 12).
Throughput: ~29,000 tokens/second.
GPU Utilization: Spiky and inefficient.

It worked, but it was painfully slow. At this rate, convergence would take days.

3. Engineering the Speedup 🚀

I didn’t want to wait days. I wanted to engineer my way out of the bottleneck. Here is how we optimized the pipeline to achieve a 3.1x speedup.

Flash Attention 2 ⚡

The standard attention mechanism requires calculating an N×N matrix (Attention Scores). For long sequences, this consumes massive amounts of VRAM (O(N^2) memory complexity).

We switched to torch.nn.functional.scaled_dot_product_attention, which leverages Flash Attention 2. This fused kernel computes attention without materializing the full matrix in high-bandwidth memory.

Result: Memory usage plummeted. I could instantly double the batch size from 4 to 8, and later 16.

Vocabulary Padding 📏

The standard GPT-2 vocabulary size is 50,257. This is an odd number that plays poorly with GPU hardware, which prefers powers of 2 (or multiples of 64/128) for efficient tiling.

Fix: We padded the vocabulary to 50,304 (the nearest multiple of 64).
Result: The final linear projection layer became significantly faster due to memory alignment.

Feeding the Beast (Data Loading) 🍽️

Profiling via nvidia-smi showed the GPU dropping to 0% utilization periodically. This meant the GPU was starving—waiting for the CPU to load the next batch of data.

Fix: We implemented a custom GPT2Dataset and used PyTorch’s DataLoader with num_workers=4 and pin_memory=True.
Result: Asynchronous data prefetching kept the GPU pinned at 99% usage. Throughput jumped to 63k tokens/sec.

Fused AdamW 🔥

The AdamW optimizer updates all 124 million parameters. Doing this sequentially in Python is slow due to interpreter overhead.

Fix: We set fused=True in torch.optim.AdamW.
Result: The entire optimizer step is launched as a single CUDA kernel, removing CPU overhead.

`torch.compile` (The Final Boss) 🛠️

PyTorch 2.0 introduced torch.compile, which JIT-compiles your model into optimized kernels (Triton). It fuses operations (like LayerNorm + Linear) to reduce memory access.

Result: This provided the final massive boost, pushing us over 92,000 tokens/sec.

4. The Results 📈

After all optimizations, the training run was stable and lightning-fast.

Loss Curve: Over 150 iterations, we can see loss droping continuously and shows model converging.

Performance Breakdown:

What did it learn?

After training for just a few hours, the model learned to generate Shakespearean-style text. It captures the vocabulary, the archaic grammar, and the dramatic structure of plays (Speakers, dialogue, stage directions).

Sample Output:

KING RICHARD II:
What implies this silence?
Tell me, what is the matter?

BOLINGBROKE:
My lord, I have acceptance of your grace,
And with a soul of love, I do beseech needed
remedy.

Next Steps:

As a next step, I am going to change the architecture to use recent advancements

1. Rotary Positional Embeddings (RoPE) 🔄

What it replaces: Learned Absolute Positional Embeddings (nn.Embedding(block_size, n_embd)).
Benefits:
- Better generalization to sequence lengths longer than seen during training.
- Relative position encoding property (tokens know how far apart they are).
- Standard in Llama, PaLM, Mistral.
Implementation: Requires modifying
CausalSelfAttention to rotate Q and K vectors.

2. RMSNorm (Root Mean Square Normalization) 📏

What it replaces: nn.LayerNorm.
Benefits:
- Computationally cheaper (re-scaling invariance, no mean calculation).
- Often leads to slightly better training stability.
Implementation: Drop-in replacement for LayerNorm.

3. SwiGLU Activation ⚡

What it replaces: GELU (Gaussian Error Linear Unit).
Benefits:
- Demonstrated better performance than GELU/ReLU in compute-matched experiments (PaLM paper).
Implementation: Changes the MLP structure from x -> gelu(x * W_1) W_2
to
This adds parameters, so we usually reduce the hidden dimension from 4d to 8/3d (or similar) to keep parameter count roughly the same.

4. Grouped Query Attention (GQA) 🏎️

What it replaces: Multi-Head Attention (MHA).
Benefits:
- Massively reduces KV cache size during interference.
- Faster inference decoding speed.
- Slightly degrades performance compared to MHA, but huge efficiency win.
Implementation: Sharing Key/Value heads across multiple Query heads.

5. Mixture of Experts (MoE) 🧠

What it replaces: Dense MLP layers.
Benefits:
- Scale total parameters (capacity) without increasing compute per token (only top-k experts active).
- “Sparse” model.
Complexity: High. Requires complex routing logic and balancing loss to prevent expert collapse. Might be overkill for a small 124M experiment, but fun to try.

Takeaway

You don’t always need more GPUs. Sometimes you just need better engineering.

By respecting the hardware—aligning memory, fusing kernels, and asynchronous loading—we squeezed 300% more performance out of the same card.

Check out the full code and try it yourself: View Project on GitHub

Connect with Vish on LinkedIn

Recsys Retrieval ~ LLM Pretraining/SFT

Gaurav Chakravorty — Sun, 30 Nov 2025 22:49:58 GMT

LLM pretraining is next-token prediction over ~100k vocabulary tokens.
RecSys retrieval is next-item prediction over millions of candidates. Otherwise they are the same mathematically.

Both systems:

embed the context into a vector
compute dot-products between this vector and a large embedding matrix
produce a distribution over discrete items/tokens with softmax or sampled softmax
train using cross-entropy

In fact, if you look at the final linear layer of an LLM, it is the token embedding matrix.
The model outputs logits by:

h: embedding the context into a D dimensional matrix.
W [D, V] : Last layer of the MLP that produces [V] outputs, corresponding to the probabilities of the V possible tokens.

Hence the logit (~ log probability) of token i in an LLM is just a function of the dot product of vectors ‘h’ and ‘W[:, i]’.

If you think of W[:, i] as the item i’s embedding, this is exactly how embedding based retrieval, a.k.a. “Two Tower Model”, computes similarity to candidates.

Same loss function - just applied based on corpus size

Loss

Estimate the probability of each valid candidate (token or item in batch). Here V is the number of valid candidates during training. The expression below is also called “softmax”. See equation 2 in “word2vec” paper Mikolov et al. 2013
Loss is how different these probabilities are from the ground truth observation (the next token in the training data, or the item the user interacted with). This is also called “classification loss”.

LLMs can compute this loss since the number of valid tokens, V, is about 100k and this is small enough for modern GPUs to do the operations above. For industrial recommenders, with 100 million+ valid items to recommend, this is infeasible.

Mikolov et al. proposed two solutions for this:

Hierarchical softmax
Sampled softmax: This is mostly what is used in recsys retrieval due to good results and training efficiency.

Semantic IDs

Recent efforts using “semantic IDs” are trying to bring these even closer. “Generative retrieval” efforts like TIGER try to bridge this gap with full softmax using hierarchical clustering.

Another Google Research paper shows that clustering is the key and not inference time generation. It finds high information density pathways that allows retrieval to discard large irrelevant areas of the item corpus. Adding full softmax over item clusters in the current retrieval pipeline produces results as good or better than generative retrieval while being an order of magnitude faster.

Prior posts to understand generative retrieval and semantic IDs

Ranking Models explained: Deep Dive into RecSys Architecture (Features, Embeddings, & Attention)

Gaurav Chakravorty — Sun, 30 Nov 2025 19:48:02 GMT

The intent of the post is to explain the ranking model in detail to tee up future posts explaining how this should change learning from LLM advances.

High level structure

Retrieval —> a relatively small set of candidates —> [ Ranking + Value Model ] —> Sorted order presented to the user.

So “Ranking” = Estimator model + VM which:

compute estimates of probabilities of user actions / user experience labels on presenting this item
use a Value model to compute a weighted sum of these probabilities into a single score and rank with it.

Ranking model architecture

Explanation of terms:

“float features” are typically single values, like number of mutual friends between viewer and target in friend recommendations or cosine similarity between graph neural network embeddings of viewer and target. These are also sometimes called “dense features”.
“embedding features” are full embeddings, like 128 floating points of a 128 dimension graph neural network embedding. The reason these are not just treated as “float features” is because processing them in the model as the vector/tensor leads to better model accuracy.
“sparse features” can be either class-based features like category-id of a song in music recommendation or also lists of ids like “last N user ids messaged by the user” if they are summed up into a single embedding and not retained as a list.
“User history sequences” are typically lists of ids like “last N user ids messaged by the user” but the entire sequence is retained and available to the model.
The “Interaction Arch” can be thought of as processing the non sequence features into a “user static pathway” the borrow the term from OneRec Technical Report.
To read the dimensions in the image above, for example [B, E, D] refers to [batch size, number of embeddings, D is a fixed number that is shared by most embeddings similar to d_model in LLMs or OneRec.

Prior posts on ranking

Building Generative Friend Recommendations

Gaurav Chakravorty — Fri, 29 Aug 2025 23:10:30 GMT

Main difference between recommending videos and recommending friends is that in friend-recs we are working with a 1000 times less positive signal and most of it delayed.

Out of the five parts of generative recs: (a) Semantic Embeddings (b) Tokenization (c) Modeling (d) Training Losses for modeling (e) Reward modeling1 illustrated in OneRec Technical Report, Semantic Embeddings and Training Losses are chiefly the ones that need to be changed when building generative recommendations to recommend friends like “people you may know”.
(Yes, this is a simplified view, but one to help you get to good enough MVP)

Outline: In the rest of the post:

We explain the seismic shift happening in recommender systems industry after OneRec paper and why.
A high level summary of OneRec
Which parts need to be changed for Generative Friend Recs and why

^ video explaining the post and presenting slides

Generative recs is changing the world of recsys

OneRec achieves lower system complexity, better app stay time and lower organizational investment using generative recommendations.

To elaborate, Youtube’s TIGER paper demonstrated that if we can represent the videos to be recommended in under 100K tokens2, we can use LLM models to recommend. OneRec goes one step further and builds a reward model that enables them to replace the entire recommender and not just retrieval. Now they can retire their multi-stage recommender system and just use this LLM-like model end to end to directly output a list of recommendation items.3

Fig 1: From-To states of OneRec

Most bigtech recsys teams are making generative recommendations their big bet. Reasons:

OneRec changes recsys from the traditional multi-stage process to a single stage process which makes organizational investment much more streamlined. You don’t need separate retrieval, early stage ranking and final stage ranking and value modeling and list generation (aka post ranking) teams now if OneRec can do it all.
This makes the recommendation system amenable to be “driven” by product. Imagine you don’t want viral clickbaity videos. It would be a costly multi-team effort to do so earlier. But with OneRec specifying it in the reward works. Imagine you want the recommender system to drive more Daily Active Users instead of sessions (or app opens), you can do that by shaping the reward.
The single most effective ML strategy has been metaphorically to get on trains others have built for you and to ride it close to your destination. The LLM world is building a train, with optimized kernels and infrastructure. If recsys can ride it then it can unlock a step change, as demonstrated by OneRec’s results.

High level summary of OneRec

Let’s start with an overall schematic not including reward modeling since that is not changing for Friend-Recs.

Fig 2: Process all we know about the actor into a summary. Then the decoder attends on this summary and starts generating tokens. All we need now is a way to tokenize the recommendable items (videos / friends / shopping items whatever) into tokens.

Parts of OneRec

Tokenization of recommendable items to make a “vocabulary” of a few thousand items for LLM-like generation to reliably train.
1. Semantic Embeddings: This a sort of high dimensional post code for each recommendable item which captures similarity in this domain.
2. Multi-stage “coarse-to-fine” clustering to create tokens of the item. This is similar to an e-commerce product catalog or the Yahoo.com homepage in 2000s!
3. Note that if your recsys has less than 100K items, you can skip this step and just use ids as tokens.
How to summarize user history (a.k.a. “Encoder”)
1. There is no use of tokens in OneRec’s encoder4. In some ways it is actually simpler than HSTU by Zhai et al, a state of the art user history encoder from Meta.
2. What I appreciate is how they have designed it using the standard LLM block of Self-Attention —> FeedForward block, enabling common Triton kernels and infra optimization from LLMs to be used.
How to generate (a.k.a. “Decoder”)
1. This uses the summary of step 2 using Cross-Attention and a decoder block to generate the tokens of the recommended item one token at a time.

Please note that this is not enough to produce high quality recommendations. The below quote from OneRec

" The pre-trained model only fits the distribution of the exposed item space through next token prediction, and the exposed items are obtained from the past traditional recommendation system. "

indicates that reward modeling and preference alignment are critical. We are not going further into them in this post because they stay largely the same in Friend-Recs.

Fig 3: High level components of Generative recommendation

How Generative Friend-Recs differs from Generative Video-Recs.

The process of creating Semantic Embeddings is different because the “item” is also a “user”.

Closeness is 3-way and not 2-way

Fig 4: When actor positively interacts with a new item, we can infer similarity of this new item to a previous item (recently) interacted by the same user.

Fig 5: When the actor makes a new “friend”, we can add similarity loss for not just previous friend of actor to new fried but also actor to new friend. Actor, and friends are all the same types of entities.

Grounding in demographic features / entities relevant to use case.

Fig 6: Before applying collaborative loss, OneRec starts with a representation grounded in content understanding (Blue part of the image). Green shows that the features and auxiliary losses in social recommendations are likely to be different, requiring the training of a Social Foundation Model for embeddings before applying the 3-way closeness loss of Fig 5.

Training losses - Basically more losses per training example since we have 1000X fewer examples

Problems:

Unlike video recs where platforms have 10+ billion video watches to train from every day, friend recommendations are fewer.
We have two embeddings to learn: item id embedding and semantic id (STU) embedding. We need more signal.

Solution for problem #2, we add two other losses. Fig 7 —> Fig 8.

Fig 7: Like LLMs the main loss in OneRec is the classification loss of the ground truth tokens.

Fig 8: We can also augment with two other losses, which albeit easier than token generation loss can help in improving representation learning.

Summary:

True token generation loss is the classification loss for the target.
In-batch softmax loss from the codebook embeddings of the target can help separate true positives from weak negatives.
In-batch softmax loss from the id embeddings of the target can help in providing addition signal for id-representation learning. These embeddings are use in the encoder.

Solution for problem #1, Pretraining from user history splices and compute potentially all three losses for this spliced target. Schematic in Fig 9 and losses below.

Fig 9 shows additional losses we can extract from user history to learn from each data point in the context of the social graph. While the model is training from a new social connection it has the incentive to stay grounded in previously learned social connections.

1. Temporal Autoregressive Loss L_{AR} - (true token generation loss for spliced history)

What it does: Predicts the next STU tokens (social tokenized user IDs) of the target user at time t+1, given the actor’s history up to t.
How:
- Encoder produces h_i (hidden state for actor’s history prefix up to step i).
- A classification head: Linear(d_model, Codebook_size C) outputs logits for each of the M STU tokens of the target user.
- Compute cross-entropy between predicted logits and the actual target tokens.

👉 Trains:

Encoder parameters.
The classification heads.

2. Target’s codebook embeddings : L_{code}

What it does: Encourages the semantic codebook embeddings Z to represent users well in a contrastive sense.
How:
- For the true target user at step i+1, we take its STU token embeddings z_{t_{i+1}} (e.g., by summing/averaging its M codebook vectors).
- Compare with h_i using an InfoNCE / sampled softmax style loss against in-batch negatives.

👉 Trains:

Encoder (so h_i is predictive).
Codebook embeddings Z directly.

3. Target’s id embeddings : L_{target}

What it does: Makes the encoder’s hidden state h_i useful for directly predicting the continuous target embeddings e_{t_{i+1}} (from the user embedding table).
How:
- Contrastive similarity between h_i and the “target” embedding e_{t_{i+1}}, with in-batch negatives.

👉 Trains:

Encoder (so its hidden states align with ground-truth future interactions).
Continuous embedding table {e_j} for all users.

Summary of Solution to Problem #1:

L_{AR}: trains M classification heads + encoder (discrete token prediction).
L_{code}: trains encoder + codebook embeddings Z.
L_{target}: trains encoder + continuous target embeddings {e_i}.

Conclusion: It’s a no brainer to invest in generative recs given the results from OneRec and others. This post shows one approach to building generative recommendations in a social recommendation use case. Hope it helps you in your use case.

I did not mention “Reward Modeling” since it is different between any two recsys, even two video recommenders, since it is a representation of each product’s market fit. So the difference is not because of this being a friend recommender system.

Why? Why can’t we use a billion sized vocabulary? Short answer is, even if we could find enough GPUs for it, training will overfit and produce poor recommendations.

Disclaimer: As elaborated in Table 12 of OneRec, while generative OneRec is an improvement over multi-stage recsys, OneRec with Reward model which means using OneRec as a candidate generator is currently much better. It is still open research to make the generative model competent enough to not need a Reward model layer after it.

This is a simplification. As shown in section 4.2.2 of OneRec, they have also experimented with representing the history in terms of semantic ids and not item ids. They are actually seeing improved results but this will be packaged in a separate publication. We choose to call User Encoder unchanged since evidently in OneRec using semantic ids in User Encoder is not essential and this helps us hone in on the core innovation.

Using RL to maximize ad revenue without retention tradeoffs

Gaurav Chakravorty — Sun, 24 Aug 2025 14:04:35 GMT

Global Ad industry hits $1 trillion in revenue. In this blog, we will learn how to maximize ad revenue with minimal impact to retention and engagement.

In a content platform like YouTube, deciding whether to show an ad as the very first piece of content is a subtle but critical problem. The trade-off is intuitive:

If you show an ad first: You gain immediate ad revenue, but you risk users dropping off or engaging less with the subsequent content.
If you don’t show an ad first: You keep the user more engaged (likely increasing future content consumption and potentially downstream revenue), but you sacrifice the immediate revenue opportunity of that first impression.

In this post, we’ll explore how to frame this decision using a data-driven approach. We’ll start from a simplified viewpoint—just deciding whether to show an ad first—and build up to a strategy that integrates both engagement-based revenue modeling and personalized ad revenue predictions. We’ll then discuss how a contextual bandit approach can be applied to learn these policies from historical data.

#1: Blending Approach —> Engagement Loss improvement

Fig 1 : Blending just treats ads and content as options and chooses the one with the highest value

A first, naive approach might consider the expected revenue from showing an ad in the first position versus the expected long-term engagement if we skip that ad. We could imagine we have two signals:

Expected Ad Revenue (E[rev|u, ad]): A personalized model that, given the user and session context, estimates the immediate revenue we would earn by showing an ad. This might be well-approximated by an advanced personalized ads model trained on historical ad-serving data.
Expected Engagement: A measure or score indicating how much content consumption (views, watch time, etc.) we expect if we place content first. We assume we already have a way to translate engagement into revenue via a function f_eng_to_rev(engagement), which converts user engagement into an expected revenue value (for example, predicting future watch-time-based monetization).

In a simple scenario, if we had some expected engagement value for showing content and some expected engagement value for not showing content, we might attempt to “blend” them with the direct ad revenue to decide. However, it’s not straightforward: the presence of an initial ad does not necessarily mean all engagement is lost—it just might reduce it. What we really need is the expected reduction in engagement caused by showing the ad.

For example, if for some deeply satisfied users their engagement will be mostly unaffected by showing the ad, it should be fine to show the ad.

Fig 2: We think it is better to look at the expected engagement loss of the ad and compare it to the revenue gained.

Let’s say we have estimated the expected engagement loss if we show the ad first. Our decision rule could look like this:

Show Ad if and only if: (engagement_revenue_conversion * expected engagement loss) < (expected revenue of showing an ad)

or mathematically

This equation says: “If the immediate ad revenue gain from showing the ad exceeds the revenue-equivalent cost of reduced engagement, then show the ad.” It relies on three components that can be learned independently:

E[rev|u, ad]: The expected incremental ad revenue from showing the ad first for a given user/session.
f_eng_to_rev: The function that converts changes in engagement into revenue terms.
Expected engagement loss: Learned from data, comparing engagement outcomes between sessions where an ad was shown first and sessions where it was not.

#2 Bridging Theory and Practice: Two Approaches in Our GitHub Repository

In our GitHub repository, https://github.com/gauravchak/ad-placement-rl, we demonstrate two different approaches to optimizing session revenue directly: a contextual bandit method and a reinforcement learning (policy gradient) method. In both cases, the setup is the same: at each session (context), we must make a binary decision—whether or not to show an ad—and we then observe a numerical reward. This reward is a combined metric that integrates both immediate session-level revenue and the engagement-based revenue equivalent.

Notably, these approaches can benefit from additional features in the user context. For example, even before we integrate the personalized ad estimator’s output (E[rev|u, ad]) into the reward function, we can include it as part of user_features. By doing so, both the contextual bandit and the policy gradient models can leverage this personalized signal at inference time, potentially improving decision-making and anticipating the value of showing an ad first.

Building steerability in net reward

There might be times in the year, like say thanksgiving, when your business wants to prioritize revenue. It might help to build a dial for that.

2a: Optimizing session net reward Contextual Bandits

Code Snippet on whether to show ad:

def should_show_ad(reward_model_ad, reward_model_no_ad, user_features):
    """
    Given trained reward models and user_features, return a decision:
    show_ad = 1 if predicted_reward_if_ad > predicted_reward_if_no_ad else 0
    """
    pred_net_reward_if_ad, pred_net_reward_if_no_ad = expected_reward(
        reward_model_ad=reward_model_ad,
        reward_model_no_ad=reward_model_no_ad,
        user_features=user_features)

    # True if predicted net reward of ad action is higher
    return (pred_net_reward_if_ad > pred_net_reward_if_no_ad)

This contextual bandit approach trains separate models to predict the expected reward under each action. By comparing these predictions, the policy picks the action that yields the higher expected combined value at inference time. It’s conceptually simple and can be effectively trained on logged data, making it a practical first step toward data-driven ad placement decisions.

2b REINFORCE and Policy Gradients

While contextual bandits are a simple and effective way to learn a policy directly from logged data, they still treat each decision as a one-step problem. Another approach, inspired by reinforcement learning (RL), is to use policy gradient methods such as REINFORCE.

A great tutorial of using Deep RL for a binary decision is Andrej’s talk below on Pong

Conceptual Overview
In the REINFORCE framework, rather than training separate models to predict the reward for each action and then comparing them, you directly parameterize a probabilistic policy that decides how likely it is to show an ad or not. For each training example, you know which action was taken and what the resulting reward was. The goal is to adjust the policy parameters to increase the probability of actions that led to higher rewards and decrease the probability of actions that led to lower rewards.

By doing this, REINFORCE naturally fits the problem of deciding whether to show an ad first: you have a distribution over actions, and you tune it to favor whichever action yields better long-term value. If showing ads first consistently produces higher combined revenue (ad revenue plus engagement-based returns), the policy naturally shifts towards showing the ad. If it reduces future engagement too much, the policy learns to refrain from showing the ad.

Why REINFORCE?

Direct Optimization: REINFORCE directly optimizes the expected reward of the policy. Instead of separately learning a model for each action and then deriving a policy, you adjust the policy parameters to maximize the observed rewards.
Built-in Exploration:
By modeling a probability distribution over actions (instead of choosing actions deterministically), the policy naturally explores different decisions. This stochasticity can help the algorithm discover better policies that might be missed by purely greedy strategies. By outputting probabilities over actions, the policy gradient approach allows you to sample actions according to those probabilities, rather than always taking the single top-scoring action. This means the model inherently tries different actions over time, which can lead to discovering better strategies than a purely greedy (deterministic) approach would.
Generalization to RL Settings: Although we’re currently working in a single-step contextual bandit setting, policy gradient methods can easily generalize to multi-step reinforcement learning problems. This opens the door to modeling scenarios where the consequences of showing (or not showing) an ad extend beyond the first position, or even the current session.

Pros and Cons of REINFORCE vs. Contextual Bandits

Pros (REINFORCE):
- Directly learns a stochastic policy that can be generalized to more complex sequential or multi-step decision-making scenarios.
- Conceptually simple: the update rule is just scaling the log probability of the chosen action by the reward.
Cons (REINFORCE):
- High variance: The basic REINFORCE update can be noisy and may require techniques like baselines or variance reduction to stabilize training.
- On-policy: The method is conceptually on-policy, meaning it learns best when data is collected from its own evolving policy. Using strictly logged offline data (collected by a different policy) can introduce bias unless steps are taken to correct it.
Pros (Contextual Bandits):
- Simpler, more direct learning: Estimate the reward for each action and pick the best one. Straightforward modeling from logged data.
- Lower variance estimates: Predictive models for each action can produce more stable estimates with offline data.
Cons (Contextual Bandits):
- Limited to single-step decisions: The approach doesn’t naturally extend to sequential decision-making.
- Requires a separate model for each action or a common model architecture that outputs multiple action values.

#3 - Integrating the Ad Revenue Estimator (E[rev|u, ad]) deeply

In Part 2, we’ve discussed incorporating E[rev|u, ad] as an input feature. This is helpful, but we can push the idea further by explicitly decomposing the reward. Instead of lumping all revenue and engagement value into a single session label, we can separate the immediate revenue from showing an ad (predicted by E[rev|u, ad]) from the delayed, engagement-based revenue. In other words, we estimate:

Immediate Reward (if ad is shown): E[rev|u, ad]—our personalized ad revenue estimate.
Delayed Reward (if ad is shown): Session value minus the immediate ad revenue, converted into engagement-equivalent terms.

At inference time, this lets us approximate the decision boundary more explicitly. We compare the sum of the immediate and delayed rewards for showing the ad against the expected session value if we do not show the ad:

This approach leverages the fact that our immediate revenue estimates are likely of higher accuracy and reduces the complexity of what we must learn as a “residual” (delayed) effect. It ties together the personalized ad estimator with a session-level RL framework, enabling more accurate and modular policy decisions.

Summary

We show how to use RL to learn when to show an ad.
The framework extends to other positions, not just first position. The RL policy encapsulates exploration to enable continuously improvement.
We do so in a way that maximally uses our high accuracy ad revenue estimator model. Thus we have set up the problem in a modular fashion with multiple teams working in synergy.

How to implement Generative Retrieval

Gaurav Chakravorty — Thu, 05 Jun 2025 14:30:25 GMT

Improving Recsys with GenAI

We're excited about the potential of Large Language Models (LLMs) in recommender systems, given their high accuracy in multiple domains. Building on this potential, we'll explore how to harness LLMs for recommendation tasks.

The Challenge of Using LLMs in RecSys

One key challenge is tokenizing billions of recommendable items, making it hard to apply LLMs directly. If only we could break it into fewer tokens like LLMs do to long words like "“happiness”. Once we have a vocabulary of meaningful tokens that reliably describe interaction probabilities, we can leverage LLM machinery for prediction.

Proposed Solution: Generative Retrieval

The paper Generative Retrieval generates "semantic" embeddings using RQVAE (a type of vector quantized variational autoencoder), enabling LLMs to learn meaningful item representations. By creating semantic embeddings where similar items are closer together, one can generate semantic IDs that capture nuanced relationships between items.

Showing You How to Implement It

To make this approach more accessible, Samson Komo has prepared for you:

A video tutorial walking through paper code and colab:

GitHub repo: https://github.com/komosam/Generative-Retrieval

Street Cred of the Approach

Generative retrieval has already shown impressive results, with over 40% share of retrieval in some state-of-the-art video and ad recommender systems. By implementing this method, you can unlock more accurate and diverse recommendations.

Conclusion

By implementing generative retrieval, you can tap into the power of LLMs for recommendation tasks. Explore our resources to get started and discover how this approach can enhance your recommender systems.

Attention Explained: When to use Self, Graph, and Target-Aware Attention

Gaurav Chakravorty — Sun, 25 May 2025 15:30:18 GMT

TL;DR:

Self-attention summarizes information from a list (e.g., recent videos watched or chatbot text) to create a relevant summary.
Graph attention understands relationships within a network (e.g., social circles in a social network).
Target-aware attention evaluates the relevance of items being ranked to a user's history or query.

Attention is a powerful tool in AI, but its applications and types can be confusing. In this article, we'll break down three common attention architectures - self-attention, graph attention, and target-aware attention - and explore their use cases and strengths.

Basic building block of attention

The unit shown below in Fig 1, finds a weighted sum of the input sequence (aka “keys”) using the query embedding. One way to think about it is to find a summary in the keys that is most relevant to the query.

Fig 1: Basic building. block of attention

Self attention

In self attention we generate an equal number of embeddings as the input. We do that by taking each of the inputs as a query. So post a layer of attention, each item in the list is replaced by a sort of smoothened version of it. (Fig 2)

Fig 2: Each item is used as a query in self attention. Hence the output is the same length as input.

In certain scenarios like language and content recommender systems, where positions matter to the relevance of different items, positional encoding is also useful to find better attention weights. (Fig 3)

Fig 3 positional encoding is added to item encoding

Code implementation

Graph Attention

Graph Attention is similar to Fig 2. As in, in each layer of attention, the embeddings of a node are updated using embeddings of items of the neighbors and their own. However the formulation is slightly different. Graph Attention Transformers decelerate over-smoothing by giving the node’s previous value as an input to the next layer neural network.

Fig 4: Graph Attention

Code implementation

Target-aware attention

In ranking applications, we try to evaluate the probability of successful outcome with a number of options, aka “targets”. A successful application of attention in ranking stems from using attention with the target as the query and user’s history or query text sequence as keys.

In contrast to self-attention, target-aware attention uses a specific target as the query to compute attention weights over a sequence of items.

Fig 5: The item being ranked is the query of the attention module. For instance, while ranking a video in a video recommender, the target video’s embedding is the query, the user’s history is the sequence (or “keys”)

Code implementation

Conclusion

Whether it's processing sequences with self-attention, modeling relationships with graph attention, or ranking items with target-aware attention, each mechanism offers unique strengths. Use what is most applicable or a combination as needed. If you want to talk about your use case and what might fit the best, please reach out to us.

Prior posts on recsys stacks that can use attention:

Scalable Embedding based retrieval for target side value

Gaurav Chakravorty — Sat, 17 May 2025 17:54:59 GMT

Friending recommendations in a social network have been known to deliver value by both finding targets that drive viewers to visit the app (“viewer visitation”) and initiating connections from viewers that lead targets to visit the app (“target visitation”).

In the post friend recommendation retrieval in a social network we have also covered embedding based retrieval. In this post we will focus on scalability and target side value.

Scalability

As noted in the previous post one of the most challenging aspects of embedding based retrieval is scalability since the query and candidate sets are 5Billion+ for large social networks.

Key insight: The retention value of people/friending recommendations is highest for users who are still building meaningful connections, i.e. “graph builders”.

We will use this below to engineer a low capacity high ROI retrieval system.

Target side value

Traditionally you will see two tower models only talk about viewer → best recommendations because many of these were written to maximize viewer value. You will see a diagram like below:

Figure 1: Schematic of two tower inference for viewer value. It uses K-nearest-neighbors to find recommendations using the query embedding.

Even when accounting for candidate side value like the multi-stage system in short-video, the goal is to handle uncertainty of newer candidates. It is not to deliver value to candidates at the same level of importance as viewers.

If targets are as important, one option is to run a similar query with each target …

Figure 2: Parallel K-Nearest-Neighbors (KNN) for queries and items to find best recs for both sets.

… and flip to add the targets to recommended lists for each viewer.

Figure 3: Since viewers receive the recommendations, we still have to flip the output of target KNN so that we can find the viewers to whom we should recommend these targets.

By now we have the basic argument in place. What is missing is scalability. It is not easy within the capacity, latency and time budget of modern social networks to run 5B+ KNN queries with 5B+ candidates. What we could do is to find the cohort we want to focus on and run the KNNs for only the cohorts most in need of good recommendations.

Figure 4: For scalability, we can independently limit queries to viewers and targets who most need people recommendations.

Algorithm: Maximizing target participation.

One part of this that we hand waved over above is how to “flip” the list of tgt → [list of recommended viewers] to viewer → list of targets.

A naive solution would be to just do it in Presto, etc., but what potential problems could arise with this approach?

Viewer flooding: Some viewers are recommended too many graph-building targets. This might degrade their experience and cause recommendation blindness (similar to ad blindness).
Target-starvation: Some targets might receive low “attention” since they are only in lists of viewers with too many graph-building targets.

This is a variant of the “Set-Cover” problem which is NP-complete. However, we will show below what is a decent approximation under large numbers.

Solution: Let’s say your target-side KNN produces a table named tgt_to_top_ten_viewers_knn_table, which has “target_id” and “viewers”, an array of viewer_ids. We can use the below algorithm to maximize the number of targets that are present in some viewer’s list and also get attention from them. The basic idea is to ensure that each viewer does not get more than (say) 3 graph-building targets.

-- Calculate the count of targets each viewer is associated with
WITH viewer_target_count AS (
  SELECT 
    viewer_id,
    COUNT(*) AS viewer_app_count
  FROM (
    -- Expand the tgt_to_top_ten_viewers_knn_table
    -- into a table with target_id and viewer_id columns
    SELECT 
      target_id,
      viewer_id
    FROM tgt_to_top_ten_viewers_knn_table
    CROSS JOIN UNNEST(viewers) AS t(viewer_id)
  ) t
  GROUP BY viewer_id
),
-- Expand the tgt_to_top_ten_viewers_knn_table 
-- into a table with target_id and viewer_id columns
expanded_table AS (
  SELECT 
    target_id,
    viewer_id
  FROM tgt_to_top_ten_viewers_knn_table
  CROSS JOIN UNNEST(viewers) AS t(viewer_id)
),
-- Select targets for each viewer with a probability 
-- that ensures each viewer is associated with 
-- approximately 3 targets
selected_targets AS (
  SELECT 
    et.target_id,
    et.viewer_id,
    vtc.viewer_app_count
  FROM expanded_table et
  JOIN viewer_target_count vtc ON et.viewer_id = vtc.viewer_id
  WHERE RAND() < LEAST(1, 3.0 / vtc.viewer_app_count)  -- Probability of selection decreases as viewer_app_count increases
)
-- Group the selected targets by viewer_id and 
-- aggregate the target_ids
SELECT 
  viewer_id,
  ARRAY_AGG(DISTINCT target_id) AS target_ids
FROM selected_targets
GROUP BY viewer_id;

Conclusion

Model based retrieval is powerful in social network recommendations like Linkedin, Snap and Facebook. Noting that these recommendations deliver value to both viewers and the targets recommended can maximize retention for the app. In the article we propose approaches to deliver such viewer and target side value while handling scalability challenges of searching large sets of users.

We hope this spurs your imagination the next time you are thinking about building model based retrieval.

Friend Recommendation Retrieval in a social network

Gaurav Chakravorty — Sun, 24 Nov 2024 16:31:38 GMT

Introduction

Social networking platforms, e.g. LinkedIn:PYMK and IG:SA, help users find friends on the platform and forge meaningful connections. Traditionally, an effective approach for friend recommendations has been to suggest "friends of friends" i.e. using mutual connections as a proxy for relevance. This article begins by exploring graph search as a way to implement this baseline and traces the evolution of friend recommendation systems beyond this foundational approach. It concludes with a discussion of Two Tower models, which leverage compute power and data to drive greater accuracy in retrieval.

Graph Search: The First Principles Approach

Fig 1: Showing two hop paths from C. First hop nodes are colored purple and Second hop are colored blue. The number of 2 hop paths converging on each node is written alongside the node. Here all paths have been weighted the same. We can consider an extension where paths are weighted lower if they travel through nodes with high degree. For instance, weight of C,A,J could be ⅓ and weight of C,E,D could be ⅕ based on the degree of the 1-hop node. There are other approaches to weighting as well. In Kipf et al. 2017 they show 1/sqrt(degree) to work well.

To implement “Friends of Friends,” a scalable approach, e.g. FB:Unicorn, has been to create a graph datastore of relationships that enables counting the number of two-hop paths between a source user and other nodes. Ranking nodes by the number of two-hop paths is akin to ranking by mutual friend count, providing a basic but efficient recommendation mechanism.

Improving Graph Search with Weighted Paths

Besides scaling up one can use weighted paths in the graph between viewer and recommended friend to refine results. By assigning weights to edges (friendship connections) and nodes (users) based on various attributes, platforms can calculate a "path score" for potential connections. These scores are derived from the cumulative weight of paths ending at a given node, allowing the platform to rank potential connections by relevance. (Gong et al. 2017)

For weighting, one might consider incorporating factors like interaction frequency of the users in the edge, connection recency (how recently that connection was made), common interests, or shared communities (for example - we may assign a ⅓ edge weight multiplier if two users have had shared a conversation in the last 7 days, effectively increasing the likelihood of sourcing this candidate). The hypothesis here is that a user with multiple paths of highly weighted connections is more likely to be a relevant friend recommendation. Different weighting hypotheses can be validated through online experimentation, making this approach more robust. This weighted-path method aligns with traditional graph search while enhancing it with more granular social signals.

Embedding-Based Approaches: DeepWalk and Node2Vec

Credit to where it is due. Two seminal papers that demonstrated the feasibility of embedding techniques are DeepWalk[4] and Node2Vec[5]. These methods sample paths and learn latent representations (embeddings) of nodes based on their connectivity patterns, capturing the structure of the network more efficiently than traditional graph search. This embedding-based approach effectively reconstructs paths in the graph, with similar embeddings indicating a higher likelihood of connection. Sharing two images of DeepWalk below to build intuition.

Figures 2 & 3 from DeepWalk paper show how they are building intuition from paths to clusters to embeddings.

DeepWalk, which uses random walks on the graph to generate training samples, laid the foundation for this approach. Node2Vec extends it by enabling biased sampling, which helps control the balance between exploring global and local structures. These methods allow platforms to efficiently compute similarity scores between users, which are then used to make friend recommendations.

Clustering with Spectral Analysis

Spectral clustering on the friendship graph is another approach to friend recommendations, particularly useful for identifying groups of users likely to share interests or connections. In this method, clusters are formed based on dense areas in the graph, often revealing latent communities. Nodes with many paths ending on them, but without direct connections, are likely in the same cluster as the source node, making them promising friend recommendations.

Though historically computationally complex, some recent approaches to spectral clustering have been more efficient: AROPE, NetSMF, ProNE ([8], [9], [10] respectively). For example, AROPE is computationally linear with respect to graph size. NetSMF claims to have taken “only” 24 hours to train on a network of 10s of millions of nodes.

Spectral clustering can yield high-quality recommendations by grouping users into meaningful social clusters, but due to its high computational complexity it is no longer popular for generating production-scale friending recommendations. It is although very useful to reduce network interference bias in A/B tests for friending recommendation models - core idea is to leverage spectral clustering to identify dense user-groups and assign the entire group to either treatment or control [14]. Running A/B tests at a coarser granularity (at a user-group level instead of at a user level) helps reduce network interference in friending systems.

Challenges remain though; at the billion node scale, these approaches could still be problematic and model fine tuning with updated data could be challenging.

Two Tower Models: An Extension of Clustering Approaches

Two Tower models ([7], [1], [2]), also the workhorse in ads or video recommendations, are a natural evolution from clustering techniques, as they aim to learn representations for each user independently, which are then used to predict likely friend connections. In this architecture, one "tower" processes the features of the viewing user (e.g., the one looking for new friends), while the other tower processes features of the candidate user. The model is trained to maximize the similarity between embeddings of users who are friends and minimize it for non-friends.

Fig 4: Two Tower models (schematic inspired from post)

Pros of Two Tower Models over Graph Search

Two Tower models offer several advantages over traditional graph search:

Nuanced Relationship Capture: These models can learn complex, latent relationships by processing a variety of features and considering recency or frequency of interactions, which graph search approaches struggle to incorporate effectively.
Multi-Task Potential: Since Two Tower models capture user relationships in embeddings, they are flexible enough to be used for tasks beyond friend recommendations, like predicting message frequency or engagement with friends' content. This multi-task ability leads to enhanced alignment between the sourced candidates and the final ranking model; they can both be trained to value the same interactions and training samples.
Scalability: Unlike graph search, which can become computationally intensive at scale, Two Tower models are optimized for distributed computation, making them well-suited for large networks.
Compute-Enhanced Accuracy: Two Tower models allow state-of-the-art architectural techniques, like transformers or mixture-of-experts, to complement increased training data, boosting accuracy in retrieval.

Two Tower models for friend recommendations is a deep area of research and there is a lot more to talk about it. For instance, the Multi-task reference above is a rich and impactful vein of exploration.

Priming Friend Recommendations with Existing Connections

One effective way to enhance a Two Tower model for friend recommendation is to prime the training dataset with existing friends. By including all current connections as positive samples, the model can better understand what a "friendship" looks like, helping it differentiate friends from non-friends more effectively. This approach enriches the embedding space, offering the model a broader understanding of social connections and improving its performance in predicting new friendships. This can be even more useful if the embedding table of user and target ids is shared with other sparse features that leverage userids. During inference using K-nearest neighbors, you will need to filter out current friends from new friend recommendations. Open research question: Can this be avoided? Can we directly train embeddings which are close for new friends but aren’t for existing friends?

Using Impact Weighting for Topline Alignment and to Prevent Embedding Collapse

A risk in embedding-based approaches is that the embeddings may collapse into a low-rank representation, meaning they do not adequately capture the uniqueness of individual users. Collapsed embeddings are often overly fixated on high-activity "power users," limiting the model's ability to recommend relevant connections for a broader user base. Put another way, the neural net is learning the simplest solution to the challenging problem.

To address this, we need to force the model to avoid the simple solution by making the loss depend less on these “easy” training examples. Try weighting your examples by inverse square root of number of friends of user and target in the Two Tower model training data [3]. This will focus more on users who are likely to derive greater value from new friendships. By emphasizing these users, you not only mitigate the risk of embedding collapse but also develop an embedding space that aligns more closely with business goals. This approach improves the quality and diversity of friend recommendations, supporting the creation of meaningful connections for a wider range of users.
Embedding collapse is a deep topic, e.g. [6], and we will share a separate post on it.

Blueprint for Adding Two Tower Models to Graph Search Implementations

If you currently have a friend-of-friends implementation and are considering adding Two Tower-based retrieval, it can be a challenging transition. Two Tower model development is a new skill for many teams, and without a proper validation path, delivering top-line results may take time. The following trajectory is suggested:

Optimize Offline Hit Rate: Improve the model to achieve an offline hit rate at rank 1 for a batch size of 1024, targeting a hit rate of 0.7.
Begin with Offline Inference: Before deploying embedding-based retrieval in production, start with offline computations.
- Compute embeddings for both user and candidate towers.
- For each user, generate the top 100 candidates.
- For each candidate, generate the top 100 users, and cross-reference in SQL to produce candidate lists for users.
Validate Recall: Use candidate lists to evaluate recall against ground truth (organically added friends) in offline experiments. This recall should be measured at the retrieval stage, as the ranking model has not yet been trained for this distribution.
Develop Ranking Features: Using the lists from the retrieval stage, develop ranking features for friend recommendations. Validate that these features improve performance by reducing offline normalized entropy.
Generator efficiency: After step 4, see if the impression rate of candidates from your two tower model is more than the retrieval distribution rate of your generator. This would mean that candidates from your generator are considered better by the ranking layer than alternatives.

If the above is not happening, it is too soon to expect topline gains. You might want to iterate.

More generally about embedding based retrieval, Snap’s retrieval team found ([15]) adding Embedding Based Retrieval (EBR) increased friendships made from friend-recs by 5% to 10%, and the overlap of these with Graph-Search (Friends-of-Friends) is low. In a follow up ([16]) they found an 11% improvement in friends-made-with-communication. This can be seen as a hybrid of two approaches. Here they sample (up to) 5 of the friends of the viewer and find nearest neighbors with them.

Fig 5: Illustration of sampling 1-hop before embedding based retrieval in [16]

Conclusion

The journey from graph search to Two Tower models represents a significant advancement in friend recommendation systems, enabling platforms to deliver recommendations that are both accurate and meaningful. While graph search techniques provide a solid foundation, the advent of embedding-based and deep neural models like Two Tower architectures has opened new avenues for friend retrieval, allowing social networks to capture the nuances of human connections in increasingly sophisticated ways. As these systems continue to evolve, they will foster richer, more personalized experiences that mirror the dynamic nature of social relationships.

Though we discuss Two Tower models in the Friending use case, this architecture lends itself towards efficient, scalable ad & video recommendation, facial recognition, and more generally - learned embedding database search.

References & Further reading

[2407.13218] LiNR: Model Based Neural Retrieval on GPUs at LinkedIn
Candidate Generation in a Large Scale Graph Recommendation System: People You May Know
[1609.02907] Semi-Supervised Classification with Graph Convolutional Networks the authors propose a layer-wise propagation rule that includes a normalization factor 1didj , where di and djare the degrees of the nodes connected by an edge. This normalization helps to stabilize training by scaling the contributions of each node's neighbors according to their degrees. This motivation was used above in the weighted paths idea.
[1403.6652] DeepWalk: Online Learning of Social Representations
[1607.00653] node2vec: Scalable Feature Learning for Networks
[2310.04400] On the Embedding Collapse when Scaling up Recommendation Models
GitHub - gauravchak/two_tower_models: Repo to guide implementation of Two Tower models
AROPE - Arbitrary-Order Proximity Preserved Network Embedding
[1906.11156] NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization
ProNE: Fast and Scalable Network Representation Learning
FAISS - Facebook AI Similarity Search
Integrating a weighted-average method into the random walk framework to generate individual friend recommendations - Gong et al. 2017
Graph Convolutional Networks | Thomas Kipf | Google DeepMind
[1903.08755] Using Ego-Clusters to Measure Network Effects at LinkedIn
Embedding Based Retrieval in Friend Recommendation (Snap 2023)
Improving Embedding-Based Retrieval in Friend Recommendation with ANN Query Expansion (Snap 2024)

Declarative Value-Model Tuning

Gaurav Chakravorty — Tue, 10 Sep 2024 06:02:33 GMT

Normally Value Models in recommender systems are hand tuned. We provide a couple of utilities to derive VM weights from targeted task importance.

Context

Value model (VM) weights are used to combine multiple task predictions in a recommender system. For instance the following could be a config to produce a ranked list by (0.1 * P(watch > 3s) + 0.3 * P(watch > 30s) + 20 * P(watch & share) + 2 * P(watch & like) + 5 * P(watch & follow)).

{
  "weights": {
    "p_watch_3s": 0.1,
    "p_watch_30s": 0.3,
    "p_watch_and_share": 20,
    "p_watch_and_like": 2,
    "p_watch_and_follow": 5
  }
}

More info on VM here and here.

Normal workflow

Normally VM weights are tuned by either a grid search or by multiplying by 2 or 1/2 times the current task weight and running experiments.

What could be empowering is if practitioners had a tool to specify the desired importance of each task and compute the VM weights from that. This could be a great baseline to jump to and then search nearby this weight.

Two approaches to task importance

In github.com/gauravchak/value_model_tuning, we look at two approaches for declarative VM tuning.

NDCG Gap Targeting: This computes a leave one out ranking for each task and then computes the NDCG gap of this ranking from the current ranking. This gap or difference is the importance of this task. Then it tries to adjust weights to achieve the desired relative gaps/deltas.
Per task regret targeting: In the spirit of this Google-Deepmind paper, this describes the regret of current weights per task as compared to ranking purely based on that task, how much worse is the current ranking doing. It then adjusts weights so that these per-task regrets have the relative importance specified by the user.

Ranking model calibration in recommender systems

Gaurav Chakravorty — Sun, 09 Jun 2024 00:27:32 GMT

We show the importance of calibration in ranking models and how to implement it efficiently.

Context setting

Most recommender systems have a multi-task estimator model that estimates the probability of various user actions on the recommendation. After that there is usually a “value model” (a.k.a. multi-task fusion) to combine these into a single score to rank by. However, as we will show below the emitted probabilities might not be calibrated (explained below) with the observed probabilities.

Fixing model calibration can improve the topline metrics of your recsys.
(see benefit section)

Fig 1: Multi-task fusion / Value model is used often in modern recommender system. See code:L73 for an example implementation. When it is used, it basically assumes that your ranking model / estimator is self-calibrated.

Optional Context: For readers interested in late stage ranking, various aspects have been covered in

Subscribe now

Code & Video

PTAL code here: github - gauravchak/calibration_arch_in_ranking_mtml

What will calibration bring you in recsys?

But hang on … you say, while these “late stage ranking” models are trained against binary user labels with binary cross entropy loss, we don’t really need them to be actual probabilities right, since there is value model on top of it.

And yes you would be right technically, if this isn’t ads ranking or feed blending you don’t really need the outputs to be probabilities. Any score that is comparable between two options should work to order them. However, since you have a value model (a.k.a. Multi-task fusion model), to combines these scores, calibration provides a sort of decoupling, a contract if you will about the scale and distribution between your model and your value model.

Without calibration, many of your ranking model accuracy improvements will fail to be launched because they are changing the scale / distribution especially for some country or user cohort. This will make your experiments look unintentionally soft in metrics.

In addition, since the goal of calibration is to provide the true likelihood of the predicted outcomes it can indeed help your ranking. Wouldn't you prefer to put at the top 1 position what you believe has the highest likelihood? For example Youtube found that using calibration increased their search CTR by 0.66%. This might seem small but in the context of a model like Youtube this is not a small feast!
Microsoft found similar results for the context of search retrieval. They also argue that this helps to find a threshold when not to show results for a query.

Prior work & Insights

1. Post trained model on logits

In the most classical approaches, calibration is often applied as a post-training step; training a specific model on the logits of the first model using an independent dataset [1]. For example in On Calibration of Modern Neural Networks, Guo et al. leverage the idea of using a calibration model to improve the reliability of probability estimates from classification models using the temperature to do the calibration.
An advantage of this method is that by using an independent dataset and not the training dataset of your first model that created the logit it should generalize better on new data. One obvious disadvantage is that it complicates your overall architecture and training a second model can be costly. In addition, scalability can quickly become an issue if you wish to ensure calibration on different features. But don’t worry as with anything with machine learning there are other ways!

Fig 2: Apply post-training model on logits source

2. Loss function

In Regression Compatible Listwise Objectives for Calibrated Ranking with Binary Relevance, Google explores how creating a scale-calibrated multi-objective loss function can not only allow you to increase calibration but also benefit your ranking score in a listwise context. Interestingly, in this paper they argue that using a multi-objective where one is for ranking and the second is for calibration is not always compatible. More precisely, they prove that the common loss used (sigmoid for regression and softmax for ranking) actually pushes the gradient scores in different directions. Therefore, we will not cover others multi-objectives as a means for calibration here.
Also please note that for pointwise there are also ways to include calibration as part of the loss.

Fig 3: Regression Compatible Ranking (RCR) loss for a single query in a logistic-regression ranking task (i.e. ranking with binary relevance labels)

3. Layer

Another way to achieve calibration is by introducing a last layer into your network that achieves the calibration for you. For example, Linkedin created an Isotonic Calibration Layer for their ranker which helped increase offline and online metrics. In the repository, we also included a layer to represent Platt scaling.

Fig 4: Isotonic Layer by Linkedin LiRank

Which one to pick?

As with everything in life.. It depends. Every problem is different, are you doing a listwise loss? Are you working on a retrieval system? Is scale an issue for you? Do you want to add a bayesian layer at the end of your model to do some exploration? Do you have a pointwise or listwise loss?
Sadly there is no one solution that fits all. And to be frank, this does not cover all the different ways to do calibration to start with. The importance as with everything is to start with something easy, test it and gradually make it better.

Defining calibration

1. Overall calibration (per task)

For each task the average value of the predicted probability of the task should match the average value of the observed probability.

2. Calibration on each prediction bucket/bin

It is possible that your model overpredicts or underpredicts at some ranges of the prediction.

For instance if you make 5 equal buckets of the eval dataset based on the predicted labels and compare the average values of the predicted label and observed task, do you see some buckets where there is significant gap in prediction vs observation?

Fig 5: Showing how your prediction could be over-calibrated in certain ranges and under-calibrated in another range while being overall well calibrated.

Fig 6: Another way to look at this is to plot the fraction of the positive label by binning your data along your predictions.

3. Calibrated per user-cohort (or in general based on a feature)

Here we are concerned where your multi-task model might be overall well calibrated but could be mis-calibrated for certain cases. For instance:

You are building a music recommender system and your system could be under-calibrated for “timeless classics”. That means for timeless classics the predicted probability of listening might be lower than what you observe in data.
You are building a video recommender system where the training data is dominated by short videos and you find that you could be under predicting the probability of “like” on longer videos.

Implementation suggestions for calibrated ranking

Platt Scaling

In src/platt_scaling_calibration.py we show an option of adding Platt Scaling to improve Overall Calibration .

# in init: set the weight and bias for Platt Scaling
self.weights = nn.Parameter(torch.zeros(num_tasks))
self.bias = nn.Parameter(torch.zeros(num_tasks))

# during inference: computing task estimates
calibrated_logits = self.weights * ui_logits + self.bias

Add a loss to improve calibration per prediction bucket

In src/prediction_buckets_calibration.py, we add a loss based on #2 “Calibration on each prediction bucket/bin”. To do that, we compute the mean squared error (MSE) between the mean of the label and the mean of the prediction in each equally spaced bucket/bin (like histogram) of the prediction values.

# Compute ECE-MeanSquaredError loss
# These steps have been verified in this google colab

# Sigmoid to go from logits to predicted probabilities
preds: torch.Tensor = torch.sigmoid(ui_logits)

# Assuming preds and labels are of shape [B, T], sort preds to get indices. sorted_indices[0, t] would then be the index (from 0 to B-1) corresponding to the smallest value of the predicted probabilities of the t_th task.
sorted_indices: torch.Tensor = torch.argsort(preds, dim=0)
# Hence sorted_preds[i, t] is the i-th smallest predicted probability
sorted_preds: torch.Tensor = torch.gather(input=preds, dim=0, index=sorted_indices)
# sorted_labels[i, t] is the corresponding label
sorted_labels: torch.Tensor = torch.gather(input=labels.float(), dim=0, index=sorted_indices)

# Compute the mean prediction in each bin
pred_mean_per_bin: torch.Tensor = torch.matmul(self.scale_proj_mat, sorted_preds)  # [PB, T]
# Compute label_mean in the bucket
label_mean_per_bin: torch.Tensor = torch.matmul(self.scale_proj_mat, sorted_labels)  # [PB, T]

# Compute MSE between mean label and prediction in the bucket.
# First compute per task. This will allow us to later reuse any task specific weights set by the user for cross_entropy_loss.
mse_per_task: torch.Tensor = ((pred_mean_per_bin - label_mean_per_bin)**2).mean(dim=0)
calibration_loss: torch.Tensor = mse_per_task.mean()

Note

ECE is defined with absolute deviation but we have chosen to use mean squared error in this implementation.
Verified Uncertainty Calibration shows that ECE has some bias. However, we think the drawback is not considerable.

Making sure the model is calibrated for different user cohorts

In src/feature_based_calibration.py, we add a loss that captures calibration for both values of a given feature. Imagine you are building a friend recommendation application and you want to ensure that your ranking model works for both new users and tenured users. By setting a feature “is_tenured”, this code shows how to ensure your models are calibrated for both tenured and new users.

Fig 7: Showing how the model could have been miscalibrated for a key feature like say user cohort, and using this approach we can fix that. Hence with value model above it, we will not be unfair to any user cohort if we use this approach.

Debiasing model arch against a feature

In src/feature_bias_capture.py, we are not trying to add a loss. We are adding a model architecture component, a shallow tower if you will, that computes each task logit purely based on the single feature and then adds the main task logits to it.

Appendix

Note on nomenclature: Calibration here is different from Steck (2018).
What we are referring to as calibration is different from what Harald Steck refers to here. In that paper, he is suggesting that if you observe the user’s prior interest in some categories, by ensuring your current slate of recommendations match the user’s prior distribution, you will not be under or over predicting a category. Here we are talking about matching the rate of the observed true label in your predictions.

Subscribe now

Entrypoint retention modeling in recommender systems

Gaurav Chakravorty — Fri, 24 May 2024 13:49:49 GMT

In our previous post we shared our learnings on how to consider the risk of abandonment in a repeated recommender system. In this post, we focus on the initial recommendation, which we call the “entrypoint” item. We share our learnings on improved entrypoint recommendations, which in our experience has led to higher daily active users and app-sessions.

What do we mean by “entrypoint” recommendations

Carousel entrypoint: In many apps you will find either a carousel of options that opens up an immersive UI where you can scroll for more content. For example the Youtube Shorts carousel shown below.

An example in a modern recommender system of a carousel of entrypoint recommendations. (Image source, image chosen without prejudice as the first on the topic on Google image search)

Feed entrypoint: We have also had success applying these approaches to the first item in an interface where the user can swipe/scroll up to see more recommendations.

An example of a continuous feed of content where the first content recommended in the feed is being referred to as “entrypoint” in this article. (image source, image chosen without prejudice as the first GIF on google image search on the topic)

In both these situations the entrypoint item has a high impact on the value derived by the user from the session. Not getting it right can lead to shallow sessions and reduced current and future sessions.

The benefit of entrypoint retention modeling

Retention modeling in carousel entrypoint

If done right can lead to:

reduction in number of single item sessions, which means sessions that don’t advance beyond the item in the carousel
longer app-sessions
an increase in app-sessions and daily active users especially for infrequent (non power) users.
increased revenue caused by the effects above.

Retention modeling in feed entrypoint.

If done right can lead to:

reduction in “skip rate”, the rate at which users skip the recommended item.
longer app-sessions
an increase in app-sessions and daily active users especially for infrequent (non power) users.
increased revenue caused by the effects above.

Mathematical derivation

The cumulative value of a trajectory given the user is guaranteed to see the first recommendation “entrypoint” is Eq(1)

V(i) refers to the value derived from the i_th recommendation
exit(i) is 1 if the user exits the feed at the i_th recommendation

We will use this version, Eq(1), in the implementation section but before we dive into implementation it might help to look at the problem in a couple of ways so that you can connect it to what we discussed in our previous post of incorporating P(exit).

Eq(2) We can also write this in a way that separates the value derived from every point in the feed.

Another way to look at this is is Eq(3):

and in general the value starting the k_th position is the value from that recommendation and conditional on the user not exiting the value from the rest of the session starting at position (k+1)

Key Insight

Our key insights are:

Optimizing entrypoint purely by pointwise reward E[V(0)] ignores the second term in Eq(1).
Entrypoint recommendations can affect the value of the full session since, as shown in Eq(3), the recommendation directly affects the probability of exit and hence the second term.
Since you expect to have a higher impact of improved entrypoint recommendation on users who don’t yet have entrenched habits, it might help to limit your training data for these tasks to infrequent users.
In recommender systems that are responsive and alter what they show based on previous user interactions, entrypoint recommendations with good follow on recommendation options can capitalize on user intent generated by the entrypoint.
To account for the reduced causality between entrypoint and future terms, we can use a discounted sum of future rewards as opposed to the sum of future rewards. (See more in implementation set 2.a. below)

Recommended Implementation

Join the user interaction at the entrypoint with the value for the rest of the session like time spent, number of items seen by the user, likes, etc.
We recommend keeping these downstream tasks separate in initial iteration like
1. retention_time_spent = discounted sum of time spent starting at position 1
2. retention_items_seen = discounted count of items seen starting position 1
3. retention_conversions etc.
  Fig1: Compute a decayed sum of future rewards in the session, and use it as a label for the first video. In the example in the image we have used watch time of the videos as the reward. Similar to the multi-task paper, we recommend experiment with engagement and number of views as reward as well.
Add these tasks to your Multi-task estimator (“ranking”) model.
Fig 2: Showing that the new retention tasks are added to the multi task estimator (“ranking”) model
Experiment with conditioning on the right user segment and perhaps on entrypoints where the immediate value to the user was strong enough.
Assuming you have a Multi-task fusion (a.k.a. “value model”) approach to combining your task estimates to actually pick the entrypoint item to recommend, in this step there will be a fair bit of iteration to use these new tasks.

Video explaining the post

Optimal whole page ranking = reward / risk

Gaurav Chakravorty — Sat, 11 May 2024 14:00:48 GMT

Are we looking at risk enough in recommender systems?

While it is tempting to equate “risk” with the absence of “reward”, we think we can learn from portfolio construction in finance in how modeling and accounting for risk in action selection leads to an increase in long term user value in a multi-iteration setting.

Illustration of the idea in finance

There is many decades of beautiful mathematics on optimal portfolio construction factoring into risk and reward. In this Colab, we have shown using a simplistic example how instead of allocating inversely proportional to risk leads to:

higher returns as evidenced by the final portfolio being 1.81 times of the normal full stocks allocation.
lower risk as evidenced by lower drawdowns during times of crisis and hence lower risk of the investor having to liquidate.

Conclusion from notebook illustrating the value of factoring risk into long term value optimization in financial portfolio construction.

The above is just an illustration and all the disclaimers that you usually find with things related to financial advice apply here. Based on the personal experience of the authors, there are ways to mess it up and there are ways to deliver 100+ times more value than the above chart as well. Let’s take the basic idea and expand on it in the next section.

High level idea

Take the resources that are limited and make sure to use them optimally.

In financial portfolios, risk is also limited. It’s not just money.

In fact with leverage, which is to a large extent accessible and cheap, the portfolio invested value is not strictly limited to the money that is invested.
Risk is limited: An investor can only stomach a certain amount of it. Hence we see how a portfolio that maximizes reward while containing risk is optimal.

In recommender systems, you have a limited amount of attention, or interest from the user. You are constantly balancing the risk of depleting that resource while trying to deliver value.

Reward, Risk and Regret in financial portfolios

Reward could be the returns or the increase in the portfolio
Risk
- short term risk of negative returns
- long term risk of stopping to invest altogether. This is all too common. If you speak to an investor who has been personal investing for around 25 years, our experience has been that more than half of them just don’t invest in the stock market since they would have gone through some period of extreme risk where they just got disillusioned with the outcome.
- medium term risk of underallocation
Regret
- often people are looking to have some play in asset classes or exciting stocks / investments that their friends / peers are invested in. This comes from the fear of missing out. So if you are a portfolio manager and you are not allocating at all to something like say cryptocurrency, you could be incurring the risk of regret from your clients who have friends who are allocated.

Reward, Risk and Regret in recommended feed construction

Reward: This is often related to your definition of business value. For instance, it could be the number of daily active users for your platform, or the total time spent or user activity on your platform.
Risk:
- users retaining less (high severity)
- exiting from this session on your app (medium severity)
- exit from the feed or skip the recommendation (low severity)
Regret
- recommendations not capturing some category/topic/creator/job to be done that you consider par for the course. An inspiring way to reduce this risk is calibrated recommendations we think.

Final algorithm for feed recommendations

Instead of ranking items by Expected value / reward, borrowing from the formula in finance, we recommend ranking items for your recommended feed by

A simplistic formulation of this in say a video recommendation system could be

30s | user, item]}{P[exit\\_app | user, item]}","id":"KOJWVOXQOY"}" data-component-name="LatexBlockToDOM">

This is similar to what Guanfeng et al. show in Improving feeds by modelling scrolling behavior, i.e. the optimal solution is to rank by probability of reward / probability of ending session:

Optimal ranking from Improving feeds by modelling scrolling behavior

User representation in a recommender system | memorization vs generalization

Gaurav Chakravorty — Fri, 03 May 2024 16:54:38 GMT

Representing the preferences of the user is crucial to personalizing recommender systems. In this article, we propose an approach to representing user preferences that we believe is optimal for large scale recommender systems. The approach is inspired by how humans communicate with others. We use progressively more memory when there is more historical context to fall back to (see figure below).

Similarly the approach presented in the section “Mixture of representations” uses table lookup based memory primarily for users for whom we have enough data to specialize, else it relies on a generalized representation.

Introduction: Generalization vs Memorization

We have seen successful recsys implemented using both schools of thought:

Generalization: Let’s not have any user specific memorization and purely personalize based on user features. Look at this seminal Youtube paper for an example.
Memorization: Let’s have a large lookup table keyed by user id. This is essentially a way to capture clear causality based specifically on the user’s preference. One could think of this as a modern recommender system incarnation of collaborative filtering, which even today is quote close to SOTA (see here). For those who want to learn collaborative filtering from the best, I recommend reading this chapter.

Outline

In this post, we will discuss various implementations to capture user preference:

table lookup
deep hash embeddings
generalization based on user cohort and not specific to the user
mixture of representations

We encourage you to try multiple approaches since the results could vary based on the scale, the dynamism of user preferences and how asymmetric the distribution of power / marginal user is on your platform.

Code

We share PyTorch code here. It is tested and freely available. We have also posted a walkthrough of the code on Youtube. See the first of the 6 videos below.

Memorization based user representation (Table lookup)

In this implementation, we create a (large) embedding table keyed with user id. Notwithstanding hash collisions, this enables us to memorize the user’s preferences and use it in future recommendations.

Deep Hash Embeddings

This is based on this seemingly magical paper, which achieves performance similar to id lookup without using embedding tables.

I believe the intuition is that the stacked neural network layers learn a form of generalization that is competitive with memorization with an order of magnitude less parameters.

User cohort / cluster based representation

In this implementation we only look at user features (not including user id). We have a embedding table of a smallish size, say 1024 and we try to find the index in this table the user should map to based on their features, like say location, broad interests etc. This is especially useful when you have very little information for this user.

Mixture of representations

Now we try to combine these ideas. In the image below,

the ”Table Lookup” refers to the module in “Memorization based user representation (Table lookup)” section.
the “Cohort lookup” refers to the module in “User cohort / cluster based representation” section.

Then we take a weighted sum. This weight is hopefully intelligent and knows how to use the best embedding for this user.

A note about sequential recommendation

Please note that this article is about understanding the user’s preferences beyond current session’s ephemeral interests. Sequential recommendation modules might be best at capturing those and making your recsys responsive. User representation and sequential recommendation modules should be complementary.

RecsysML + LLMs

LLM Evals From Scratch: Run Your First Benchmarks

Prerequisites

Step 1 — Why training loss isn’t enough

Step 2 — The three benchmarks we’ll run

ARC-Easy (AI2 Reasoning Challenge)

PIQA (Physical Intuition QA)

HellaSwag

Step 3 — Why we use acc_norm in this tutorial

Step 4 — Set up your environment

Step 5 — Get the script

Step 6 — What the script does

Step 7 — Read the output

Step 8 — Interpret the results

Step 9 — Compare model sizes

Step 10 — Limits and contamination

When to use these evals — and when not to

What’s next

Personalization at Bluesky

Bluesky is hiring!

PinnerSage

How it works

Step 1: Cluster User Interactions

Step 2: Calculate the Medoid

Step 3: Importance Scoring

Integrating with your Recommender System

Short Term Interests & Item Embeddings

Conclusion

Bibliography

The Mathematics of Intelligence: A Deep Dive into LLM Training

1. Pre-training: The Foundation of Knowledge

2. Instruction Tuning (SFT) & LoRA

3. Reinforcement Learning (PPO): Joint Optimization

A. The Bradley-Terry Reward Model

C. The PPO Clipped Objective (LPPO​)

C. The Joint Loss

4. Direct Preference Optimization (DPO): The Sequential Shortcut

Summary Table: Adapter & Loss Strategies

Conclusion

Modernizing GPT-2: A Journey from 2019 to 2025

Introduction

1. The Upgrades: Why We Made Them

2. Challenge: The Backward Compatibility Trap

3. Experiment 1: The Overfitting Trap (Tiny Shakespeare)

4. Experiment 2: Redemption (FineWeb)

Seeing is Believing

5. Experiment 2: The Ablation Study (Who contributed what?)

Conclusion

A framework for ML Design Interviews

The Case for RL-Aligned Ranking in RecSys

1. Retrieval ≈ Pretraining+SFT. Ranking is missing Reward optimization.

2. Reward optimization in LLMs vs “Ranking” in RecSys

3. LLM Alignment (RLHF etc) uses a reward model to maximize user value

4. RecSys Ranking model is a probability estimator

5. The Opportunity: RecSys Needs Its “RLHF Moment”

Why is this better?

6. Code implementation

7. Closing Thoughts

Learning ML by doing Part1 | GPT-2

1. The Architecture: What is GPT-2?

Key Components

2. The Implementation

The “Naive” Baseline

3. Engineering the Speedup 🚀

Flash Attention 2 ⚡

Vocabulary Padding 📏

Feeding the Beast (Data Loading) 🍽️

Fused AdamW 🔥

torch.compile (The Final Boss) 🛠️

4. The Results 📈

What did it learn?

Next Steps:

1. Rotary Positional Embeddings (RoPE) 🔄

2. RMSNorm (Root Mean Square Normalization) 📏

3. SwiGLU Activation ⚡

4. Grouped Query Attention (GQA) 🏎️

5. Mixture of Experts (MoE) 🧠

Takeaway

Recsys Retrieval ~ LLM Pretraining/SFT

Same loss function - just applied based on corpus size

C. The PPO Clipped Objective (LPPO)

`torch.compile` (The Final Boss) 🛠️