David Merrell's Blogsite

Map of the cat

Fri, 10 Feb 2023 05:00:00 +0000

I think of this Richard Feynman story pretty often.

Apparently Feynman spent a short episode of his career dabbling in biology. This story recounts his experience taking a graduate level biology course. The professor has assigned him to present a paper to the class:

“I began to read the paper. It kept talking about extensors and flexors, the gastrocnemius muscle, and so on. This and that muscle were named, but I hadn’t the foggiest idea of where they were located in relation to the nerves or to the cat. So I went to the librarian in the biology section and asked her if she could find me a map of the cat.

“‘A map of the cat, sir?’ she asked, horrified. ‘You mean a zoological chart!’ From then on there were rumors about some dumb biology graduate student who was looking for a ‘map of the cat.’

“When it came time for me to give my talk on the subject, I started off by drawing an outline of the cat and began to name the various muscles.

“The other students in the class interrupt me: ‘We know all that!’

“‘Oh,’ I say, ‘you do? Then no wonder I can catch up with you so fast after you’ve had four years of biology.’ They had wasted all their time memorizing stuff like that, when it could be looked up in fifteen minutes.”

—Richard Feynman (Surely You’re Joking, Mr. Feynman!)

This story resonates with me as a computer scientist/mathematician working on biological problems.

I’ve found that physicists (and computer scientists) have very different sensibilities from biologists. I think Ernest Rutherford had these differences in mind when he said “All science is either physics or stamp collecting.”

Biologists are required to memorize a great many facts. In contrast, computer scientists, mathematicians, and physicists only memorize a small number of “first principles”. Those principles serve as a starting point to reason/solve problems/understand the world—you can get a lot of mileage out of a small number of good abstractions.

Meanwhile, biological systems are like Rube Goldberg machines. Very little of biology can be explained from anything resembling “first principles.” The central dogma is the closest thing to a first principle I can think of, but it explains only a small fraction of the mechanisms people find interesting.

A competent biologist must (a) learn a large number of facts and (b) form a unified understanding from them. I feel optimistic that algorithms, ML, and AI can do a good job of this. I’m not very interested in stamp collecting—but I am very keen to develop algorithms that collect stamps for me.

Whenever I feel my computer scientist mindset collide with the stamp-collecting aspects of biology, I think of it as a “map of the cat” moment. I only wish more people knew this Feynman story, so that I could say “map of the cat” out loud and people would understand what I mean.

$ \blacksquare$

A less-bad blog post about transformers

Mon, 23 Jan 2023 05:00:00 +0000

This is the second post in a series!

I recommend reading the first post (“A less-bad blog post about attention mechanisms”) before reading this one. It explains important prerequisite concepts. For example, “self-attention” and “multi-head attention.”

Why I’m writing this

I suspect “Attention is All You Need” (AIAYN) was written with a very specific audience in mind. I don’t think it actually explains transformers very well to a general ML audience.

As a single example, consider this graphic from the original AIAYN paper:

This was incomprehensible to me when I first tried reading AIAYN.

What’s being stacked N times? And in what fashion? How do connections work between stacked units on the left and right?
Why is there an arrow from outputs into the network?
What does it mean that the outputs are “shifted right?”
Why are there “outputs” in the lower right and “output probabilities” in the upper right?

Despite its (arguable) lack of clarity, this same image is copied and pasted into practically every blog post about transformers. The typical blogger then proceeds to “explain” it by parroting the same explanation given in AIAYN. It’s as though the bloggers only have a superficial understanding of the concepts in AIAYN, or haven’t thought carefully about how to explain them.

This post aims to do a less-bad job of explaining transformers to a broad ML audience.

A gradual explanation of the transformer architecture

We’ll start with a 10,000-foot view of the transformer and gradually zoom in, focusing on important details as appropriate.

Broad brush strokes

Some important big-picture things to understand about the transformer architecture:

The transformer is a sequence-to-sequence model.
- That is, it receives a sequence $x_1, x_2, \ldots, x_M $ as input and produces a new sequence $y_1, y_2, \ldots, y_N$ as output.
- Concretely, AIAYN presents the transformer as a model for translating text from one language to another (like in Google Translate).
- Whenever appropriate, we’ll focus on concrete examples from text translation. But keep in mind that the architecture may accommodate a much broader class of tasks.
The transformer has two primary components: an encoder and a decoder.
- The encoder receives an input sequence and transforms it into a latent representation. The idea is that this latent representation encodes the input in some informative way.
- The decoder receives the latent representation and transforms it into a useful output. (It operates differently from your typical decoder, though—pay special attention to the graphics in this post.)
The transformer is autoregressive. That is:
- it generates the output sequence one item at a time; and
- each new item in the output sequence is a function of the previous items.
- The output sequence terminates when a special “end token” is generated.

The situation is captured in this graphic:

The transformer’s encoder receives the input sequence $x_1, x_2, \ldots, x_M $ and computes a latent representation for it. This latent representation gets passed to the decoder.

In this scenario, the transformer has already generated the first $t $ items in the output sequence—$y_1, y_2, \ldots, y_t$. The decoder generates $y_{t+1}$ as a function of (i) the latent representation and (ii) the first $t$ items. Finally, $y_{t+1} $ is appended to the output sequence and the process repeats. Note that the latent representation remains the same while the output items are being generated. That is, the latent representation only needs to be computed once for the input sequence.

Layers of the model

It’s time to reveal more details. The encoder and decoder are each composed of layers. For example, In AIAYN they both contain $K = 6$ layers.

The layers in the encoder all have identical architecture, though their weights are allowed to differ. The same applies to the decoder: its layers have identical architecture but differing weights. We’ll discuss the encoder and decoder layers later in much more detail.

Here’s the graphic again, updated to show the layers:

The encoder’s final layer produces the latent representation. Interestingly, it passes the latent representation to every layer of the decoder. This ensures that the input sequence thoroughly “informs” the generation of the next item in the output sequence.

Here’s an important detail that isn’t captured in the graphic: the latent representation is actually a collection of $M$ vectors; i.e., a collection as long as the original input sequence. And this collection of $M$ vectors gets passed to each layer of the decoder. Just imagine that each of the edges from encoder to decoder is actually a collection of $M$ edges.

Notice that the next output item, $ y_{t+1} $, is a function of the outputs of the final decoder layer. A more complete explanation is that $y_{t+1} $ is computed via a linear function followed by a softmax. Importantly, the transformer assumes you have a fixed-size vocabulary of possible output items (e.g., the 10,000 most common English words). The decoder selects the next token by (i) assigning probabilities to the vocabulary items and (ii) choosing the most probable item. (This amounts to a one-hot encoding of $y_{t+1} $—which was a point of confusion for me since one-hot encodings are not used anywhere else in the model.)

Much of the credit for this graphic goes to Jay Alammar’s “The Illustrated Transformer,” one of the few blog posts I found useful for understanding transformers. Before reading his post I found it difficult to understand the connections between the encoder and decoder.

Encoder layers and sublayers

TODO

Self-attention layer
Fully connected, position-wise neural network
Layer norm
Residual connections

Decoder layers and sublayers

TODO

Three sublayers, rather than two. Same layer norms and residual connections as before.
First sublayer: masked self-attention
Second sublayer: self-attention, including the input’s latent representation
Third layer: fully connected, position-wise neural network

Input representations

TODO

Positional encoding
Connection to Fourier series
This is one of the very few ways sequence information is preserved in the input.

Beyond sequence-to-sequence tasks

The transformer described in AIAYN has very few attributes tailoring it specifically to sequence-to-sequence tasks:

The positional encoding
The masked self-attention in the decoder
The autoregressive generative process

With the right modification, it can be quite serviceable for other classes of tasks.

Domain-appropriate encodings can encourage relevant pairs of inputs tend to pay attention to each other.
The attention-masking can easily be removed from the decoder, if it’s not appropriate for a given application domain.
It’s not too difficult to define autoregressive processes for non-sequence data.

Creativity and generative AI

Mon, 09 Jan 2023 05:00:00 +0000

Yet another nerd’s take on ChatGPT/stable diffusion/etc.

The thoughts in this post had been swirling in my head for a couple of weeks, but sort of congealed today while I listened to this EconTalk episode.

I theorize there are multiple kinds of creativity.
- Combinatorial creativity: combining existing concepts in a way that hasn’t been done before. (Relatively common.)
  - Example: Harry Potter fan fiction.
- Original creativity: generating novel concepts. (Much rarer.)
  - Example: inventing the Harry Potter universe.
- (In truth I think the reality is more complicated. The difference between “combinatorial” and “original” creativity seems like one of abstraction level, rather than kind. I.e., all new things come about from new configurations of atoms, but a creative person usually works at a higher level of abstraction than atoms.)
The creativity shown by generative models (ChatGPT, stable diffusion, Dall-E, etc.) falls almost completely in the “combinatorial creativity” bucket.
- Modern deep neural networks are essentially interpolators. In many cases they literally interpolate their training sets.
- Abstractly, concepts can be regarded as points in a high-dimensional space. A generative AI is trained on data that induce a point cloud in this space.
- Those points from the dataset have a convex hull. Current generative models seem to do a great job of interpolating within that convex hull. This is, by itself, an impressive and valuable achievement (to say the least). A ton of economic and human value comes from fleshing out that convex hull, finding new combinations of ideas that make our lives better and more interesting.
- What’s less clear is how well they can generate points outside of that convex hull. It’s rare even for humans to do that.

From this perspective, what can knowledge-workers or creatives do to maximize their comparative advantage against AI?

Pay attention to your environment. If you do encounter a novel concept, it will probably arise from some kind of interaction between you and your physical or social surroundings. This is your best chance at punching through that convex hull.
Embrace weirdness. You have something unique to contribute. Let the world see it!
Avoid being too predictable. De-correlate your thinking from that of other people. If I can predict your thoughts, words and actions from simple labels (“democrat”, “republican”, “religious”, “irreligious”, etc.) then you’re unlikely to provide any alpha to society.
Find ways to harness AI to enhance productivity.
There’s still room for a human touch. It helps to be a pleasant person to be around.

You may also invoke the muse:

O Divine Poesy,
Goddess-daughter of Zeus,
[…]
Make the tale live for us
In all its many bearings,
O Muse.

$ \blacksquare$

A less-bad blog post about attention mechanisms

Mon, 02 Jan 2023 05:00:00 +0000

A spectrum runs through the world of machine learning, with “curmudgeon statistician” at one end and “deep learning zealot” at the other. I lean toward the “statistician” end of that spectrum, so I delayed learning about attention mechanisms until recently.

It was surprisingly difficult to find clear explanations of attention. Most sources tended to be poorly-written Medium posts with a formulaic structure:

Point to the famous “Attention Is All You Need” (AIAYN) paper;
bumble through an awkward discussion of “keys,” “values,” and “queries”;
show mathematical formulas for different attention mechanisms;
describe the transformer architecture in too much detail;
show some code snippets.

Maybe that works for some people, but it didn’t work for me. AIAYN is an important paper, but it seems like a poor way to explain attention mechanisms¹.

With that background in mind, I’ve compiled my current understanding of attention mechanisms into this blog post.

This post is not a thorough survey of the literature on attention mechanisms. It only aims to be less bad than other blog posts on the topic. I hope it makes your path easier than my own.

A gradual explanation of attention

We’ll start with attention as humans experience it. Then we’ll present a mathematical description of attention, and show how it fits into machine learning. Finally, we’ll arrive at the keys, values, and queries jargon of AIAYN.

Attention in humans

Machine learning researchers chose the word “attention” on purpose. There is a strong analogy between attention mechanisms in ML and the human notion of attention.

With every waking moment, your brain is flooded with sensory data. How are you able to process it? How are you not overwhelmed?

The answer is that your brain filters out most of the data and only allows a small subset to be perceived. At any given time, only a small amount of the sensory data is considered relevant enough for perception. When data enters your perception, we say you are “paying attention” to it.

Human attention has certain key properties that will carry over to the machine learning version:

Attention assigns importance to items. It filters out irrelevant items and keeps relevant ones.
You have a finite amount of attention. You can concentrate it on few items, or spread it out over many items.

Attention as weight assignment

Our explanation of human attention suggests a mathematical description.

Suppose you have some arbitrary set of items, $x_1, x_2, \ldots, x_N $. Then we can think of attention as an assignment of nonnegative weights $p_1, p_2, \ldots, p_N $ to those items. We constrain the weights such that $\sum_i p_i = 1 $.

If we interpret the weights $p_1, p_2, \ldots, p_N $ as importances, then they satisfy the two properties of attention mentioned in the previous subsection.

The weights can then inform judgments about the set of items, with heavier items being given greater importance (i.e., more attention).

Machine learning on collections of items

This weight assignment becomes relevant for machine learning in settings where each input is a collection of items. Concretely, imagine we have a model that classifies documents. In this case each document would be regarded as a collection of words.

Here’s a simple way our model could employ attention:

Represent each word with an embedding vector: $x_1, x_2, \ldots, x_N$.
Assign a weight to each word: $p_1, p_2, \ldots, p_N $.
Compute a document vector $z$ from the weighted average of word vectors: $z = \sum_i p_i x_i$
Let additional layers infer the class from the document vector.

This is illustrated in the following graphic:

Edges indicate functional dependencies (i.e., an edge from A to B means “B is a function of A”).

A practical advantage of this approach is that it naturally accommodates inputs of varying size. In other words, the document embedding $z$ does not necessarily depend on the document’s length—we could append an arbitrary number of irrelevant words to the document and $z$ would not change.

At this point we should discuss where the weights come from. The big ML idea is to learn a function $f$ that computes the weights. We call this function that assigns weights to items an attention mechanism or attention head. Typically $f$ consists of two stages:

Compute a “relevance” or “compatibility” score for each word in the document;
use a softmax function to transform the relevance scores into nonnegative weights.

The relevance score for a word is usually a function of (1) that word’s vector representation $x_i$ and (2) additional contextual information about that word. For example, contextual information could include a word’s position in the document, its neighboring words, or some other vector embedding of the word. Some of this contextual information may be appended to the word vector $x_i$, but it’s also typical to store context information in a separate vector.

For illustration, contrast this graphic with the previous one:

(Note that it does not depict the full set of edges going into the “$p$” nodes. Since the $p$ nodes come from a softmax function, each $p$ node depends on every input to the softmax.)

So far we’ve focused on documents (collections of words) as a concrete example. However, our discussion could just as easily apply to molecules (collections of atoms), images (collections of pixels) or other domains.

Queries, keys and values

Somehow this terse, jargon-ridden paragraph from AIAYN made it fashionable to describe attention in terms of queries, keys and values:

“An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”

Our previous discussion maps onto that jargon in a fairly straightforward way. Once again, let’s focus on the concrete example of a document model.

It’s easiest to start with the queries and keys:

“queries” = word vectors
“keys” = context vectors

This means the attention weights are primarily a function of the queries and keys. There are many ways to compute $p_1, p_2, \ldots, p_N$ from the queries and keys: so-called “additive”, “dot-product”, “scaled dot-product”, and so on. I recommend Lilian Weng’s blog post for coverage of those details.

Here’s the figure from before, with “queries” and “keys” substituted in the appropriate places:

That covers queries and keys. But what about “values?”

Values require us to introduce new nodes to our diagram—an additional node for each word. These new nodes will provide more flexibility to the model. Specifically, they will allow the word vectors that form $z$ to differ from the word vectors used to compute the attention weights (i.e., the queries). Here’s the figure again, showing the new value nodes:

Notice that before this, we only had one word vector representing each word (the query). In contrast, we now have two word vectors representing each word (query and value). And we still have the keys, which represent context (roughly speaking).

At this point we’ve arrived at a flexible and broadly applicable class of attention mechanisms. The queries, keys and values will often be generated by other layers of the network. And the output $z$ may be used as a query/key/value in another attention layer!

There are many ways to incorporate these attention mechanisms into the design of a neural network. The following sections describe a few of them.

Global attention

The attention mechanism described above assigns attention weights to every item in the collection, and produces a vector representing the entire collection. For that reason it’s often called global attention. This is the simplest way to incorporate attention into a neural network.

Global attention is interesting from an interpretability standpoint: for each input received by the model, the attention weights will indicate which items in that input are most relevant for that layer of the model.

Multi-head attention

The attention mechanism described above has a fairly strong inductive bias: it assumes the output is a weighted average of item-specific vectors. To counter this bias, an attention layer can include multiple attention heads. That is, the same set of queries/keys/values can be passed to multiple attention mechanisms $f_1, f_2, \ldots, f_K$; and afterward, the outputs of these mechanisms can be recombined in some fashion. For example, AIAYN concatenates their outputs.

Ideally, the different attention heads $f_1, f_2, \ldots, f_K$ “pay attention” to $K$ different aspects of the input; and their recombined output captures all of their “diverse perspectives”. This allows the output to not be a simple weighted average of values.

Multi-head attention is analogous to having multiple channels in a convolutional neural network. Additional channels allow a CNN to learn multiple convolutional kernels that detect distinct patterns in the data.

Self-attention (or intra-attention)

To keep things concrete, let’s once again assume we’re working with text data (documents of words).

The high level idea of self-attention is to allow each word in the document to “pay attention” to the other words in the document. This allows the model to capture pairwise interactions between words. People call this “self-attention” or “intra-attention”; I prefer the term “intra-attention” since it seems more accurate.

Recall that global attention applies a single attention head $f$ to the entire document $x_1, x_2, \ldots, x_N$, producing a single vector $z$. In contrast, imagine we have an attention head for every word in the document.

More accurately, imagine we have a single attention head $f$, but for each word $x_i$ in the document we compute a new set of attention weights tailored to that word: $ p_{i,1}, p_{i,2}, \ldots, p_{i,N} $. These weights encode the strength of pairwise relationships between word $x_i$ and words $x_1, \ldots, x_N$. Finally, suppose we use the attention head $f$ to compute $z_i $; a new vector for word $i$.

Here’s a graphic, showing the situation for $i = 1$:

If we do this for every word in the document, then we end up computing $N^2$ attention weights and producing outputs $z_1, z_2, \ldots, z_N$. We can think of $z_1, z_2, \ldots, z_N$ as new vectors for the words in the document, updated to include information from pairwise relationships with other words in the document.

Some important things to notice:

The complexity of intra-attention grows quadratically with the length of the document. This is no surprise, since it introduces pairwise interactions between words. There are domain-specific strategies for overcoming the $O(N^2)$ complexity, like ignoring words past a certain distance in the document.
Multiple rounds of intra-attention allow a model to capture higher-order relationships between words, rather than just pairwise relationships. Intra-attention can be thought of as a form of message passing, similar to that in graph convolutional networks.

It’s straightforward to define a multi-head version of intra-attention. Do the natural thing—replace the single attention head with a multi-head mechanism, just as in global attention.

Wrapping up

~~I would have liked to cover transformers, but they’re important enough to warrant their own post. (Also, I already spent too much time on this post.)~~

Edit 2023-01-23: I’ve started writing a post about transformers. Read it here:

(“A less-bad blog post about transformers”)

Fusion confusion

Mon, 19 Dec 2022 05:00:00 +0000

I’m bothered by some of the media’s breathless, credulous coverage of recent events in fusion research. People are too excited; they’re forming incorrect conclusions.

This article from Science is one of the better ones I’ve seen on the subject.

What are the facts?

NIF has an inertial confinement reactor that shoots lasers at a pellet of hydrogen fuel; the goal is to ignite a fusion reaction.
In a recent NIF experiment, lasers shot ~2 MJ of energy into a fuel pellet. About 3 MJ came out… which is pretty exciting!
However, the lasers themselves required hundreds of MJ to operate during that single shot. So the reactor at NIF is nowhere close to a self-sustaining fusion reaction.
This experiment was the latest in a series of very similar NIF experiments in recent years. Its success resulted from (a) incremental improvements to the fuel pellet design and (b) increased power for the lasers.

What’s bothering me?

The fact that the reaction is not self-sustaining has not been given enough attention. A lot of people I talk to seem to have missed that crucial detail. I don’t blame those people for missing it. But I do blame journalists for not giving it more prominence in their coverage.
People keep calling this “revolutionary,” or a “breakthrough.” To me it seems more like an exciting—though incremental—improvement in inertial confinement fusion.
I’ve heard some people say that this experiment proves a scientific principle. They say it’s an existence proof that controlled fusion is possible. However, I don’t think there was any doubt controlled fusion was physically possible. The physics was never in question; it has always been fundamentally an engineering problem. And most of the engineering challenges remain unsolved.
A lot of the excitement seems misplaced given that inertial confinement reactors are unlikely to produce economically viable fusion. Most experts consider magnetic confinement devices (e.g., tokamaks or stellarators) much more promising for attaining self-sustaining fusion. I doubt the technological lessons learned at NIF will help improve magnetic confinement designs.
Some people say public excitement about these advances is a good thing, since it will result in increased funding for fusion research. That may be true. But if we still don’t have commercial fusion reactors in 10 or 20 years, tax-payers and investors may become resentful. That funding could quickly disappear if people feel that the researchers oversold and underdelivered.

I voice these concerns as somebody who is a big fan of fusion research. I wish it had more funding. But I think the discussion around recent events—by journalists and politicians—has been confused or dishonest.

$ \blacksquare$

Some amateur utility theory about gift-giving

Sun, 18 Dec 2022 05:00:00 +0000

I remember taking a microeconomics class in college. It covered standard introductory material about utility theory. Preferences, utility functions, and constrained optimization.

At one point the professor made an observation about gift-giving: that cash is the gift of maximal utility for the receiver. The reason is simple. Cash imposes no constraint on the receiver, so the receiver is free to exchange the cash for whatever maximizes their own utility.

I was very receptive to this idea. It made sense to me, since I usually enjoyed receiving cash as a birthday or Christmas present (or gift cards, which are a slightly constrained version of cash). The insight also gave me some solace as a gift giver: I could justify the lazy choice of cash or gift certificates from very sophisticated economic principles.

However, my feelings on cash gifts have changed over the years. I doubt I’ve arrived at any novel insights—I’m guessing there’s a pile of academic literature on the subject, but I don’t know any of it. Here are some of my thoughts anyways:

Nobody truly knows their own utility function. Even if we assume (a) my utility function exists and (b) it remains constant over my lifetime, the reality is that I will never fully explore it. Since I’m still young, there are many goods I haven’t experienced yet. And since life is finite, there are many goods I will never experience. Many (most?) dimensions of my utility function will remain unknown to me.
If someone gives me cash, it enables me to take a utility-increasing step within my known preferences. This serves as a useful baseline for gift-giving: can you do better than giving simple cash?
The interesting thing—the point my professor made—is that you can not do better than cash if you confine yourself to the receiver’s known preferences. This implies that the only way to beat a cash gift is to provide a good that the receiver is not familiar with.
This seems like a tall order. How do I
1. identify the unknown unknowns in another person’s life; and
2. choose one that they would appreciate?
A gift that succeeds at these tasks could be called a thoughtful gift. I have known people who are skilled at giving gifts. I think it is a real skill, and has something to do with competence at tasks 1 and 2.
An additional human reality may make it easier to beat cash gifts: people have limited attention and memory. So there may be preferences that they have explored in the past, but that they’ve forgotten or lost sight of for whatever reason. A gift that brings those preferences back to the receiver’s attention could also beat a cash gift.

There you have it: some theories of gift-giving. Like most theorists, I’m a lousy practitioner.

$ \blacksquare$

Thinking about startup stock options

Wed, 07 Dec 2022 05:00:00 +0000

I’ve been reading up on how equity compensation works at biotech startups.

Some of the more helpful sources I found:

Rapport
BioBuzz
Reddit
Wellfound/AngelList
- This blog post
- This interactive tool
- (I would guess Wellfound’s numbers are more typical of software startups, which are much less capital-intensive than biotechs.)
GPT-3:

With enough reading I was able to arrive at coarse-grained mental model of the decision-landscape presented by equity compensation.

When I consider working for a startup, I expect to (a) take a somewhat lower salary than I would at a more mature company, in exchange for (b) exposure to potential upside contingent on the company’s success. The company will probably not succeed; but if it does, then I would hope it yields life-changing wealth. I would also expect the risk and reward to be higher for employee 10 than for employee 100.

Suppose a startup offers 10,000 shares of stock options. Some things to think about:

\[\text{your stake} = \frac{\text{your shares}}{\text{total shares outstanding}}\]

What fraction of the company do those shares represent? Equivalently: what is the total number of shares outstanding?
Is the total number of shares likely to increase? If the startup is very young then the answer is “yes—by a lot”. If this is the case, then your fraction will shrink. This is called “dilution.”
This fraction of ownership (your “stake”) gives an idea of the potential value of your shares.
- Suppose the 10,000 shares represent 0.1% of the startup. Then imagine all of your dreams come true, and the startup reaches a $1B valuation.
- In that case, your stake would be worth 0.001 x $1B = $1M. Which is great!
- However, suppose there is significant dilution. Then the stake would be worth much less than that.
Some other salient things to be aware of:
- Vesting schedule. You will be given the stock options on a gradual basis. And this usually only begins to happen after you’ve been an employee for a whole year. The typical arrangement is “four year vesting with a one year cliff.” This means you’ll receive 0 options during the first year, and ~3,333 options the following three years (usually on a monthly basis). This schedule ends if you quit or get fired.
- Liquidity. An early stage startup is usually a private company. It can be very difficult to sell the stock of a private company; you typically can’t sell your stock until the company goes public or gets acquired (in which case the acquiring company may buy out your equity, though it’s not guaranteed).
- Strike price. It will cost money to exercise your options and get actual stock. For example, suppose we’re back in the dream scenario and the strike price is $1 per share. Then you would have to pay $10K to obtain all of your stock (which you could then resell for $1M, netting $990K). The strike price may vary over the vesting schedule as the valuation of the company changes.
- Taxes. You will have to pay taxes (a) when you exercise the options and (b) when you sell the stock. In the dream scenario, you will end up with much less than $990K post-tax.
- If you quit or get fired, then you will be given a strict time limit to exercise your options—usually no more than 90 days. Otherwise you forfeit your options. This is a difficult situation if the shares are still illiquid! You either pay taxes and the strike price up front for illiquid stocks, waiting with the hope of a liquidity event in the future; or you throw away the options that you earned over years of employment.

Once the offer is accepted, the value of your stock options is a random quantity that depends on (a) the startup’s success; (b) your employment status in the startup and (c) certain choices you make. It seemed natural to think of the situation as a Markov decision process. I went through the trouble of sketching it out in Inkscape:

You begin at the top of the image. Time flows downward and out, roughly speaking. This doesn’t show every possible state or action. My goal was to capture some of the more salient parts of the decision landscape. Perhaps the most important part of this picture is the large probability of the company failing and the stock price going to zero.

I should emphasize that compensation is one of many criteria for taking a job. The other criteria belong in different blog posts, though.

$ \blacksquare$

A genealogy of probability distributions

Sun, 27 Nov 2022 05:00:00 +0000

Some of the Wikipedia pages I visit the most belong to probability distributions:

My favorite parts of these pages are the “Related distributions” and “Random variate generation” sections. Those sections describe (i) sometimes-surprising mathematical relationships between probability distributions and (ii) algorithmic strategies for generating samples from them.

This paints an interesting picture in my head: a family tree of probability distributions. It’s not actually a tree, but it is a directed graph that suggests some distributions are more ancestor-like while others are more descendant-like.

I’ve sketched it out in the SVG below. Making this diagram required many choices regarding (a) which information to include and (b) how to display it in space. You’re free to disagree with my choices.

I really wanted to capture some of the generative stories underlying these distributions. The diagram contains a combination of exact, approximate, asymptotic, and algorithmic relationships. Various other relationships are excluded: conjugate priors, for example.

Try opening the image in a new tab and zooming in for a closer look.

In this picture, randomness begins with nearly-physical representations of uncertainty: fair coins and urns. The central limit theorem gives the normal distribution a “sink-like” quality in the directed graph (though Cauchy and some Pareto or t-distributions evade capture).

The relationship between e.g., Binomial and Poisson—marked “flips $ \rightarrow \infty$”—glosses over a beautiful mathematical process that I might describe in a future post.

$ \blacksquare$

Edit 2022-12-15: Yesterday I came across Larry Leemis’s website that does something similar, but in much greater detail. Larry Leemis is a professor at William and Mary.

If you're not writing, you're not thinking

Thu, 10 Nov 2022 05:00:00 +0000

If you’re not writing, you’re not thinking.

My lazy Google search didn’t immediately yield an attribution for this quote, so I gave up. I’m pretty sure I first heard it years ago on the Tim Ferriss podcast.

Similar quotes do have attributions:

If you’re thinking without writing, you only think you’re thinking.

—Leslie Lamport

(The fact that this comes from Leslie Lamport makes me like it more.)

If you can’t write clearly, you probably don’t think nearly as well as you think you do.

—Kurt Vonnegut

Regardless of who says it, the concept resonates with my experience.

Most of my writing doesn’t end up on this website. It exists as emails, messages, notes, or journal entries. In each case writing forces me to confront the ideas, intuitions, and feelings bouncing around in my head and impose order on them. Literally impose order, as in, arrange them in a sequence that has some logic to it.

It’s a healthy exercise. Once again, I mean that literally. Cognitive behavioral therapy is based on the idea that thinking dispassionately about our feelings can help us gain control over them and become more mentally healthy. I don’t deal with an unusual amount of negative emotion, but I find that when I do a good journaling session helps me process it. I write down what’s bothering me and the potential causes. I evaluate their plausibilities and consider solutions. Writing about problems helps me think clearly about them, which in turn asserts my agency over them.

Writing’s usefulness translates to my work life as well. I spend a lot of my time in a problem-solving loop where (1) I’m confronted by some problem, (2) I have to think of possible solutions and (3) I try those solutions and see what happens. Writing aids each of these steps.

I can tell when I’ve gone several days without any thoughtful writing. My brain feels noticeably more sluggish. Thoughts, words, and ideas come to me much more slowly.

TBH the whole reason I’m writing this post is that I haven’t done enough writing recently, and the sluggish feeling was returning. This has been a good session… Thanks for humoring me!

$ \blacksquare$

Themes from ICML 2022

Tue, 26 Jul 2022 05:00:00 +0000

I had the fortune of attending ICML 2022 in Baltimore, MD last week.

I didn’t present any of my work there—I went because my advisor was invited to give a talk in the AI for Science workshop. Tony said I could attend the conference, and I was happy to oblige!

It was refreshing to go back to a big machine learning conference, post-pandemic. Apparently ~6,000 people registered for the conference. They had 10 sessions running in parallel for three days, and two more days of workshops after that.

It was a deluge of information; I could only catch a small fraction of the content. I paid attention and took some notes, though.

Some themes stood out to me. I’ll list them here in no particular order:

Biological and chemical applications. The conference had 3 keynote talks, and 2 of them focused on biological applications. There was a computational biology workshop on Friday, and an AI for science workshop on Saturday—biology and chemistry took center stage in the AI for science workshop.
Graph neural networks; transformers. There were several sessions devoted to GNNs and attention mechanisms.
Robustness; distributional shift; out-of-distribution (OOD) detection. There were sessions and workshops devoted to these topics. I didn’t attend them, but I saw many of their posters. UW-Madison was well-represented in this area. Sharon Li’s group presented half a dozen or so papers.
Self-supervised learning and contrastive learning. These have become dominant learning paradigms in recent years—the “general purpose self-supervised training followed by application-specific fine tuning” strategy seems to work great for images and natural language. On a somewhat related note: a surprising number of talks focused on data augmentation.
Multi-modal tasks and representations. Most often learning unified representations from images and natural language. (I can imagine this extending to, e.g., multiomic data).
Certain kinds of mathematical sophistication. SE(3) layers; invariant and equivariant features. Connections between ODEs/PDEs and neural networks. Using neural nets to solve differential equations; using differential equations as a form of prior knowledge; using neural networks to solve ill-posed inverse problems. These use concepts that I recognize from physics, but haven’t noticed in ML until now.

I want to flesh out the first bullet point a bit more; the emphasis on biological and chemical applications. The two relevant keynotes were given by Regina Barzilay (MIT) and Aviv Regev (Genentech). Additionally, I’ll mention workshop talks by Daphne Koller (Insitro) and Chris Langmead (Amgen). All of these talks seemed to emphasize that data is the bottleneck in biological and chemical applications. Several of them seemed to propose closed loops between ML models and laboratories—an active learning framework.

I won’t claim that these themes give an objective picture of ICML. Think of it as an opinionated summary. I’d be interested to hear other people’s observations.

$ \blacksquare$

David Merrell's Blogsite

Map of the cat

A less-bad blog post about transformers

Why I’m writing this

A gradual explanation of the transformer architecture

Broad brush strokes

Layers of the model

Encoder layers and sublayers

Decoder layers and sublayers

Input representations

Beyond sequence-to-sequence tasks

Other reading

Creativity and generative AI

A less-bad blog post about attention mechanisms

A gradual explanation of attention

Attention in humans

Attention as weight assignment

Machine learning on collections of items

Queries, keys and values

Global attention

Multi-head attention

Self-attention (or intra-attention)

Wrapping up

Other reading

Fusion confusion

Some amateur utility theory about gift-giving

Thinking about startup stock options

A genealogy of probability distributions

If you're not writing, you're not thinking

Themes from ICML 2022