Musings on Math (++)

Museum of dancing math

2019-03-25T00:34:40+00:00

To create a bouncy sphere or a wavy sphere, run

from pyray.shapes.sphere import *
draw_wavy_sphere_wrapper('.\\im', 66, 1)

from pyray.shapes.polyhedron import *
basedir = '.\\'
tr = Tetartoid()
for i in range(0, 31):
    im = Image.new("RGB", (2048, 2048), (1,1,1))
    draw = ImageDraw.Draw(im,'RGBA')
    r = general_rotation(np.array([0,1,0]),2*np.pi*i/30)
    tr.render_solid_planes(draw, r, shift=np.array([1000, 1000, 0]), scale=750)
    im.save(basedir + "im" + str(i) + ".png")

from pyray.shapes.pointswarm import *
points_to_bins()

Hazard rate

2018-10-12T05:31:00+00:00

What is it good for?

Hazard rate is a very useful concept when modelling the time to an event. It has the “hazard” in it’s name because the people who named it were modelling negative events like deaths or failures. But it’s just as useful when modelling positive events like recovery from a failure.

What is it?

The mathematical definition of hazard rate is the probability we will see the event we are interested in in the next unit of time conditional on it not having occurred till now. That might or might not have been too useful.

To understand it intuitively, we’ll need to think about my favorite Batman villain, Two-Face.

For those unfamiliar, Two-Face is basically a guy with a coin he likes to toss. When he gets heads, bad things happen.

Now imagine Two-Face standing by the side of a freeway, watching cars zip by. He tosses his coin every milli-second. If he gets heads in any of these milli seconds, an accident immediately ensues. Luckily, the probability of heads is quite low at one in a billion. But on the other hand, the guy tosses his coin roughly 100 million times a day. So, we should expect about one accident every 10 days.

Now, the probability that we will see a heads in the next toss doesn’t depend on how many times he tossed the coin before.

The coin doesn’t care what it did in previous tosses; it has no memory. If you think it does, you suffer from a condition known as the gamblers fallacy (side effects include losing all of ones pocket money on unsound hypotheses).

A changing coin?

If Two-Face keeps tossing the same coin (with the same probability of heads), the hazard rate for the next accident doesn’t change with time. Again, this is because the coin doesn’t care that it didn’t show you a heads for the past $t$ tosses.

This might be a good assumption in certain scenarios (and modelling freeway accidents might well be one of them), but not in most. For example, when modelling the recovery of a machine, it is logical to assume that if a machine didn’t auto-recover in a certain time, it becomes less likely to do so in the next second. This is equivalent to Two-Face changing his coin to one with a lower probabilities of heads the more he tosses.

Alternately, if we’re modelling the lifetime of a person or machine before catastrophic failure, it is logical to assume that the longer they have lived the higher the chance they are going to die/fail in the next year or so. In this case, Two-Face is gradually increasing the probability of heads as the number of tosses increase (perhaps because he’s getting impatient).

We can see that most positive events have a decreasing hazard rate while negative ones have an increasing hazard rate.

What conclusion you draw from that is up to you.

Coin toss markov chains

2018-10-08T03:02:40+00:00

1. The question

Let’s start with a simple question that will motivate the content of this blog. Not only is the answer beautiful, but it also helps us develop a framework for answering a whole family of such questions. The question goes like this - let’s say I have a fair coin (50-50 chance of heads and tails) and so do you. Both of us start tossing our coins. What is the probability that you will get three heads in a row before I get two heads in a row? It’s quite clear that I have a higher chance of winning (but don’t worry, my better odds are balanced by us obsessing over your victory throughout this post). Now, how do we go about calculating how much higher?

2. Thinking in terms of states

We can think of how close each player is to realizing his goal in terms of states. The first player (I) keeps tossing until he reaches two consecutive heads. So, we can define his state after n tosses as the number of consecutive heads he has so far. In the beginning, there are no tosses. So, the number of consecutive heads at that point is obviously zero.

If he gets a heads on the first toss then the number of consecutive heads after the first toss is one whereas if he gets a tails, it stays at zero. Hence, anytime a player is in the state “zero consecutive heads”, there is a 50% chance they will go to one consecutive head and 50% chance they will stay at zero consecutive heads after the next toss. The probability that the player will jump from zero consecutive heads to two consecutive heads in one toss is zero.

We can collect these three numbers into a vector of probabilities.

\[v = (.5, .5, 0)\]

Similarly, if a player is at one consecutive head so far on any toss, the probability that they will be at two consecutive heads after the next toss is 50% and the probability of dropping back to zero is 50%. There will be a vector for transitions from this state as well.

So, each state has a vector of probabilities of going to each of the other states. We can collect these vectors and create a matrix.

This is what it will look like for me:

\[M_3 = \left( \begin{array}{ccc} 0.5 & 0.5 & 0 & \\ 0.5 & 0 & 0.5 \\ 0 & 0 & 1 \end{array} \right)\]

Note that the third row corresponds to the transitions from state 2. Since I plan on stopping once I reach two consecutive heads, I will not be making transitions to any other states once I reach state 2. That’s why state 2 (third row) has a probability 1 of simply transitioning to itself.

Similarly, you plan on stopping once you reach three consecutive heads. So, your transition matrix will look like this:

\[M_4 = \left( \begin{array}{ccc} 0.5 & 0.5 & 0 & 0 & \\ 0.5 & 0 & 0.5 & 0\\ 0.5 & 0 & 0 & 0.5\\ 0 & 0 & 0 & 1 \end{array} \right)\]

3. Probabilities at nth toss

Now, let’s think of the probabilities that I am in each of the possible states after a certain number of tosses. As said before, the probabilities at zero tosses are pretty clear. Both sequences are at zero consecutive heads. For the first sequence for example, there is a 100% chance I’m in state 0, a 0% chance I’m in state one (one consecutive head) and 0% chance I’m in state two. These three probabilities can be collected into a vector:

\[P_0 = (1, 0, 0)\]

Now, what about after one toss. If I get a heads, I go to state one and stay in state zero if I get a tails. In other words, there is a 50-50 chance I’ll be in state 0 or state 1 after the first toss.

\[P_1 = (0.5, 0.5, 0)\]

This argument might give you a feeling of deja-vu since we used the exact same one when constructing the matrix $M_3$. In fact, we can get $P_1$ by simply multiplying the $P_0$ with $M_3$.

\[P_0 M_3 = (1,0,0) \left( \begin{array}{ccc} 0.5 & 0.5 & 0 & \\ 0.5 & 0 & 0.5 \\ 0 & 0 & 1 \end{array} \right) = (.5,.5,0) = P_1\]

Similarly to get $P_2$, we multiply $P_1$ with $M_3$.

\[P_1 M_3 = (.5,.5,0) \left( \begin{array}{ccc} 0.5 & 0.5 & 0 & \\ 0.5 & 0 & 0.5 \\ 0 & 0 & 1 \end{array} \right) = (.5,.25,.25) = P_2\]

Which means that after 2 tosses, there is a .25 probability that I would reach my goal.

But from the previous equation we can say,

\[P_2 = P_1 M_3 = P_0 M_3 M_3 = P_0 M_3^2\]

In general we can say,

\[\begin{equation} P_n = P_0 M_3^n \tag{1}\end{equation}\]

Now, we can of course say this for any transition probability matrix $M$ (non-negative entries and rows sum to one).

For your markov chain (you need three consecutive heads) we can similarly define the probabilities $Q_n$, that you will be in each of the states 0, 1, 2 and your goal state of 3 consecutive heads. Like before,

\[\begin{equation} Q_n = Q_0 M_4^n \tag{2}\end{equation}\]

$Q_0$ is your state at zero tosses which is $(1,0,0,0)$ (with a probability of one, you are in state 0 - you haven’t tossed, so have obtained zero consecutive heads so far).

Let’s plot the probabilities that you will be in each of the states as a function of $n$.

In the beginning, the starting state (0 consecutive heads) is obviously at 1.0. But it soon drops to zero. The other two non-absorbing states also make a climb initially since the final absorbing state must go through them. However, they quickly fall to zero as all the probability mass is sucked up by the absorbing state (in green).

Here is some simple, self-contained python code that shows how to calculate the probabilities of this sequence getting to the absorbing state as a function of the number of tosses (the green line in the plot).

import numpy as np

start = np.array([1,0,0])

# The transition matrix.
m_3 = np.matrix([
	 [.5,.5,0],
	 [.5,0,.5],
	 [0,0,1]
	 ])

## Probabilities of getting to absorbing state after n tosses.
# First raise the matrix to nth power
# then get the index of absorbing state
p_n = np.array([np.dot(start, np.linalg.matrix_power(m_3,n))\
						 [0,2]\
                               for n in range(100)])#Calculate this upto 100 tosses.

The plot will look very similar for the elements of the vector $P_n$ of my states. However, my absorbing state (two consecutive heads) will be reached much quicker. In fact, let’s compare the probabilities me and you reach our goals (absorbing states: 2 heads and 3 heads respectively) by plotting them side-by-side.

4. Using the sequences to get the answer

All this is well and good but how is it going to help us get the number we’re looking for? What is the probability that you will reach three consecutive heads before I reach two? Let’s call this event $A$. We then want to calculate $P(A)$ (probability of event $A$).

Since we have the probabilities of both sequences being in each state, $P_n$ and $Q_n$, let’s try and use those. We can start with thinking of the event where we win in the nth toss. Let’s call this event $A_n$. The only way $A$ will happen is that one of the $A_n$s happen.

So, we can say that event $A$ is the union of the events represented by $A_n$.

\[A = \bigcup_{n=1}^{\infty}A_n\]

Also, the $A_n$’s are mutually exclusive (meaning you can’t win on the third toss and the fourth toss). Because of this we can say:

\[P(A) = \sum_{n=1}^{\infty}P(A_n)\]

So, if we get $A_n$’s, we can sum them to get the probability we are interested in.

What needs to happen for you to win on the nth toss?

1) The losing sequence (mine), given by $P_n$ should not have reached it’s absorbing state by the nth toss. Otherwise, you didn’t reach three consecutive heads before I reached two. The probability is $(1-P_n[2])$.

2) The winning sequence (yours), given by $Q_n$ should have reached a state where it needs just one more heads to win in the toss one earlier than the current toss. In this case, it means I should have reached two consecutive heads by the $(n-1)$th toss. The probability is $Q_{n-1}[2]$.

3) And finally, you should get a heads in the $n$th toss and complete the coup-de-grace. The probability of this is $\frac{1}{2}$ since the coins are fair.

Combining those three events, we get -

\[P(A_n) = (1-P_n[2])(Q_{n-1}[2]) \frac{1}{2}\]

And so we get to the most important equation of this blog,

\[\begin{equation}P(A) = \sum_{n=0}^{\infty} (1-P_n[2])(Q_{n-1}[2]) \frac{1}{2} \tag{3} \end{equation}\]

There are other ways to represent the equation above. Here, we considered that the probability of you winning on the $n$th toss should be: $\frac {Q_{n-1}[2]} {2}$. Taking it another way, we know that the probability that you will reach your goal on or before the $n$th toss is $Q_{n}[3]$. Also, the probability that you will reach your goal on the $n$th toss is the probability that you reach it on or before the $n$th toss subtracted by the probability that you reach it strictly before the $n$th toss, which is just $Q_{n}[3] - Q_{n-1}[3]$.

This means that equation (3) above can also be written as:

\[P(A) = \sum_{n=0}^{\infty} (1-P_n[2])(Q_{n}[3]-Q_{n-1}[3])\]

We can get $P_n$ and $Q_n$ by using equations (1) and (2) and simply multiplying the transition matrices $n$ times. Of course, the sum in equation (3) goes to infinite terms and we can’t multiply the matrices infinite times. However, note that (as we can see from figure 1), $(1-P_n[2])$ will tend to zero as $n$ becomes large (as the absorbing state - which is 2 - sucks up all the probability mass as $n$ increases and this means one minus it’s probability tends to zero) and $Q_{n-1}[2]$ will tend to zero for the same reason. So, the contributions to the sum of terms corresponding to large $n$ become smaller and smaller. Hence, we can get a very, very good approximation by simply ignoring all terms associated with $n$s larger than some reasonable value (like 100 or 50).

Here is some python code that demonstrates this.

import numpy as np
start1 = np.array([1,0,0])
m_3 = np.matrix([[.5,.5,0], [.5,0,.5], [0,0,1]])

start2 = np.array([1,0,0,0])
m_4 = np.matrix([[.5,.5,0,0], [.5,0,.5,0],[.5,0,0,.5], [0,0,0,1]])

# p_n must always be one toss ahead of q_n_minus_1. So, when p_n is at toss 1, q_n_minus_1 
# should be at 0. When it is at 2, q_n_minus_1 should be at 1 and so on.
# that is why p_n starts with 1 and goes to 100 while q_n_minus_1 starts with 0 and goes to 99.
p_n = np.array([np.dot(start1, np.linalg.matrix_power(m_3,n))[0,2]\
                                     for n in range(1,101)])
q_n_minus_1 = np.array([np.dot(start2, np.linalg.matrix_power(m_4,n))[0,2]\
                                     for n in range(100)])

print("Prob(3 consec heads b4 2 consec heads):" + str(sum(q_n_minus_1*(1-p_n))/2))

Using this, we get that the probability we were after (you’ll get 3 consecitive heads before I get 2 consecutive heads) is 21.25%.

Similarly by flipping things, we can get the probability that I’ll get 2 consecutive heads before you get 3 consecutive heads

# Assumes you have run the previous block
p_n = np.array([np.dot(start2, np.linalg.matrix_power(m_4,n))[0,3] for n in range(1,101)])
q_n_minus_1 = np.array([np.dot(start1, np.linalg.matrix_power(m_3,n))[0,1] for n in range(100)])
print("Prob(2 consec heads b4 3 consec heads):" + str(sum(q_n_minus_1*(1-p_n))/2))

This shows that the probability that I will win is 73.98%.

Of course, the two numbers above don’t sum to one since there is also the third possibility of a draw (both reach their goals on the same toss).

We have now answered the question we posed at the starting of this blog. But often when one climbs to the top of a mountain, one simply sees more mountains on the other side. In that spirit, we will see what other questions we can answer with the tools developed in seeking the answer to this one in the next section (5).

Also, note that the approach of taking a large sequence formed by repeated matrix multiplication is somewhat ugly and inefficient. In section 6, we will improve on this and solve equation (3) in a cleaner, more efficient manner.

5. Other mountains

5.1. Even the odds

In the question we started the blog with, it was clear that the stakes were in my favor. Let’s consider now a contest where that isn’t obvious. I still need to get to two consecutive heads and you still need three. But to even the odds for you, your heads no longer need to be consecutive. As soon as you see three heads in total, you win.

This turns out to be a pretty close contest. Who should a gambler bet their money on?

We can solve this problem in a pretty similar manner to the previous one. Just construct the two Markov chains, use them to calculate the sequences of being in their various states after $n$ tosses and plug the sequences into equation (3). &

My transition matrix will still be the $M_3$ defined in section 2. Your transition matrix however, will not be the $M_4$ defined there.

Now since you need three total heads, when you are in state $i$, you don’t go back to zero if you get a tail on that toss. Instead, you simply stay at the state you were. In this way, I’m penalized more severely for a tails (I have to start over and you don’t).

Your new transition matrix, $O_4$ will look like this:

\[O_4 = \left( \begin{array}{cccc} 0.5 & 0.5 & 0 & 0 & \\ 0 & 0.5 & 0.5 & 0\\ 0 & 0 & 0.5 & 0.5\\ 0 & 0 & 0 & 1 \end{array} \right)\]

And just like in section 4, we can solve this with the exact same code, just replacing your $M_4$ matrix with $O_4$ defined above.

import numpy as np
start1 = np.array([1,0,0])
m_3 = np.matrix([[.5,.5,0], [.5,0,.5], [0,0,1]])

start2 = np.array([1,0,0,0])
o_4 = np.matrix([[.5,.5,0,0], [0,.5,.5,0],[0,0,.5,.5], [0,0,0,1]])

# See previous code block in section 4 for detailed comments.
p_n = np.array([np.dot(start1, np.linalg.matrix_power(m_3,n))[0,2] for n in range(1,101)])
q_n_minus_1 = np.array([np.dot(start2, np.linalg.matrix_power(o_4,n))[0,2] for n in range(100)])

print("Prob(3 running heads b4 2 consec heads):" + str(sum(q_n_minus_1*(1-p_n))/2))

We see that the probability of you getting 3 running heads before my 2 consecutive is 36.74%. Your chances of winning have improved a little (from ~24% to ~37%), but there is more good news. When we flip things we get the probability that I will win:

# Assumes you have run the previous block
p_n = np.array([np.dot(start2, np.linalg.matrix_power(o_4,n))[0,3] for n in range(1,101)])
q_n_minus_1 = np.array([np.dot(start1, np.linalg.matrix_power(m_3,n))[0,1] for n in range(100)])
print("Prob(2 consec heads b4 3 running heads):" + str(sum(q_n_minus_1*(1-p_n))/2))

And this turns out to be 54.77%. So, my chances of winning have gone down from ~74% to ~55%. It seems there is a much higher chance the game will end in a draw (1-.55-.37 = 8%, up from 5% earlier).

So, your odds against me have gone from roughly 2:7 in the previous game to 2:3 in this one. That evens things out quite a bit, you’re probably still not happy, so let’s see what else we can do.

5.2. Unfair coins

In this section, we won’t settle for anything less than a completely even game. One way to do that is to go back to the old criterion (you need three consecutive heads while I need 2). However, we’ll let you cheat and use a biased coin (one that has a higher probability of heads than mine).

So, while my head still has a 50% chance of getting heads, yours will be higher. If your coin has a 100% chance of getting heads, note that your win is still not guaranteed. I could get heads on my first two tosses and that would mean you lose. The probability I’ll win is therefore 25% ($.5 \times .5$ for the two heads I need).

The only way a draw can happen is if I get a tails on the first toss and then two consecutive heads. The probability of this is therefore 12.5% ($.5\times .5 \times .5$).

So, the probability that you’ll win is 1-.125-.25 = 62.5%.

But that tips things too much in your favor. We want the game to be even. Also note that if your probability of winning is 50%, you’re still going to be at an advantage because of the possibility of a draw, which will make my probability of winning less than 50%. We want to find the probability of heads $p$ your coin should have for none of us to have an advantage.

For this, we just need to wrap the code from the blocks above in a function that takes the probability of heads as input (instead of hard coding to 0.5) and then we can a simple numeric method like bisection to find the $p$ at which the probabilities of both of us winning are the same.

I simply plotted these probabilities with $p$ in the code here. The result of which is:

You can see from the plot that at $p \sim 0.77$ at which point, both of us will have a $\sim 45\%$ chance of winning.

5.3. Longer sequences

Now that we’ve evened the odds in the previous sub-section (just make your coin have a 77% instead of 50% chance of getting heads), we can try and test the power of these new wings we’ve developed. What if we go back to both of us having fair coins and wanted to get the probability that I reach 4 consecutive heads before you reached 3? What if it were me reaching 5 before you reaching 4? In other words, you still require one more consecutive head than me, but we’re gradually increasing the number of heads I need. In general, what is the chance I’ll reach $n$ heads before you reach $(n+1)$ heads?

You can imagine now that the markov matrices become larger and larger as we increase $n$ (there are $n^2$ terms in the smaller one) and the chance that the game is resolved by the hundredth toss will become smaller and smaller (two or three consecutive heads would have appeared by the 100th toss, but ten consecutive heads is far less likely). So, we need to extend the sequences to a much larger number of tosses.

This makes the method we described earlier very slow. But, we can still use the faster method referenced earlier (will be described in section 6) to get the exact solution for $n$ going all the way up to the late teens.

In any case, we plot the probabilities of me winning, you winning and a draw in the figure below. If you want to use the more efficient method described in section 6, you can do so with the aid of an open source library that goes along with this blog.

# To install the library, pip install stochproc from command line.
# hosted at - https://github.com/ryu577/stochproc
from stochproc.competitivecointoss.smallmarkov import *

ns = np.arange(2,15)
win_probs = []
for n in ns:
 	# The losing markov sequence of coin tosses that needs (n-1) heads.
 	lose_seq = MarkovSequence(get_consecutive_heads_mat(n))
 	# The winning markov sequence of coin tosses that needs n heads.
 	win_seq = MarkovSequence(get_consecutive_heads_mat(n+1))
 	# If you multiply the two sequence objects, you get the probability
 	# that the first one beats the second one.
	win_prob = win_seq*lose_seq
	win_probs.append(win_prob)

plt.plot(ns, win_probs)

Looking at the figure, something surprising pops out.

As we increase the number of tosses I need, the probability of you winning starts approaching 33.33% or one third. Similarly, the probability of me winning starts approaching 66.67% or two thirds. It’s a little surprising that such simple numbers pop out.

And where there are simple numbers, there are elegant reasons.

Note that this game is like I have a coin that flips successfuly once in $2^n$ (to get $n$ heads in a row) and you have a coin that flips successfuly once in $2^{n+1}$ and we ask who succeeds first.

What complicates matters is that I win ties. If $n$ is large however, there will be very few ties.

Also if $n$ is large, the chance of a run of $n$ heads is very small, $2^{-n}$. We can view each flip as a try by you to start a run of $n+1$ heads, which happens with probability $2^{-(n+1)}$ and a try by me to start a run of $n$ heads, which happens with probability $2^{-n}$.

Almost all attempts will fail for both of us. We can ignore those.

The chance that you win is then $\frac {2^{-(n+1)}}{2^{-n}+2^{-(n+1)}}= \frac{\frac 12}{1+ \frac12} = \frac 13$.

The reason it starts lower when $n$ is smaller is that we might both start succesful runs at the same time, at which point you win because your run finishes first. However for large values of $n$, it becomes near impossible to succeed at the same time.

6. Closed forming the markov sequences

In the previous section, the approximation of cutting off at 100 tosses worked pretty well. This is because there is a very high chance the game would have ended well before 100 tosses (either one of the players reaching their goal).

What if however, we wanted the probability the first player gets 50 consecutive heads before the second one got 51. The probability that this would get resolved before 100 tosses is quite low, so cutting off at 100 would not give us a good approximation. And let’s say we needed to go to a million tosses to get a good approximation. Multiplying these large matrices millions of times would be a very computationally expensive affair to the point where we might not even be able to get the answer.

We can make this method more scalable to questions about large numbers of tosses and not dependent on multiplying matrices many times. But we’ll need some linear algebra - in particular, the concept of eigen decomposition - to do so.

To follow this section therefore, it is best you’re somewhat familiar with eigen value decomposition. It is basically a way to really x-ray matrices and see what they’re all about. The idea is that for the matrices $M_3$ and $M_4$, you can find three and four dimensional vectors respectively that when multiplied with them, don’t change. And then there are other vectors that are also special because they change, but are only scaled, not rotated.

If the matrix is $n \times n$, you can find $n$ such directions (including that one that wasn’t changed or was scaled by 1).

This translates to being able to write:

\[\begin{equation}M_3 E = E \left( \begin{array}{ccc} 1 & 0 & 0 & \\ 0 & \lambda_1 & 0 \\ 0 & 0 & \lambda_2 \end{array} \right) \tag{4}\end{equation}\]

where the matrix $E$ contains the special vectors that are scaled and not rotated as it’s columns.

Now if we look at equations (1) and (2), we basically need to raise the markov matrices to the power of $n$, meaning multiply them with themselves over and over.

If we assume the matrix $E$ from equation (4) is invertible, we can write (pre-multiplying both sides by $E^{-1}$).

\[\begin{equation}M_3 = E \Lambda E^{-1}\tag{5}\end{equation}\]

Where

\[\Lambda = \left( \begin{array}{ccc} 1 & 0 & 0 & \\ 0 & \lambda_1 & 0 \\ 0 & 0 & \lambda_2 \end{array} \right)\]

Now, if we want to square $M_3$, we can use equation (5)

\[M_3^2 = E \Lambda E^{-1} E \Lambda E^{-1} = E \Lambda \Lambda E^{-1} = E \Lambda^2 E^{-1}\]

We can repeat this $n$ times to conclude:

\[M_3^n = E\Lambda^n E^{-1}\]

Since $\Lambda$ is diagonal, we can write -

\[\Lambda^n = \left( \begin{array}{ccc} 1 & 0 & 0 & \\ 0 & \lambda_1^n & 0 \\ 0 & 0 & \lambda_2^n \end{array} \right)\]

Now from equation (1),

\[P_n = P_0 M_3^n = (1,0,0) E \Lambda^n E^{-1}\]

If you multiply these out, this is equivalent to saying

\[\begin{equation}P_n = (1,\lambda_1^n, \lambda_2^n) C \tag{6}\end{equation}\]

Where $C$ is a constant matrix of coefficients given by:

\[C = \left( \begin{array}{ccc} E_{1,1} & 0 & 0 & \\ 0 & E_{1,2} & 0 \\ 0 & 0 & E_{1,3} \end{array} \right) E^{-1}\]

and $E_{1,1}$, $E_{1,2}$ and $E_{1,3}$ are the elements of the first row of $E$.

In equation (3), we only need $P_n[2]$ which will be given by:

\[P_n[2] = C_{1,3}+C_{2,3}\lambda_1^n + C_{3,3} \lambda_2^n\]

You can see here a proof that a stochastic matrix will always have eigen values less than one in magnitude. This means that the second and third terms of the expression above will tend to zero as $n \to \infty$.

Since the sequence must eventually end up in the absorbing state for large enough $n$ we must have,

\[\lim_{n \to \infty} P_n[2] = C_{1,3} = 1\]

Giving us,

\[\begin{equation}P_n[2] = 1 + C_{2,3}\lambda_1^n + C_{3,3}\lambda_2^n\tag{7}\end{equation}\]

Similarly, we will have a corresponding matrix $D$ for your $Q_n$ sequence and we can write:

\[\begin{equation}Q_n[2] = 1 + D_{2,3}\mu_1^n+D_{3,3}\mu_2^n+D_{4,3}\mu_3^n\tag{8}\end{equation}\]

Plugging (7) and (8) into equation (3) we get,

\[P(A) = \frac{1}{2}\sum_{n=1}^{\infty} (C_{2,3}\lambda_1^n + C_{3,3}\lambda_2^n)(1 + D_{2,3}\mu_1^{n-1}+D_{3,3}\mu_2^{n-1}+D_{4,3}\mu_3^{n-1})\] \[=\frac{1}{2}\sum_{n=1}^{\infty} (C_{2,3}\lambda_1^n + C_{3,3}\lambda_2^n)(1 + \frac{D_{2,3}}{\mu_1}\mu_1^{n}+\frac{D_{3,3}}{\mu_2}\mu_2^{n}+\frac{D_{4,3}}{\mu_3}\mu_3^{n})\] \[= \frac{1}{2}\left(C_{2,3}\left(\frac{\lambda_1}{1-\lambda_1}\right) + C_{2,3}\frac{D_{2,3}}{\mu_1}\left(\frac{\lambda_1\mu_1}{1-\lambda_1 \mu_1}\right) +\dots\right)\]

As long as we can construct the transition matrices for the two sequences of coin tosses and get their eigen values and eigen matrices, we can get the probability that one of them will beat the other using the approach above.

7. Other methods

I alluded earlier that once you climb a mountain, you’ll sometimes see other mountains on the other side. But there are probably multiple paths to the top of the mountain you currently on. And climbing via another paths is probably a unique experience in itself.

In this section, we’ll therefore consider some other methods to solve our original problem (what is the chance you get to three consecutive heads before I get to two consecutive heads?).

7.1. A “different” equation

There is a way to tame these processes coming from a completely different direction. By constructing a difference equation (see what I did there? Anyone? Anyone?). Ok, it might be too early for that pun if you haven’t heard of difference equations before.

Here is how this goes - consider my sequence of tosses where I’m aiming for two consecutive heads.

Let’s see if there is an alternate way to get the sequence of probabilities $P_n$ used in equation (3). In particular, let $a_n$ be the probability that I’ll reach my goal on the $n$th toss.

One thing we can say is that at the time when I reach my goal on the $n$th toss, the last two tosses I saw would have both been heads. Also, the third-from-final toss would have had to be a tails (otherwise, I would have won one toss earlier). The probability of this sequence of THH is $\frac{1}{2}\times \frac{1}{2} \times \frac{1}{2} = \frac{1}{8}$.

Before these three tosses, my probability of winning would have been by definition, $a_{n-3}$. But if I am to win in the $n$th toss, I need to exclude that event. And similarly $a_{n-4}$ and so on. This means:

\[a_n=\frac{1}{8}\left(1-\sum_{i=2}^{n-3} a_i\right) = \frac{1}{8}\left(1-\sum_{i=1}^{n-3} a_i\right)\]

The second part of the equality follows since $a_1 = 0$ (the probability of reaching two consecutive heads on the first toss is zero).

Now let’s define:

\[b_n = \sum_{i=1}^{n} a_i\]

which represents the probability you would have won by the $n$th toss.

Plugging this equation into the previous one we get:

\[b_n-b_{n-1} = \frac{1}{8}(1-b_{n-3})\]

This is what I meant by “difference equation” it expresses a relationship for the difference of two consecutive terms of the series.

Simplifying further:

\begin{equation} b_n = b_{n-1} - \frac{1}{8} b_{n-3} + \frac{1}{8} \tag{9}\end{equation}

Now, we know $b_n$ for the first few values of $n$. When $n=0$, there is no way I would be in my absorbing state of two consecutive tosses. Hence, $b_0 = 0$. Even on the first toss, there is no chance I would have tosses two heads surely. So, $b_1=0$ as well. When $n=2$, there is a chance I would see two consecutive heads. For that, the first two tosses would have to both have been heads. The probability of this event is $\frac{1}{2}\times\frac{1}{2} = 0.25$.

Now, we can use these and equation (9) to calculate the other terms of the sequence.

\[b_3 = b_2 - \frac{b_0}{8} + \frac{1}{8} = 0.25 + 0.125 = 0.375\] \[b_4 = b_3 - \frac{b_1}{8} + \frac{1}{8} = 0.375 + 0.125 = 0.5\] \[b_5 = b_4 - \frac{b_2}{8} + \frac{1}{8} = 0.5+\frac{0.75}{8} = 0.59375\] \[\vdots\]

You get the idea by now, we can extend this as far as we like to get any $b_n$. This is an alternate way to get the sequence we calculated in section 3 ($P_n[2]$) and used in section 4 to get the answer.

But, we still need to patiently iterate all the way to $n$ starting from $n=3$. This is similar to the way we were patiently multiplying matrices in section 3 to get the sequence of probabilities. However, we found a way in section 6 to replace the iterative method for calculating the probability sequence with a more elegant closed form. Is there a way we can find a closed form for this difference equation as well?

7.1.1. Closed form for the difference equation

The standard way to solve a difference equation like (9) is to separate into homogeneous and non-homogeneous parts, solve the homogeneous part using a polynomial guess and then make another guess for how the solution will need to be modified to make it compatible with the original, non-homogeneous equation.

First, let’s clean up equation (9) a bit:

\begin{equation}8b_{n}-8b_{n-1}+b_{n-3}=1\tag{10}\end{equation}

The homogeneous part of this equation is given by:

\begin{equation}8b_{n}-8b_{n-1}+b_{n-3}=0\tag{11}\end{equation}

Here, we make an educated guess (“ansatz” in German - we will make two of these guesses in this section which can make some people feel uneasy. If that is so, skip to the next sub-section (7.1.2) where we solve the same equation in a ‘more systematic’ manner).

We suspect the solution to this homogeneous equation might be:

\[b_n' = l^n\]

Substituting this into equation (11), we get the characteristic polynomial:

\[8l^3-8l^2+1=0\]

The roots of which are given by $l=\frac{1-\sqrt{5}}{4}, \frac{1+\sqrt{5}}{4}, \frac{1}{2}$

Notice that the golden ratio, $\phi$ is given by $\phi = \frac{1+\sqrt 5}{2}$ which means we can express the roots of the characteristic polynomial of the homogeneous equation as:

\[l = \frac{\phi}{2}, \frac{1-\phi}{2}, \frac 1 2\]

Remeber we started off with the assumption that $b_n'=l^n$. This means that the following will all work for $b_n'$:

\[b_n' = \left(\frac{\phi}{2}\right)^n\] \[b_n' = \left(\frac{1-\phi}{2}\right)^n\] \[b_n' = \left(\frac{1}{2}\right)^n\]

Meaning that any linear combination of these three solutions will also be a solution of equation (11). So in general,

\begin{equation}b_n’ = c_1 \left(\frac{\phi}{2}\right)^n + c_2 \left(\frac{1-\phi}{2}\right)^n + c_3 \left(\frac {1} {2}\right)^n\tag{12}\end{equation}

Now, how do we modify this solution of equation (11) above to get the solution of the original, non-homogeneous equation (equation (10))?

We know that $b_n'$ satisfies equation (10) so we can write:

\[8b_n'-8b_{n-1}'+b_{n-3}' = 0\]

Let’s assume $b_n' = b_n+c_0$

So the equation above becomes:

\[8(b_n+c_0)-8(b_n+c_0)+(b_{n-3}+c_0) = 0\] \[=>8b_n-8b_n+b_{n-3} = -c_0\]

To make this align with equation (10) we require $c_0=-1$

And this would mean $b_n = b_n'-c_0 = b_n'+1$

So, the general solution ($b_n$) of equation (10) can be written using this result and (12):

\begin{equation}b_n= c_1 \left(\frac{\phi}{2}\right)^n + c_2 \left(\frac{1-\phi}{2}\right)^n + c_3 \left(\frac {1} {2}\right)^n + 1\tag{13}\end{equation}

And for the two consecutive heads problem, we can use the first few values in the sequence $b_n$ in particular, $b_0=0$, $b_1=0$, $b_2=0.25$ and $b_3=0.375$ to get the $c_i$’s. This is accomplished with the following python code:

import numpy as np
phi = 1.618033988749895
## The three roots
l1 = phi/2; l2 = (1-phi)/2; l3 = .5

mat = np.array([
            [1, 1,   1,    1],
            [1,l1,   l2,   l3],
            [1,l1**2,l2**2,l3**2],
            [1,l1**3,l2**3,l3**3]])

## The first four b_n's
bn = np.array([0,0,.25,.375])
cc = np.linalg.solve(mat,bn)
# cc is the vector of coefficients [c_0,c_1,c_2,c_3] = [1., -1.17082039, .170820393, 0]

7.1.2. Closed form for difference equation using matrix approach

If you were happy with the solution presented in 7.1.1, you can skip this one.

But some people feel like that qpproach requires a lot of guessing. How am I supposed to make all the right guesses after all? So, let’s explore another method to solve the difference equation which is based on the eigen values of a matrix just like the method in section 6 was.

To get a matrix, we need a system of equations while we have only one (eqn 9). To create a system of equations, let’s just add two more dummy equations.

\[b_{n-1}=b_{n-1}\] \[b_{n-2}=b_{n-2}\]

Now, we can combine the above two equations with equation (9) and express this system in matrix form.

\[\left( \begin{array}{c} b_n \\ b_{n-1}\\ b_{n-2}\\ \end{array} \right) = \left( \begin{array}{ccc} 1 & 0 & -\frac{1}{8} & \\ 1 & 0 & 0\\ 0 & 1 & 0\\ \end{array} \right) \left( \begin{array}{c} b_{n-1} \\ b_{n-2}\\ b_{n-3}\\ \end{array} \right) + \left( \begin{array}{c} \frac{1}{8} \\ 0\\ 0\\ \end{array} \right)\]

Now let $\beta_n = \left(\begin{array}{c} b_n \\ b_{n-1}\\ b_{n-2}\\ \end{array} \right)$, $\gamma = \left(\begin{array}{c} \frac{1}{8} \\ 0\\ 0\\ \end{array} \right)$ and $M = \left( \begin{array}{ccc} 1 & 0 & -\frac{1}{8} & \\ 1 & 0 & 1\\ 0 & 1 & 0\\ \end{array} \right)$.

We then get:

\[\beta_n = M \beta_{n-1} + \gamma\] \[=M(M \beta_{n-2}+\gamma) + \gamma\] \[=M^2 \beta_{n-2} + (I+M)\gamma\] \[=M^3 \beta_{n-3} + (I+M+M^2)\gamma\] \[\vdots\]

And repeating this $(n-2)$ times we get,

\[\beta_n = M^{n-2}\beta_{2} + (I+M+M^2+ \dots + M^{n-3})\gamma\]

Now, assuming $M$ is diagonalizable (which it is) we can say:

\[M=E\Lambda E^{-1}\]

And this would imply:

\[M^n = E \Lambda^n E^{-1}\]

So we get:

\begin{equation}\beta_n = E\Lambda^{n-2}E^{-1}\beta_2 + E(I+\Lambda+\Lambda^2+\dots+\Lambda^{n-3})E^{-1}\gamma \tag{14}\end{equation}

Now, if $\lambda_1$, $\lambda_2$ and $\lambda_3$ happen to be the eigen values of $M$ then,

\[\Lambda = \left( \begin{array}{ccc} \lambda_1 & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \\ \end{array} \right)\]

and,

\[\Lambda \Lambda = \Lambda^2 = \left( \begin{array}{ccc} \lambda_1^2 & 0 & 0 \\ 0 & \lambda_2^2 & 0 \\ 0 & 0 & \lambda_3^2 \\ \end{array} \right)\]

meaning the $\Lambda$ stays diagonal no matter how many times we multiply it with itself.

Remember, we are interested in the first element of $\beta_n$ which is $b_n$ and can be extracted by taking a dot product with a vector that has a 1 in the first position and zero at other positions.

\[b_n = \beta_n^T \left(\begin{array}{c}1 \\ 0 \\ 0\\\end{array} \right)\]

Now, using the same arguments as in section 6, we can say that any component of the first term in the R.H.S of equation (14) can be written as:

\[(E \Lambda^{n-2} E^{-1} \beta_2)^T \left(\begin{array}{c}1 \\ 0 \\ 0\\\end{array} \right) = l_1 \lambda_1^{n-2}+l_2 \lambda_2^{n-2} + l_3 \lambda_3^{n-2} \tag{15}\]

Now to tackle the second term of equation (14). Observe that:

\[(I+\Lambda+\Lambda^2 + \dots + \Lambda^{n-3}) = \left( \begin{array}{ccc} 1+\lambda_1+\dots+\lambda_1^{n-3} & 0 & 0 \\ 0 & 1+\lambda_2+\dots+\lambda_2^{n-3} & 0 \\ 0 & 0 & 1+\lambda_3+\dots+\lambda_3^{n-3} \\ \end{array} \right)\] \[= \left( \begin{array}{ccc} \frac{1-\lambda_1^{n-2}}{1-\lambda_1} & 0 & 0 \\ 0 & \frac{1-\lambda_2^{n-2}}{1-\lambda_2} & 0 \\ 0 & 0 & \frac{1-\lambda_3^{n-2}}{1-\lambda_3} \\ \end{array} \right)\]

Here, we used the geometric series result:

\[1+\lambda + \lambda^2 + \dots + \lambda^{n-1} = \frac{1-\lambda^n}{1-\lambda}\]

Again, we see a similar pattern to the one utilized in equation (15) and so we can say:

\[(E(I+\Lambda+\Lambda^2+\dots+\Lambda^{n-3})E^{-1}\gamma)^T \left(\begin{array}{c}1 \\ 0 \\ 0\\\end{array} \right) = \left(\begin{array}{ccc}1&0&0\end{array}\right) E \left( \begin{array}{ccc} \frac{1-\lambda_1^{n-2}}{1-\lambda_1} & 0 & 0 \\ 0 & \frac{1-\lambda_2^{n-2}}{1-\lambda_2} & 0 \\ 0 & 0 & \frac{1-\lambda_3^{n-2}}{1-\lambda_3} \\ \end{array} \right)E^{-1}\gamma\] \[= m_1 \frac{1-\lambda_1^{n-2}}{1-\lambda_1} + m_2 \frac{1-\lambda_2^{n-2}}{1-\lambda_2} + m_3 \frac{1-\lambda_3^{n-2}}{1-\lambda_3}\]

meaning,

\[(E(I+\Lambda+\Lambda^2+\dots+\Lambda^{n-3})E^{-1}\gamma)^T \left(\begin{array}{c}1 \\ 0 \\ 0\\\end{array} \right)\] \[= \left(\frac{m_1}{1-\lambda_1} + \frac{m_2}{1-\lambda_2} + \frac{m_3}{1-\lambda_3}\right) - \left(\frac{m_1}{1-\lambda_1} \lambda_1^{n-2} + \frac{m_2}{1-\lambda_2} \lambda_2^{n-2} + \frac{m_3}{1-\lambda_3} \lambda_3^{n-2}\right)\] \[= d_0 + d_1 \lambda_1^{n-2} + d_2 \lambda_2^{n-2} + d_3 \lambda_3^{n-2} \tag{16}\]

Adding equations (15) and (16), we get the first term in the L.H.S of equation (14), which is $b_n$.

\[b_n = \beta_n^T\left(\begin{array}{c}1\\0\\0\end{array}\right)\] \[= (E\Lambda^{n-2}E^{-1}\beta_2 + E(I+\Lambda+\Lambda^2+\dots+\Lambda^{n-3})E^{-1}\gamma)^T \left(\begin{array}{c}1\\0\\0\end{array}\right)\] \[= d_0 + (d_1+l_1)\lambda_1^{n-2} + (d_2+l_2) \lambda_2^{n-2} + (d_3+l_3) \lambda_3^{n-2}\]

Now redefine: $c_0 = d_0$, $\frac{d_1+l_1}{\lambda_1^2} = c_1$ and so on we get:

\[b_n = c_0 + c_1 \lambda_1^n + c_2 \lambda_2^n + c_3 \lambda_3^n\]

Now, the eigen values of $M$ are: $\lambda_1, \lambda_2, \lambda_3 = \frac{\phi}{2}$, $\frac{1-\phi}{2}$, $\frac 1 2$, which are the same as the roots of the polynomial equation in the previous sub-section.

Substituting this into the previous equation we get equation (13) through this alternate linear algebra based route.

7.2. I’ll give you the answer on one condition

We started this blog with a simple to understand problem. The solution provided involved Markov chains and thinking in terms of states. That solution was like a silver bullet for a whole family of problems of this nature. In this section, we will pursue a simpler solution that involves just conditional probabilities.

Let’s define $A$ as the event that you win.

Now, instead of thinking about the entire sequences of tosses, let’s think of just the very first toss for both sequences (the top one is yours and the bottom one is mine). Conditioning on these first two tosses we get,

\[P(A) = P\left(\begin{array}{c} H \\ H \end{array} \right) P\left(A| \begin{array}{c} H \\ H \end{array} \right) + P\left(\begin{array}{c} H \\ T \end{array} \right) P\left(A| \begin{array}{c} H \\ T \end{array} \right)\] \[+P\left(\begin{array}{c} T \\ H \end{array} \right) P\left(A| \begin{array}{c} T \\ H \end{array} \right) + P\left(\begin{array}{c} T \\ T \end{array} \right) P\left(A| \begin{array}{c} T \\ T \end{array} \right)\]

Now, since all tosses are independent,

\[P\left(\begin{array}{c} T \\ H \end{array} \right) = P\left(\begin{array}{c} H \\ T \end{array} \right) = P\left(\begin{array}{c} H \\ H \end{array} \right) = P\left(\begin{array}{c} T \\ T \end{array} \right) = \frac 1 4\]

Which means that:

\[P(A) = \frac 1 4 P\left(A| \begin{array}{c} H \\ H \end{array} \right) + \frac 1 4 P\left(A| \begin{array}{c} H \\ T \end{array} \right)\] \[+\frac 1 4 P\left(A| \begin{array}{c} T \\ H \end{array} \right) + \frac 1 4 P\left(A| \begin{array}{c} T \\ T \end{array} \right)\]

Now, if both of us get tails on our first tosses, that’s essentially the same as restarting the game. We might as well throw those tosses out and start from scratch since none of us is any closer to their goal.

Therefore its easy to see that:

\[P\left(A| \begin{array}{c} T \\ T \end{array} \right) = P(A)\]

Substituting into the previous equation we get:

\[P(A) = \frac 1 3 \left(P\left(A| \begin{array}{c} H \\ H \end{array} \right) + P\left(A| \begin{array}{c} H \\ T \end{array} \right) + P\left(A| \begin{array}{c} T \\ H \end{array} \right) \right) \tag{17}\]

Now let’s tackle the next term in order of difficulty, $P\left(A| \begin{array}{c} T \\ H \end{array} \right)$. Let’s consider the various possibilities.

If I get a tails on the next toss, the game is over and I’ve won. So, I have to get a tails on my second toss for you to win (or event $A$ to occur). The probability of this is $\frac 1 2$. The question becomes, what did you get on your second toss.
- Now if you get a tails on the second toss (an event with probability $\frac 1 2$), the game would have been reset and the probability of you winning from there is $P(A)$.
- If you get a heads on your next toss, then we can throw away the first tosses. The probability you will win from here becomes $P\left(A| \begin{array}{c} H \\ T \end{array} \right)$

Putting all this together we get -

\[P\left(A| \begin{array}{c} T \\ H \end{array} \right) = \frac 1 2 \left[ \frac{1}{2} P(A) + \frac 1 2 P\left(A| \begin{array}{c} H \\ T \end{array} \right) \right] \tag{18}\]

Next, let’s tackle $P\left(A| \begin{array}{c} H \\ H \end{array} \right)$.

If I get a heads on the second toss, the game is over since my first toss was already a heads. So, we don’t have to consider any possibility where that happens. So, our next two tosses can either be a heads for you and tails for me or tails for both of us.
- If the our second tosses result in a heads for you and tails for me, the probability of this is $\frac 1 4$.
  - If you then score a heads on the third toss as well (probability $\frac 1 2$), you win and it doesn’t matter what I got.
  - If you score a tails (probability $\frac 1 2$), things get more interesting.
    - If I score a heads (probability $\frac 1 2$), it’s effectively the same as me being on one heads and you being a one tails, which is $P\left(A| \begin{array}{c} T \\ H \end{array} \right)$.
    - If I score a tails (probability $\frac 1 2$), then we both scored tails on our third tosses and any time that happens the game resets and the probability of you winning again becomes $P(A)$.
- If I get a tails on the second toss (probability $\frac 1 2$), it means we now both have tails and the game has reset. So the probability from here is $P(A)$.

Putting all this together we have:

\[P\left(A| \begin{array}{c} H \\ H \end{array} \right) = \frac 1 4 \left( \frac 1 2 + \frac 1 2 \left( \frac 1 2 P\left(A| \begin{array}{c} T \\ H \end{array} \right) + \frac 1 2 P(A)\right)\right) + \frac 1 4 P(A)\tag{19}\]

And using similar reasoning we can get:

\[P\left(A| \begin{array}{c} H \\ T \end{array} \right) = \frac 1 4 \left( \frac 1 2 + \frac 1 2 \left(\frac 1 2 P(A) + \frac 1 2 P\left(A| \begin{array}{c} T \\ H \end{array} \right)\right) \right)\] \[+ \frac 1 4 \frac 1 2 \left(\frac 1 2 + \frac 1 2 P(A)\right) + \frac 1 4 P\left(A| \begin{array}{c} T \\ H \end{array} \right) + \frac 1 4 P(A)\tag{20}\]

Now, equations (17), (18), (19) and (20) are four equations in four unknowns and they can be solved to get $P(A)$, which happens to be one of those unknowns.

7.3. One big chain

The original problem stated in this blog (you get three consecutive heads before I get two consecutive heads) was solved using Markov chains.

Is the fact then that there exists another method to solve this same problem, also relying completely on Markov chains, but having very little to do with the one discussed in section 4 a testement to the versetility and power of Markov chains?

Let’s go over this new method before we decide.

Here, instead of keeping track of the two coin toss sequences independently of each other, we define a single state for the whole game.

And this state is a collection of two numbers, the number of consecutive heads seen so far by you and me (first number, second number).

Before any of us tosses our coin, the state is of course $(0,0)$. After the first toss, if I get a heads and you get a tails, the state will be $(1,0)$; if both of us get heads, it will be $(1,1)$ and so on.

Also like the markov chains in the method described in section 4, we can’t let this one have an unbounded number of states. We need to stunt it’s state space by defining the ones that lead to a victory for either you or I as absorbing states. So for example, it will be impossible to get to state $(4,1)$ since you would have won and hence stopped as soon as you got three consecutive heads, hence never reaching four.

This makes $(3,0)$ and $(3,1)$ absorbing states that result in your victory while $(0,2)$, $(1,2)$, $(2,2)$ and even $(3,2)$ absorbing states resulting in my victory (you need to get 3 consecutive heads before I get two consecutive heads).

So, the possible states are (might as well express them in python code):

index2state=[(0,0),(1,0),(2,0),(0,1),(1,1),(2,1),(0,2),(1,2),(2,2),(3,0),(3,1),(3,2)]

Note that for the first six states, none of us has won. These are called transient states since they go to other states.

In the last six states, one of us has won and once the game reaches them, it stays in those states forever (since it concluded as soon as it reached them). Those states are called recurrent states.

Of the six recurrent states ((0,2),(1,2),(2,2),(3,0),(3,1) and (3,2)); (3,0) and (3,1) are the only ones where you win.

Now, the rules of the game obviously dictate some transition matrix between the states mentioned above.

Any time the game is in state (i,j), if (i,j) is a transient state:
- It will transition to (0,0) if both of us get heads (probability $\frac 1 4$).
- It will transition to (i+1,0) if you get a heads but I get a tails (probability $\frac 1 4$).
- It will transition to (j+1,0) if you get a tails and I get a heads (probability $\frac 1 4$).
- It will transition to (i+1,j+1) if both of us get heads (probability $\frac 1 4$).
Any time the game is in an absorbing state, it stays in the absorbing state with probability 1.

The matrix these rules correspond to is shown in the figure below. The transient states are represented in black, the absorbing states where you lose are represented in red and the ones where you win in green.

The larger matrix is divided into four sections. The top-left section is the sub-matrix for transitions between transient states. We call it $Q$. The top-right is the sub-matrix for transitions from transient to recurring states. We call it $R$. The bottom right is from recurring to recurring. Since recurring states stay in the same state and don’t transition anywhere else, this is simply an identity matrix.

Now, if we can find the probabilities that the game ends in each of the recurring states, we can use those to find the probability you’ll win since the recurring states corresponding to your victory are (3,0) and (3,1).

Given that we start in transient state $i$, the probabilities that the game ends up in each of the absorbing states is given by the $i$th row of the matrix given by:

\[U = (I-Q)^{-1}R \tag{21}\]

To see the reason, conditionin the absorption process from state $i$ to state $j$ based on what happens in the first transition: either the absorption happens in the first transition with probability $R_{ij}$, or a transition occurs into some other transient state $k$ with probability $Q_{ik}$ and from there transitions until eventually absorbed with probability $u_{kj}$. This can be expressed as the matrix equation

\[u = Qu + R\]

where solving for $u$ generates the original equation above.

Here is some python code that generates the transition matrix $M$ shown in the figure above, splits it into $Q$ and $R$ and then uses them to obtain $U$ as per equation (21).

import numpy as np
m = np.zeros((12,12))
state2index = {(0,0):0,(1,0):1,(2,0):2,(0,1):3,(1,1):4,(2,1):5,(0,2):6,(1,2):7,(2,2):8,(3,0):9,(3,1):10,(3,2):11}
index2state = [(0,0),(1,0),(2,0),(0,1),(1,1),(2,1),(0,2),(1,2),(2,2),(3,0),(3,1),(3,2)]
for i in range(6):
    m[i,0] = 0.25
    m[i, state2index[(index2state[i][0]+1,0)]] = 0.25
    m[i, state2index[(0,index2state[i][1]+1)]] = 0.25
    m[i, state2index[(index2state[i][0]+1,index2state[i][1]+1)]] = 0.25
m[6:,6:] = np.eye(6)

q = m[:6,:6]
r = m[:6,6:]
u = np.linalg.solve(np.eye(6)-q, r)
print("Probability you win:" + str(u[0,3]+u[0,4]))

Optimum waiting thresholds

2018-05-22T06:49:40+00:00

1. Optimum waiting threshold problem

Say it takes ten minutes for you to walk to work. However, there is a bus that also takes you right from your house to work. As an added bonus, the bus has internet, so you can start working while on it. The catch is that you don’t know how long it will take for the bus to arrive.

Now, being the productive person you are, you want to minimize the time you spend being in a state where you can’t work (walking to work or waiting for the bus).

If we knew exactly how long the bus was going to take on any day, this would be quite easy. If the bus is more than ten minutes away, simply start walking.

However, we know that real life is rarely so predictable. We probably have some distribution in mind for how much longer it will take for the bus to arrive, but not the exact time. Assume we have this distribution and it doesn’t change from one day to the next. Our strategy is simple - wait a certain amount of time for the bus (say 5 minutes) and if the bus doesn’t arrive by then, give up waiting and simply walk to work.

Now, assume that in parallel universes, you and your clones pick different wait-times. In one version, your clone always waits one minute, in another one, two minutes and so on.

How can you ensure that the total time you’ll spend waiting for the bus over a long period (say, a year) will be less than all the clones. In other word, what wait time should you pick so that the total time you spend either walking or waiting is minimized?

This clearly depends on the distribution for how long the bus is going to take. If we know this distribution, what can we do next?

2. Parametric survival models

Let’s say the time it’s going to take for the bus to arrive is $T$, which is a random variable. Also, if we give up on the bus and decide to walk, the time it’s going to take is $y$.

\[E[DT] = P(T \leq \tau). E [T|T<\tau] + P(T>\tau). (\tau + y)\]

Where :

$T$ is the organic distribution for the bus to arrive.
$\tau$ is the threshold for the time we will wait before starting to walk to the office.
$y$ is the time it will take for you to get to the office once we start walking.
$DT$ is the time we spend not working given waiting threshold, $\tau$. It is a random variable that depends on $T$ and $\tau$.

As we can see, the equation above can be described as a simple expectation of two possibilities. Either the the bus will arrive before the threshold, $\tau$ or, it won’t and we will walk to work losing a total time of $(\tau + y)$ (since we waited $\tau$ before starting to walk and then spend $y$ time walking). Now, to find the optimal value of $\tau$ (which will minimize the L.H.S of equation (1)), we can differentiate equation (1) with respect to $\tau$ and set it to zero. This leads us to the following condition (see appendix for how this comes about):

\begin{equation} \Rightarrow \frac{f_T(\tau)}{P(T\geq \tau)} = \frac{1}{y}\end{equation}

The expression in equation (1) has a very intuitive meaning. The expression on the L.H.S is called the “Hazard rate” of distributions describing the arrival times of certain events. The events being modelled are generally negative in nature (like defaults) hence the “hazard” in the name.

In our case though, the event we are anticipating is positive (the bus arriving - not so hazardous). It is the rate at which evens that the distribution is describing arrive.

Here, the rate is described as the inverse of the average time until the next event arrives as seen from the current state, instantaneously. Note that for most distributions, this rate itself will change with time so the average time until the next event won’t actually be the inverse of the instantaneous rate just as the instantaneous velocity of an accelerating object at a certain time can’t on its own predict the time the object will reach a certain point. The R.H.S is the inverse of the (deterministic) time it’ll take for you to get to work once we start walking. It too is then a kind of rate. It is when the rates corresponding to the two competing processes align that the optimal $\tau$ is acheived.

In a nutshell, this points to the strategy - “at any point in time, go with the option that gives you a higher hazard rate”.

Now, how do we get the distribution, $T$? We can just observe how long the bus takes each and every day and then use the data to fit a distribution. This can be done using maximum likelihood estimation. In this context though, there is a bit of a wrinkle. And this is discussed next.

2.1 Dealing with censored data

So, we record the time it takes each day for the bus to arrive and fit a distribution to these observations. Simple, right? Well, not quite. The problem is, you probably can’t wait beyond a certain point for the bus. So, what about the times you waited say seven minutes, gave up and just walked to work? You know for sure that the bus took more than seven minutes. You just don’t know how much longer than seven minutes it took. What do you do with these observations? Should you simply throw them away? That would be a mistake since it would bias the data. The fitting method would never see values that are greater than the value at which they are getting censored, and wrongly assume those extreme values never occur in the real data.

To take these censored values into account, we need to modify the likelihood function. If there were no censoring, we might fit the distribution parameters to the data with a likelihood function that looks something like this:

\[L(\Theta; t_1, t_2, \dots t_n) = f_T(t_1, t_2, \dots t_n | \Theta) = \prod_{i=1}^n f_T(t_i | \Theta)\]

However, now we need to account for the censored data points as well. For this, we modify our likelihood function:

\[L(\Theta; t_1, t_2, \dots t_n, x_1, x_2, \dots x_m) = \left(\prod_{i=1}^n f_T(t_i | \Theta)\right) \left(\prod_{j=1}^m S(x_j|\Theta)\right)\]

Now, we can simply optimize this log likelihood function with respect to its parameters, calculate its hazard rate and use it in equation (1) to find the optimal wait threshold.

3. Alternate approaches

Apart from fitting the parametric distributions to the data and using equation (1), we can frame the problem in slightly different ways. In this section, we explore some of those ways.

3.1. Markov chain based approach

We can equivalently look at the process described above as a Markov chain. The two things we need for a Markov chain are states and the process that describes motion between the states. For example, we might define the states - “waiting for bus”, “walking to work” and “working”. We could then think of how we would move between these states.

In the Markov assumption, the probability of the state we will go to next only depends on the state we are in now. So, we can write these probabilities in the form of a matrix (source states along the rows and destination states along the columns).

\[\left( \begin{array}{ccc} P(T>\tau) & 0 & P(T<\tau) \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{array} \right)\]

We can also consider other aspects of the Markov chain. For example, instead of looking at the average time it takes to get to a new state, we can ask what proportion of time overall is spent on the road (walking or waiting for the bus).

Experiments show that for this simple Markov chain, the optimal threshold doesn’t change. We can see in the figure below that the threshold that minimizes the wait time to get from waiting to working (blue curve) is the same as the one that maximizes the proportion of time spent in a state where we can work (red curve).

However, in more complex markov chains with multiple thresholds, we might want to optimize this objective function.

3.2. Non-parametric approach

Let’s say we only wanted to evaluate thresholds that were less than the minimum censored value (say our current threshold is 10 minutes and so, all data points are censored at ten minutes and we’re sure we want to decrease it - so evaluate thresholds less than 10 minutes). Since we have complete information for all our data at the thresholds lower than 10 minutes that we wish to evaluate, we know exactly what would have happened if they were enforced. The figure below demonstrates this. Let’s say we want to move from the current threshold (in purple) to some lower threshold (in yellow). For the data points represented by the green dots, this is a good thing since they caused us to start walking anyway. We might as well have started walking earlier. For the orange points, it is less clear. If we walk

3.3. Hybrid approach

3.4. Piecewise hazard rates

Since equation (1) equates the hazard rates of two strategies, it is also possible to directly model them as a function of some covariates using the piecewise exponential model described in [1] and [2].

4. Experiments

In this section, we conduct a series of experiments where we start with a process such that we know the answer for the optimal threshold. We can then see how various models and approaches perform given just some data generated by this process. Each experiment is a set of answers to the following questions:

What process was used to generate the data? We have the following options:
1. Lomax distribution.
2. Weibull distribution.
3. LogLogistic distribution.
4. LogNormal distribution.
5. Distribution given by an arbitrary hazard rate profile. As shown in equation (1), the hazard rate is a crucial element in defining where the optimal parameter will be. Also, the hazard rate can be used to get the PDF and CDF of the distribution. So, in theory we should be able to generate data from a distribution corresponding to an arbitrary function defining the hazard rate.
What is the process used to censor the data? We only consider censoring schemes where we don’t see data larger than the optimal threshold.
1. None: No censoring at all.
2. Deterministic censoring: All samples below the censor level are censored.
3. Stochastic censoring: We randomly toss a coin and with probability p, censor the data point if it is eligible for censoring.
What model/methodology is being used to estimate the optimal wait threshold?
1. We could use one of the distributions defined in question 1 and fit the data to them.
2. We could use one of the non-parametric approaches.

If we exactly know the distribution generating the data, we have shown that we actually know the correct answer for the optimal waiting time by equation (2).

So, we can see which of our approaches comes the closest to this right answer.

4.1. Lomax data generation

In this section, we generate data from a Lomax distribution and see which models perform best at predicting the optimal threshold.

4.1.1. Deterministic censoring at value less than the optimal threshold.

First, let’s see how the Lomax distribution itself does.

Let’s start with the simple case where we have no censoring. Let’s generate some data from a Lomax distribution. Now, what happens when we apply the non-parametric methods we discussed to this data?

The results are best encapsulated in the figure below:

For the Weibull distribution, we get much higher optimal thresholds. For the LogLogistic distribution, we get slightly lower ones.

We should be careful to use a model that is capable of modelling the distribution we are after. If our assumption is incorrect, we can end up way off in terms of the optimal wait threshold.

4.1.2. Complete censoring

Here, we consider the case where we completely censor the data at some censor level. This process describes the scenario where we always wait exactly 10 minutes (say) for the bus and then start walking. How do our non-parametric approaches do in this scenario?

4.1.3. Random censoring

4.2. Non-standard processes

Our data might be generated by a process that doesn’t conform to any of the standard distributions.

5. Appendices

Now, $E[T|T < \tau] = \frac{\int_0^\tau tf_T(t)dt}{P(T < \tau)}$

\begin{equation}\Rightarrow E[T] = \int_0^\tau x f_X(x) dx + [1-P(X\leq\tau)] \times [\tau + Y] \end{equation}

To find the value of $\tau$ that minimizes the expression in equation (1), we take the derivative with respect to it and set to zero.

\[\frac{dE[T]}{d\tau} = \tau f_T(\tau) - f_T(\tau) \times [\tau + Y] + [1-P(T\leq\tau)] = 0\] \[\Rightarrow 1 - P(T\leq \tau) = y f_T(\tau)\]

\begin{equation} \Rightarrow \frac{f_X(\tau)}{P(X\geq \tau)} = \frac{1}{Y}\end{equation}

7. References

[1] http://data.princeton.edu/wws509/notes/c7.pdf

[2] https://projecteuclid.org/download/pdf_1/euclid.aos/1176345693

Visualizing the Newton Raphson method

2018-03-25T05:10:40+00:00

Newton and Raphson

From the name of this method, you might picture Newton and Raphson working together as a team, coming up with this method. But in reality, they discovered it independently. This seems to be a common theme with Newton. Remember how he also discovered calculus with Leibnitz? How come this guy was having so many ideas at the exact time other people were having them too? It’s actually quite possible since for multiple people at the cutting edge of research in a field, the next steps are often unambiguous. It’s the same reason that multiple independent teams discovered the security vulnerabilities in Intel chips (spectere and meltdown).

What is x?

The essence of algebra is solving equations for unknown quantities. For example, find $x$ where

\[2x+5=7\]

It’s easy to see that the solution to the equation above is $x=1$. Another way to look at this is to take everything to one side and calling the expression $y=2x-2$. Then, we can see try and find the $x$ for which $y=0$.

But what happens when the expression of $y$ in terms of $x$ becomes more and more complex? Can we find all (or at least one of) the values of $x$ that satisfy $y=0$?

If linear functions are the simplest, then perhaps the next in rank are quadratics (involve $x^2$). Now, if we have a quadratic equation, how would we go about solving it? Well, we could simply use the quadratic formula. But let’s for a second suppose that we don’t know this formula. We only know how to solve linear systems like equation (1). Can we use this knowledge of solving a linear equation for solving a non-linear (in this case, quadratic) equation?

\[y = x^2\]

Well we can, but only if we’re very persistent. Let’s start with any random point, $x$. Now, calculate the value of our quadratic function at this $x$ and call it $y$. Now, our tool is a linear equation solver but we have a quadratic equation instead. So, let’s convert the quadratic equation into a linear equation. To do this, just approximate the quadratic equation with a linear equation. Now, when you solve this linear equation, you will get an “answer” for $x$. But we can’t expect this answer to be “right” since we “cheated”. Since the definition of insanity is doing the same thing and expecting a different result, we keep repeating this process. The only difference being that this time we use the previous solution to the linear equation as our starting point. And eventually, this process leads us to the solution for the quadratic equation as you can see below.

Cranking up the dimensions

Now, it is possible that you might have seen something very similar to the visualization above before. But, how does this extend to multiple dimensions? For example, instead the one variable $x$, let’s say we now have two variables, $x$ and $y$. The most natural way to extend equation (2) to two dimensions is:

\[z=x^2+y^2\]

And here is what that plot looks like:

Like before, we want to solve for $z=0$. This happens at the green circle in the figure above is where our equation intersects the x-y plane.

But unlike the previous case, this are an infinite number of solutions. And this is quite natural since we increased the number of variables by one, keeping the number of equations the same. In order to get a finite number of solutions, we need to have the same number of equations as variables, which is two.

To keep things simple, let’s just replicate our existing equation and move it by a finite amount. And just like that, we have two equations now.

Also, the second equation intersects the x-y plane in a circle as well and the two circles intersect at two distinct points which are the solutions to the system of equations. These are the yellow points in the figure below.

Now, how do we use our method from before to get to one of these solutions?

We’ll start with any random point on the x-y plane (the pink point in the figure below). This can be projected to the green point on the green parabola and the yellow point on the yellow parabola. We can then draw the best linear approximations of the green and yellow parabolas at the green and yellow points. Since linear equations are planes, this gives us the light green and light yellow planes. These planes will intersect in the purple line which will intersect the x-y plane in the white point. This white point is the solution of the system of two linear equations (approximations of the two parabolas).

Now, we can simply repeat this process with the white point instead of the original pink point and keep going. Ultimately, we get to one of the two solutions. This process is shown below.

Why is the gradient the direction of maximum ascent

2017-09-29T00:34:40+00:00

This blog is based on a YouTube video explaining why the gradient is the direction of steepest ascent. Check it out.

I) Why bother?

First, let’s get what might be the elephant in the room for some out of the way. Why should you read this blog? First, because it has awesome animations like figure 1 below. Don’t you want to know what’s going on in this picture?

Figure 1: A plane with its gradient

And second, because optimization is really, really important. I don’t care who you are, that should be true. I mean, it’s the science of finding the best. It let’s you choose your definition of “best” and then whatever that might be, tells you what you can do to achieve it. That simple.

Also, though optimization - being a whole science and all - has a lot of depth to it, it has this basic first order technique called “gradient descent” which is really simple to understand. And it turns out, this technique actually is the most widely used in practice. In machine learning at least, as the models get more and more complex, using complex optimization algorithms becomes harder and harder. So people just use gradient descent. In other words, learn gradient descent and you learn the simplest, but also most widely used technique in optimization. So, let’s understand this technique very intuitively.

II) Optimization brass tacks

As I mentioned in the last section, optimization is awesome. It involves taking a single number - for example, the amount of money in your bank account, or the number of bed bugs in your bed - and showing you how to make it the best it can be (if you’re like most people, high in the first case and low in the second). Let’s call this thing we’re trying to optimize $z$.

And the implicit assumption here of course is that we can control the thing we want to optimize in some way. Let’s say it depends on some variable (say $\vec{x}$) which is in our control. So, at every value of $\vec{x}$, there is some value of $z$ (and we want to find the $\vec{x}$ that makes $z$ the best).

There is probably some equation that describes this graph. Say $f(x,z) = 0$. But in the context of optimization, we need to express it in the form: $z = f(x)$ (assuming the original equation is conducive to separating $o$ from $x$ in this way). Then, we can ask - “what value of $x$ corresponds to the best $z$?”. If we have a nice continuous function, then one thing we can say for sure is that at this special $x$, the derivative of $z = f(x)$ (generally denoted by $f'(x)$) will be zero. Now if you don’t know what a derivative is (it’s the ratio of the amount $o$ gets perturbed to the perturbation in $x$, when we purposely perturb $x$) and why it should be zero when we achieve the best $z$, I’d recommend checking out this video that covers this in detail.

III) What is a gradient

When the thing we are optimizing depends on more than one variable, the concept of the derivative extends to a gradient. So, if $o$ from above depends on $x$ and $y$, we can collect them into a single vector $\vec{u} = [x, y]$. So, with $o = f(x,y) = f(\vec{u})$, the gradient of $o$ becomes $\Delta f(u) = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} ]$. And just like with the derivative, we can be sure that the value of $\vec{u}$ that will optimize $z$ will have both components of the gradient equaling zero.

As a side note, the gradient plays a star role in the Taylors series expansion of any smooth, differentiable function:

\begin{equation} f(\vec{u}) = f(\vec{a}) + (\vec{u}-\vec{a})^T \Delta f(\vec{a}) + \dots\end{equation}

As you can see, the first two terms of the right side only involve $u$, no squares, cubes or higher powers (those come up in the subsequent terms). Those first two terms also happen to be the best linear approximation of the function around $u=a$. We show below what this linear approximation looks like for a simple paraboloid ($z=x^2+y^2 = u^T u$) at various points.

Figure 2: The best approximation of a paraboloid (pink) by a plane (purple) at various points

IV) Linear functions

We saw in the previous section that gradients are quite adequately represented with linear functions. So, we will restrict the discussion to linear functions going forward. For the equation of a linear function, we only need to know where it intersects the axes.

If we have just one dimension (the x-axis) and the intersection happens at $x=a$, we can describe this by

\begin{equation}\frac{x}{a} = 1\end{equation}.

If we have two dimensions (x-axis and y-axis) and the line intersects the x-axis at $x=a$ and the y-axis at $y=b$, the equation becomes

\begin{equation}\frac{x}{a} + \frac{y}{b} = 1\end{equation}.

When $y=0$, we get $\frac{x}{a}=1$ which is the same as the equation above.

What if we have three dimensions? I think you know where this is going

\begin{equation}\frac{x}{a}+\frac{y}{b}+\frac{z}{c} = 1 \end{equation}

and so on (this by the way, is the red plane you saw in figure 1 above).

Now, we can see that all of the equations above are symmetric in $x$, $y$, $z$ and so on. However, in the context of optimization one of them has special status. And that is the variable we are seeking to optimize. Let’s say this special variable is $z$. In we want to express this as an optimization problem, we need to express the equation as $z=f(x)$. If we do this to equation (4), what we get is

\begin{equation} z = c (1 - \frac{x}{a} - \frac{y}{b})\end{equation}

V) I want to increase my linear function. Where should I go?

This is the central question for this blog. You have your linear function described by equation (5) above, with $x$ and $y$ are in your control. You find yourself at a certain value of $x=x_0$ and $y=y_0$. Let’s say for the sake of simplicity that $z=0$ at this current point. You can take a step of 1 unit along any direction. The question becomes, in which direction should you take this step? This conundrum is expressed in figure 3 below showing the infinite directions you can possibly walk along. Each directions changes the objective function, $z$ by a different amount. So, one of them will increase $z$ the most while another will decrease $z$ the most (depending on weather we want to maximize or minimize it).

Note that if we had just one variable in our control (say $x$), this would have been a lot easier. There would have been only two directions to choose from (increase $x$ or decrease $x$). As soon as we get to two or more free variables however, the number of choices jumps from two to $\infty$.

Figure 3: The infinite directions we can move along. Which one should we choose?

Now, we want to find the directions along which $z$ changes the most. So let’s do the opposite (say, because we’re a little crazy?). Let’s look for the direction where $z$ doesn’t change at all. If you look at the figure above carefully, you’ll see that this happens when the green arrow aligns with the orange line. And then if you continue staring, you might notice that the $z$ changes the most when the green arrow is perpendicular to the orange line.

So, it seems like that orange line can provide some insight into this problem. What is the orange line then? Well, its clearly where our plane intersects the green grid representing the x-y plane (the grid along which we can move). And what would the equation of the x-y plane be? It would be $z=0$. In other words, $z$ does not change on it. So, since the orange line lies completely on the grid, it must also have $z=0$ everywhere on it. No wonder $z$ refuses to change when our green arrow causes us to simply move along the orange line.

As for the equation of the orange line, it is where the equations of the plane -

\[\frac{x}{a} + \frac{y}{b} + \frac{z}{c} = 1\]

and x-y grid; $z=0$ are satisfied simultaneously. This gives us

\[\frac{x}{a} + \frac{y}{b} = 1\]

Now, it is clear from the equation of the orange line above that when $y=0$, $x=a$. So, the position vector of the point where it cuts the x-axis is $\vec{o_x} : [a,0]$ ($o$ for orange). Similarly, the point where it cuts the y-axis is $\vec{o_y} : [0,b]$. Now that we have the position vectors of two points on the line, we subtract them to get a vector along the line ($\vec{o}$).

\begin{equation}\vec{o} = \vec{o_x} - \vec{o_y} = [a,0] - [0,b] = [a, -b]\end{equation}

Now, if we can show that the gradient is perpendicular to this vector, we are done. That will give us some intuition around why the gradient changes $z$ the most.

VI) Gradient of the plane

Applying the definition of the gradient from section III to the equation of the plane (equation (5) above) we get -

\begin{equation} \frac{\partial z}{\partial x} = \frac{c}{a} \ , \frac{\partial z}{\partial y} = \frac{c}{b} \end{equation}

This makes the gradient:

\begin{equation} \Delta z : [\frac{c}{a}, \frac{c}{b}]\end{equation}

Now, we know that for two vectors to be orthogonal, their dot product must be zero. Taking the dot product or the gradient of the plane (from equation (7)) and the vector along the orange line (from equation (6)) we get,

\begin{equation} (\Delta z)^T \vec{o} = [\frac{c}{a}, \frac{c}{b}] [a, -b]^T = a.\frac{c}{a} - b.\frac{c}{b} = 0 \end{equation}

And there we have it, the gradient is aligned with the direction perpendicular to the orange line and so, it changes $z$ the most. It turns out that going along the gradient increases $z$ the most while going in the opposite direction to it (note that both these directions are orthogonal to the orange line) decreases $z$ the most.

I’ll leave you with this visualization demonstrating how as we change the plane, the gradient continues to stubbornly point in the direction that changes it the most.

Figure 4: As we change the plane, the gradient always aligns itself with the direction that changes it the most

Can we see four dimensional space with our own two (dimensional) eyes?

2017-03-26T00:34:40+00:00

This blog is based on a video about slicing a four dimensional cube or Teserract. Check it out.

Building up to seeing in four dimensions

If we want to visualize a four dimensional phenomenon, we can’t just visualize a four dimensional phenomenon. We first need to understand the phenomenon well, see what it looks like in three dimensions. Then, we must know what to expect in four dimensions and then, we will be ready to truly see it in all its glory. So, let’s begin the first step of this journey.

Slicing a cube

The simplest shape in space of any dimensionality is a cube. When we draw a cube in a space, it is partitioned into the region outside the cube and the region inside the cube. The region outside the cube is of course, infinite. But the region inside provides a means to measure the space the cube lives in. Whenever someone draws a closed boundary and asks how much space is within the boundary, we can try and count the number of cubes we would be able to fit into it.

So, since cubes are such a fundamental property of space, let’s try and observe a phenomenon within cubes. It involves slicing the cube. Pick any vertex of the cube and move to its three nearest neighbors (there will be three since we’re in 3 dimensional space). Joining these three vertices will result in an equilateral triangle. Then, we move from those to their nearest neighbors. There turn out to be three of those as well and they form another equilateral triangle parallel to the first one. This is best visualized in the figure below -

Figure 1: Start at any vertex. Its nearest neighbors form an equilateral triangle. And if we then go to their nearest neighbors, another equilateral triangle.

Let’s inspect this cube with it’s two equilateral triangles more closely -

Figure 2: A three dimensional look at the equilateral triangles described above.

And now, we can go ahead and cut the cube along these two equilateral triangles.

Figure 3: Slicing the cube along the two saperating planes.

Shapes of the slices

The red and blue shapes in figure 3 above look a lot like Tetrahedra while the green shape in the middle looks like an Octahedron, which are the simplest of the Platonic solids (three dimensional objects that have regular polygons like equilateral triangles as their faces). There is however, a small thorn in the side here. If we look at the front faces of the blue and green objects, they are right angled triangles (being part of a cube). Platonic solids like Tetrahedra and Octahedra are composed of only right angled triangles, however. So, these seem to be imperfect versions of the Platonic solids.

Figure 4: Since the two front faces are right angled not equilateral, these shapes seem to be distorted versions of the Platonic solids.

But, here’s another observation - the cutting planes along which we sliced our cube were certainly equilateral. However, when we project the cube to a two dimensional square as in figure 5 below, they look right angled. Could something similar be happening when we project a four dimensional object to this three dimensional cube?

Figure 5: Equilateral triangles look like right angled when the cube is projected to a square.

Slicing a Teserract

Turns out, this is indeed the case. In figure 6 below, we see a four dimensional cube, also called a Teserract. It has 16 vertices (2^4). And since this is four dimensional space, the corresponding cutting planes formed by the nearest neighbors will be three dimensional (like they were two dimensional when we were in the three dimensional space of the cube). We start with the bottom left point. Since this is a cube in 4d space, the point will have four nearest neighbors. These four points form the blue tetrahedron (which is perfectly Platonic - all its faces are equilateral). Then, those four vertices are connected to six other vertices. These form the green (again, Platonic) Octahedron. and finally, we get the red tetrahedron from the nearest neighbors of the green Octahedron. It is these three shapes we were seeing our 3d cube split into in the previous section. However, the projection to a lower space was causing them to look imperfect (right angled faces).

Figure 6: Slicing a Teserract. There are three cutting planes - a tetrahedron, octahedron and tetrahedron.

Solve a system of polynomial equations - Buchbergers algorithm for computing Groebner basis

2017-01-23T03:34:40+00:00

What are polynomials?

First, let’s define very quickly some basic terms.

Monomials are functions in one or more variables where we are only allowed multiplication and non-negative integer exponents. For example, $x^2 y$ is a monomial in the variables $x$ and $y$.
Polynomials are functions in one or more variables that involve the operations of addition, subtraction, multiplication and non-negative integer exponents. This means, they are basically sums of monomial terms. Here is an example of a polynomial in three variables, $x, y, z$.

\[5x^2 y + y^2 - z - 3\]

Linear Ideals

Before getting into polynomials, lets consider the simpler special case of linear equations (polynomials where the powers of the variables in each monomial sum to no more than one). Let’s say we are given a system of linear equations.

\begin{equation} a_{1,1} x + a_{1,2} y - b_1 = 0 \end{equation}

\begin{equation} a_{1,2} x + a_{2,2} y - b_2 = 0 \end{equation}

Now, we can apply any single variable functions, $f_1(u)$, $f_2(u)$ such that $f_1(0) = 0$ and $f_2(0) = 0$ to the two equations and the sum would still be an expression that equals $0$.

\begin{equation} f_1(a_{1,1} x + a_{1,2} y - b_1 ) + f_2(a_{1,2} x + a_{2,2} y - b_2) = 0 \end{equation}

However, equations (1) and (2) are linear equations while this is not necessarily true for equation (3). The only functions $f_1(u)$ and $f_2(u)$ for which equation (3) is also a linear equation are of the form -

\begin{equation} f_i(u) = c_i u \end{equation}

Here, the $c_i$ are constants, so these $f_i$ are just linear equations that pass through the origin. The operations described above would be the famous row operations if the equations were written in matrix form.

Now, the set of all linear equations obtained through this process is infinite, but in general smaller than the set of all possible linear equations that exist. We call this smaller set the “Ideal” of linear equations. Given the two equations (1) and (2), we can generate every other linear equation in this Ideal. So, the two equations form a basis of this ideal. However, we could choose two other equations from the ideal and those would form a valid basis as well (say, the first one by adding equations (1) and (2) and the second by subtracting (2) from (1)). The natural question then becomes, which of these sets is the “best”? There is infact a basis which looks a lot more natural and elegant than all the others. And this is the Echelon form, where as far as possible, each equation involves only one variable (or as few variables as possible). This makes the region where all of these equations are solved simultaneously very obvious to see (if each equation involves just one variable, solving for that variable is trivial). We can acheive this special basis through Gaussian elimination, where we systematically perform row operations until each equation involves just one variable.

Polynomial Ideals

Polynomial bases are a generalization of systems of linear equations to polynomial equations. Let’s again consider a system of (two) quadratic equations in two variables.

\begin{equation}g_a(x, y) = a_{0,0} + a_{1,0} x + a_{0,1} y + a_{2,0} x^2 + a_{0,2} y^2 + a_{1, 1} xy = 0 \end{equation}

\begin{equation}g_b(x, y) = b_{0,0} + b_{1,0} x + b_{0,1} y + b_{2,0} x^2 + b_{0,2} y^2 + b_{1, 1} xy = 0 \end{equation}

Again, applying functions $f_1$ and $f_2$ to these eqations still results in a valid expression but that expression is not necessarily a polynomial equation. So, what should the functions be to ensure that the resulting equation is also a polynomial?

Unlike linear equations (where multiplying two of them produces a quadratic, not necessarily linear equation), if we multiply two polynomial equations, we just get another polynomial equation. So, we can multpily both equations (4) and (5) by any other polynomials in the same variables ($x$ and $y$) and still end up with polynomials. We can then add these new polynomials together and get yet another polynomial. In other words, can consider $f_i(u) = c_i u$ where now, $c_i$ is a polynomial in all the variables that $g_a$ and $g_b$ are are polynomials in (here $x$ and $y$). Then, we get -

\begin{equation} g_a(x,y) c_a(x,y) + g_b(x,y) c_b(x,y) = 0\end{equation}

For different polynomial functions, $c_a$ and $c_b$, we will end up with different polynomial equations. These will again (just like with linear equations) form an infinite set which is probably smaller than the set of all possible polynomial equations in $x$ and $y$. And since all we need to generate this set are $g_a$ and $g_b$, we can consider them a basis of this set. Just like linear equations, the set of these polynomial equations is called the “Ideal” of polynomial equations. And again, we can think of alternate bases for this ideal (other than ($g_a$, $g_b$)). Like the Echelon form for linear equations, can we think of a collection of polynomial equations where each of them has as few variables as possible. This will again make the task of finding points that satisfy them simultaneously very easy. For polynomial ideals, this special basis is called a “Groebner basis”. It is a fairly recent invention (1965).

Applications of Groebner Basis

It is beyond the scope of this blog to go into the details of how Groebner basis are calculated and what their properties are. For that, you can read the book on Ideals, Varieties and Algorithms by Cox, Little and O Shea (henceforth, “CLO”), which I highly encourage you to do (I personally completed the first two chapters at the time of writing). We can however, treat them like a magic black box that produces a special type of basis for polynomial ideals. A basis that makes all problems go away. And by doing so, we can solve a surprising array of very practical problems.

As you can imagine, systems of polynomial equations occur in all kinds of applications. And whenever they do, Buchbergers algorithm and Groebner basis is there to foil them.

This paper is an excellent resource on this. Here are some examples -

Optimization problems where the objective function and constraints are polynomial equations. Minimizing sum of squared errors immediately comes to mind.
Robotic motion planning - let’s imagine a robotic arm consisting of various limbs. We can imagine that the length of each limb will be fixed, so the range of motion it can perform in isolation will be restricted to a circle at the hinge. When we consider many such limbs and want to find the best way to reach for an object, we get a system of polynomial equations.
Finding the minimum number of colors needed to paint a graph such that no adjacent vertices have the same color. This is called the chromatic number of the graph.

Implementation in C#

The algorithms provided in the CLO book were implemented in this git repo. While there are other implementations of Buchbergers algorithm on Github, I felt a new one was needed since -

None of them were in C#.
None of them were well commented and designed to be easily understandable. So, I took care to comment this code really well and since it follows the most popular book on this topic very closely, it is a very good companion for someone reading through it.

In the code, you will find C# classes for most of the algebric objects discussed in the previous sections and methods implementing the algorithms from the book. The various classes that are a part of this solution form a hierarchical structure:

PolynomialBasis
│
└───Polynomial
    │
    └───Monomial

What this means is that a collection of monomials form a polynomial and a collection of polynomials form a polynomial basis. Since a polynomial basis is simply a set of polynomials, I used a HashSet to store the polynomials. However, when we think of monomials in a polynomial, the order matters. Also, we need to store not just the monomials but also their coefficients. So, I used a SortedDictionary of monomials to back up the polynomial object.

The code starts with the building blocks like LCM (Least Common Multiple), polynomial division, etc. and builds up to Groebner basis. Some applications of Groebner basis are also provided. You can execute the code via Program.cs. In the comments and documentation for the code, “CLO” referes to the CLO book

Let’s see an example of how to use this code to solve systems of polynomial equations.

A new Polynomial basis object can be instantiated through strings which contain polynomial equations in human readable form -

And then, the Buchbergers algorithm can be called to simplify the basis to a more natural Groebner basis. Then, we use the pretty print function to show this new basis.

This will produce output in the following format -

Polynomial - 0
 + 0.67-0.78 z - 0.22 z^2 - 0.11 z^3 +  x

Polynomial - 1
-2.67 - 0.22 z + 0.22 z^2 + 0.11 z^3 +  y

Polynomial - 2
-18 - 12z + 13z^2 + 5z^3 +  z^4

Notice that the last polynomial involves only $z$. We can use it to find possible solutions for this variable. The second to last one then, contains only $z$ and $y$. So, the values of $z$ calculated from the previous equation can be substituted to get possible values of $y$. Similarly, the first equation can be used to calculate $x$.

Now, let’s peer at the machinery behind this. Here is the simplest (and most inefficient) version of Buchbergers algorithm as provided in section 2.7, Theorem 2 for computing the Groebner basis ($g_1, g_2, \dots , g_t$) of a polynomial ideal, starting with an arbitrary basis ($f_1, f_2, \dots , f_s$). Don’t peer at it too closely here before reading the background in the book. This is meant as an illustration of how the algorithms in the book translate to the C# code. Note that the more efficient version of the Buchbergers algorithm given in theorem 11 in section 2.9 is also implemented.

	INPUT: $F = (f_1,f_2, \dots , f_s)$
	OUTPUT: A Groebner basis $G = (g_1, \dots , g_t )$ for $I$, with $F \subset G$
	$G := F$
	REPEAT
	$G' = G$
	FOR EACH PAIR $\{p ,q\}$, $p \neq q$ in $G'$ DO
	$S := \overline{S(p, q)} ^ {G'}$
	IF $S \neq 0$ THEN $G := G \cup {S}$
	UNTIL $G = G'$

And here it is implemented in C# (from the PolynomialBasis.cs file from the repo):

	INPUT: \(F = (f_1,f_2, \dots , f_s)\)
	OUTPUT: A Groebner basis \(G = (g_1, \dots , g_t )\) for \(I\), with \(F \subset G\)
	\(G := F\)
	REPEAT
	\(G' = G\)
	FOR EACH PAIR \(\{p ,q\}\), \(p \neq q\) in \(G'\) DO
	\(S := \overline{S(p, q)} ^ {G'}\)
	IF \(S \neq 0\) THEN \(G := G \cup {S}\)
	UNTIL \(G = G'\)