Python on adventures in optimization

👔 Hierarchical Optimization with HiGHS

Mon, 11 Nov 2024 00:00:00 +0000

In the last post, we used Gurobi’s hierarchical optimization features to compute the Pareto front for primary and secondary objectives in an assignment problem. This relied on Gurobi’s setObjectiveN method and its internal code for managing hierarchical problems.

Some practitioners may need to do this without access to a commercial license. This post adapts the previous example to use HiGHS and its native Python interface, highspy. It’s also useful to see what the procedure is in order to understand it better. This isn’t exactly what I’d call hard, but it is easy to mess up.¹

Code

The mathematical models are available in the last post, so I won’t restate them here. We start in roughly the same manner as before²: create a binary variable for each worker-patient pair, add assignment problem constraints, and state the primary objective.

from itertools import product
import highspy

n = len(data["cost"])
workers = range(n)
patients = range(n)
workers_patients = list(product(workers, patients))

h = highspy.Highs()

# x[w,p] = 1 if worker w is assigned to patient p.
x = {(w, p): h.addBinary(obj=data["cost"][w][p]) for w, p in workers_patients}

# Each worker is assigned to one patient.
h.addConstrs(sum(x[w, p] for p in patients) == 1 for w in workers)

# Each patient is assigned one worker.
h.addConstrs(sum(x[w, p] for w in workers) == 1 for p in patients)

# Primary objective: minimize cost.
h.setMinimize()
h.solve()
cost = h.getObjectiveValue()

Note that if the costs and affinities were lists instead of matrices, we could have used h.addBinaries instead of h.addBinary.

From here we’ll be solving the model twice for every value of alpha. These expressions for total cost and affinity will make a code a little cleaner.

cost_expr = sum(data["cost"][w][p] * x[w, p] for w, p in workers_patients)
affinity_expr = sum(data["affinity"][w][p] * x[w, p] for w, p in workers_patients)

Now comes the hierarchical optimization logic. For every value of alpha, we find the best affinity possible while keeping cost within alpha of its best possible value.

Update the objective function to maximize affinity (see the calls to h.changeColCost and h.setMaximize).
Constrain the cost to be within alpha of the original optimal cost (see cost_cons).
Re-optimize and save the maximal affinity.

Now we constrain the affinity and re-optimize cost.³

Update the objective function to minimize cost again.
Constrain the affinity.

Once that’s done, we remove the additional constraints and repeat for a new value of alpha.

for alpha in alphas:
    # Secondary objective: maximize affinity.
    for (w, p), x_wp in x.items():
        h.changeColCost(x_wp.index, data["affinity"][w][p])

    # Constrain cost to be within alpha of maximum.
    cost_cons = h.addConstr(cost_expr <= (1 + alpha) * cost)

    h.setMaximize()
    h.solve()
    affinity = h.getObjectiveValue()

    # Re-optimize with original cost objective, constraining affinity.
    for (w, p), x_wp in x.items():
        h.changeColCost(x_wp.index, data["cost"][w][p])
    affinity_cons = h.addConstr(affinity_expr >= affinity)

    h.setMinimize()
    h.solve()

    yield alpha, h.getObjectiveValue(), affinity

    # Remove cost and affinity constraints for
    h.removeConstr(cost_cons)
    h.removeConstr(affinity_cons)

Encouragingly, running this using the model.py linked below gives the same values as the Gurobi model, albeit not as quickly. Floating point values are rounded for readability.

| alpha | cost     | affinity |
| ----- | -------- | -------- |
| 0.0   | 11212.0  | 53816.0  |
| 0.05  | 11761.0  | 74001.0  |
| 0.1   | 12332.0  | 79981.0  |
| 0.15  | 12886.0  | 83103.0  |
| 0.2   | 13454.0  | 85394.0  |
| 0.25  | 13996.0  | 87136.0  |
| 0.3   | 14557.0  | 88546.0  |
| 0.35  | 15125.0  | 89751.0  |
| 0.4   | 15670.0  | 90664.0  |
| 0.45  | 16255.0  | 91345.0  |
| 0.5   | 16816.0  | 91997.0  |
| 0.55  | 17370.0  | 92537.0  |
| 0.6   | 17924.0  | 93012.0  |
| 0.65  | 18495.0  | 93491.0  |
| 0.7   | 19055.0  | 93829.0  |
| 0.75  | 19591.0  | 94228.0  |
| 0.8   | 20167.0  | 94530.0  |
| 0.85  | 20737.0  | 94833.0  |
| 0.9   | 21295.0  | 95114.0  |
| 0.95  | 21812.0  | 95361.0  |
| 1.0   | 22402.0  | 95613.0  |

Resources

model.py hierarchical objectives HiGHS model

It gets even easier to mess up with more than two objectives. ↩︎
Isn’t it nice that MIP modeling is similar across different APIs? ↩︎
Exercise for the reader: why do we need to re-optimize cost? ↩︎

👔 Hierarchical Optimization with Gurobi

Fri, 08 Nov 2024 00:00:00 +0000

One of the first technology choices to make when setting up an optimization stack is which modeling interface to use. Even if we restrict our choices to Python interfaces for MIP modeling, there are lots of options to consider.

If you use a specific solver, you can opt for its native Python interface. Examples include libraries like gurobipy, Fusion, highspy, or PySCIPOpt. This approach provides access to important solver-specific features such as lazy constraints, heuristics, and various solver settings. However, it can also lock you into a solver before ready for that.

You can also choose a modeling API that targets multiple solvers. In the Python ecosystem. These are libraries like amplpy, Pyomo, PyOptInterface, and linopy. These interfaces target multiple solver backends (both open source and commercial) and provide a subset of the functionality of each. Since they make it easy to switch between solvers, this is usually where I start.¹

Hierarchical assignment

However, there are plenty of times when solver-specific APIs are useful, or even critical. One example is hierarchical optimization. This is a simple technique for managing trade-offs between multiple objectives in a problem. Let’s look at an example.

Imagine we are assigning in-home health care workers ($w \in W$) to patients ($p \in P$). For simplicity, let’s say we have $n$ workers and $n$ patients, and we are assigning them one-to-one. Each worker has a given cost ($c_{wp}$) of assignment to each patient, which may reflect something like the travel time to get to them. We want to assign each worker to exactly one patient while minimizing the overall cost.

Model

So far, what we have is a simple linear sum assignment problem.

$$ \begin{align*} & \text{min} && z = \sum_{wp} c_{wp} x_{wp} \\ & \text{s.t.} && \sum_w x_{wp} = 1 && \forall \quad p \in P \\ & && \sum_p x_{wp} = 1 && \forall \quad w \in W \\ & && x \in \{0,1\}^{|W \times P|} \end{align*} $$

Solving this model gives us the minimum cost assignment. That’s all well and good, but now say we have a secondary objective of maximizing affinity of workers to patients ($a_{wp}$). That is, we want to prefer assignments that increase overall affinity while still minimizing cost. This is actually a common goal in health care scheduling: if possible, send the same worker to a given patient that you usually send.

Hierarchical optimization gives us a simple way to solve this problem. First, we optimize the model as stated above. This gives us an optimal objective value $z^*$. Then we re-solve the same optimization model, while constraining the cost to be $z^*$ and using the secondary objective function. This says to the optimizer, “improve the affinity as much as you can, but keep the cost optimal.”

$$ \begin{align*} & \text{max} && w = \sum_{wp} a_{wp} x_{wp} \\ & \text{s.t.} && \sum_{wp} c_{wp} x_{wp} \le z^* \\ & && \sum_w x_{wp} = 1 && \forall \quad p \in P \\ & && \sum_p x_{wp} = 1 && \forall \quad w \in W \\ & && x \in \{0,1\}^{|W \times P|} \end{align*} $$

From here, the natural question becomes: what if we trade off some cost for affinity? If we’re willing to increase cost by some percentage, how much more affinity do we get? We can do this by setting a constant $\alpha \ge 0$ and solving the model a number of times.²

$$ \begin{align*} & \text{max} && w = \sum_{wp} a_{wp} x_{wp} \\ & \text{s.t.} && \sum_{wp} c_{wp} x_{wp} \le (1 + \alpha) z^* \\ & && \sum_w x_{wp} = 1 && \forall \quad p \in P \\ & && \sum_p x_{wp} = 1 && \forall \quad w \in W \\ & && x \in \{0,1\}^{|W \times P|} \end{align*} $$

For example, if $\alpha = 0.05$, then we’re willing to accept a 5% increase in overall cost to improve affinity. Setting different values of $\alpha$ lets us explore the space of that trade-off and its impact on cost and affinity.

Once we solve this and get the optimal affinity ($w^*$), we should re-optimize for the primary objective again while constraining the secondary one.

$$ \begin{align*} & \text{min} && \sum_{wp} c_{wp} x_{wp} \\ & \text{s.t.} && \sum_{wp} a_{wp} x_{wp} \ge w^* \\ & && \sum_w x_{wp} = 1 && \forall \quad p \in P \\ & && \sum_p x_{wp} = 1 && \forall \quad w \in W \\ & && x \in \{0,1\}^{|W \times P|} \end{align*} $$

Code

So the math looks reasonable. How do we implement it? If we have a Gurobi license, we can use its built-in facilities for multiobjective optimization. This means that, instead solving a model multiple times and adding constraints to keep cost within $\alpha$ of its optimal value, we can create a single model that does all of this for us.

Assume we have input data which looks like this.

{
    "cost": [
        [10, 20, ...],
        [30, 40, ...],
        ...
    ],
    "affinity": [
        [25, 15, ...],
        [35, 25, ...],
        ...
    ]
}

We start with a simple assignment problem formulation.

import gurobipy as gp

n = len(data["cost"])
workers = range(n)
patients = range(n)

m = gp.Model()
m.ModelSense = gp.GRB.MINIMIZE

# x[w,p] = 1 if worker w is assigned to patient p.
x = m.addVars(n, n, vtype=gp.GRB.BINARY)

for i in range(n):
    # Each worker is assigned to one patient.
    m.addConstr(gp.quicksum(x[i, p] for p in patients) == 1)

    # Each patient is assigned one worker.
    m.addConstr(gp.quicksum(x[w, i] for w in workers) == 1)

We add primary and secondary objectives, and call optimize. The objectives are solved in descending order of the priority flag for Model.setObjectiveN. reltol allows us to degrade the primary objective by some amount (e.g. 5%) to improve the secondary objective.

One catch is that the model only has one objective sense. Since we are minimizing the primary objective, we give the secondary objective a weight of -1 in order to maximize it.

from itertools import product

# Primary objective: minimize cost.
z = (data["cost"][w][p] * x[w, p] for w, p in product(workers, patients))
m.setObjectiveN(expr=gp.quicksum(z), index=0, name="cost", priority=1, reltol=alpha)

# Secondary objective: maximize affinity. Since the model sense is minimize,
# we negate the secondary objective in order to maximize it.
w = (data["affinity"][w][p] * x[w, p] for w, p in product(workers, patients))
m.setObjectiveN(
    expr=gp.quicksum(w), index=1, name="affinity", priority=0, weight=-1
)

m.optimize()

Then we use this magic syntax to pull out the optimal cost and affinity.

m.params.ObjNumber = 0
cost = m.ObjNVal

m.params.ObjNumber = 1
affinity = m.ObjNVal

Results

If we solve this in a loop with alpha values from 0 to 1 in increments of 0.05, we can plot the trade-off between cost and affinity. Going from $\alpha = 0$ to $\alpha = 0.05$ or $\alpha = 0.1$ gives a pretty sizable improvement in affinity. After that, the return starts to gradually level off. This allows us to make a more informed choice about these two objectives.

Resources

generate.py generates input data
input-100x100.json contains input data
model.py hierarchical objectives Gurobi model

While commercial libraries like AMPL have always focussed on modeling performance, some of the open source options targeting multiple solvers come with significant performance penalties during formulation and model handoff to the solver. Newer options like linopy (benchmarks) and PyOptInterface (benchmarks) don’t have that issue. ↩︎
This gives us a Pareto front, which explores the trade-offs between different objectives. ↩︎

👾 Detecting Polygon Intersections

Sun, 27 Sep 2015 00:00:00 +0000

Note: This post has been updated to work with HiGHS.

A fun geometry problem to think about is: given two polygons, do they intersect? That is, do they touch on the border or overlap? Does one reside entirely within the other? While this question has obvious applications in computer graphics (see: arcade games of the 1980s), it’s also important in areas such as cutting and packing problems.

There are a number of way to answer this. In computer graphics, the problem is often approached using a clipping algorithm. This post examines a couple of simpler techniques using linear inequalities and properties of convexity. To simplify the presentation, we assume we’re only interested in convex polygons in two dimensions. We also assume that rotation is not an issue. That is, if one of the polygons is rotated, we can simply re-test to see if they overlap.

Problem

Let’s say we have two objects: a right triangle and a square. We can place them anywhere inside a larger rectangle. The triangle has vertices:

$$\{\left(x_t, y_t\right), \left(x_t, y_t + a\right), \left(x_t + a, y_t\right)\}$$

The square has vertices:

$$\{\left(x_s, y_s\right), \left(x_s, y_s + a\right), \left(x_s + a, y_s + a\right), \left(x_s + a, y_s\right)\}$$

We will be given $\left(x_t, y_t\right)$, $\left(x_s, y_s\right)$, and $a$, but we do not know them a priori. We would like to know, for any set of values these can take, whether or not the triangle and square they define intersect.

$\left(x_t, y_t\right)$ and $\left(x_s, y_s\right)$ are the offsets of the triangle and square with respect to the bottom left corner of the rectangle. If they are far enough apart in any direction, the two objects do not intersect. The figure below shows such a case, with small gray circles representing $\left(x_t, y_t\right)$ and $\left(x_s, y_s\right)$.

However, if they are too close in some manner, the objects will either touch or overlap, as shown below.

he two polygons can intersect in a few different ways. They may touch on their borders, in which case they will share a single point or line segment. They may overlap such that their intersecting region has nonzero relative interior but each polygon contains points outside the other. Or one of them might live entirely within the other, so that the former is a subset of the latter. Our goal is to determine if any of these cases are true given any $\left(x_t, y_t\right)$, $\left(x_s, y_s\right)$, and $a$.

Method 1. Define the intersecting polygon with linear inequalities

The first method we use to detect intersection is based on the fact that our polygons themselves are the intersections of finite numbers of linear inequalities. Instead of defining them based on their vertices, we can equivalently represent them as the set of $\left(x, y\right)$ that satisfy a known inequality for each edge.

Let $S_t$ be the set of points in our triangle. It can be defined as follows. $x$ must be greater than or equal to $x_t$. $y$ must be greater than or equal to $y_t$. And $x + y$ must be left of or lower than the triangle’s hypotenuse. There are three sides on the triangle, so we have three inequalities.

$$ \begin{array}{rcl} S_t = \{\,\left(x, y\right) & | & x \ge x_t,\\ & & y \ge y_t,\\ & & x + y \le x_t + y_t + a \,\} \end{array} $$

Similarly, let $S_s$ be the set of points in our square. This set is defined using four inequalities, which are shown in a slightly compacted form.

$$ \begin{array}{rcl} S_s = \{\,\left(x, y\right) & | & x_s \le x \le x_s + a,\\ & & y_s \le y \le y_s + a \,\} \end{array} $$

Finally, let $S_i = S_t \cap S_s$ be the set of points that satisfy all seven inequalities.

$$ \begin{array}{rcl} S_i = \{\,\left(x, y\right) & | & x \ge x_t,\\ & & y \ge y_t,\\ & & x + y \le x_t + y_t + a,\\ & & x_s \le x \le x_s + a,\\ & & y_s \le y \le y_s + a \,\} \end{array} $$

If $S_i \ne \emptyset$, then there must exist some point that satisfies the inequalities of both the triangle and the square. This point resides in both of them, therefore they intersect. If $S_i = \emptyset$, then there is no such point and they do not intersect.

Method 2. Use convex combinations of the polygon vertices

Both of our polygons are convex. That is, they contain every convex combination of their vertices. So every point in the triangle, regardless of where it is located, can be represented as a linear combination of ${\left(x_t, y_t\right), \left(x_t + a, y_t\right), \left(x_t, y_t + a\right)}$ where $\lambda_1, \lambda_2, \lambda_3 \ge 0$ and $\lambda_1 + \lambda_2 + \lambda_3 = 1$.

We can define the set $S_t$ equivalently using this concept.

$$ S_t = \{\, \lambda_1 \left(\begin{array}{c} x_t \\ y_t \end{array}\right) + \lambda_2 \left(\begin{array}{c} x_t + a \\ y_t \end{array}\right) + \lambda_3 \left(\begin{array}{c} x_t \\ y_t + a \end{array}\right) \, | \\ \lambda_1 + \lambda_2 + \lambda_3 = 1, \\ \lambda_i \ge 0, , i = {1, \ldots, 3 } \, \} $$

Similarly, the square is defined a the convex combination of its vertices.

$$ S_s = \{\, \lambda_4 \left(\begin{array}{c} x_s \\ y_s \end{array}\right) + \lambda_5 \left(\begin{array}{c} x_s + a \\ y_s \end{array}\right) + \lambda_6 \left(\begin{array}{c} x_s \\ y_s + a \end{array}\right) + \lambda_7 \left(\begin{array}{c} x_s + a \\ y_s + a \end{array}\right) \, | \\ \lambda_4 + \lambda_5 + \lambda_6 + \lambda_7 = 1, \\ \lambda_i \ge 0, , i = {4, \ldots, 7 } \, \} $$

If there exists a point inside both the triangle and the square, then it must satisfy both convex combinations. Thus we can define our intersecting set $S_i$ as follows. (This is a little loose with the notation, but I think it makes the point a bit better.)

$$ \begin{array}{rl} S_i = \{\, & \\ & \lambda_1 \left(\begin{array}{c} x_t \\ y_t \end{array}\right) + \lambda_2 \left(\begin{array}{c} x_t + a \\ y_t \end{array}\right) + \lambda_3 \left(\begin{array}{c} x_t \\ y_t + a \end{array}\right) =\\ & \lambda_4 \left(\begin{array}{c} x_s \\ y_s \end{array}\right) + \lambda_5 \left(\begin{array}{c} x_s + a \\ y_s \end{array}\right) + \lambda_6 \left(\begin{array}{c} x_s \\ y_s + a \end{array}\right) + \lambda_7 \left(\begin{array}{c} x_s + a \\ y_s + a \end{array}\right),\\ & \lambda_1 + \lambda_2 + \lambda_3 = 1,\\ & \lambda_4 + \lambda_5 + \lambda_6 + \lambda_7 = 1,\\ & \lambda_i \ge 0, \, i = {1, \ldots, 7}\\ \,\} & \end{array} $$

Just as before, if $S_i \ne \emptyset$, our polygons intersect.

Code

Both models are pretty easy to implement using an LP Solver. But they look very different. That’s because in the first method we’re thinking about the problem in terms of inequalities and in the second we’re thinking about it in terms of vertices. The code below generates a thousand random instances of the problem and tests that each method produces the same result.

import highspy
import random


def method1(xy_t, xy_s, a):
    x_t, y_t = xy_t
    x_s, y_s = xy_s

    h = highspy.Highs()
    h.silent()

    x = h.addVariable()
    y = h.addVariable()
    h.addConstrs(
        x_t <= x <= x_t + a,
        x_s <= x <= x_s + a,
        y_t <= y <= y_t + a,
        y_s <= y <= y_s + a,
        x + y <= x_t + y_t + a,
    )

    return h


def method2(xy_t, xy_s, a):
    x_t, y_t = xy_t
    x_s, y_s = xy_s

    h = highspy.Highs()
    h.silent()

    lm = [h.addVariable(lb=0, ub=1) for _ in range(7)]

    conv_xt = lm[0] * x_t + lm[1] * (x_t + a) + lm[2] * x_t
    conv_xs = lm[3] * x_s + lm[4] * (x_s + a) + lm[5] * x_s + lm[6] * (x_s + a)

    conv_yt = lm[0] * y_t + lm[1] * y_t + lm[2] * (y_t + a)
    conv_ys = lm[3] * y_s + lm[4] * y_s + lm[5] * (y_s + a) + lm[6] * (y_s + a)

    h.addConstrs(
        conv_xt == conv_xs,
        conv_yt == conv_ys,
        sum(lm[:3]) == 1,
        sum(lm[3:]) == 1,
    )

    return h


if __name__ == "__main__":
    problems1 = []
    problems2 = []

    for _ in range(1000):
        a = random.random() * 2.5 + 1
        x_t = random.random() * 10
        y_t = random.random() * 10
        x_s = random.random() * 10
        y_s = random.random() * 10

        problems1.append(method1([x_t, y_t], [x_s, y_s], a))
        problems2.append(method2([x_t, y_t], [x_s, y_s], a))

    overlap1 = []
    for h in problems1:
        h.solve()
        overlap1.append(h.getModelStatus())

    overlap2 = []
    for h in problems2:
        h.solve()
        overlap2.append(h.getModelStatus())

    assert overlap1 == overlap2

These aren’t necessarily the best ways to solve this particular problem, but they are quick and flexible. And they leverage existing solver technology. One downside is that they aren’t easy to adapt to certain decision making contexts. That is, we can use them to determine whether objects overlap, but not to force objects not to overlap. In the next post, we’ll go over another tool from computational geometry that allows us to embed decisions about the relative locations of objects in our models.

Exercises

We assumed convex polygons in this presentation. How might one extend the model to work on non-convex polygons? What problems does this introduce?
The two methods shown above are equivalent. How can this be proven?
This post only answers the question of whether two convex polygons intersect. Devise models for determining if they only touch, or if one is a subset of the other.

🏖️ Langrangian Relaxation with Gurobi

Sat, 22 Sep 2012 00:00:00 +0000

Note: This post was updated to work with Python 3 and the 2nd edition of “Integer Programming” by Laurence Wolsey.

We’ve been studying Lagrangian Relaxation (LR) in the Advanced Topics in Combinatorial Optimization course I’m taking this term, and I had some difficulty finding a simple example covering its application. In case anyone else finds it useful, I’m posting a Python version for solving the Generalized Assignment Problem (GAP). This won’t discuss the theory of LR at all, just give example code using Gurobi.

Generalized assignment

The GAP as defined by Wolsey consists of a maximization problem subject to a set of set packing constraints followed by a set of knapsack constraints.

$$ \begin{align*} & \text{max} && \sum_i \sum_j c_{ij} x_{ij} \\ & \text{s.t.} && \sum_j x_{ij} \leq 1 && \forall i \\ & && \sum_i a_{ij} x_{ij} \leq b_{ij} && \forall j \\ & && x_{ij} \in {0, 1} \end{align*} $$

Naive model

A naive version of this model using Gurobi might look like the following.

#!/usr/bin/env python

# This is the GAP per Wolsey, pg 208.
from gurobipy import Model, GRB, quicksum as qsum

m = Model("GAP per Wolsey")
m.modelSense = GRB.MAXIMIZE
m.setParam("OutputFlag", False)  # turns off solver chatter

b = [15, 15, 15]
c = [
    [6, 10, 1],
    [12, 12, 5],
    [15, 4, 3],
    [10, 3, 9],
    [8, 9, 5],
]
a = [
    [5, 7, 2],
    [14, 8, 7],
    [10, 6, 12],
    [8, 4, 15],
    [6, 12, 5],
]

# x[i][j] = 1 if i is assigned to j
x = [[m.addVar(vtype=GRB.BINARY) for _ in row] for row in c]

# sum j: x_ij <= 1 for all i
for x_i in x:
    m.addConstr(sum(x_i) <= 1)

# sum i: a_ij * x_ij <= b[j] for all j
for j, b_j in enumerate(b):
    m.addConstr(qsum(a[i][j] * x_i[j] for i, x_i in enumerate(x)) <= b_j)

# max sum i,j: c_ij * x_ij
m.setObjective(
    qsum(qsum(c_ij * x_ij for c_ij, x_ij in zip(c_i, x_i)) for c_i, x_i in zip(c, x))
)
m.optimize()

# Pull solution out of m.
print(f"z = {m.objVal}")
print("x = [")
for x_i in x:
    print(f"  {[1 if x_ij.x >= 0.5 else 0 for x_ij in x_i]}")
print("]")

The solver quickly finds the following optimal solution of this toy problem.

z = 46.0
x = [
  [0, 1, 0]
  [0, 1, 0]
  [1, 0, 0]
  [0, 0, 1]
  [0, 0, 0]
]

Lagrangian model

There are two sets of constraints we can dualize. It can be beneficial to apply Lagrangian Relaxation against problems composed of knapsack constraints, so we will dualize the set packing ones.

# sum j: x_ij <= 1 for all i
for x_i in x:
    model.addConstr(sum(x_i) <= 1)

We replace these with a new set of variables, penalties, which take the values of the slacks on the set packing constraints. We then modify the objective function, adding Lagrangian multipliers times these penalties.

Instead of optimizing once, we do so iteratively. An important consideration is we may get nothing more than a dual bound from this process. Any integer solution is not guaranteed to be primal feasible unless it satisfies complementary slackness conditions – for each dualized constraint either its multiplier or penalty must be zero.

We then set the initial multiplier values to 2 and use sub-gradient optimization with a step size of 1 / (iteration #) to adjust them.

#!/usr/bin/env python

# This is the GAP per Wolsey, pg 208, using Lagrangian Relaxation.
from gurobipy import Model, GRB, quicksum as qsum

m = Model("GAP per Wolsey with Lagrangian Relaxation")
m.modelSense = GRB.MAXIMIZE
m.setParam("OutputFlag", False)  # turns off solver chatter

b = [15, 15, 15]
c = [
    [6, 10, 1],
    [12, 12, 5],
    [15, 4, 3],
    [10, 3, 9],
    [8, 9, 5],
]
a = [
    [5, 7, 2],
    [14, 8, 7],
    [10, 6, 12],
    [8, 4, 15],
    [6, 12, 5],
]

# x[i][j] = 1 if i is assigned to j
x = [[m.addVar(vtype=GRB.BINARY) for _ in row] for row in c]

# As stated, the GAP has these following constraints. We dualize these into
# penalties instead, using variables so we can easily extract their values.
penalties = [m.addVar() for _ in x]

# Dualized constraints: sum j: x_ij <= 1 for all i
for p, x_i in zip(penalties, x):
    m.addConstr(p == 1 - sum(x_i))

# sum i: a_ij * x_ij <= b[j] for all j
for j, b_j in enumerate(b):
    m.addConstr(qsum(a[i][j] * x_i[j] for i, x_i in enumerate(x)) <= b_j)

# u[i] = Lagrangian Multiplier for the set packing constraint i
u = [2.0] * len(x)

# Re-optimize until either we have run a certain number of iterations
# or complementary slackness conditions apply.
for k in range(1, 101):
    # max sum i,j: c_ij * x_ij
    m.setObjective(
        qsum(
            # Original objective function
            sum(c_ij * x_ij for c_ij, x_ij in zip(c_i, x_i))
            for c_i, x_i in zip(c, x)
        )
        + qsum(
            # Penalties for dualized constraints
            u_j * p_j
            for u_j, p_j in zip(u, penalties)
        )
    )
    m.optimize()

    print(
        f"iteration {k}: z = {m.objVal}, u = {u}, penalties = {[p.x for p in penalties]}"
    )

    # Test for complementary slackness
    stop = True
    eps = 10e-6
    for u_i, p_i in zip(u, penalties):
        if abs(u_i) > eps and abs(p_i.x) > eps:
            stop = False
            break

    if stop:
        print("primal feasible & optimal")
        break

    else:
        s = 1.0 / k
        for i in range(len(x)):
            u[i] = max(u[i] - s * (penalties[i].x), 0.0)

# Pull solution out of m.
print(f"z = {m.objVal}")
print("x = [")
for x_i in x:
    print(f"  {[1 if x_ij.x >= 0.5 else 0 for x_ij in x_i]}")
print("]")

Again, the example converges very quickly to an optimal solution.

iteration 1: z = 48.0, u = [2.0, 2.0, 2.0, 2.0, 2.000], penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
iteration 2: z = 47.0, u = [2.0, 2.0, 2.0, 2.0, 1.000], penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
iteration 3: z = 46.5, u = [2.0, 2.0, 2.0, 2.0, 0.500], penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
iteration 4: z = 46.2, u = [2.0, 2.0, 2.0, 2.0, 0.167], penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
iteration 5: z = 46.0, u = [2.0, 2.0, 2.0, 2.0, 0.000], penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
primal feasible & optimal
z = 46.0
x = [
  [0, 1, 0]
  [0, 1, 0]
  [1, 0, 0]
  [0, 0, 1]
  [0, 0, 0]
]

Exercise for the reader: change the script to dualize the knapsack constraints instead of the set packing constraints. What is the result of this change in terms of convergence?

Resources

🔲 Normal Magic Squares

Fri, 13 Jan 2012 00:00:00 +0000

Note: This post was updated to work with Python 3 and PySCIPOpt. The original version used Python 2 and python-zibopt. It has also been edited for clarity.

As a followup to the last post, I created another SCIP example for finding Normal Magic Squares. This is similar to solving a Sudoku problem, except that here the number of binary variables depends on the square size. In the case of Sudoku, each cell has 9 binary variables – one for each potential value it might take. For a normal magic square, there are $n^2$ possible values for each cell, $n^2$ cells, and one variable representing the row, column, and diagonal sums. This makes a total of $n^4$ binary variables and one continuous variables in the model.

However, there are no big-Ms.

I think the neat part of this code is in this section:

# Construct an expression for each cell that is the sum of
# its binary variables with their associated coefficients.
sums = []
for row in matrix:
    sums_row = []
    for cell in row:
        sums_row.append(sum((i + 1) * x for i, x in enumerate(cell)))
    sums.append(sums_row)

It creates sums of the $n^2$ variables for each cell with their appropriate coefficients ($1$ to $n^2$) and stores those expressions to make the subsequent constraint creation simpler.

Another interesting exercise for the reader: Change the code to minimize the sum of each column. How does that impact the solution time?

🔲 Magic Squares and Big-Ms

Thu, 12 Jan 2012 00:00:00 +0000

Note: This post was updated to work with Python 3 and PySCIPOpt. The original version used Python 2 and python-zibopt. It has also been edited for clarity.

Back in October of 2011, I started toying with a model for finding magic squares using SCIP. This is a fun modeling exercise and a challenging problem. First one constructs a square matrix of integer-valued variables.

from pyscipopt import Model

# [...snip...]

m = Model()

matrix = []
for i in range(size):
    row = [m.addVar(vtype="I", lb=1) for _ in range(size)]
    for x in row:
        m.addCons(x <= M)
    matrix.append(row)

Then one adds the following constraints:

All variables ≥ 1.
All rows, columns, and the diagonal sum to the same value.
All variables take different values.

The first two constraints are trivial to implement, and relatively easy for the solver. What I do is add a single extra variable then set it equal to the sums of each row, column, and the diagonal.

sum_val = m.addVar(vtype="M")
for i in range(size):
    m.addCons(sum(matrix[i]) == sum_val)
    m.addCons(sum(matrix[j][i] for j in range(size)) == sum_val)

m.addCons(sum(matrix[i][i] for i in range(size)) == sum_val)

It’s the third that messes things up. You can think of this as saying, for every possible pair of integer-valued variables $x$ and $y$:

$$ x \ge y + 1 \quad \text{or} \quad x \le y - 1 $$

Why is this hard? Because we can’t add both constraints to the model. That would make it infeasible. Instead, we add write them in such a way that exactly one will be active for any any given solution. This requires, for each pair of variables, an additional binary variable $z$ and a (possibly big) constant $M$. Thus we reformulate the above as:

$$ x \ge (y + 1) - M z \ x \le (y - 1) + M (1-z) \ z \in {0,1} $$

In code this looks like:

from itertools import chain

all_vars = list(chain(*matrix))
for i, x in enumerate(all_vars):
    for y in all_vars[i+1:]:
        z = m.addVar(vtype="B")
        m.addCons(x >= y + 1 - M*z)
        m.addCons(x <= y - 1 + M*(1-z))

However, here be dragons. We may not know how big (or small) to make $M$. Generally we want it as small as possible to make the LP relaxation of our integer programming model tighter. Different values of $M$ have unpredictable effects on solution time.

Which brings us to an interesting idea:

SCIP now supports bilinear constraints. This means that I can make $M$ a variable in the above model.

import sys

try:
    M = int(sys.argv[2])
except IndexError:
    M = m.addVar(vtype="M", lb=size * size)
else:
    assert M >= size * size

The magic square model linked to in this post provides both options. The first command line argument it requires is the matrix size. The second one, $M$, is optional. If not given, it leaves $M$ up to the solver.

An interesting exercise for the reader: Change the code to search for a minimal magic square, which minimizes either the value of $M$ or the sums of the columns, rows, and diagonal.

⏳️ Know Your Time Complexities - Part 2

Fri, 25 Nov 2011 00:00:00 +0000

In response to this post, Ben Bitdiddle inquires:

I understand the concept of using a companion set to remove duplicates from a list while preserving the order of its elements. But what should I do if these elements are composed of smaller pieces? For instance, say I am generating combinations of numbers in which order is unimportant. How do I make a set recognize that [1,2,3] is the same as [3,2,1] in this case?

There are a couple points that should help here.

While lists are unhashable and therefore cannot be put into sets, tuples are perfectly capable of this. Therefore I cannot do this.

s = set()
s.add([1,2,3])

Traceback (most recent call last):
 File "", line 1, in 
TypeError: unhashable type: 'list'

But this works just fine (extra space added for emphasis of tuple parentheses).

s.add( (1,2,3) )

(3,2,1) and (1,2,3) may not hash to the same thing, but tuples are easily sortable. If I sort them before adding them to a set, they look the same.

tuple(sorted( (3,2,1) ))

(1, 2, 3)

If I want to be a little fancier, I can user itertools.combinations. The following generates all unique 3-digit combinations of integers from 1 to 4:

from itertools import combinations
list(combinations(range(1,5), 3))

[(1, 2, 3), (1, 2, 4), (1, 3, 4), (2, 3, 4)]

Now say I want to only find those that match some condition. I can add a filter to return, say, only those 3-digit combinations of integers from 1 to 6 that multiply to a number divisible by 10:

list(filter(
    lambda x: not (x[0]*x[1]*x[2]) % 10,
    combinations(range(1, 7), 3)
))

[(1, 2, 5),
 (1, 4, 5),
 (1, 5, 6),
 (2, 3, 5),
 (2, 4, 5),
 (2, 5, 6),
 (3, 4, 5),
 (3, 5, 6),
 (4, 5, 6)]

⏳️ Know Your Time Complexities

Tue, 25 Oct 2011 00:00:00 +0000

This is based on a lightning talk I gave at the LA PyLadies October Hackathon.

I’m actually not going to go into anything much resembling algorithmic complexity here. What I’d like to do is present a common performance anti-pattern that I see from novice programmers about once every year or so. If I can prevent one person from committing this error, this post will have achieved its goal. I’d also like to show how an intuitive understanding of time required by operations in relation to the size of data they operate on can be helpful.

Say you have a Big List of Things. It doesn’t particularly matter what these things are. Often they might be objects or dictionaries of denormalized data. In this example we’ll use numbers. Let’s generate a list of 1 million integers, each randomly chosen from the first 100 thousand natural numbers:

import random

choices = range(100000)
x = [random.choice(choices) for i in range(1000000)]

Now say you want to remove (or aggregate, or structure) duplicate data while keeping them in order of appearance. Intuitively, this seems simple enough. A first solution might involve creating a new empty list, iterating over x, and only appending those items that are not already in the new list.

The Bad Way

order = []
for i in x:
    if i not in order:
        order.append(i)

Try running this. What’s wrong with it?

The issue is the conditional on line 3. In the worst case, it could look at every item in the order list for each item in x. If the list is big, as it is in our example, that wastes a lot of cycles. We can reason that we can improve the performance of our code by replacing this conditional with something faster.

The Good Way

Given that sets have near constant time for membership tests, one solution is to create a companion data structure, which we’ll call seen. Being a set, it doesn’t care about the order of the items, but it will allow us to test for membership quickly.

order = []
seen = set()
for i in x:
    if i not in seen:
        seen.add(i)
        order.append(i)

Now try running this. Better?

Not that this is the best way to perform this particular action. If you aren’t familiar with it, take a look at the groupby function from itertools, which is what I will sometimes reach for in a case like this.

🎰 Deterministic vs. Stochastic Simulation

Sat, 11 Jun 2011 00:00:00 +0000

I find I have to build simulations with increasing frequency in my work and life. Usually this indicates I’m faced with one of the following situations:

The need for a quick estimate regarding the quantitative behavior of some situation.
The desire to verify the result of a computation or assumption.
A situation which is too complex or random to effectively model or understand.

Anyone familiar at all with simulation will recognize the last item as the motivating force of the entire field. Simulation models tend to take over when systems become so complex that understanding them is prohibitive in cost and time or entirely infeasible. In a simulation, the modeler can focus on individual interactions between entities while still hoping for useful output in the form of descriptive statistics.

As such, simulations are nearly always stochastic. The output of a simulation, whether it be the mean time to service upon entering a queue or the number of fish alive in a pond, is determined by a number of random inputs. It is estimated by looking at a sample of the entire, often infinite, problem space and therefore must be described in terms of mean and variance.

For me, simulation building usually follows a process roughly like this:

Work with a domain expert to understand the process under study.
Convert this process into a deterministic simulation (no randomness).
Verify the output of the deterministic simulation.
Anlyze the inputs of the simulation to determine their probability distributions.
Convert the deterministic simulation to a stochastic simulation.

The reason for creating a simulation without randomness first is that it can be difficult or impossible to verify its correctness otherwise. Thus one may focus on the simulation logic first before analyzing and adding sources of randomness.

Where the procedure breaks down is after the third step. Domain experts are often happy to share their knowledge about systems to aid in designing simulations, and typically can understand the resulting abstractions. They are also invaluable in verifying simulation output. However, they are unlikely to understand why it is necessary to add randomness to a system that they already perceive as functional. Further, doing so can be just as difficult and time consuming as the initial model development and therefore requires justification.

This can be a quandary for the model builder. How does one communicate the need to incorporate randomness to decision makers who lack understanding of probability? It is trivially easy to construct simulations that use the same input parameters but yield drastically different outputs. Consider the code below, which simulates two events occurring and counts the number of times event b happens before event a.

import random

def sim_stochastic(event_a_lambda, event_b_lambda):
    # Returns 0 if event A arrives first, 1 if event B arrives first

    # Calculate next arrival time for each event randomly.
    event_a_arrival = random.expovariate(event_a_lambda)
    event_b_arrival = random.expovariate(event_b_lambda)

    return 0.0 if event_a_arrival <= event_b_arrival else 1.0

def sim_deterministic(event_a_lambda, event_b_lambda):
    # Returns 0 if event A arrives first, 1 if event B arrives first

    # Calculate next arrival time for each event deterministically.
    event_a_arrival = 1.0 / event_a_lambda
    event_b_arrival = 1.0 / event_b_lambda

    return 0.0 if event_a_arrival <= event_b_arrival else 1.0

if __name__ == '__main__':
    event_a_lambda = 0.3
    event_b_lambda = 0.5

    repetitions = 10000

    for sim in (sim_stochastic, sim_deterministic):
        output = [
            sim(event_a_lambda, event_b_lambda)
            for _ in range(repetitions)
        ]
        event_b_first = 100.0 * (sum(output) / len(output))
        print('event b is first %0.1f%% of the time' % event_b_first)

Both simulations use the same input parameter, but the second one is essentially wrong as b will always happen first. In the stochastic version, we use exponential distributions for the inputs and obtain an output that verifies our basic understanding of these distributions.

event b is first 63.0% of the time
event b is first 100.0% of the time

How about you? How do you discuss the need to model a random world with decision makers?

🔮 NetworkX and Python Futures

Thu, 19 May 2011 00:00:00 +0000

Note: This post was updated to work with NetworkX and for clarity.

It’s possible this will turn out like the day when Python 2.5 introduced coroutines. At the time I was very excited. I spent several hours trying to convince my coworkers we should immediately abandon all our existing Java infrastructure and port it to finite state machines implemented using Python coroutines. After a day of hand waving over a proof of concept, we put that idea aside and went about our lives.

Soon after, I left for a Python shop, but in the next half decade I still never found a good place to use this interesting feature.

But it doesn’t feel like that.

As I come to terms more with switching to Python 3.2, the futures module seems similarly exciting. I wish I’d had it years ago, and it’s almost reason in itself to upgrade from Python 2.7. Who cares if none of your libraries have been ported yet?

This library lets you take any function and distribute it over a process pool. To test that out, we’ll generate a bunch of random graphs and iterate over all their cliques.

Code

First, let’s generate some test data using the dense_gnm_random_graph function. Our data includes 1000 random graphs, each with 100 nodes and 100 * 100 edges.

import networkx as nx

n = 100
graphs = [nx.dense_gnm_random_graph(n, n*n) for _ in range(1000)]

Now we write a function iterate over all cliques in a given graph. NetworkX provides a find_cliques function which returns a generator. Iterating over them ensures we will run through the entire process of finding all cliques for a graph.

def iterate_cliques(g):
    for _ in nx.find_cliques(g):
        pass

Now we just define two functions, one for running in serial and one for running in parallel using futures.

from concurrent import futures

def serial_test(graphs):
    for g in graphs:
        iterate_cliques(g)

def parallel_test(graphs, max_workers):
    with futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
        executor.map(iterate_cliques, graphs)

Our __main__ simply generates the random graphs, samples from them, times both functions, and write CSV data to standard output.

from csv import writer
import random
import sys
import time

if __name__ == '__main__':
    out = writer(sys.stdout)
    out.writerow(['num graphs', 'serial time', 'parallel time'])

    n = 100
    graphs = [nx.dense_gnm_random_graph(n, n*n) for _ in range(1000)]

    # Run with a number of different randomly generated graphs
    for num_graphs in range(50, 1001, 50):
        sample = random.choices(graphs, k = num_graphs)

        start = time.time()
        serial_test(sample)
        serial_time = time.time() - start

        start = time.time()
        parallel_test(sample, 16)
        parallel_time = time.time() - start

        out.writerow([num_graphs, serial_time, parallel_time])

The output of this script shows that we get a fairly linear speedup to this code with little effort.

I ran this on a machine with 8 cores and hyperthreading. Eyeballing the chart, it looks like the speedup is roughly 5x. My system monitor shows spikes on CPU usage across cores whenever the parallel test runs.

Resources

Output data
Full source listing

🐪 Reformed JAPHs: Transpiler

Wed, 20 Apr 2011 00:00:00 +0000

Note: This post was edited for clarity.

For the final JAPH in this series, I implemented a simple transpiler that converts a small subset of Scheme programs to equivalent Python programs. It starts with a Scheme program that prints 'just another scheme hacker'.

(define (output x)
    (if (null? x)
        ""
        (begin (display (car x))
                (if (null? (cdr x))
                    (display "\n")
                    (begin (display " ")
                            (output (cdr x)))))))
(output (list "just" "another" "scheme" "hacker"))

The program then tokenizes that Scheme source, parses the token stream, and converts that into Python 3.

def output(x):
    if not x:
        ""
    else:
        print(x[0], end='')
        if not x[1:]:
            print("\n", end='')
        else:
            print(" ", end='')
            output(x[1:])

output(["just", "another", "python", "hacker"])

Finally it executes the resulting Python string using exec. Obfuscation is left as an exercise for the reader.

import re

def tokenize(input):
    '''Tokenizes an input stream into a list of recognizable tokens'''
    token_res = (
        r'\(',      # open paren -> starts expression
        r'\)',      # close paren -> ends expression
        r'"[^"]*"', # quoted string (don't support \" yet)
        r'[\w?]+'   # atom
    )
    return re.findall(r'(' + '|'.join(token_res) + ')', input)

def parse(stream):
    '''Parses a token stream into a syntax tree'''
    if not stream:
        return []

    else:
        # Build a list of arguments (possibly expressions) at this level
        args = []
        while True:
            # Get the next token
            try:
                x = stream.pop(0)
            except IndexError:
                return args

            # ( and ) control the level of the tree we're at
            if x == '(':
                args.append(parse(stream))
            elif x == ')':
                return args
            else:
                args.append(x)

def compile(tree):
    '''Compiles an Scheme Abstract Syntax Tree into near-Python'''
    def compile_expr(indent, expr):
        indent += 1

        lines = [] # these will have [(indent, statement), ...] structure
        while expr:
            # Two options: expr is a string like "'" or it is a list
            if isinstance(expr, str):
                return [(
                    indent,
                    expr.replace('scheme', 'python').replace('\n', '\\n')
                )]

            else:
                start = expr.pop(0)

                if start == 'define':
                    signature = expr.pop(0)
                    lines.append((indent,
                        'def %s(%s):' % (
                            signature[0],
                            ', '.join(signature[1:])
                        )
                    ))
                    while expr:
                        lines.extend(compile_expr(indent, expr.pop(0)))

                elif start == 'if':
                    # We don't support multi-clause conditionals yet
                    clause = compile_expr(indent, expr.pop(0))[0][1]
                    lines.append((indent, 'if %s:' % clause))

                    if_true_lines = compile_expr(indent, expr.pop(0))
                    if_false_lines = compile_expr(indent, expr.pop(0))

                    lines.extend(if_true_lines)
                    lines.append((indent, 'else:'))
                    lines.extend(if_false_lines)

                elif start == 'null?':
                    # Only supports conditionals of the form (null? foo)
                    if isinstance(expr[0], str):
                        condition = expr.pop(0)
                    else:
                        condition = compile_expr(indent, expr.pop(0))[0][1]
                    return [(indent, 'not %s' % condition)]

                elif start == 'begin':
                    # This is just a series of statements, so don't indent
                    while expr:
                        lines.extend(compile_expr(indent-1, expr.pop(0)))

                elif start == 'display':
                    arguments = []
                    while expr:
                        arguments.append(
                            compile_expr(indent, expr.pop(0))[0][1]
                        )
                    lines.append((
                        indent,
                        "print(%s, end='')" % (', '.join(arguments))
                    ))

                elif start == 'car':
                    lines.append((indent, '%s[0]' % expr.pop(0)))

                elif start == 'cdr':
                    lines.append((indent, '%s[1:]' % expr.pop(0)))

                elif start == 'list':
                    arguments = []
                    while expr:
                        arguments.append(
                            compile_expr(indent, expr.pop(0))[0][1]
                        )
                    lines.append((indent, '[%s]' % ', '.join(arguments)))

                else:
                    # Assume this is a function call
                    arguments = []
                    while expr:
                        arguments.append(
                            compile_expr(indent, expr.pop(0))[0][1]
                        )
                    lines.append((
                        indent,
                        "%s(%s)" % (start, ', '.join(arguments))
                    ))

        return lines

    return [compile_expr(-1, expr) for expr in tree]

if __name__ == '__main__':
    scheme = '''
        (define (output x)
            (if (null? x)
                ""
                (begin (display (car x))
                       (if (null? (cdr x))
                           (display "\n")
                           (begin (display " ")
                                  (output (cdr x)))))))
        (output (list "just" "another" "scheme" "hacker"))
    '''
    python = ''
    for expr in compile(parse(tokenize(scheme))):
        python += '\n'.join([(' ' * 4 * x[0]) + x[1] for x in expr]) + '\n\n'
    exec(python)

🐪 Reformed JAPHs: Turing Machine

Mon, 18 Apr 2011 00:00:00 +0000

Note: This post was edited for clarity.

This JAPH uses a Turing machine. The machine accepts any string that ends in '\n' and allows side effects. This lets us print the value of the tape as it encounters each character. While the idea of using lambda functions as side effects in a Turing machine is a little bizarre on many levels, we work with what we have. And Python is multi-paradigmatic, so what the heck.

import re

def turing(tape, transitions):
    # The tape input comes in as a string.  We approximate an infinite
    # length tape via a hash, so we need to convert this to {index: value}
    tape_hash = {i: x for i, x in enumerate(tape)}

    # Start at 0 using our transition matrix
    index = 0
    state = 0
    while True:
        value = tape_hash.get(index, '')

        # This is a modified Turing machine: it uses regexen
        # and has side effects.  Oh well, I needed IO.
        for rule in transitions[state]:
            regex, next, direction, new_value, side_effect = rule
            if re.match(regex, value):
                # Terminal states
                if new_value in ('YES', 'NO'):
                    return new_value

                tape_hash[index] = new_value
                side_effect(value)
                index += direction
                state = next
                break

assert 'YES' == turing('just another python hacker\n', [
    # This Turing machine recognizes the language of strings that end in \n.

    # Regex rule, next state, left/right = -1/+1, new value, side effect.
    [ # State 0:
        [r'^[a-z ]$', 0, +1, '', lambda x: print(x, end='')],
        [r'^\n$', 1, +1, '', lambda x: print(x, end='')],
        [r'^.*$', 0, +1, 'NO', None],
    ],
    [ # State 1:
        [r'^$', 1, -1, 'YES', None]
    ]
])

Obfuscation again consists of converting the above code into lambda functions using Y combinators. This is a nice programming exercise, so I’ve left it out of this post in case anyone wants to try. The Turing machine has to return 'YES' to indicate that it accepts the string, thus the assertion. Our final obfuscated JAPH is a single expression.

assert'''YES'''==(lambda g:(lambda f:g(lambda arg:f(f)(arg)))(lambda f:g(
lambda arg: f(f)(arg))))(lambda f: lambda q:[(lambda g:(lambda f:g(lambda
arg:f(f)(arg)))(lambda f: g(lambda arg:f(f)(arg))))(lambda f: lambda x:(x
[0][0]if x[0] and __import__('re').match(x[0][0][0],x[1])else f([x[0][1:]
,x[1]]))) ([q[3][q[1]],q[2].get(q[0],'')])[4](q[2].get(q[0],'')), (lambda
g:(lambda f:g(lambda arg:f(f)(arg))) (lambda f:g(lambda arg:f(f)(arg))))(
lambda f:lambda x:(x[0][0]if x[0] and __import__('re').match(x[0][0][0],x
[1])else f([x[0][1:],x[1]])))([q[3][q[1]],q[2].get(q[0],'')])[3]if(lambda
g:(lambda f:g(lambda arg:f(f)(arg))) (lambda f:g(lambda arg:f(f)(arg))))(
lambda f:lambda x:(x[0][0]if x[0]and __import__('re').match(x[0][0][0],x[
1]) else f([x[0][1:],x[1]])))([q[3][q[1]],q[2].get(q[0],'')])[3]in('YES',
'NO')else f([q[0]+(lambda g:(lambda f:g(lambda arg:f(f)(arg)))(lambda f:g
(lambda arg:f(f)(arg))))(lambda f:lambda x:(x[0][0]if x[0]and __import__(
're').match(x[0][0][0],x[1])else f([x[0][1:], x[1]])))([q[3][q[1]], q[2].
get(q[0],'')])[2],(lambda g:(lambda f:g(lambda arg: f(f)(arg)))(lambda f:
g(lambda arg:f(f)(arg))))(lambda f:lambda x:(x[0][0]if x[0]and __import__
('re').match(x[0][0][0],x[1])else f([x[0][1:], x[1]])))([q[3][q[1]],q[2].
get(q[0],'')])[1],q[2],q[3]])][1])([0,0,{i:x for i,x in enumerate('just '
'another python hacker\n')}, [[[r'^[a-z ]$',0,+1,'',lambda x:print(x,end=
'')], [r'^\n$',1,+1,'',lambda x:print(x, end='')],[r'^.*$',0,+1,'''NO''',
lambda x:None]], [[r'''^$''',+1,-1,'''YES''', lambda x: None or None]]]])

🐪 Reformed JAPHs: Huffman Coding

Thu, 14 Apr 2011 00:00:00 +0000

Note: This post was edited for clarity.

At this point, tricking python into printing strings via indirect means got a little boring. So I switched to obfuscating fundamental computer science algorithms. Here’s a JAPH that takes in a Huffman coded version of 'just another python hacker', decodes, and prints it.

# Build coding tree
def build_tree(scheme):
    if scheme.startswith('*'):
        left, scheme = build_tree(scheme[1:])
        right, scheme = build_tree(scheme)
        return (left, right), scheme
    else:
        return scheme[0], scheme[1:]

def decode(tree, encoded):
    ret = ''
    node = tree
    for direction in encoded:
        if direction == '0':
            node = node[0]
        else:
            node = node[1]
        if isinstance(node, str):
            ret += node
            node = tree
    return ret

tree = build_tree('*****ju*sp*er***yct* h**ka*no')[0]
print(
    decode(tree, bin(10627344201836243859174935587).lstrip('0b').zfill(103))
)

The decoding tree is like a LISP-style sequence of pairs. '*' represents a branch in the tree while other characters are leaf nodes. This looks like the following.

(
    (
        (
            (
                ('j', 'u'), 
                ('s', 'p')
            ), 
            ('e', 'r')
        ), 
        (
            (
                ('y', 'c'), 
                't'
            ), 
            (' ', 'h')
        )
    ), 
    (
        ('k', 'a'), 
        ('n', 'o')
    )
)

The actual Huffman coded version of our favorite string gets about 50% smaller represented in base-2.

0000000001000100101011010111011101010111001000110110000110100001010111111110011001111010100110000100011

There’s a catch here, which is that this is hard to obfuscate unless we turn it into a single expression. This means that we have to convert build_tree and decode into lambda functions. Unfortunately, they are recursive and lambda functions recurse naturally. Fortunately, we can use Y combinators to get around the problem. These are worth some study since they will pop up again in future JAPHs.

Y = lambda g: (
    lambda f: g(lambda arg: f(f)(arg))) (lambda f: g(lambda arg: f(f)(arg))
)

build_tree = Y(
    lambda f: lambda scheme: (
        (f(scheme[1:])[0], f(f(scheme[1:])[1])[0]),
        f(f(scheme[1:])[1])[1 ]
    ) if scheme.startswith('*') else (scheme[0], scheme[1:])
)

decode = Y(lambda f: lambda x: x[3]+x[1] if not x[2] else (
    f([x[0], x[0], x[2], x[3]+x[1]]) if isinstance(x[1], str) else (
        f([x[0], x[1][0], x[2][1:], x[3]]) if x[2][0] == '0' else (
            f([x[0], x[1][1], x[2][1:], x[3]])
        )
    )
))

tree = build_tree('*****ju*sp*er***yct* h**ka*no')[0]
print(
    decode([
        tree,
        tree,
        bin(10627344201836243859174935587).lstrip('0b').zfill(103), ''
    ])
)

The final version is a condensed (and expanded, oddly) version of the above.

print((lambda t,e,s:(lambda g:(lambda f:g(lambda arg:f(f)(arg)))(lambda f:
g(lambda arg: f(f)(arg))))(lambda f:lambda x: x[3]+x[1]if not x[2]else f([
x[0],x[0],x[2],x[3]+x[1]])if isinstance(x[1],str)else f([x[0],x[1][0],x[2]
[1:],x[3]])if x[2][0]=='0'else f([x[0],x[1][1],x[2][1:],x[3]]))([t,t,e,s])
)((lambda g:(lambda f:g(lambda arg:f(f)(arg)))(lambda f:g(lambda arg:f(f)(
arg))))(lambda f:lambda p:((f(p[1:])[0],f(f(p[1:])[1])[0]),f(f(p[1:])[1])[
1])if p.startswith('*')else(p[0],p[1:]))('*****ju*sp*er***yct* h**ka*no')[
0],bin(10627344201836243859179756385-4820798).lstrip('0b').zfill(103),''))

🐪 Reformed JAPHs: Rolling Effect

Mon, 11 Apr 2011 00:00:00 +0000

Note: This post was updated to work with Python 3.12. It may not work with different versions.

Here’s a JAPH composed solely for effect. For each letter in 'just another python hacker' it loops over each the characters ' abcdefghijklmnopqrstuvwxyz', printing each. Between characters it pauses for 0.05 seconds, backing up and moving on to the next if it hasn’t reached the desired one yet. This achieves a sort of rolling effect by which the final string appears on our screen over time.

import string
import sys
import time

letters = ' ' + string.ascii_lowercase
for l in 'just another python hacker':
    for x in letters:
        print(x, end='')
        sys.stdout.flush()
        time.sleep(0.05)

        if x == l:
            break
        else:
            print('\b', end='')

print()

We locate and print each letter in the string with a list comprehension. At the end we have an extra line of code (the eval statement) that gives us our newline.

[[(lambda x,l:str(print(x,end=''))+str(__import__(print.
__doc__[print.__doc__.index('stdout') - 4:print.__doc__.
index('stdout')-1]).stdout.flush()) + str(__import__(''.
join(reversed('emit'))).sleep(0o5*1.01/0x64))+str(print(
'\b',end='\x09'.strip())if x!=l else'*&#'))(x1,l1)for x1
in('\x20'+getattr(__import__(type('phear').__name__+'in'
'g'),dir(__import__(type('snarf').__name__+'ing'))[15]))
[:('\x20'+getattr(__import__(type('smear').__name__+'in'
'g'),dir(__import__(type('slurp').__name__+'ing'))[15]))
.index(l1)+1]]for l1 in'''just another python hacker''']
eval('''\x20\x09eval("\x20\x09eval('\x20 print()')")''')

🐪 Reformed JAPHs: ROT13

Wed, 06 Apr 2011 00:00:00 +0000

Note: This post was updated to work with Python 3.12. It may not work with different versions.

No series of JAPHs would be complete without ROT13. This is the example through which aspiring Perl programmers learn to use tr and its synonym y. In Perl the basic ROT13 JAPH starts as:

$foo = 'whfg nabgure crey unpxre';
$foo =~ y/a-z/n-za-m/;
print $foo;

Python has nothing quite so elegant in its default namespace. However, this does give us the opportunity to explore a little used aspect of strings: the translate method. If we construct a dictionary of ordinals we can accomplish the same thing with a touch more effort.

import string

table = {
    ord(x): ord(y) for x, y in zip(
        string.ascii_lowercase,
        string.ascii_lowercase[13:] + string.ascii_lowercase
    )
}

print('whfg nabgure clguba unpxre'.translate(table))

We obfuscate the construction of this translation dictionary and, for added measure, use getattr to find the print function off of __builtins__. This will likely only work in Python 3.2, since the order of attributes on __builtins__ matters.

getattr(vars()[list(filter(lambda _:'\x5f\x62'in _,dir
()))[0]], dir(vars()[list(filter(lambda _:'\x5f\x62'in
_, dir()))[0]])[list(filter(lambda _:_ [1].startswith(
'\x70\x72'),enumerate(dir(vars()[list(filter(lambda _:
'\x5f\x62'in _,dir()))[0]]))))[0][0]])(getattr('whfg '
+'''nabgure clguba unpxre''', dir('0o52')[0o116])({ _:
(_-0o124) %0o32 +0o141 for _ in range(0o141, 0o173)}))

🐪 Reformed JAPHs: Ridiculous Anagram

Sun, 03 Apr 2011 00:00:00 +0000

Here’s the second in my reformed JAPH series. It takes an anagram of 'just another python hacker' and converts it prior to printing. It sorts the anagram by the indices of another string, in order of their associated characters. This is sort of like a pre-digested Schwartzian transform.

x = 'upjohn tehran hectors katy'
y = '1D0HG6JFO9P5ICKAM87B24NL3E'

print(''.join(x[i] for i in sorted(range(len(x)), key=lambda p: y[p])))

Obfuscation consists mostly of using silly machinations to construct the string we use to sort the anagram.

print(''.join('''upjohn tehran hectors katy'''[_]for _ in sorted(range
(26),key=lambda p:(hex(29)[2:].upper()+str(3*3*3*3-3**4)+'HG'+str(sum(
range(4)))+'JFO'+str((1+2)**(1+1))+'P'+str(35/7)[:1]+'i.c.k.'.replace(
'.','').upper()+'AM'+str(3**2*sum(range(5))-3)+hex(0o5444)[2:].replace
(*'\x62|\x42'.split('|'))+'NL'+hex(0o076).split('x')[1].upper())[p])))

🐪 Reformed JAPHs: Alphabetic Indexing

Fri, 01 Apr 2011 00:00:00 +0000

Note: This post was edited for clarity.

Many years ago, I was a Perl programmer. Then one day I became disillusioned at the progress of Perl 6 and decided to import this.

This seems to be a fairly common story for Perl to Python converts. While I haven’t looked back much, there are a number of things I really miss about perl (lower case intentional). I miss having value types in a dynamic language, magical and ill-advised use of cryptocontext, and sometimes even pseudohashes because they were inexcusably weird. A language that supports so many ideas out of the box enables an extended learning curve that lasts for many years. “Perl itself is the game.”

Most of all I think I miss writing Perl poetry and JAPHs. Sadly, I didn’t keep any of those I wrote, and I’m not competent enough with the language anymore to write interesting ones. At the time I was intentionally distancing myself from a model that was largely implicit and based on archaic systems internals and moving to one that was (supposedly) explicit and simple.

After switching to Python as my primary language, I used the following email signature in a nod to this change in orientation (intended for Python 2):

print 'just another python hacker'

Recently I’ve been experimenting with writing JAPHs in Python. I think of these as “reformed JAPHs.” They accomplish the same purpose as programming exercises but in a more restricted context. In some ways they are more challenging. Creativity can be difficult in a narrowly defined landscape.

I have written a small series of reformed JAPHs which increase monotonically in complexity. Here is the first one, written in plain understandable Python 3.

import string

letters = string.ascii_lowercase + ' '
indices = [
     9, 20, 18, 19, 26,  0, 13, 14, 19, 7,  4, 17, 26,
    15, 24, 19,  7, 14, 13, 26,  7,  0, 2, 10,  4, 17
]

print(''.join(letters[i] for i in indices))

This is fairly simple. Instead of explicitly embedding the string 'just another python hacker' in the program, we assemble it using the index of its letters in the string 'abcdefghijklmnopqrstuvwxyz '. We then obfuscate through a series of minor measures:

Instead of calling the print function, we import sys and make a call to sys.stdout.write.
We assemble string.lowercase + ' ' by joining together the character versions of its respective ordinal values (97 to 123 and 32).
We join together the integer indices using 'l' and split that into a list.
We apply ''' liberally and rely on the fact that python concatenates adjacent strings.

Here’s the obfuscated version:

eval("__import__('''\x73''''''\x79''''''\x73''').sTdOuT".lower()
).write(''.join(map(lambda _:(list(map(chr,range(97,123)))+[chr(
32)])[int(_)],('''9l20l18l19''''''l26l0l13l14l19l7l4l17l26l15'''
'''l24l19l7l14l1''''''3l26l7l0l2l10l4l17''').split('l')))+'\n',)

We could certainly do more, but that’s where I left this one. Stay tuned for the next JAPH.

🧐 Data Fitting 2 - Very, Very Simple Linear Regression in Python

Tue, 15 Feb 2011 00:00:00 +0000

This post is based on a memo I sent to some former colleagues at the Post. I’ve edited it for use here since it fits well as the second in a series on simple data fitting techniques. If you’re among the many enlightened individuals already using regression analysis, then this post is probably not for you. If you aren’t, then hopefully this provides everything you need to develop rudimentary predictive models that yield surprising levels of accuracy.

Data

For purposes of a simple working example, we have collected six records of input data over three dimensions with the goal of predicting two outputs. The input data are:

$$ \begin{align*} x_1 &= \text{How much a respondent likes vanilla [0-10]}\\ x_2 &= \text{How much a respondent likes strawberry [0-10]}\\ x_3 &= \text{How much a respondent likes chocolate [0-10]} \end{align*} $$

Output data consist of:

$$ \begin{align*} b_1 &= \text{How much a respondent likes dogs [0-10]}\\ b_2 &= \text{How much a respondent likes cats [0-10]} \end{align*} $$

Below are anonymous data collected from a random sample of people.

respondent	vanilla ❤️	strawberry ❤️	chocolate ❤️	dog ❤️	cat ❤️
Alyssa P Hacker	9	4	9	9	8
Ben Bitdiddle	8	6	4	10	4
Cy D. Fect	9	4	8	2	6
Eva Lu Ator	3	7	9	4	6
Lem E. Tweakit	6	8	5	2	5
Louis Reasoner	4	5	3	10	3

Our input is in three dimensions. Each output requires its own model, so we’ll have one for dogs and one for cats. We’re looking for functions, dog(x) and cat(x), that can predict $b_1$ and $b_2$ based on given values of $x_1$, $x_2$, and $x_3$.

Model 1

For both models we want to find parameters that minimize their squared residuals (read: errors). There’s a number of names for this. Optimization folks like to think of it as unconstrained quadratic optimization, but it’s more common to call it least squares or linear regression. It’s not necessary to entirely understand why for our purposes, but the function that minimizes these errors is:

$$\beta = ({A^t}A)^{-1}{A^t}b$$

This is implemented for you in the numpy.linalg Python package, which we’ll use for examples. Much more information than you probably want can be found here.

Below is a first stab at a Python version. It runs least squares against our input and output data exactly as they are. You can see the matrix $A$ and outputs $b_1$ and $b_2$ (dog and cat love, respectively) are represented just as they are in the table.

# Version 1: No offset, no squared inputs

import numpy

A = numpy.vstack([
    [9, 4, 9],
    [8, 6, 4],
    [9, 4, 8],
    [3, 7, 9],
    [6, 8, 5],
    [4, 5, 3]
])

b1 = numpy.array([9, 10, 2, 4, 2, 10])
b2 = numpy.array([9, 4, 6, 6, 5, 3])

print('dog ❤️:', numpy.linalg.lstsq(A, b1, rcond=None)[0])
print('cat ❤️:', numpy.linalg.lstsq(A, b2, rcond=None)[0])

# Output:
# dog ❤️: [0.72548294      0.53045642     -0.29952361]
# cat ❤️: [2.36110929e-01  2.61934385e-05  6.26892476e-01]

The resulting model is:

dog(x) = 0.72548294 * x1 + 0.53045642 * x2 - 0.29952361 * x3
cat(x) = 2.36110929e-01 * x1 + 2.61934385e-05 * x2 + 6.26892476e-01 * x3

The coefficients before our variables correspond to beta in the formula above. Errors between observed and predicted data, shown below, are calculated and summed. For these six records, dog(x) has a total error of 20.76 and cat(x) has 3.74. Not great.

respondent	predicted b1	b1 error	predicted b2	b2 error
Alyssa P Hacker	5.96	3.04	7.77	1.23
Ben Bitdiddle	7.79	2.21	4.40	0.40
Cy D. Fect	6.25	4.25	7.14	1.14
Eva Lu Ator	3.19	0.81	6.35	0.35
Lem E. Tweakit	7.10	5.10	4.55	0.45
Louis Reasoner	4.66	5.34	2.83	0.17
Total error:		20.76		3.74

Model 2

One problem with this model is that dog(x) and cat(x) are forced to pass through the origin. (Why is that?) We can improve it somewhat if we add an offset. This amounts to prepending 1 to every row in $A$ and adding a constant to the resulting functions. You can see the very slight difference between the code for this model and that of the previous:

# Version 2: Offset, no squared inputs

import numpy

A = numpy.vstack([
    [1, 9, 4, 9],
    [1, 8, 6, 4],
    [1, 9, 4, 8],
    [1, 3, 7, 9],
    [1, 6, 8, 5],
    [1, 4, 5, 3]
])

print('dog ❤️:', numpy.linalg.lstsq(A, b1, rcond=None)[0])
print('cat ❤️:', numpy.linalg.lstsq(A, b2, rcond=None)[0])

# Output:
# dog ❤️: [20.92975427  -0.27831197  -1.43135684  -0.76469017]
# cat ❤️: [-0.31744124   0.25133547   0.02978098   0.63394765]

This yields the seconds version of our models:

dog(x) = 20.92975427 - 0.27831197 * x1 - 1.43135684 * x2 - 0.76469017 * x3
cat(x) = -0.31744124 + 0.25133547 * x1 + 0.02978098 * x2 + 0.63394765 * x3

These models provide errors of 13.87 and 3.79. A little better on the dog side, but still not quite usable.

respondent	predicted b1	b1 error	predicted b2	b2 error
Alyssa P Hacker	5.82	3.18	7.77	1.23
Ben Bitdiddle	7.06	2.94	4.41	0.41
Cy D. Fect	6.58	4.58	7.14	1.14
Eva Lu Ator	3.19	0.81	6.35	0.35
Lem E. Tweakit	3.99	1.99	4.60	0.40
Louis Reasoner	10.37	0.37	2.74	0.26
Total error:		13.87		3.79

Model 3

The problem is that dog(x) and cat(x) are linear functions. Most observed data don’t conform to straight lines. Take a moment and draw the line $f(x) = x$ and the curve $f(x) = x^2$. The former makes a poor approximation of the latter.

Most of the time, people just use squares of the input data to add curvature to their models. We do this in our next version of the code by just adding squares of the input row values to our $A$ matrix. Everything else is the same. (In reality, you can add any function of the input data you feel best models the data, if you understand it well enough.)

# Version 3: Offset with squared inputs

import numpy

A = numpy.vstack([
    [1, 9, 9**2, 4, 4**2, 9, 9**2],
    [1, 8, 8**2, 6, 6**2, 4, 4**2],
    [1, 9, 9**2, 4, 4**2, 8, 8**2],
    [1, 3, 3**2, 7, 7**2, 9, 9**2],
    [1, 6, 6**2, 8, 8**2, 5, 5**2],
    [1, 4, 4**2, 5, 5**2, 3, 3**2]
])

b1 = numpy.array([9, 10, 2, 4, 2, 10])
b2 = numpy.array([9, 4, 6, 6, 5, 3])

print('dog ❤️:', numpy.linalg.lstsq(A, b1, rcond=None)[0])
print('cat ❤️:', numpy.linalg.lstsq(A, b2, rcond=None)[0])

# dog ❤️: [1.29368307  7.03633306  -0.44795498  9.98093332
#  -0.75689575  -19.00757486  1.52985734]
# cat ❤️: [0.47945896  5.30866067  -0.39644128 -1.28704188
#   0.12634295   -4.32392606  0.43081918]

This gives us our final version of the model:

dog(x) = 1.29368307 + 7.03633306 * x1 - 0.44795498 * x1**2 + 9.98093332 * x2 - 0.75689575 * x2**2 - 19.00757486 * x3 + 1.52985734 * x3**2
cat(x) = 0.47945896 + 5.30866067 * x1 - 0.39644128 * x1**2 - 1.28704188 * x2 + 0.12634295 * x2**2 - 4.32392606 * x3 + 0.43081918 * x3**2

Adding curvature to our model eliminates all perceived error, at least within 1e-16. This may seem unbelievable, but when you consider that we only have six input records, it isn’t really.

respondent	predicted b1	predicted b2
Alyssa P Hacker	9	9
Ben Bitdiddle	10	4
Cy D. Fect	2	6
Eva Lu Ator	4	6
Lem E. Tweakit	2	5
Louis Reasoner	10	3
Total error:

It should be fairly obvious how one can take this and extrapolate to much larger models. I hope this is useful and that least squares becomes an important part of your lives.

🗳 Off the Cuff Voter Fraud Detection

Tue, 30 Nov 2010 00:00:00 +0000

Consider this scenario: You run a contest that accepts votes from the general Internet population. In order to encourage user engagement, you record any and all votes into a database over several days, storing nothing more than the competitor voted for, when each vote is cast, and a cookie set on the voter’s computer along with their apparent IP addresses. If a voter already has a recorded cookie set they are denied subsequent votes. This way you can avoid requiring site registration, a huge turnoff for your users. Simple enough.

Unfortunately, some of the competitors are wily and attached to the idea of winning. They go so far as programming or hiring bots to cast thousands of votes for them. Your manager wants to know which votes are real and which ones are fake Right Now. Given very limited time, and ignoring actions that you could have taken to avoid the problem, how can you tell apart sets of good votes from those that shouldn’t be counted?

One quick-and-dirty option involves comparing histograms of interarrival times for sets of votes. Say you’re concerned that all the votes during a particular period of time or from a given IP address might be fraudulent. Put all the vote times you’re concerned about into a list, sort them, and compute their differences:

# times is a list of datetime instances from vote records
times.sort(reversed=True)
interarrivals = [y-x for x, y in zip(times, times[1:]]

Now use matplotlib to display a histogram of these. Votes that occur naturally are likely to resemble an exponential distribution in their interarrival times. For instance, here are interarrival times for all votes received in a contest:

This subset of votes is clearly fraudulent, due to the near determinism of their interarrival times. This is most likely caused by the voting bot not taking random sleep intervals during voting. It casts a vote, receives a response, clears its cookies, and repeats:

These votes, on the other hand, are most likely legitimate. They exhibit a nice Erlang shape and appear to have natural interarrival times that one would expect:

Of course this method is woefully inadequate for rigorous detection of voting fraud. Ideally one would find a method to compute the probability that a set of votes is generated by a bot. This is enough to inform quick, ad hoc decisions though.

🧐 Data Fitting 1 - Linear Data Fitting

Tue, 23 Nov 2010 00:00:00 +0000

Note: This post was updated to work with Python 3 and PySCIPOpt. The original version used Python 2 and python-zibopt.

Data fitting is one of those tasks that everyone should have at least some exposure to. Certainly developers and analysts will benefit from a working knowledge of its fundamentals and their implementations. However, in my own reading I’ve found it difficult to locate good examples that are simple enough to pick up quickly and come with accompanying source code.

This article commences an ongoing series introducing basic data fitting techniques. With any luck they won’t be overly complex, while still being useful enough to get the point across with a real example and real data. We’ll start with a binary classification problem: presented with a series of records, each containing a set number of input values describing it, determine whether or not each record exhibits some property.

Model

We’ll use the cancer1.dt data from the proben1 set of test cases, which you can download here. Each record starts with 9 data points containing physical characteristics of a tumor. The second to last data point contains 1 if a tumor is benign and 0 if it is malignant. We seek to find a linear function we can run on an arbitrary record that will return a value greater than zero if that record’s tumor is predicted to be benign and less than zero if it is predicted to be malignant. We will train our linear model on the first 350 records, and test it for accuracy on the remaining rows.

This is similar to the data fitting problem found in Chvatal. Our inputs consist of a matrix of observed data, $A$, and a vector of classifications, $b$. In order to classify a record, we require another vector $x$ such that the dot product of $x$ and that record will be either greater or less than zero depending on its predicted classification.

A couple points to note before we start:

Most observed data are noisy. This means it may be impossible to locate a hyperplane that cleanly separates given records of one type from another. In this case, we must resort to finding a function that minimizes our predictive error. For the purposes of this example, we’ll minimize the sum of the absolute values of the observed and predicted values. That is, we seek $x$ such that we find $min \sum_i{|a_i^T x-b_i|}$.
The slope-intercept form of a line, $f(x)=m^T x+b$, contains an offset. It should be obvious that this is necessary in our model so that our function isn’t required to pass through the origin. Thus, we’ll be adding an extra variable with the coefficient of 1 to represent our offset value.
In order to model this, we use two linear constraints for each absolute value. We minimize the sum of these. Our Linear Programming model thus looks like:

$$ \begin{align*} \min\quad & z = x_0 + \sum_i{v_i}\\ \text{s.t.}\quad& v_i \geq x_0 + a_i^\intercal x - 1 &\quad\forall&\quad\text{benign tumors}\\ & v_i \geq 1 - x_0 - a_i^\intercal x &\quad\forall&\quad\text{benign tumors}\\ & v_i \geq x_0 + a_i^\intercal x - (-1) &\quad\forall&\quad\text{malignant tumors}\\ & v_i \geq -1 - x_0 - a_i^\intercal x &\quad\forall&\quad\text{malignant tumors} \end{align*} $$

Code

In order to do this in Python, we use SCIP and SoPlex. We start by setting constants for benign and malignant outputs and providing a function to read in the training and testing data sets.

# Preferred output values for tumor categories
BENIGN = 1
MALIGNANT = -1

def read_proben1_cancer_data(filename, train_size):
    '''Loads a proben1 cancer file into train & test sets'''
    # Number of input data points per record
    DATA_POINTS = 9

    train_data = []
    test_data = []

    with open(filename) as infile:
        # Read in the first train_size lines to a training data list, and the
        # others to testing data. This allows us to test how general our model
        # is on something other than the input data.
        for line in infile.readlines()[7:]: # skip header
            line = line.split()

            # Records = offset (x0) + remaining data points
            input = [float(x) for x in line[:DATA_POINTS]]
            output = BENIGN if line[-2] == '1' else MALIGNANT
            record = {'input': input, 'output': output}

            # Determine what data set to put this in
            if len(train_data) >= train_size:
                test_data.append(record)
            else:
                train_data.append(record)

    return train_data, test_data

The next function implements the LP model described above using SoPlex and SCIP. It minimizes the sum of residuals for each training record. This amounts to summing the absolute value of the difference between predicted and observed output data. The following function takes in input and observed output data and returns a list of coefficients. Our resulting model consists of taking the dot product of an input record and these coefficients. If the result is greater than or equal to zero, that record is predicted to be a benign tumor, otherwise it is predicted to be malignant.

from pyscipopt import Model

def train_linear_model(train_data):
    '''
    Accepts a set of input training data with known output
    values.  Returns a list of coefficients to apply to
    arbitrary records for purposes of binary categorization.
    '''
    # Make sure we have at least one training record.
    assert len(train_data) > 0
    num_variables = len(train_data[0]['input'])

    # Variables are coefficients in front of the data points. It is important
    # that these be unrestricted in sign so they can take negative values.
    m = Model()
    x = [m.addVar(f'x{i}', lb=None) for i in range(num_variables)]

    # Residual for each data row
    residuals = [m.addVar(lb=None, ub=None) for _ in train_data]
    for r, d in zip(residuals, train_data):
        # r will be the absolute value of the difference between observed and
        # predicted values. We can model absolute values such as r >= |foo| as:
        #
        #   r >=  foo
        #   r >= -foo
        m.addCons(sum(x * y for x, y in zip(x, d['input'])) + r >= d['output'])
        m.addCons(sum(x * y for x, y in zip(x, d['input'])) - r <= d['output'])

    # Find and return coefficients that min sum of residuals.
    m.setObjective(sum(residuals))
    m.setMinimize()
    m.optimize()

    solution = m.getBestSol()
    return [solution[xi] for xi in x]

We also provide a convenience function for counting the number of correct predictions by our resulting model against either the test or training data sets.

def count_correct(data_set, coefficients):
    '''Returns the number of correct predictions.'''
    correct = 0
    for d in data_set:
        result = sum(x*y for x, y in zip(coefficients, d['input']))

        # Do we predict the same as the output?
        if (result >= 0) == (d['output'] >= 0):
            correct += 1

    return correct

Finally we write a main method to read in the data, build our linear model, and test its efficacy.

from pprint import pprint

if __name__ == '__main__':
    # Specs for this input file
    INPUT_FILE_NAME = 'cancer1.dt'
    TRAIN_SIZE = 350

    train_data, test_data = read_proben1_cancer_data(
        INPUT_FILE_NAME,
        TRAIN_SIZE
    )

    # Add the offset variable to each of our data records
    for data_set in [train_data, test_data]:
        for row in data_set:
            row['input'] = [1] + row['input']

    coefficients = train_linear_model(train_data)
    print('coefficients:')
    pprint(coefficients)

    # Print % of correct predictions for each data set
    correct = count_correct(train_data, coefficients)
    print(
        '%s / %s = %.02f%% correct on training set' % (
            correct, len(train_data),
            100 * float(correct) / len(train_data)
        )
    )

    correct = count_correct(test_data, coefficients)
    print(
        '%s / %s = %.02f%% correct on testing set' % (
            correct, len(test_data),
            100 * float(correct) / len(test_data)
        )
    )

Results

The result of running this model against the cancer1.dt data set is:

coefficients:
[1.4072882449702786,
 -0.14014055927954652,
 -0.6239513714263405,
 -0.26727681774258882,
 0.067107753841131157,
 -0.28300216102808429,
 -1.0355594670918404,
 -0.22774451038152174,
 -0.69871243677663608,
 -0.072575089848659444]
328 / 350 = 93.71% correct on training set
336 / 349 = 96.28% correct on testing set

The accuracy is pretty good here against the both the training and testing sets, so this particular model generalizes well. This is about the simplest model we can implement for data fitting, and we’ll get to more complicated ones later, but it’s nice to see we can do so well so quickly. The coefficients correspond to using a function of this form, rounding off to three decimal places:

$$ \begin{align*} f(x) =\ &1.407 - 0.140 x_1 - 0.624 x_2 - 0.267 x_3 + 0.067 x_4 - \\ &0.283 x_5 - 1.037 x_6 - 0.228 x_7 - 0.699 x_8 - 0.073 x_9 \end{align*} $$

Resources

cancer1.dt data file from proben1
Full source listing

🐍 Monte Carlo Simulation in Python

Thu, 08 Oct 2009 00:00:00 +0000

Note: This post was updated to work with Python 3.

One of the most useful tools one learns in an Operations Research curriculum is Monte Carlo Simulation. Its utility lies in its simplicity: one can learn vital information about nearly any process, be it deterministic or stochastic, without wading through the grunt work of finding an analytical solution. It can be used for off-the-cuff estimates or as a proper scientific tool. All one needs to know is how to simulate a given process and its appropriate probability distributions and parameters if that process is stochastic.

Here’s how it works:

Construct a simulation that, given input values, returns a value of interest. This could be a pure quantity, like time spent waiting for a bus, or a boolean indicating whether or not a particular event occurs.
Run the simulation a, usually large, number of times, each time with randomly generated input variables. Record its output values.
Compute sample mean and variance of the output values.

In the case of time spent waiting for a bus, the sample mean and variance are estimators of mean and variance for one’s wait time. In the boolean case, these represent probability that the given event will occur.

One can think of Monte Carlo Simulation as throwing darts. Say you want to find the area under a curve without integrating. All you must do is draw the curve on a wall and throw darts at it randomly. After you’ve thrown enough darts, the area under the curve can be approximated using the percentage of darts that end up under the curve times the total area.

This technique is often performed using a spreadsheet, but that can be a bit clunky and may make more complex simulations difficult. I’d like to spend a minute showing how it can be done in Python. Consider the following scenario:

Passengers for a train arrive according to a Poisson process with a mean of 100 per hour. The next train arrives exponentially with a rate of 5 per hour. How many passers will be aboard the train?

We can simulate this using the fact that a Poisson process can be represented as a string of events occurring with exponential inter-arrival times. We use the sim() function below to generate the number of passengers for random instances of the problem. We then compute sample mean and variance for these values.

import random

PASSENGERS = 100.0
TRAINS     =   5.0
ITERATIONS = 10000

def sim():
    passengers = 0.0

    # Determine when the train arrives
    train = random.expovariate(TRAINS)

    # Count the number of passenger arrivals before the train
    now = 0.0
    while True:
        now += random.expovariate(PASSENGERS)
        if now >= train:
            break
        passengers += 1.0

    return passengers

if __name__ == '__main__':
    output = [sim() for _ in range(ITERATIONS)]

    total = sum(output)
    mean = total / len(output)

    sum_sqrs = sum(x*x for x in output)
    variance = (sum_sqrs - total * mean) / (len(output) - 1)

    print('E[X] = %.02f' % mean)
    print('Var(X) = %.02f' % variance)