Home

Future Post Machine Learning for Causal Inference: Synthetic Control and Double Machine Learning

2021-08-11T00:00:00-07:00

This post is inspired by Frank Diebold’s

Future Post Machine Learning in Julia using MLJ.jl

2021-07-11T00:00:00-07:00

Future Post Random Variables in Julia compared to MATLAB/R/STATA/Mathematica/Python

2021-06-11T00:00:00-07:00

This post compares the way random variables are handled in Julia/MATLAB/R/STATA/Mathematica/Python. It was inspired by Bruce Hansen’s recent textbook which compares statistical commands in Matlab/R/STATA on page 114. This post will focus on the main methods for working with random variables in a language: e.g. Distributions.jl is the flagship Julia package for random variables, MATLAB’s internal distributions, Base R, Base STATA, Mathematica, and Python’s SciPy.

1: Tables Comparing Syntax

CDF:

RV	Julia	MATLAB	Base R	STATA	Mathematica	Python SciPy
$N(0,1)$	cdf(Normal(0,1),x)	normcdf(x)	pnorm(x)	normal(x)	CDF[NormalDistribution[0, 1],x]	norm.cdf(x)
$\chi^2_{r}$	cdf(Chisq(r),x)	chi2cdf(x,r)	pchisq(x,r)	chi2(r,x)	CDF[ChiSquareDistribution[r],x]	chi2.cdf(x, r)
$t_r$	cdf(TDist(r),x)	tcdf(x,r)	pt(x,r)	1-ttail(r,x)	CDF[StudentTDistribution[r],x]	t.cdf(x, r)
$F_{r,k}$	cdf(FDist(r,k),x)	fcdf(x,r,k)	pf(x,r,k)	F(r,k,x)	CDF[FRatioDistribution[r,k],x]	f.cdf(x, r, k)
$D(\theta)$	cdf(D(θ),x)	Dcdf(x,θ)	pD(x,θ)	?	CDF[D[θ],x]	D.cdf(x,θ)

Inverse Probabilities (quantiles):

RV	Julia	MATLAB	Base R	STATA	Mathematica	Python SciPy
$N(0,1)$	quantile(Normal(0,1),p)	norminv(p)	qnorm(p)	invnormal(p)	Quantile[NormalDistribution[],p]	norm.ppf(p)
$\chi^2_{r}$	quantile(Chisq(r),p)	chi2inv(p,r)	qchisq(p,r)	invchi2(r,p)	Quantile[ChiSquareDistribution[r],p]	chi2.ppf(p, r)
$t_r$	quantile(TDist(r),p)	tinv(p,r)	qt(p,r)	invttail(r,1-p)	Quantile[StudentTDistribution[r],p]	t.ppf(p, r)
$F_{r,k}$	quantile(FDist(r,k),p)	finv(p,r,k)	qf(p,r,k)	invF(r,k,p)	Quantile[FRatioDistribution[r,k],p]	f.ppf(p, r, k)
$D(\theta)$	quantile(D(θ),p)	Dinv(p,θ)	qD(p,θ)	invD(p,θ)	Quantile[D[θ],p]	D.ppf(p,θ)

Other Properties:

Property	Julia	MATLAB	Base R	STATA	Mathematica	Python SciPy
cdf	cdf(D(θ),x)	Dcdf(x,θ)	pD(x,θ)	?	CDF[D[θ],x]	D.cdf(x,θ)
pdf/pmf	pdf(D(θ),x)	Dcdf(x,θ)	dD(x,θ)	?	PDF[D[θ],x]	D.pdf(x,θ)
quantile	quantile(D(θ),p)	Dinv(p,θ)	qD(p,θ)	invD(p,θ)	Quantile[D[θ],p]	D.ppf(p,θ)
random	rand(D(θ),N)	Dinv(p,θ)	rD(N)	invD(p,θ)	RandomVariate[D[θ],N]	D.ppf(p,θ)
mean	mean(D(θ))	-	-	-	Mean[D[θ]]	-
entropy	entropy(D(θ))	-	-	-	-	-
fit	fit(D, data)	-	-	-	FindDistributionParameters[data,D]	-

2: Random Variables as Types

A key distinction between the way the packages above handle random variables is that in Julia and Mathematica a random variable is itself a type. On the other hand e.g. in R you cannot refer to the underlying randomv variable, you can only compute properties such as chi2cdf(x,r).

General syntax in Julia:
Distributions.jl distinguishes between a Random Variable’s parameters and property variables. A random variable is a type such as Chisq(r) or D(θ). A property of a random variable such as CDF or mean is (typically) a functional which takes random variable as its argument along with any necesarry property specific variables.
Note: some properties don’t have any arguments such as mean(D(θ)).
Note: the fit(D, data) function requires a distribution type without parameters D as opposed to D(θ).

3: Random Variables in Distributions.jl

In general a random variables package does three things:

Creates random variables: built-in/fit/transform
Sample random variables
Compute properties: probabilities/moments/cumulants/entropies etc

Here is an overview of current features:

Creating Random Variables:
- Built in random variables: D(θ), Chisq(r), FDist(r,k) etc
- Combining and transforming random variables:
- Mixture models: MixtureModel([Normal(0,1),Cauchy(0,1)], [0.5,0.5])
- Truncated random variables: Truncated(Cauchy(0,1), 0.25, 1.8)
- Convolution of random variables: convolve(Cauchy(0,1), Cauchy(5,2))
- Cartesian product of random variables: product_distribution([Normal(),Cauchy()])
- Other packages for creating random variables: AlgebraPDF.jl etc
Sampling: rand(D(θ),N), rand(Cauchy(0,1), 100)
Fitting:
- parametric: fit(D, data)
- non-parametric: fit(Histogram, data)
Other properties: property(D(θ)) or property(D(θ),x) where θ is the vector of distribution parameters and x is the vector of property variables.
- example: d=LogNormal()
- mean(d), median(d), mode(d), var(d), std(d)
- skewness(d), kurtosis(d), entropy(d)
- pdf(d, 2), cdf(d, 2), quantile(d, .9), gradlogpdf(d, 2)
- Most properties above are implemented in closed form. There are POC tools from numerical expectation etc
  Distributions.expectation(LogNormal(), cos) computes $E[cos(X)]$ where $X\sim LogNormal(0,1)$.

4: Future and other and general tranformations of random variables

Numerical vs Symbolic:

I discussed the following examples on Discourse.
Distributions.jl currently doesn’t operate on transformations of random variables. Mathematica can handle transformations of a distribution when it can solve the problem symbolically.

Now consider the same distribution with symbolic parameters BetaDistribution[α,β]

From paper:
Type hierarchy

Sampling interface
Distribution interface and types
R equivalent p-d-q-r in Julia:
Distribution fitting and estimation
parametric: fit(D, data)
non-parametric: fit(Histogram, data)
Modeling mixtures of distributions

The table below adds Julia, Mathematica, and Python.

Python: https://github.com/QuantEcon/rvlib
R: https://github.com/alan-turing-institute/distr6
Compare syntax: https://hyperpolyglot.org/scripting

Simpson’s Paradox is a Special Case of Omitted Variable Bias

2021-04-11T00:00:00-07:00

The goal of this post is to illustrate a point made in a recent tweet by Amit Ghandi that Simpson’s Paradox is a special case of omitted variable bias.

Let’s start with some definitions:
Simpson’s Paradox: a statistical phenomenon where an association between two variables in a population emerges, disappears or reverses when the population is divided into subpopulations.
Omitted Variable Bias (OVB): when a statistical model leaves out one or more variables that is both correlated with the treatment and the outcome.
Case Fatality Rate (CFR): the proportion of people who die from a specified disease among all individuals diagnosed with the disease over a certain period of time.

Example: COVID-19 in China versus Italy

Let’s use an example from How Simpson’s paradox explains weird COVID19 statistics. (This example is for illustrative purposes only. This post is about interpreting statistics, not COVID-19). The video compares those diagnosed with COVID-19 in China and Italy between March and May 2020.
CFR by country: people infected with COVID-19 were more likely to die in Italy than China.
CFR by country-age group: at each age bracket, people infected with COVID-19 were more likely to die in China than Italy.

Simulate Data

Let’s illustrate this with a simulation in the Julia Language.
The variables are defined in the table below:

\begin{array}{lll} Outcome: Y_{i} & Treatment: X_{i} & Confounder: Z_{i} \\ Y_{i} \equiv 0 if person i survives & X_{i} \equiv 0 if person i is in China & Z_{i} \equiv 0 if person i's age \leq 59 \\ Y_{i} \equiv 1 if person i dies & X_{i} \equiv 1 if person i is in Italy & Z_{i} \equiv 1 if person i's age > 59 \end{array}

Let’s assume the true data generating process (DGP) is: $Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \beta_{zy} Z_{i} + \varepsilon_{i}$
Under the true DGP, $\text{CFR}\left(X_i, Z_i \right) = P\left(Y_i =1 | X_i, Z_i \right) =E\left[Y_i =1 | X_i, Z_i \right]$.

Assume $\beta_{0}=10, \beta_{xy} = -5, \beta_{zy} = 10$ (coefficients are in %).
China-Young: $\text{CFR}\left(0, 0\right) = P\left(Y_i =1 | X_i=0, Z_i=0 \right) = \beta_{0} = 10\%$
China-Old: $\text{CFR}\left(0, 1\right) = P\left(Y_i =1 | X_i=0, Z_i=1 \right) = \beta_{0} + \beta_{zy} = 20\%$
Italy-Young: $\text{CFR}\left(1, 0\right) = P\left(Y_i =1 | X_i=1, Z_i=0 \right) = \beta_{0} + \beta_{xy} = 5\%$
Italy-Old: $\text{CFR}\left(1, 1\right) = P\left(Y_i =1 | X_i=1, Z_i=1 \right) = \beta_{0} + \beta_{xy} + \beta_{zy} = 15\%$

Let’s generate artificial data consistent with the DGP.
Suppose we have N=200 observations (half China, half Italy).
Suppose 80% of China’s population is young $Z_{i} =0$ and 20% is old $Z_{i} = 1$.
Suppose 20% of Italy’s population is young $Z_{i} =0$ and 80% is old $Z_{i} = 1$.

  using DataFrames, Plots, Statistics
  N = 200; #200 obs = 100 in China + 100 in Italy.
  β_0 = 10.0; β_Italy = -5.0; β_Age = 10.0;
  #
  df = DataFrame(
      Y        = [
                  ones(8);zeros(80-8);   #China-Young: 8/80 die
                  ones(4);zeros(20-4);   #China-Old:  4/20 die
                  ones(1);zeros(20-1);   #Italy-Young: 1/20 die
                  ones(12);zeros(80-12); #Italy-Old: 12/80 die
                  ], 
      Intercept = ones(N), 
      Italy     = [zeros(100); ones(100)], 
      Age       = [zeros(80);ones(100-80); 
                   zeros(20);ones(100-20);],
      )
  y = df.Y;

Estimate CFR conditional on: nothing/country/country & age

1) Let’s estimate the unconditional probability of death from COVID-19 in this data:
$Y_{i} = \beta_{0} + \varepsilon_{i}$

X = hcat(df.Intercept);
β = X \ y   # 12.5%
mean(y)     # 12.5%

\begin{array}{lll} P (Death from COVID-19) = 12.5 % \end{array}

2) Let’s estimate the probability of death from COVID-19 conditional only on country:
$Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \varepsilon_{i}$

X = hcat(df.Intercept, df.Italy);
β = X \ y   
β[1]         # 12% = CFR in China
β[1] + β[2]  # 13% = CFR in Italy

\begin{array}{lll} P (Death from COVID-19 | China) = 12 % \\ P (Death from COVID-19 | Italy) = 13 % \end{array}

3) Let’s estimate the probability of death from COVID-19 conditional on country and age:
$Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \beta_{zy} Z_{i} + \varepsilon_{i}$

X = hcat(df.Intercept, df.Italy, df.Age);
β = X \ y   
β[1]                # 10% = CFR for China-Young
β[1] + β[3]         # 20% = CFR for China-Old
β[1] + β[2]         #  5% = CFR for Italy-Young
β[1] + β[2] + β[3]  # 15% = CFR for Italy-Old

Summarize $P\left( \text{Death from COVID-19 } | \text{ Country, Age} \right)$ in the following table:

\begin{array}{lll} Young (Z_{i} = 0) & Old (Z_{i} = 1) \\ China (X_{i} = 0) & 10 % & 20 % \\ Italy (X_{i} = 1) & 5 % & 15 % \end{array}

Without conditioning on age, patients in Italy have a 1% higher probability of Death than China (13% vs 12%).
Conditioning on age, patients in Italy have a 5% lower probability of Death than China within both age brackets.

OVB

Next we will show how Simpson’s paradox is a special case of OVB.
Suppose the true model is: $Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \beta_{zy} Z_{i} + \nu_{i}$
Suppose you omit $Z_{i}$ and instead estimate: $Y_{i} = \beta_{0} + \beta_{xy} X_{i} + u_{i} \Rightarrow u_{i} = \beta_{zy} Z_{i} + \nu_{i}$
Suppose $X_{i}$ predicts $Z_{i}$: $Z_{i} = \delta_{xz} X_{i} + w_{i} \Rightarrow \delta_{xz} = \frac{\sigma_{xz}}{\sigma_{x}^2} = \rho_{xz}\times \frac{\sigma_{z}}{\sigma_{x}}$
Denote the OLS estimate (from the equation that omits age) $\hat{\beta}_{xy}$.
We have: $E\left[ \hat{\beta}_{xy} | X_{i} \right] = \beta_{xy} + \underbrace{\delta_{xz} \beta_{zy}}_{\text{Bias}}$ (derivation below¹)
$ \text{Bias} = \delta_{xz} \beta_{zy} = \left( \rho_{xz}\times \frac{\sigma_{z}}{\sigma_{x}} \right) \times \left( \rho_{zy}\times \frac{\sigma_{y}}{\sigma_{z}} \right) = \rho_{xz}\times \rho_{zy} \times \frac{\sigma_{y}}{\sigma_{x}} $
The bias is the product of (1) the impact of the treatment on the OV $\delta_{xz}$ and (2) the impact of the OV on the outcome $\beta_{zy}$.
The estimate will be unbiased if either (1) the treatment is uncorrelated w/ the OV, or (2) the OV is uncorrelated w/ the outcome.

Simpson’s reversal occurs when the sign of the estimated coefficient switches after including the confounder (when the bias is big enough in the opposite direction of the true effect): $\text{sign}\left( \hat{\beta}_{xy} \right) \neq \text{sign}\left( \beta_{xy} \right)$ $\Leftrightarrow$ $\text{sign}\left( \beta_{xy} + \delta_{xz} \beta_{zy} \right) \neq \text{sign}\left( \beta_{xy} \right)$.

In our case, the true effect $\beta_{xy} = -5\%$ and the bias $\delta_{xz} \beta_{zy}=6\%$, the OVB is big enough to cause a reversal:
$\begin{align*} \hat{\beta}_{xy} &= 1\% & \text{Non-causal effect, estimated when excluding Z} \\ \beta_{xy} &= -5\% & \text{Causal effect, estimated when including Z} \\ \delta_{xz} \beta_{zy} &= 60\% \times 10\% =6\% &= \text{Bias} \\ \beta_{xy} + \delta_{xz} \beta_{zy} &= -5\% + 60\% \times 10\% &= 1\% \\ \end{align*}$

Levels of Interpretation

Suppose we estimate: $Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \varepsilon_{i}$

Non-causal interpretation $\hat{\beta}_{xy} = 1\%$: the probability of a diagnosed patient dying from COVID-19 is 1% higher in Italy than China.
Assumption 1: the Chinese and Italian data was correctly measured and reported.
Note: the assumption required for the non-causal interpretation is relatively mild.

Causal interpretation $\hat{\beta}_{xy} = 1\%$: if we intervene and move a diagnosed patient from China to Italy, the probability of the patient dying from COVID-19 will be 1% higher in Italy.
Assumption 1: the Chinese and Italian data was correctly measured and reported.
Assumption 2: the “treatment” (China vs Italy) is uncorrelated with unobserved determinants of survival. This is the famous CMI assumption: $E\left[ \varepsilon | X \right] = 0$.
Note: the identifying assumption (CMI) required for a causal interpretation is very strong. In general treatments are often correlated with variables which are also correlated with the outcome. In this case, the confounder is age, Italy’s population is older than China’s.

In general each reader can decide how convinced he is with the identifying assumption and thus how to interpret an estimate. Importantly, non-causal estimates are often still very useful in contexts where our goal is to make predictions.

Additional Practice

The true DGP above assumed the treatment effect was the same across age bins: $Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \beta_{zy} Z_{i} + \nu_{i}$
Thus the CFR was $\beta_{xy} = -5\%$ lower for both young and old patients in Italy.
Suppose there was treatment effect heterogeneity and the true DGP was:
$Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \beta_{zy} Z_{i} + \beta_{xzy} X_{i} Z_{i} + \nu_{i}$
In this case, estimating the model omitting the interaction effect (omitted non-linearity) would also cause OVB.

To derive the bias, slightly abuse notation by stacking a column of ones and $X_{i}$ into a matrix “X”, and stack $\beta_{0}$ and $\beta_{xy}$ into $\beta$:
$\begin{align*} \hat{\beta} &= (X'X)^{-1} X'Y \\ &= (X'X)^{-1} X'(X\beta + Z\beta_{zy} + \nu_{i}) \\ &= \beta + (X'X)^{-1} X'Z \beta_{zy} + (X'X)^{-1} X'Z \nu_{i} \\ \delta_{xz} &\equiv (X'X)^{-1} X'Z \\ \hat{\beta} &= \beta + \delta_{xz} \beta_{zy} + \delta_{xz} \nu_{i} \\ E\left( \hat{\beta} | X \right) &= \beta + \delta_{xz} \beta_{zy} \\ & \end{align*}$
↩