Box-Cox Transformations, or 'How Do I Make This Skewed Distribution More Normal?'

Certain analyses require normally-distributed data. But what happens if you don't have normally distributed data?

In [1]:
# the usual imports.
import numpy as np
from scipy.stats import skew, probplot

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
In [2]:
skew_norm_dist = np.random.gamma(4, 2, size=200)
f, ax = plt.subplots(1, 1)
ax.set_title("Skewed Plot")
p = ax.hist(skew_norm_dist)

This skew is not too bad, but we aren't sure how to make it better. One way we can transform our data is to take powers of the values; in other words $x \mapsto x^{\lambda}$. This is traditionally called the Tukey Transformation. Let's see how that works.

In [3]:
def tukey_transform(data, L):
    """Transforms data by a power of L."""
    return pow(data, L)
In [4]:
skew_norm_dist = np.random.gamma(4, 2, size=200)
f, axes = plt.subplots(2, 2, figsize=(10,10))

axes[0,0].set_title("Lambda = 0.1")
axes[0,1].set_title("Lambda = 0.3")
axes[1,0].set_title("Lambda = 0.5")
axes[1,1].set_title("Lambda = 0.7")

axes[0,0].hist(tukey_transform(skew_norm_dist, 0.1))
axes[0,1].hist(tukey_transform(skew_norm_dist, 0.3))
axes[1,0].hist(tukey_transform(skew_norm_dist, 0.5))
axes[1,1].hist(tukey_transform(skew_norm_dist, 0.7))

p = ax.hist(skew_norm_dist)

That's look a lot more normal! A similar transform (which seems to be more widely known) is the Box-Cox Transform. The difference between the Tukey Transform and the Box-Cox transform is, on the surface, a matter of scale. It's defined to be $x\mapsto \frac{x^{\lambda} - 1}{\lambda}$. One interesting property about the Box-Cox transform is that as $\lambda\rightarrow 0$ this becomes $\log(x)$.

That is, as $\lambda\rightarrow 0$ the Tukey Transform will bring all values to 1 (there is that $x = 0$ problem which we can simply define to be 1) but the Box-Cox transform will bring the values to $\log(x)$.

In [5]:
def box_cox_transform(data, L):
    """Transforms data by a power of L."""
    return (pow(data, L) - 1) / L
In [6]:
skew_norm_dist = np.random.gamma(4, 2, size=200)
f, axes = plt.subplots(2, 2, figsize=(10,10))

axes[0,0].set_title("Lambda = 0.01")
axes[0,1].set_title("Lambda = 0.1")
axes[1,0].set_title("Lambda = 1")
axes[1,1].set_title("Lambda = 10")

axes[0,0].hist(box_cox_transform(skew_norm_dist, 0.01))
axes[0,1].hist(box_cox_transform(skew_norm_dist, 0.1))
axes[1,0].hist(box_cox_transform(skew_norm_dist, 1))
axes[1,1].hist(box_cox_transform(skew_norm_dist, 10))

p = ax.hist(skew_norm_dist)

Some of these are better than others; let's look at a probability plot plots to see which is approximately most normal.

In [10]:
# Copying and pasting because of the weird axes thing.

skew_norm_dist = np.random.gamma(4, 2, size=200)
f, axes = plt.subplots(2, 2, figsize=(10,10))
L = [0.01, 0.1, 1, 10]
ax_idx = [[0,0], [0,1], [1,0], [1,1]]

for n, aidx in enumerate(ax_idx):
    a = probplot(box_cox_transform(skew_norm_dist, L[n]))
    axes[aidx[0], aidx[1]].set_title("L = {}, r = {}".format(L[n],
                                                             round(a[1][2], 3)))
    axes[aidx[0], aidx[1]].scatter(*a[0])

One way we can figure out the best $\lambda$ is to find the probability plot above which is most linear; we can do this by using a linear regression and checking out its $r$ value (closer to 1 is best in our case). Let's make a grid of values and pick out the best one.

In [12]:
skew_norm_dist = np.random.gamma(4, 2, size=200)

gridvals = np.arange(0.01, 2, 0.01)
r_vals = np.array([probplot(box_cox_transform(skew_norm_dist, gridvals[i]))[1][2] 
          for i in range(len(gridvals))])

best_val = gridvals[np.argmax(r_vals)]

f, axes = plt.subplots(1, 3, figsize=(10,5))
axes[0].set_title("Original Dist")
axes[1].set_title("Optimal Transform")
axes[2].set_title("L = {} Prob Plot".format(best_val))

q0 = axes[0].hist(skew_norm_dist)
q1 = axes[1].hist(box_cox_transform(skew_norm_dist, best_val))
q2 = axes[2].scatter(*probplot(box_cox_transform(skew_norm_dist, best_val))[0])