A practical exponentially-weighted mean/variance algorithm

Matteo Santamaria · September 3, 2022

Back in August, I started contributing to polars, an exciting new dataframe library that is modern, memory-efficient, and lightning FAST (check out the benchmarks here). It is under active development so bugs are identified (and resolved) at an incredible pace. One morning, I noticed a user reporting an error with the exponentially-weighted variance computation – the polars result differed significantly from what pandas produced. Intrigued, I began looking for the source of the disparity.

The diagnosis

I first replicated the user’s issue on my own machine

>>> pl.Series([1., 2., 3.]).ewm_var(alpha=0.5, adjust=False)
shape: (3,)
Series: '' [f64]
[
	0.0
	0.25
	0.6875
]

>>> pd.Series([1., 2., 3.]).ewm(alpha=0.5, adjust=False).var()
0    NaN
1    0.5
2    1.1
dtype: float64

Looking closer at the polars and pandas function signatures, I noticed a bias parameter was notably absent from the polars implemenation (ignore the adjust parameter for now, we will get to that shortly). And indeed, this was the problem!

>>> pd.Series([1., 2., 3.]).ewm(alpha=0.5, adjust=False).var(bias=False)
0    NaN
1    0.5
2    1.1
dtype: float64

>>> pd.Series([1., 2., 3.]).ewm(alpha=0.5, adjust=False).var(bias=True)
0    0.0000
1    0.2500
2    0.6875
dtype: float64

With adjust=False, polars apparently computed a biased variance by default, whereas pandas returned an unbiased estimator. But the adjust parameter proved to be significant too. When adjust=True, the polars results were entirely incorrect.

>>> pl.Series([1., 2., 3.]).ewm_var(alpha=0.5, adjust=True)
shape: (3,)
Series: '' [f64]
[
	0.0
	0.25
	0.569444
]

>>> pd.Series([1., 2., 3.]).ewm(alpha=0.5, adjust=True).var(bias=False)
0         NaN
1    0.500000
2    0.928571
dtype: float64

>>> pd.Series([1., 2., 3.]).ewm(alpha=0.5, adjust=True).var(bias=True)
0    0.000000
1    0.222222
2    0.530612
dtype: float64

Clearly this needed to be fixed, but implementing a correct variance algorithm would first require a foray in the mathematics of exponential weights.

Exponentially-weighted mean

Given data points \(x_i\) and weights \(w_i\) for \(i \in \{0, 1, \dots, n\}\), the general formula for a weighted mean \(\mu\) is given by

\[\mu := \frac{\sum_0^n w_i x_i}{\sum_0^n w_i}\]

If the data points are ordered (in time, perhaps) we might want to compute the weighted mean over an expanding window. With some additional notation, we can identify a \(\mu_j\) at each observation \(x_j : j \in \{0, 1, \dots, n\}\), where \(\mu_j\) is the weighted mean of all prior observations \(x_{i \leq j}\). As our window expands, and more data points are considered, the weight corresponding to a particular \(x_i\) will vary, so it is wise to index the weights by \((j)\) as well.

\[\mu_j := \frac{\sum_{i=0}^j w^{(j)}_i x_i}{\sum_{i=0}^j w^{(j)}_i} = \frac{\sum_{i=0}^j w^{(j)}_i x_i}{W_j}\]

where I’ve defined \(W_j := \sum_i w_i^{(j)}\).

But naively computing a \(j\)-term sum at each step is far too costly. Instead, I will derive a recurrence relation that allows us to compute \(\mu_j\) in an online fashion. From the definition of \(\mu_j\), it is readily apparent that

\[W_{j + 1} \mu_{j + 1} = \sum_{i=0}^{j + 1} w_i^{(j + 1)} x_i = \sum_{i=0}^{j} w_i^{(j + 1)} x_i + w_{j + 1}^{(j + 1)} x_{j + 1}\] \[\implies \mu_{j + 1} = \frac{1}{W_{j + 1}} \left[ \sum_{i=0}^{j} w_i^{(j + 1)} x_i + w_{j + 1}^{(j + 1)} x_{j + 1} \right]\]

At this point, though, we lack a crucial relationship between \(w_i^{(j + 1)}\) and \(w_i^{(j)}\) that must be resolved.

It is important to note that until now, I’ve discussed an arbitrary weighting scheme, ie. I have not yet used any properties of the weights themselves, but I will now restrict my analysis to an exponential weighting scheme with parameter \(\alpha \in (0, 1)\).

The unadjusted case

The standard exponentially-weighted mean formula is given by the pair of equations

\[\mu_0 = x_0 \\ \mu_j = (1 - \alpha) \mu_{j - 1} + \alpha x_j\]

which, when expanded out, yield

\[\mu_0 = x_0 \\ \mu_j = (1 - \alpha)^j x_0 + \alpha \sum_{i = 1}^j (1 - \alpha)^{j - i} x_i\]

for finite \(j\). This equation has a remarkable property, namely the weights sum to \(1\) for all \(j\).

\[W_j = (1 - \alpha)^j + \alpha \sum_{i = 1}^j (1 - \alpha)^{j - i} = 1\]

This weighting scheme is what we will call the “unadjusted” exponential weightings.

The adjusted case

There is another way to specify the weights in an exponentially-weighted mean. Starting with the definition

\[\mu_j = (1 - \alpha) \mu_{j - 1} + \alpha x_j\]

we can write

\[\mu_j = \alpha x_j + \alpha (1 - \alpha) x_{j-1} + \alpha (1-\alpha)^2x_{j - 2} + \dots\]

Assuming an infinite number of data points, the general form of the weights is

\[w_i^{(j)} = \alpha (1 - \alpha)^{j - i}\]

However, note that

\[W_j = \sum_{i=0}^j w_i^{(j)} = \alpha \sum_{i=0}^j (1 - \alpha)^{j - i} < 1\]

Since our weights no longer sum to \(1\), we must normalize the weights and divide by \(W_j\)

\[\mu_j = \frac{\alpha}{W_j} \sum_{i=0}^j (1 - \alpha)^{j-i} x_i = \frac{x_j + (1 - \alpha) x_{j-1} + \dots + (1 - \alpha)^{j} x_0}{1 + (1 - \alpha) + \dots + (1 - \alpha)^{j}}\]

When computed this way, \(\mu_j\) is called the “adjusted” exponentially-weighted mean.

An incremental update

With these defintions in hand, we can again consider the equation

\[\mu_{j + 1} = \frac{1}{W_{j + 1}} \left[ \sum_{i=0}^{j} w_i^{(j + 1)} x_i + w_{j + 1}^{(j + 1)} x_{j + 1} \right]\]

I have shown that for both adjusted and unadjusted variants

\[w_{i}^{(j + 1)} = (1 - \alpha) w_{i}^{(j)} : i \leq j\]

and for \(i = j + 1\),

\[w_{j+1}^{(j+1)} = \begin{cases} \alpha & \leftarrow\mathrm{unadjusted} \\ 1 & \leftarrow \mathrm{adjusted} \\ \end{cases}\]

\(W_{j + 1}\) is easily computed by the recurrence

\[W_{j + 1} = \sum_{i=0}^j w_i^{j+1} + w_{j + 1}^{(j + 1)} = (1 - \alpha) W_{j} + w_{j + 1}^{(j + 1)}\]

(although it is worth reminding the reader that in the unadjusted case this computation is strictly unnecessary as the weights always sum to \(1\)). Lastly, I note an important identity

\[\sum_{i = 0}^j w_i^{(j + 1)} x_i =\left( W_{j + 1} - w_{j + 1}^{(j + 1)} \right) \mu_j\]

Putting the various pieces together and working through the algebra, I am now able to write down the one-step recurrence relation

\[\mu_{j + 1} = \frac{w_{j + 1}^{(j+1)}}{W_{j+1}} \left( x_{j + 1} - \mu_j \right) + \mu_j\]

Exponentially-weighted variance

With a firm understanding of the mathematics underlying the exponentially-weighted mean, we are now ready to analyze the exponentially-weighted variance.

The general formula for a weighted variance \(\sigma^2\) is given by

\[\sigma^2 := \frac{1}{W_n} \sum_{i=0}^n w_i \left( x_i - \mu \right)^2\]

as we did with the mean, we can compute a weighted variance at each \(j \in \{0, 1, \dots, n\}\) by

\[\sigma_j^2 := \frac{1}{W_j} \sum_{i=0}^j w_i^{(j)} \left( x_i - \mu_j \right)^2\]

An incremental update

To design an efficient algorithm for computing \(\sigma_j^2\), I start with some useful, easily derived, identities

\[x_i - \mu_j = ( x_i - \mu_{j - 1} ) - \frac{w_j^{(j)}}{W_j} \left( x_j - \mu_{j-1} \right)\]

which for \(i = j\) can be written

\[x_j - \mu_j = \frac{W_j - w_j^j}{W_j} \left( x_j - \mu_{j-1} \right)\]

We are thus interested in solving

\[\sigma_j^2 = \frac{1}{W_j} \underbrace{\sum_{i=0}^{j-1} w_i^{(j)} \left[ ( x_i - \mu_{j - 1} ) - \frac{w_j^{(j)}}{W_j} \left( x_j - \mu_{j-1} \right) \right]^2}_{X} + \frac{ w_j^{(j)} }{W_j} \left[ \frac{W_j - w_j^j}{W_j} \left( x_j - \mu_{j-1} \right) \right]^2\]

The section I’ve identified as \(X\) can be expressed as

\[X = \sum_{i=0}^{j-1}w_i^{(j)} \left[ \left( x_i - \mu_{j-1} \right)^2 - 2 \frac{w_j^{(j)}}{W_j} \left( x_i - \mu_{j-1} \right) \left( x_j - \mu_{j - 1} \right) + \left( \frac{w_j^{(j)}}{W_j} \right)^2 \left( x_j - \mu_{j-1} \right)^2 \right]\]

If you consider each of these terms separately, you will find

\[\sum_{i=0}^{j-1} w_i^{(j)} \left( x_i - \mu_{j - 1} \right)^2 = \left( W_j - w_j^{(j)} \right) \sigma_{j-1}^2\] \[\sum_{i=0}^{j-1} w_i^{(j)} \frac{w_j^{(x)}}{W_j} \left( x_i - \mu_{j-1} \right) \left( x_j - \mu_{j-1} \right) = 0\] \[\sum_{i=0}^{j-1} w_i^{(j)} \left( \frac{w_j^{(j)}}{W_j} \right)^2 \left( x_j - \mu_{j-1} \right)^2 = \left( \frac{w_j^{(j)}}{W_j} \right)^2 \left( x_j - \mu_{j-1} \right)^2 \left( W_j - w_j^{(j)} \right)\]

which, after some simplification, implies

\[\sigma_j^2 = \left( 1 - \frac{w_j^{(j)}}{W_j} \right) \left[ \sigma_{j-1}^2 + \frac{w_j^{(j)}}{W_j} \left( x_j - \mu_{j-1} \right)^2 \right]\]

This is hugely important – it means at each step \(j\) we don’t have to traverse all prior observations to compute \((x_i - \mu_j)^2\). It therefore suggests an \(O(n)\) algorithm is possible.

Biased-ness

If we pause to realize that \(\sigma_j^2\) is only a sample estimate for the population parameter \(\sigma^2\), we can formalize the relationship between \(\sigma_j^2\) and \(\sigma^2\). If we define

\[\theta_j = \sum_{i=0}^{j} \left[ w_i^{(j)} \right]^2\]

we can then compute

\[\mathbf{E} [ \sigma_j^2 ] = \frac{1}{W_j} \sum_{i=0}^j w_i^{(j)} \left( \mathbf{E} \left[ \left( x_i - \mu_j \right)^2 \right] \right) = \left( 1 - \frac{\theta_j}{W_j^2} \right) \sigma^2\]

To derive this, it is useful to note that

\[\mathbf{E}[ x_i^2 ] = \mu^2 + \sigma^2 \\ \mathbf{E}[ x_i \mu_j ] = \mu^2 + \frac{\sigma^2}{W_j} w_i^{(j)} \\ \mathbf{E}[ \mu_j^2 ] = \mu^2 + \frac{\sigma^2 \theta_j}{W_j^2}\] \[\implies \mathbf{E} \left[ \left( x_i - \mu_j \right)^2 \right] = \sigma^2 - 2 \frac{w_i^{(j)}}{W_j} \sigma^2 + \frac{\theta_j}{W_j^2} \sigma^2\]

In order to produce an unbiased estimate of \(\sigma^2\), we must therefore multiply our estimate by a bias correction term. In particular, we must compute

\[\left( 1 - \frac{\theta_j}{W_j^2} \right)^{-1} \sigma_j^2\]

A practical mean/variance algorithm

I now present the algorithm in its entirety:

def exponentially_weighted(
    xs: list[float], alpha: float, adjust: bool, bias: bool
) -> dict[str, list[float]]:
    """Compute and return the exponentially-weighted (mean, variance)."""
    assert 0 < alpha < 1
    
    if adjust:
        wgt, wgt_sum, wgt_sum_sqr = 1, 0, 0
    else:
        # NOTE: we must ensure that `wgt_sum` and `wgt_sum_sqr` are
        # equal to one in the first iteration
        wgt, wgt_sum, wgt_sum_sqr = alpha, 1, (1 + alpha) / (1 - alpha)
    
    result = {"mean": [], "var": []}
    
    mean = xs[0]
    var = 0
        
    for i, x in enumerate(xs):
        wgt_sum = (1 - alpha) * wgt_sum + wgt
        wgt_sum_sqr = (1 - alpha) ** 2 * wgt_sum_sqr + wgt ** 2
        
        diff = x - mean
        ratio = wgt / wgt_sum
        
        mean = mean + diff * ratio
        var = (1 - ratio) * (var + ratio * diff ** 2)
        
        # NOTE: the `i == 0` condition prevents a ZeroDivisionError
        if bias or i == 0:
            bias_correction = 1
        else:
            bias_correction = 1 - wgt_sum_sqr / (wgt_sum ** 2)
        
        result["mean"].append(mean)
        result["var"].append(var / bias_correction)
        
    return result

Note that both polars and pandas provide options for dealing with missing values; this simple implementation puts that issue aside. For a more robust Rust implementation, see my merge request here.

References

Twitter, Facebook