Back in August, I started contributing to polars
, an exciting new dataframe library that is modern, memory-efficient, and lightning FAST (check out the benchmarks here). It is under active development so bugs are identified (and resolved) at an incredible pace. One morning, I noticed a user reporting an error with the exponentially-weighted variance computation – the polars
result differed significantly from what pandas
produced. Intrigued, I began looking for the source of the disparity.
The diagnosis
I first replicated the user’s issue on my own machine
>>> pl.Series([1., 2., 3.]).ewm_var(alpha=0.5, adjust=False)
shape: (3,)
Series: '' [f64]
[
0.0
0.25
0.6875
]
>>> pd.Series([1., 2., 3.]).ewm(alpha=0.5, adjust=False).var()
0 NaN
1 0.5
2 1.1
dtype: float64
Looking closer at the polars
and pandas
function signatures, I noticed a bias
parameter was notably absent from the polars
implemenation (ignore the adjust
parameter for now, we will get to that shortly). And indeed, this was the problem!
>>> pd.Series([1., 2., 3.]).ewm(alpha=0.5, adjust=False).var(bias=False)
0 NaN
1 0.5
2 1.1
dtype: float64
>>> pd.Series([1., 2., 3.]).ewm(alpha=0.5, adjust=False).var(bias=True)
0 0.0000
1 0.2500
2 0.6875
dtype: float64
With adjust=False
, polars
apparently computed a biased variance by default, whereas pandas
returned an unbiased estimator. But the adjust
parameter proved to be significant too. When adjust=True
, the polars
results were entirely incorrect.
>>> pl.Series([1., 2., 3.]).ewm_var(alpha=0.5, adjust=True)
shape: (3,)
Series: '' [f64]
[
0.0
0.25
0.569444
]
>>> pd.Series([1., 2., 3.]).ewm(alpha=0.5, adjust=True).var(bias=False)
0 NaN
1 0.500000
2 0.928571
dtype: float64
>>> pd.Series([1., 2., 3.]).ewm(alpha=0.5, adjust=True).var(bias=True)
0 0.000000
1 0.222222
2 0.530612
dtype: float64
Clearly this needed to be fixed, but implementing a correct variance algorithm would first require a foray in the mathematics of exponential weights.
Exponentially-weighted mean
Given data points \(x_i\) and weights \(w_i\) for \(i \in \{0, 1, \dots, n\}\), the general formula for a weighted mean \(\mu\) is given by
\[\mu := \frac{\sum_0^n w_i x_i}{\sum_0^n w_i}\]If the data points are ordered (in time, perhaps) we might want to compute the weighted mean over an expanding window. With some additional notation, we can identify a \(\mu_j\) at each observation \(x_j : j \in \{0, 1, \dots, n\}\), where \(\mu_j\) is the weighted mean of all prior observations \(x_{i \leq j}\). As our window expands, and more data points are considered, the weight corresponding to a particular \(x_i\) will vary, so it is wise to index the weights by \((j)\) as well.
\[\mu_j := \frac{\sum_{i=0}^j w^{(j)}_i x_i}{\sum_{i=0}^j w^{(j)}_i} = \frac{\sum_{i=0}^j w^{(j)}_i x_i}{W_j}\]where I’ve defined \(W_j := \sum_i w_i^{(j)}\).
But naively computing a \(j\)-term sum at each step is far too costly. Instead, I will derive a recurrence relation that allows us to compute \(\mu_j\) in an online fashion. From the definition of \(\mu_j\), it is readily apparent that
\[W_{j + 1} \mu_{j + 1} = \sum_{i=0}^{j + 1} w_i^{(j + 1)} x_i = \sum_{i=0}^{j} w_i^{(j + 1)} x_i + w_{j + 1}^{(j + 1)} x_{j + 1}\] \[\implies \mu_{j + 1} = \frac{1}{W_{j + 1}} \left[ \sum_{i=0}^{j} w_i^{(j + 1)} x_i + w_{j + 1}^{(j + 1)} x_{j + 1} \right]\]At this point, though, we lack a crucial relationship between \(w_i^{(j + 1)}\) and \(w_i^{(j)}\) that must be resolved.
It is important to note that until now, I’ve discussed an arbitrary weighting scheme, ie. I have not yet used any properties of the weights themselves, but I will now restrict my analysis to an exponential weighting scheme with parameter \(\alpha \in (0, 1)\).
The unadjusted case
The standard exponentially-weighted mean formula is given by the pair of equations
\[\mu_0 = x_0 \\ \mu_j = (1 - \alpha) \mu_{j - 1} + \alpha x_j\]which, when expanded out, yield
\[\mu_0 = x_0 \\ \mu_j = (1 - \alpha)^j x_0 + \alpha \sum_{i = 1}^j (1 - \alpha)^{j - i} x_i\]for finite \(j\). This equation has a remarkable property, namely the weights sum to \(1\) for all \(j\).
\[W_j = (1 - \alpha)^j + \alpha \sum_{i = 1}^j (1 - \alpha)^{j - i} = 1\]This weighting scheme is what we will call the “unadjusted” exponential weightings.
The adjusted case
There is another way to specify the weights in an exponentially-weighted mean. Starting with the definition
\[\mu_j = (1 - \alpha) \mu_{j - 1} + \alpha x_j\]we can write
\[\mu_j = \alpha x_j + \alpha (1 - \alpha) x_{j-1} + \alpha (1-\alpha)^2x_{j - 2} + \dots\]Assuming an infinite number of data points, the general form of the weights is
\[w_i^{(j)} = \alpha (1 - \alpha)^{j - i}\]However, note that
\[W_j = \sum_{i=0}^j w_i^{(j)} = \alpha \sum_{i=0}^j (1 - \alpha)^{j - i} < 1\]Since our weights no longer sum to \(1\), we must normalize the weights and divide by \(W_j\)
\[\mu_j = \frac{\alpha}{W_j} \sum_{i=0}^j (1 - \alpha)^{j-i} x_i = \frac{x_j + (1 - \alpha) x_{j-1} + \dots + (1 - \alpha)^{j} x_0}{1 + (1 - \alpha) + \dots + (1 - \alpha)^{j}}\]When computed this way, \(\mu_j\) is called the “adjusted” exponentially-weighted mean.
An incremental update
With these defintions in hand, we can again consider the equation
\[\mu_{j + 1} = \frac{1}{W_{j + 1}} \left[ \sum_{i=0}^{j} w_i^{(j + 1)} x_i + w_{j + 1}^{(j + 1)} x_{j + 1} \right]\]I have shown that for both adjusted and unadjusted variants
\[w_{i}^{(j + 1)} = (1 - \alpha) w_{i}^{(j)} : i \leq j\]and for \(i = j + 1\),
\[w_{j+1}^{(j+1)} = \begin{cases} \alpha & \leftarrow\mathrm{unadjusted} \\ 1 & \leftarrow \mathrm{adjusted} \\ \end{cases}\]\(W_{j + 1}\) is easily computed by the recurrence
\[W_{j + 1} = \sum_{i=0}^j w_i^{j+1} + w_{j + 1}^{(j + 1)} = (1 - \alpha) W_{j} + w_{j + 1}^{(j + 1)}\](although it is worth reminding the reader that in the unadjusted case this computation is strictly unnecessary as the weights always sum to \(1\)). Lastly, I note an important identity
\[\sum_{i = 0}^j w_i^{(j + 1)} x_i =\left( W_{j + 1} - w_{j + 1}^{(j + 1)} \right) \mu_j\]Putting the various pieces together and working through the algebra, I am now able to write down the one-step recurrence relation
\[\mu_{j + 1} = \frac{w_{j + 1}^{(j+1)}}{W_{j+1}} \left( x_{j + 1} - \mu_j \right) + \mu_j\]Exponentially-weighted variance
With a firm understanding of the mathematics underlying the exponentially-weighted mean, we are now ready to analyze the exponentially-weighted variance.
The general formula for a weighted variance \(\sigma^2\) is given by
\[\sigma^2 := \frac{1}{W_n} \sum_{i=0}^n w_i \left( x_i - \mu \right)^2\]as we did with the mean, we can compute a weighted variance at each \(j \in \{0, 1, \dots, n\}\) by
\[\sigma_j^2 := \frac{1}{W_j} \sum_{i=0}^j w_i^{(j)} \left( x_i - \mu_j \right)^2\]An incremental update
To design an efficient algorithm for computing \(\sigma_j^2\), I start with some useful, easily derived, identities
\[x_i - \mu_j = ( x_i - \mu_{j - 1} ) - \frac{w_j^{(j)}}{W_j} \left( x_j - \mu_{j-1} \right)\]which for \(i = j\) can be written
\[x_j - \mu_j = \frac{W_j - w_j^j}{W_j} \left( x_j - \mu_{j-1} \right)\]We are thus interested in solving
\[\sigma_j^2 = \frac{1}{W_j} \underbrace{\sum_{i=0}^{j-1} w_i^{(j)} \left[ ( x_i - \mu_{j - 1} ) - \frac{w_j^{(j)}}{W_j} \left( x_j - \mu_{j-1} \right) \right]^2}_{X} + \frac{ w_j^{(j)} }{W_j} \left[ \frac{W_j - w_j^j}{W_j} \left( x_j - \mu_{j-1} \right) \right]^2\]The section I’ve identified as \(X\) can be expressed as
\[X = \sum_{i=0}^{j-1}w_i^{(j)} \left[ \left( x_i - \mu_{j-1} \right)^2 - 2 \frac{w_j^{(j)}}{W_j} \left( x_i - \mu_{j-1} \right) \left( x_j - \mu_{j - 1} \right) + \left( \frac{w_j^{(j)}}{W_j} \right)^2 \left( x_j - \mu_{j-1} \right)^2 \right]\]If you consider each of these terms separately, you will find
\[\sum_{i=0}^{j-1} w_i^{(j)} \left( x_i - \mu_{j - 1} \right)^2 = \left( W_j - w_j^{(j)} \right) \sigma_{j-1}^2\] \[\sum_{i=0}^{j-1} w_i^{(j)} \frac{w_j^{(x)}}{W_j} \left( x_i - \mu_{j-1} \right) \left( x_j - \mu_{j-1} \right) = 0\] \[\sum_{i=0}^{j-1} w_i^{(j)} \left( \frac{w_j^{(j)}}{W_j} \right)^2 \left( x_j - \mu_{j-1} \right)^2 = \left( \frac{w_j^{(j)}}{W_j} \right)^2 \left( x_j - \mu_{j-1} \right)^2 \left( W_j - w_j^{(j)} \right)\]which, after some simplification, implies
\[\sigma_j^2 = \left( 1 - \frac{w_j^{(j)}}{W_j} \right) \left[ \sigma_{j-1}^2 + \frac{w_j^{(j)}}{W_j} \left( x_j - \mu_{j-1} \right)^2 \right]\]This is hugely important – it means at each step \(j\) we don’t have to traverse all prior observations to compute \((x_i - \mu_j)^2\). It therefore suggests an \(O(n)\) algorithm is possible.
Biased-ness
If we pause to realize that \(\sigma_j^2\) is only a sample estimate for the population parameter \(\sigma^2\), we can formalize the relationship between \(\sigma_j^2\) and \(\sigma^2\). If we define
\[\theta_j = \sum_{i=0}^{j} \left[ w_i^{(j)} \right]^2\]we can then compute
\[\mathbf{E} [ \sigma_j^2 ] = \frac{1}{W_j} \sum_{i=0}^j w_i^{(j)} \left( \mathbf{E} \left[ \left( x_i - \mu_j \right)^2 \right] \right) = \left( 1 - \frac{\theta_j}{W_j^2} \right) \sigma^2\]To derive this, it is useful to note that
\[\mathbf{E}[ x_i^2 ] = \mu^2 + \sigma^2 \\ \mathbf{E}[ x_i \mu_j ] = \mu^2 + \frac{\sigma^2}{W_j} w_i^{(j)} \\ \mathbf{E}[ \mu_j^2 ] = \mu^2 + \frac{\sigma^2 \theta_j}{W_j^2}\] \[\implies \mathbf{E} \left[ \left( x_i - \mu_j \right)^2 \right] = \sigma^2 - 2 \frac{w_i^{(j)}}{W_j} \sigma^2 + \frac{\theta_j}{W_j^2} \sigma^2\]In order to produce an unbiased estimate of \(\sigma^2\), we must therefore multiply our estimate by a bias correction term. In particular, we must compute
\[\left( 1 - \frac{\theta_j}{W_j^2} \right)^{-1} \sigma_j^2\]A practical mean/variance algorithm
I now present the algorithm in its entirety:
def exponentially_weighted(
xs: list[float], alpha: float, adjust: bool, bias: bool
) -> dict[str, list[float]]:
"""Compute and return the exponentially-weighted (mean, variance)."""
assert 0 < alpha < 1
if adjust:
wgt, wgt_sum, wgt_sum_sqr = 1, 0, 0
else:
# NOTE: we must ensure that `wgt_sum` and `wgt_sum_sqr` are
# equal to one in the first iteration
wgt, wgt_sum, wgt_sum_sqr = alpha, 1, (1 + alpha) / (1 - alpha)
result = {"mean": [], "var": []}
mean = xs[0]
var = 0
for i, x in enumerate(xs):
wgt_sum = (1 - alpha) * wgt_sum + wgt
wgt_sum_sqr = (1 - alpha) ** 2 * wgt_sum_sqr + wgt ** 2
diff = x - mean
ratio = wgt / wgt_sum
mean = mean + diff * ratio
var = (1 - ratio) * (var + ratio * diff ** 2)
# NOTE: the `i == 0` condition prevents a ZeroDivisionError
if bias or i == 0:
bias_correction = 1
else:
bias_correction = 1 - wgt_sum_sqr / (wgt_sum ** 2)
result["mean"].append(mean)
result["var"].append(var / bias_correction)
return result
Note that both polars
and pandas
provide options for dealing with missing values; this simple implementation puts that issue aside. For a more robust Rust implementation, see my merge request here.
References
-
Welford, B. P. “Note on a Method for Calculating Corrected Sums of Squares and Products.” Technometrics, vol. 4, no. 3, 1962, pp. 419–20. JSTOR, https://doi.org/10.2307/1266577. Accessed 31 Aug. 2022.
-
Hunter, J. S. (1986). The Exponentially Weighted Moving Average. Journal of Quality Technology, 18(4), 203–210. https://doi.org/10.1080/00224065.1986.11979014
-
Wikipedia contributors. (2022, September 2). Weighted arithmetic mean. In Wikipedia, The Free Encyclopedia. Retrieved 00:40, September 4, 2022, from https://en.wikipedia.org/w/index.php?title=Weighted_arithmetic_mean&oldid=1108027383