The book Asset Allocation: From Theory to Practice and Beyond by Kinlaw et al. (2021) is one of my favorites as it gets a few myths about mean variance optimization right, which are constantly parroted, even in academic papers to motivate some fancy new method as solution. It also provides useful solutions to a portfolio manager’s or allocator’s practical questio like when to rebalance.
In chapter 13 on Forecasting the authors introduce a method called partial sample regression / relevance-based prediction, which provides an interesting perspective on a linear regression’s prediction and generalizes it into a direction akin to the Nadaraya-Watson kernel regression.
Linear Regression
Say we are in a time-series setting with past observations of input variables and a corresponding output variable . Now, at , we get realizations of and are asked to predict .
What’s the prediction of a linear regression? We start with the general form and rewrite it in terms of de-meaned quantities. In the following, is the full coefficient vector including the intercept and is the vector of slope coefficients only.
In we separate the intercept from the slope coefficients, noting that . In we substitute the least squares expression for the intercept, . In we collect the terms that multiply . In we define as the de-meaned current observation. The sample means are:
Tedious Algebra
The reason I reformulated the linear regression’s prediction a bit is to set the stage for further reformulations. Equation shows that the prediction is the sample mean of the outcome plus a correction term. To work with the correction term in isolation, we subtract from both sides of :
Now we substitute the least squares estimator into and rearrange:
In we divide both the matrix in the inverse and the vector on the right by . This is permissible because cancels between numerator and denominator. In we replace with in both factors, which again cancels, but is the conventional normalisation for sample covariance matrices. In we recognise the sample covariance matrix . In we pull inside the sum, since it does not depend on the summation index .
Equation is already revealing: the de-meaned prediction is a weighted sum of de-meaned outcomes, where the weight on observation is the inner product . But we can go further by decomposing this inner product into geometrically interpretable pieces.
Because is de-meaned, its sum vanishes: . Any constant multiplied by this sum is therefore zero. In particular, is a scalar that does not depend on . Subtracting it inside the sum does not change the result:
repeats . In we subtract a term that equals zero. In we absorb the constant into the sum.
We now want to replace the bilinear form inside with something more interpretable. Since is symmetric and positive definite, defines an inner product, and its induced (squared) norm. Every inner product satisfies the polarisation identity, which we verify by direct expansion:
In we expand the quadratic form . In we distribute the . In the squared-norm terms cancel pairwise: with , and with . In we use the symmetry of , which implies .
Reading from right to left gives us the decomposition we need:
This is the matrix-weighted analogue of the scalar identity . We substitute into :
repeats . In we replace using . In the terms from and from cancel.
We add to both sides of to recover the actual prediction. Since , we also restore the original (un-centered) notation inside the quadratic forms, noting that and :
Predictor Space Geometry
Let’s inspect the two quadratic forms in . Both are instances of the Mahalanobis distance, a distance measure that accounts for the variance of and covariance between the input variables through the inverse covariance matrix . Where the ordinary Euclidean distance treats every direction in predictor space equally, the Mahalanobis distance stretches directions of low variance and compresses directions of high variance, effectively standardizing the measurement. In the special case (uncorrelated variables with unit variance), the Mahalanobis distance reduces to the Euclidean distance.
The first term, , is (half) the squared Mahalanobis distance of observation from the historical average. It measures how unusual this observation was relative to typical conditions. Kinlaw et al. call this quantity Informativeness : observations far from the mean are more likely to carry real event-driven information, while near-average observations are more likely to be noise.
The second term, , is (half) the squared Mahalanobis distance between observation and the current conditions . Its negative, which enters with a minus sign, is what Kinlaw et al. call Similarity : observations close to the current situation are similar, and the negative Mahalanobis distance is largest (closest to zero) for observations that resemble the present.
With these definitions in hand, we reorder the terms in and label them:
Kinlaw et al. (2021) combine these two components into a single metric called Relevance:
Substituting into :
Relation to the Hat Matrix
The relevance decomposition is closely related to the classical hat matrix from regression theory. The hat matrix is defined as the matrix that maps the vector of observed outcomes to the vector of fitted values :
Here is the design matrix whose rows are (augmented with a leading 1 for the intercept). The element determines how much the outcome of observation contributes to the fitted value at observation . For an out-of-sample prediction at , the same logic applies: , where is the weight that observation receives. The name “hat matrix” comes from the fact that it puts the hat on .
We can derive the hat matrix weights directly from our existing equations. Starting from , we substitute and expand:
In we write and substitute . In we distribute the product and pull out of the second sum. In we use , which eliminates the last term. In we collect both sums into a single sum over .
Equation reveals the structural similarity to the Nadaraya-Watson kernel regression. In both cases, the prediction is simply a weighted sum of past outcomes. The difference lies in how the weights are constructed. Reading off the coefficient of in , the hat matrix weight is:
The first part, , is a uniform contribution: every observation participates equally in the sample mean. The second part is the inner product , scaled by , which tilts the weight toward or away from observation depending on the geometric relationship between and .
How does the relevance relate to the inner product ? Let us expand explicitly:
In we expand and use the symmetry of . In the terms cancel.
Relevance and the inner product differ by the residual term . This scalar depends on the current conditions but not on the summation index . In the prediction , it is multiplied by and therefore vanishes. The hat matrix weight and relevance are thus related by:
In we substitute solved for the inner product: . In we collect the terms that do not depend on into the constant .
The term depends only on the current conditions and acts as a uniform baseline for every observation. It combines the from the sample mean with a correction that accounts for how far lies from the historical centre. The observation-specific part of is proportional to relevance alone.
This means the hat matrix and the relevance representation agree on the prediction (both yield ), and the per-observation variation in the hat matrix weights is governed entirely by . The relevance decomposition thus provides a geometric interpretation of what the hat matrix encodes: similarity to the current conditions and informativeness of the historical observation.
What Equation Reveals
Equation is an interesting rewriting of the linear regression prediction . It produces the same number for any dataset and any . But it reveals the mechanics that the coefficient-based formula conceals.
The prediction is anchored at the historical average and then adjusted by a weighted sum of past outcome deviations . The weight on each historical observation is its relevance , which combines two Mahalanobis-distance-based quantities: how similar is to the current conditions , and how informative was to begin with. As shows, relevance is the observation-varying component of the hat matrix weight; the hat matrix and the relevance representation describe the same linear combination of outcomes from two complementary perspectives.
Observations with positive relevance pull the prediction toward their outcomes in the natural way. Observations with negative relevance, however, contribute too, and they contribute in reverse. The regression uses the outcome of a dissimilar, uninformative period and assumes that the opposite will occur. It treats this inversion as equally valuable as the direct extrapolation from relevant periods. However, when the data contains regime shifts, the outcomes of an irrelevant period carry no valuable information about the current one. Inverting them is expected to worsen the forecast.
Equation therefore sets the stage for a natural generalization. If the use of negatively relevant observations is problematic, the remedy is to simply exclude them: restrict the complete set of observations to a partial subset of with ( is a relevance threshold, for instance ). This is what Kinlaw et al. (2021) call Partial Sample Regression.
References
Kinlaw, W., Kritzman, M., and Turkington, D. (2021): Asset Allocation: From Theory to Practice and Beyond. Wiley.