Multivariate Break-Even Correlation Tresholds

As we all know, backtesting is not a research tool, but the very end of your research pipeline. If you want to evaluate if a given set of signals is predictive for returns, you can do this more clearly and directly by regressing returns on the signals or measuring their correlations. But “how strong” do those correlations need to be for the signals to be “good enough”? A popular heuristic by Macrocephalopod provides a practical way of thinking about this question.

This note builds on that idea, replacing all approximations with a fully generalized multivariate derivation for signals.

Linear Model

Say we have signals collected in a vector and a scalar return we try to predict. We model the return as a linear function of the signals:

is the regression’s intercept (the unconditional expected return), is the vector of slope coefficients, and the residual is a mean-zero random variable with variance . We can decompose in into a scaled unit-variance residual term , where has unit variance, yielding . We assume , which we use throughout.

From the Model to Correlations

The marginal correlation between and a signal is , defined as , where is the standard deviation of the -th signal and is the standard deviation of the returns . We collect the marginal standard deviations in the diagonal matrix and the correlations in the vector . Dividing each component of by the corresponding term amounts to pre-multiplying by . Substituting into and carrying through:

In we write the vector form of the correlation definition, and in we substitute for . In we use the linearity of covariance. In we use three facts: because is a constant; where is the covariance matrix of the signals; and by our assumption. In we collect terms, and in we simplify by recalling how the covariance matrix decomposes into correlations and standard deviations: the signal correlation matrix is defined by , so pre-multiplying by gives .

From Correlations to Betas

For stating a signal evaluation criterion in the next step, we need to be expressed in terms of . We obtain this by inverting , multiplying both sides on the left by :

In we multiply both sides of by , where the on the right cancels with . In the product reduces to an identity, and in reduces to an identity as well. Reading from right to left:

This is the multivariate generalisation of the well-known univariate identity linking the regression slope to the correlation coefficient, where the inverse correlation matrix adjusts for cross-correlations among the signals by isolating each signal’s unique contribution conditional on the others.

Signal Evaluation Criterion

Finally, we state what it means for the correlations of a set of signals to be “good enough”. We require that, at a signal level standard deviations from the mean, i.e. at where is the signal mean vector, the corresponding absolute expected return exceeds a trading cost threshold :

In we state the criterion in general terms: the conditional expected return, evaluated at a signal realisation standard deviations from the mean, must exceed the threshold in absolute value. In we substitute from , and in we replace using . In we transpose the product, using the fact that and are both symmetric, so . In we distribute over the sum, noting that . In we define , the vector of standardised signal means whose -th component is . The absolute value reflects that the signals can be profitable in either direction (long or short).

Since the term appears repeatedly throughout the rest of the article, we name it :

This quantity collapses the entire vector of correlations, the inter-signal dependence structure , and the evaluation point into a single number, so that criterion simplifies and reads:

The paramater

The vector controls how strict the criterion is and has a direct probabilistic interpretation. Since is linear in , all realisations of closer to the mean in the sense, i.e. where , fail as well if fails. By the multivariate ’s inequality, for any random vector with mean and covariance matrix ,

where is the squared norm evaluated at the fixed point . To express in terms of , we substitute into the quadratic form:

In we substitute . In we substitute , and in we invert the triple product: . In the adjacent and pairs each reduce to the identity.

So at least a fraction of all signal realisations fall within the ellipsoid of radius . If fails to exceed the threshold at the boundary controlled by , the signals may be economically non-viable for the majority of realisations and should be discarded altogether. A smaller raises the bar on because a lower fraction of unprofitable realisations is accepted, whereas a larger lowers the bar because a higher fraction is accepted. When , the bound becomes non-positive and ceases to be informative: since any probability is non-negative, the inequality holds trivially and no longer constrains the distribution of signal realisations in a meaningful way.

Case Distinction

The absolute value in splits into two cases, depending on whether the expression inside is strictly positive or strictly negative:

Case corresponds to the signals pushing expected returns above the positive threshold (profitable for a long position), while Case corresponds to pushing expected returns below (profitable for a short position). Both can be checked independently, and a set of signals may satisfy one, both, or neither.

Case : Long Profitability

We rearrange by moving to the right-hand side, dividing by , and expanding from :

In we move to the right-hand side. In we divide by , which preserves the inequality direction, and in we expand using . This is a on the correlation vector : the set of admissible is a half-space in with normal direction and offset . Notably, if , the right-hand side becomes negative, and profitability does not require since the unconditional return already exceeds the cost threshold .

Case : Short Profitability

We rearrange analogously, moving to the right-hand side, dividing by , and expanding from :

In we move to the right-hand side. In we divide by , which preserves the inequality direction, and in we expand using . This is again a on , now with the reversed inequality. Analogously, if , the right-hand side becomes positive, and profitability does not require since the unconditional return already lies below the cost threshold .

Application

Given a concrete set of signals with intercept , marginal correlation vector , signal correlation matrix , return volatility , standardised signal mean vector , a cost threshold , and an evaluation point , the procedure is as follows.

First, choose according to how selective you wish to be, noting that the norm determines the fraction of signal realisations that are covered via . Second, compute the combined signal strength . Third, determine whether the signals clear the profitability threshold for long positions (Case ), short positions (Case ), or both, keeping in mind that both cases can be checked independently.

Remarks

The multivariate generalization outlined in this note underlines a key structural insight: the profitability criterion does not just depend on the sum of individual correlations in isolation, but on their interaction encoded through . A marginally weak signal may contribute meaningfully to if it is uncorrelated with the other signals (providing orthogonal information), while a marginally strong signal may contribute little if it is highly correlated with other signals already in the set. Furthermore, the bound in reveals a cost of dimensionality: to maintain the same covered fraction of signal realisations as grows, must scale proportionally with , meaning that must be increased. As a consequence, becomes harder to satisfy simply because the signal space is higher-dimensional.

The univariate derivation of this note can be found here.

No comment found.

Add a comment

You must log in to post a comment.