Mendelian Randomization - Mendelian randomization with highly correlated genetic variants ("cis-MR")

Details: Written by: Steve Burgess; Published: 07 May 2021

When performing a Mendelian randomization analysis based on a single gene region, it is possible to include only a single variant in the analysis. However, if multiple variants explain independent variance in the exposure, then the analysis will be more efficient using all those variants - even if they are partially correlated. With summarized data, this can be achieved using the inverse-variance weighted (IVW) method, by incorporating a correlation matrix in the analysis.

For this, it’s important that the correlations are in the right direction – what is needed are signed correlations, not squared correlations (r, not r^2). The signs of the correlations must correspond to the same effect alleles as the genetic associations. It's best to estimate such a matrix in a large sample with similar ethnic background to the dataset from which the summarized data are taken (in particular, the genetic associations with the outcome). If this isn't possible, then the ld_matrix in the TwoSampleMR package can help. For example:

rho = ld_matrix(c("rs7529229", "rs4845371", "rs12740969"))
rho
> rs7529229_C_T rs4845371_T_C rs12740969_T_G
> rs7529229_C_T 1.000000 -0.687196 -0.571108
> rs4845371_T_C -0.687196 1.000000 0.155994
> rs12740969_T_G -0.571108 0.155994 1.000000

We see that the effect alleles are T for rs7529229, C for rs4845371, and G for rs12740969. If our summarized data instead used the T allele for rs4845371, we can flip the relevant elements of rho in the second row and column:

flip = c(1, -1, 1)
rho.new = rho*flip%o%flip

One question is this: suppose there are multiple genetic variants in a given gene region that are all potential instruments. How to decide how many variants to include in a Mendelian randomization analysis? Including too few variants may result in inefficiency. But including too many variants in the analysis may result in unstable estimates. The reason is that including highly correlated variants can result in numerical instabilities when inverting the correlation matrix - small changes in the correlation matrix can lead to large changes in the MR estimate. The same happens if you include 3 or more variants that aren’t close to 100% pairwise correlated, but they all predict each other (the technical term is “linearly dependence” or “multicollinearity”).

While we have developed methods for highly correlated variants - for example, here Mendelian randomization with fine‐mapped genetic data: Choosing from large numbers of correlated instrumental variables (nih.gov) we suggest using principal components analysis (PCA) to summarize the genetic variants - these methods are not foolproof when the correlation matrix is imprecisely estimated or does not fit the data well (perhaps it is estimated in a slightly different population). This can result in an overly precise estimate. A similar method (but a little more robust to weak instruments) can be found here: [2005.01765] Inference with many correlated weak instruments and summary statistics (arxiv.org)

Practical advice is this: first, start with aggressive pruning – would suggest a threshold of r^2<0.1, and account for correlations in the analysis. By increasing the pruning threshold (to r^2<0.2, r^2<0.3, etc) and including additional variants, you can potentially get an estimate that is slightly more precise that this, but if the standard error reduces sharply (say it decreases by a factor of 3), then I wouldn’t trust the estimate, and instead would suggest reporting the estimate with more aggressive pruning. Similar when using the dimension reduction methods - if the MR estimate gets a bit more precise, then it’s probably reliable, but if it’s substantially more precise, then I’d be concerned.

From experience, pruning at r^2 < 0.3 is generally safe, and r^2 < 0.4 is usually okay – but I’ve seen problems at this level in a couple of examples. If at all possible, I’d recommend trying to get correlation estimates in as large a sample size as possible (but still relevant to the dataset under analysis!) - and be suspicious if including additional highly correlated variants to an analysis substantially reduces standard errors!