Abstract
The inference of causality from observational evidence may be problematic, as observational studies frequently include confounding factors or reverse causation for the identification of associations between exposure and outcome. Thus, in observational studies, the association between a risk factor and a disease of interest may not be causal. A randomized controlled trial (RCT) is considered the gold standard, because it has the best possibility to establish a relationship between a risk factor and an outcome. However, RCTs cannot always be performed, because they can be costly, impractical, or even unethical. One of the alternatives is to perform Mendelian randomization (MR) experiments that are similar to RCTs in terms of study design. The MR technique uses genetic variants related to modifiable traits/exposures as tools to detect causal associations with outcomes. MR can provide more credible estimates of the causal effect of a risk factor on an outcome than those obtained in observational studies by overcoming the limitations of observational studies. Therefore, MR can make a substantial contribution to our understanding of complex disease etiology. MR approaches are increasingly being used to evaluate the causality of associations with risk factors, because well-performed MR studies can be a powerful method for exploring causality in complex diseases. However, there are some limitations in MR analyses, and an awareness of these limitations is essential to interpret the results. The validity of results from MR studies depends on three assumptions that should be carefully checked and interpreted in the context of prior biological information.
Understanding the causal role of risk factors to diseases is important to uncover the pathogenesis of disease and plan treatment strategy. Concluding causality from observational data, however, is troublesome, as observational studies conducted to identify correlations between modifiable exposures and disease outcomes often include confounding factors or reverse causation, which lead to misunderstood and biased findings [1]. Consequently, in observational studies, a link between the modifiable risk factor or exposure and outcome cannot be causal. A randomized controlled trial (RCT) is widely recognized as the gold standard, which has the best possibility to establish a relationship between a risk factor and an outcome. Associations found in observational studies, in many cases, do not reveal any relationship in RCTs [2,3]. In addition, RCTs cannot be done often, because they can be costly, impractical, or even unethical [4].
To deal with the confounders in observational studies and identify causal relationships, an alternative approach is needed. One of the alternative approaches is to perform Mendelian randomization (MR) experiments that are based on Mendel’s independent assortment rule: each trait is inherited to the next generation independently of other traits. In MR, the random segregation of alleles (genes) separates them independently into exposed and controlled groups, which results in unmeasured confounding factors that are also distributed equally between two groups [5]. Thus, the MR experiment resembles a RCT [6] (Figure 1). The MR approach uses genetic variants linked to modifiable traits/exposures to identify causal associations with disease. The increasing size and scope of genome-wide association studies (GWASs) to identify risk factors (exposures) and diseases (outcomes) has led to increased use of MR studies. However, MR studies depend on three assumptions, and it is important to assess the plausibility of such assumptions to validate the results from MR studies.
MR uses a genetic variant as a proxy for a risk factor, and therefore, choosing a genetic instrument variable (IV) is important for a successful MR study. The IV assumptions for a genetic variant are crucial to the validity of MR. To validate a genetic variant as a valid IV for causal inference in an MR test, three key assumptions must be met [7] (Figure 2): (1) the genetic variant is directly associated with the exposure; (2) the genetic variant is not related to factors known to obscure the connection between the exposure and the effect; and (3) the genetic variant has no effect other than through the exposure.
IV assumption 1: The genetic variant must be related to the exposure. Typically, genome-wide-significant single nucleotide polymorphisms (SNPs) (p<5×10−8) are used as instruments in MR studies. Moreover, combining the results of many SNPs is used as an instrument in MR studies, because the use of poor instruments based on a single SNPs will skew MR estimates [6].
IV assumption 2: The genetic variant is not to be correlated with exposure–outcome relationship confounders. Although it is theoretically impossible to prove that this hypothesis is present in an MR test, the correlation between the variant and known confounding factors of the exposure–outcome relation could be disproved [6].
IV assumption 3: The genetic variant affects the outcome only through the risk factor. This is commonly called the “no pleiotropy” rule. If a SNP is correlated with several traits, irrespective of the exposure of interest, the third IV assumption may be infringed. While it is not possible to prove that this assumption is considered in an MR study, different extensions of the MR design such as MR-Egger or a weighted median test can be used to detect the presence of pleiotropy and/or estimate the causal effect of exposure, even in the presence of assumption 2 and/or 3 violations [6].
MR can estimate causal effects where exposure and outcome data from different samples exist. Two-sample MR uses two different study samples for the risk factor and the outcome to estimate the causal effects of the risk factor on the outcome [8]. It is helpful when it is difficult to measure both the risk factor and outcome at the same time, as two-sample MR offers the chance to incorporate data from different sources [8]. When the genotype-exposure and genotype-outcome data comes from different sources, meta-analysis can estimate the combined effects by using two-sample MR to estimate the size of the exposure (risk factor)–outcome (disease) association [9]. Individually, single genetic variants usually explain only very little of the variation in a phenotype, which can lead to “weak instruments,” in particular in modest sample sizes [10]. To overcome this, multiple genetic variants are used that collectively explain more of the risk factor variation because they have a greater statistical power than a single variant [11].
The Wald ratio is the easiest way to calculate the causative effect of an exposure on an outcome when a single genetic variant is available. This can be seen as the change in the outcome of a unit change in exposure [9].
When several genetic variants are correlated with a specific exposure, the Wald ratio approach can be generalized through a meta-analysis process. In particular, the causal effect estimates of each genetic variant are combined in an IVW meta-analysis framework. The IVW test, therefore, is a weighted average of the causal effects of genetic variants [11].
The effect of a genetic variant on several biological routes is demonstrated by pleiotropic effects that may affect the outcome through another pathway, known as horizontal pleiotropy. Horizontal pleiotropy contradicts the IV principles, because the effect of the genetic variant tested is not solely owing to the risk factor, which is troublesome for MR studies. Many methods have been established for defining and correcting breaches of assumptions, when some of the variants selected are invalid instruments.
The MR-Egger method permits a pleiotropic effect of one or more genetic variants, as long as the size of such pleiotropic effects is independent of the size of the effect of the genetic variant on the risk factor [12]. In the MR-Egger method, a weighted linear regression of the gene–outcome coefficients is conducted for non-measured horizontal pleiotropy [12]. The slope of this regression constitutes the estimation of the causal effect and the intercept can be viewed as an estimate of the (horizontal) average pleiotropic effect of genetic variants [13]. The regression of MR-Egger replaces the second and third IV assumption with the Instrument Strength Independent on Direct Effect (InSIDE) assumption that the exposure effects of individual SNPs are independent of their pleiotropic effects on the outcome [13].
If up to 50% of the genetic variants are invalid, then a causal effect may be estimated as the median of the weighted ratio estimates using the reciprocal of the variance of the ratio estimate as weights [14]. The InSIDE assumption is not needed and violations of the second and third IV assumptions are permitted [14]. In contrast to the MR-Egger process, the weighted median estimator has the benefit of preserving greater precision in the estimates.
Violations of the IV assumptions lead to MR limitations [15]. Owing to the limited availability of population-specific information on genetic associations, genetic instruments tend to show poor statistical power. Since genetic polymorphisms usually only explain a small fraction of the total variance in traits, it is not advisable to rely on a simple IV study alone for a causal inference [16]. Since a certain genetic variant typically explains only a small proportion of the risk factor variance, multiple variants are often combined to increase statistical strength. MR studies, therefore, need large sample sizes to ensure adequate statistical power.
A genetic variant is considered a “weak instrument” if there is no adequate statistical evidence that the genetic variants are associated with the exposure. The strong relationship between the genetic variant and exposure is important [10].
A genetic variant or multiple genetic variants may associate with other exposures, a phenomenon known as “pleiotropy” [17]. The use of a genetic variant with pleiotropy will contribute to a misrepresentation of MR. The definition of MR is based on three principal assumptions. The MR analysis may be weakened by deviations in the assumptions underlying the IV. The first IV presumption can be directly assessed by checking whether the genetic variant is associated with the risk factor (exposure) [18]. However, the second and third IV assumptions cannot be proven empirically and require investigator judgment and the performance of several sensitivity analyses [18].
Other limitations of MR analysis include the lack of appropriate genetic variants, existence of linkage disequilibrium, genetic heterogeneity, population stratification, and developmental canalization or lack of an understanding of the confounding factors [6].
Previous studies have shown an association between a high level of education and a reduced risk of RA, but the results from observational studies were inconsistent [19-21]. The goal of this research was to investigate through MR analysis whether years of education are causally linked to the development of RA. The MR-Base database (http://www.mrbase.org/), which holds a large collection of summary data from many GWASs, was searched. Statistical datasets from the UK Biobank (n=293,723) GWASs for years of education were used as the exposure [22]. For the outcome dataset, a meta-analysis of GWASs of RA with autoantibody (n=5,539) and European controls (n=20,169) was used [23]. SNPs associated with years of education were selected, and it was checked whether each SNP was linked to the occurrence of RA. Finally, MR analysis combined these findings to assess the causal link between years of education and development of RA. Two-sample MR analysis was performed to estimate the causative effect of years of education on RA development using 49 SNPs as IVs [24]. IVW was conducted for MR analysis, and MR-Egger and weighted median tests that explore and modify pleiotropy were done as a sensitivity test [24]. The IVW method identified an inverse causative relationship between years of education and RA (β=−0.039, standard error [SE]=0.283, p= 0.008) (Table 1). The MR-Egger regression test showed that the MR results appeared not to be prejudicial to directional pleiotropy (intercept=0.028, p=0.358) and the MR-Egger study showed no causative relationship between RA and years of education (β=−2.320, SE= 1.709, p=0.181) (Table 1, Figure 3). However, the weighted median approach showed a significant causal relationship between RA and years of education (β=−0.950, SE=0.355, p=0.008) (Table 1, Figures 3 and 4). In conclusion, the MR technique showed a potential inverse causal relationship between years of education and development of RA. The current findings may provide an opportunity to identify the mechanisms contributing to the development of RA by years of education.
MR studies are similar to RCTs, when the three main IV assumptions are met. Although MR does not serve as a substitute for RCT, it is useful in situations where an RCT is unethical or unavailable. Properly applied, MR is less confusing than the traditional study by using observational data. MR can provide more accurate estimates of the causal effect of a risk factor on an outcome than those obtained from observational studies by overcoming the limitations. MR approaches are increasingly being used to evaluate the causality of associations between risk factors and disease. MR could contribute enormously to our understanding of complex disease etiological architecture. The validity of results from MR studies relies on the correctness of several assumptions, which should be closely checked and interpreted in the context of prior biological knowledge. There may be a number of limitations in MR analyses, and awareness of these is essential for interpreting their results. If properly applied, MR can be majorly successful in uncovering and strengthening the causality between modifiable exposures and a wide range of complex disease-related outcomes. Well-conducted MR studies can be a valuable method for exploring the causal relationships in complex diseases.
REFERENCES
1. Greenland S, Robins JM. 1985; Confounding and misclassification. Am J Epidemiol. 122:495–506. DOI: 10.1093/oxfordjournals.aje.a114131. PMID: 4025298.
2. Greenwald P. 2003; Beta-carotene and lung cancer: a lesson for future chemoprevention investigations? J Natl Cancer Inst. 95:E1. DOI: 10.1093/jnci/95.10.e4. PMID: 12759404.
3. Lawlor DA, Davey Smith G, Kundu D, Bruckdorfer KR, Ebrahim S. 2004; Those confounded vitamins: what can we learn from the differences between observational versus randomised trial evidence? Lancet. 363:1724–7. DOI: 10.1016/S0140-6736(04)16260-0. PMID: 15158637.
4. Black N. 1996; Why we need observational studies to evaluate the effectiveness of health care. BMJ. 312:1215–8. DOI: 10.1136/bmj.312.7040.1215. PMID: 8634569. PMCID: PMC2350940.
5. Wehby GL, Ohsfeldt RL, Murray JC. 2008; 'Mendelian randomization' equals instrumental variable analysis with genetic instruments. Stat Med. 27:2745–9. DOI: 10.1002/sim.3308. PMID: 18509868.
6. Zheng J, Baird D, Borges MC, Bowden J, Hemani G, Haycock P, et al. 2017; Recent developments in Mendelian randomization studies. Curr Epidemiol Rep. 4:330–45. DOI: 10.1007/s40471-017-0128-6. PMID: 29226067. PMCID: PMC5711966.
7. Angrist JD, Imbens GW, Rubin DB. 1996; Identification of causal effects using instrumental variables. J Am Stat Assoc. 91:444–55. DOI: 10.1002/bimj.201200104. PMID: 23180483.
8. Lawlor DA. 2016; Commentary: two-sample Mendelian randomization: opportunities and challenges. Int J Epidemiol. 45:908–15. DOI: 10.1093/ije/dyw127. PMID: 27427429. PMCID: PMC5005949.
9. Lawlor DA, Harbord RM, Sterne JA, Timpson N, Davey Smith G. 2008; Mendelian randomization: using genes as instruments for making causal inferences in epidemiology. Stat Med. 27:1133–63. DOI: 10.1002/sim.3213. PMID: 18203119.
10. Martens EP, Pestman WR, de Boer A, Belitser SV, Klungel OH. 2006; Instrumental variables: application and limitations. Epidemiology. 17:260–7. DOI: 10.1097/01.ede.0000215160.88317.cb. PMID: 16617274.
11. Burgess S, Butterworth A, Thompson SG. 2013; Mendelian randomization analysis with multiple genetic variants using summarized data. Genet Epidemiol. 37:658–65. DOI: 10.1002/gepi.21758. PMID: 24114802. PMCID: PMC4377079.
12. Bowden J, Davey Smith G, Burgess S. 2015; Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int J Epidemiol. 44:512–25. DOI: 10.1093/ije/dyv080. PMID: 26050253. PMCID: PMC4469799.
13. Burgess S, Thompson SG. 2017; Interpreting findings from Mendelian randomization using the MR-Egger method. Eur J Epidemiol. 32:377–89. DOI: 10.1007/s10654-017-0276-5. PMID: 28664250. PMCID: PMC6828068.
14. Bowden J, Davey Smith G, Haycock PC, Burgess S. 2016; Consistent estimation in Mendelian randomization with some invalid instruments using a weighted median estimator. Genet Epidemiol. 40:304–14. DOI: 10.1002/gepi.21965. PMID: 27061298. PMCID: PMC4849733.
15. Davey Smith G, Hemani G. 2014; Mendelian randomization: genetic anchors for causal inference in epidemiological studies. Hum Mol Genet. 23:R89–98. DOI: 10.1093/hmg/ddu328. PMID: 25064373. PMCID: PMC4170722.
16. Bennett DA, Holmes MV. 2017; Mendelian randomisation in cardiovascular research: an introduction for clinicians. Heart. 103:1400–7. DOI: 10.1136/heartjnl-2016-310605. PMID: 28596306. PMCID: PMC5574403.
17. Swerdlow DI, Kuchenbaecker KB, Shah S, Sofat R, Holmes MV, White J, et al. 2016; Selecting instruments for Mendelian randomization in the wake of genome-wide association studies. Int J Epidemiol. 45:1600–16. DOI: 10.1093/ije/dyw088. PMID: 27342221. PMCID: PMC5100611.
18. Sekula P, Del Greco MF, Pattaro C, Köttgen A. 2016; Mendelian randomization as an approach to assess causality using observational data. J Am Soc Nephrol. 27:3253–65. DOI: 10.1681/ASN.2016010098. PMID: 27486138. PMCID: PMC5084898.
19. Pincus T, Callahan LF. 1985; Formal education as a marker for increased mortality and morbidity in rheumatoid arthritis. J Chronic Dis. 38:973–84. DOI: 10.1016/0021-9681(85)90095-5. PMID: 4066893.
20. Kwon JM, Rhee J, Ku H, Lee EK. 2012; Socioeconomic and employment status of patients with rheumatoid arthritis in Korea. Epidemiol Health. 34:e2012003. DOI: 10.4178/epih/e2012003. PMID: 22611518. PMCID: PMC3350820.
21. Uhlig T, Hagen KB, Kvien TK. 1999; Current tobacco smoking, formal education, and the risk of rheumatoid arthritis. J Rheumatol. 26:47–54. PMID: 9918239.
22. Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, Rietveld CA, et al. 2016; Genome-wide association study identifies 74 loci associated with educational attainment. Nature. 533:539–42. DOI: 10.3389/fnmol.2017.00023. PMID: 28197077. PMCID: PMC5281599.
23. Stahl EA, Raychaudhuri S, Remmers EF, Xie G, Eyre S, Thomson BP, et al. 2010; Genome-wide association study meta-analysis identifies seven new rheumatoid arthritis risk loci. Nat Genet. 42:508–14. DOI: 10.1038/ng.582. PMID: 20453842. PMCID: PMC4243840.
24. Bae SC, Lee YH. 2019; Causal relationship between years of education and the occurrence of rheumatoid arthritis. Postgrad Med J. 95:378–81. DOI: 10.1136/postgradmedj-2018-136374. PMID: 31127051.