Journal List > Restor Dent Endod > v.42(4) > 1094252

Kim: Statistical notes for clinical researchers: logistic regression
Logistic regression is a regression model where the dependent variable is categorical and corresponding independent variables can be categorical or continuous. This article covers the case of a binary dependent variable such as an event occurring coded 1 = ‘event’ and 0 = ‘no event’. Frequent outcomes are pass/fail, win/lose, disease/no disease, etc. The logistic regression model estimates the probability that an event occurs versus the probability that the event does not occur.

An example: score and pass data

Let's say that an institution performed an assessment procedure to determine pass and fail of the participants considering exam scores, interview result, and reputation among colleagues. Table 1 shows a data with 2 variables, exam scores and pass state (1 = pass, 0 = fail). We can notice that there is a trend that persons with lower scores are more likely to fail, while persons with higher scores tend to pass. When we plot the data as Figure 1A, we can see persons with value 1 (pass) have scores that shift to the right side, while persons with value 0 (fail) have those that shift to the left side. Persons with same score may not have the same outcome (e.g., cases of score = 799) because the assessment procedure comprises other factors. At least we can postulate that the probability of pass may be higher if the score is higher. What is the best-fit line for this data? A usual straight regression line ranging from minus infinity to infinity does not make sense for this case. Instead of ordinal regression the logistic regression can fit the probability more adequately. In Figure 1B, the probability estimated by logistic regression is presented. The estimated probability by the logistic regression model (red dot and line) seems reasonable because it reflects the observed reality that the probability of pass decreases close to zero with very low scores, while the probability increases close to one with very high scores.
Table 1

Scores of applicants who passed the final assessments

rde-42-342-i001
Score 755 755 763 781 783 788 792 793 798 799 799 802 813 824 845
Pass 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1
Figure 1
Scatterplot of pass (1 = pass, 0 = fail) and score: (A) pass and score, and (B) estimated probability (P) of pass added.
rde-42-342-g001

Review of probability, odds, and odds ratio

From the previous sections about risk, odds, and odds ratio, they were defined as following formulas:
Probabilityorriskp=numberofeventsnumberofallobservationsrde-42-342-e001.jpg
Odds=p(event)p(non-event)=p1-prde-42-342-e002.jpg
Oddsratio=odds1odds0=p11-p1p01-p0rde-42-342-e003.jpg
Let's consider an example of flipping of fair coins vs. loaded coins.
rde-42-342-g003.jpg
Odds ratio is important in interpreting in logistic regression because it represents how much the odds change with 1 unit increase in the predictor variables while keeping all other variables constant.

Logistic regression

1. Logit link function

Logistic regression uses logit link function to estimate unknown probability of outcome (p) for a linear combination of predictor variables. The original probability ranging from zero to one cannot match with linear combination of predictor variables ranging minus infinity to infinity [1].
Logitp=lnodds=logep1-prde-42-342-e004.jpg
where logex = ln (x) and e = Euler's number, 2.71828.
Logit link function accommodate p ranging from zero to one. The logit link function reconciles the incongruity by changing the range of dependent variable, p, into minus infinity to infinity. As seen in Table 2, final logit (p) values cover from minus values to plus values.
Table 2

Logit transformation from probability (p)

rde-42-342-i002
P 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.99
1−p 0.99 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.01
odds 0.01 0.11 0.25 0.43 0.67 1.00 1.50 2.33 4.00 9.00 99.00
Logit (p) = ln (odds) −4.60 −2.20 −1.39 −0.85 −0.41 0.00 0.41 0.85 1.39 2.20 4.60

2. Property of logit and inverse logit

Shown in Figure 2A, logit function has an s-shaped curve. Logit (p) is undefined at p = 0 and p = 1. When p approaches close to zero, the value of logit (p) goes toward minus infinity and when p get larger close to one, it goes toward infinity. We can notice that the logit (p) has a value of zero at p = 0.5.
Figure 2B shows inverse logit graph. Inverse logit returns the probability of the event ranging from zero to one. Figure 1B and Figure 2B show similar shape because both represent estimated probability. The induced inverse logit formula is as following:
Inverselogitp=log-1p1-p=11+e-α=eα1+eα=prde-42-342-e005.jpg
where α = some number.
Figure 2
Probability and logit transformation: (A) natural log of adds ratio (logit [p]), (B) inverse logit (p).
rde-42-342-g002

3. Estimation of logistic regression equation

Simple logistic regression is expressed as logit (p) and linear combination of predictor variables as below.
logitp=lnp1-p=β0+β1xrde-42-342-e006.jpg
Using a fictitious data based on the example above logistic regression was performed and the output was provided (pages 6–7). The observations (n = 15) are multiplied by 100 to provide high power to get significant estimates artificially. The dependent variable was the binary variable pass and score was the predictor variable. The SPSS (IBM Corp., Armonk, NY, USA) output of (e) below gives coefficients as following.
rde-42-342-g004.jpg
The estimated logistic equation is:
logitp=lnp1-p=β0+β1x=-73.578140+0.093115×(Score)rde-42-342-e007.jpg
where p = probability of ‘pass’.
Here represents odds ratio which means the amount of change in odds with 1 unit increase in the predictor variable. The odds ratio, exp (β1) = e0.093115 = 1.097588. Therefore, as the score increases by 1 point, the odds of pass was estimated to increase by 9.8%. The 95% confidence interval of odds ratio was [1.086, 1.109] which does not include a value one. Odds ratio value of one means that 1 unit increase in the predictor variable does not make any difference in odds. Therefore, to get statistical significance, it is important to confirm that 95% confidence interval of odds ratio does not include one.

1) Estimated probability

After some algebra, inverse logit gives us the estimated probability by the predictor variable as follows:
logitp=lnp1-p=β0+β1xrde-42-342-e008.jpg
p1-p=eβ0+β1xp=eβ0+β1x1-pp1+eβ0+β1x=eβ0+β1xp^=eβ0+β1x1+eβ0+β1xrde-42-342-e009.jpg
To get the probability of pass at score 781, we can use the estimated probability function. Also, if the score increases by one point to 782 then the estimated probability can be calculated as shown in Table 3. According to the results for the score 781, estimated probability of pass in the assessment is 0.30 or 30%. Also, the odds ratio is obtained as 1.098, which is the same value with exp (β1) from the SPSS output, representing the increase of odds of 9.8% related to a 1 point increase of the score.
Table 3

Estimated probability and odds ratio based on logistic regression model

rde-42-342-i003
Score = 781 p^=eβ0+β1x1+eβ0+β1x=e73.578140+0.0931157811+e73.578140+0.093115781=e0.855331+e0.85533=0.4251451+0.425145=0.298317rde-42-342-i010.jpg
odds=p1-p=0.2983171-0.298317=0.425145rde-42-342-i011.jpg
Score = 782 p^=eβ0+β1x1+eβ0+β1x=e73.578140+0.0931157821+e73.578140+0.093115782=e0.762211+e0.76221=0.4666341+0.466634=0.318167rde-42-342-i012.jpg
odds=p1-p=0.3181671-0.318167=0.466634rde-42-342-i013.jpg
Odds ratio for a 1 point increase in score: odds at 782odds at 781=0.4666340.425145=1.097588rde-42-342-i014.jpg
Estimated probability for other score values are shown in the SPSS output (f) below under ‘PRE_1’. Using this we can calculate odds and odds ratio between 2 specific scores. For example, suppose my present score is 781 and I'd like to know how much increase in odds if I raise my score by 11 points and get 792. Then the odds ratio can be obtained easily. The calculation ends up to an increase of 179% in odds when I raise up my score by 11 points (Table 4).
oddsat792oddsat781=1.180.43=2.79rde-42-342-e010.jpg
Table 4

Scores, estimated probabilities, and odds ratios based on logistic regression model

rde-42-342-i004
Score 755 755 763 781 783 788 792 793 798 799 799 802 813 824 845
P 0.04 0.04 0.07 0.30 0.34 0.45 0.54 0.57 0.67 0.69 0.69 0.75 0.89 0.96 0.99
Odds 0.04 0.04 0.08 0.43 0.51 0.82 1.18 1.30 2.07 2.27 2.27 3.01 8.37 23.31 164.84

Appendix

Appendix 1

Procedure of logistic regression using IBM SPSS.

The procedure of logistic regression using IBM SPSS Statistics for Windows Version 23.0 (IBM Corp.) is as follows.
rde-42-342-a001.jpg
*In this fictitious data, the ‘freq’ variable was used to multiply the number of observations to get sufficient power.

References

1. Allison PD. Logistic regression using SAS: theory and application. 2nd ed. Cary (NC, USA): SAS Institute Inc.;2012. p. 19–26.
TOOLS
Similar articles