Kim HY

In the previous sections, simple linear regression (SLR) 1 and 2, we developed a SLR model and evaluated its predictability. To obtain the best fitted line the intercept and slope were calculated by using the least square method. Predictability of the model was assessed by the proportion of the explained variability among the total variation of the response variable. In this session, we will discuss four basic assumptions of regression models for justification of the estimated regression model and residual analysis to check them.

Let's recall a bivariate relationship between 2 variables, X and Y, and its depiction of a SLR model in _{0}+β_{1}X+ε. The regression model is divided into two parts, the fitted regression line, ‘β_{0}+β_{1}X’ and random error, ‘ε.’ The need of error term is justified by the gap between the line and observed dots because regression line does not go through all the observed values as appeared in

The first part, fitted regression line, is made by connection of the expected means of Y values corresponding with X values such as χ_{1} to χ_{5} in _{1}, lying on upper vertical direction. The distribution is displayed as a bell-shaped normal distribution with a mean, μ_{y|χ1}, at the center. The symbol, μ_{y|χ1} means expected population mean of Y when X variable has the value χ_{1}. In accordance with the previous sections on regression, the expected mean of Y can be symbolized as Ŷ, which is equal to the fitted line, β_{0}+β_{1}X’.

The expected population mean of Y changes from μ_{y|χ1} to μ_{y|χ5} as X changes from χ_{1} to χ_{5}. The conceptual model suggests that the expected mean of Y can be depicted as the straight line ‘β_{0}+β_{1}X’ by connecting the expected mean of Y matched with subgroups of X. We call the straight line as ‘mean function’ because the expected mean of Y is expressed as the function of ‘β_{0}+β_{1}X’. Please clearly understand that there are numerous means by numerous subgroups of continuous X, and they are linearly connected to make the linear mean function, ‘β_{0}+β_{1}X’. Therefore, we should be able to reasonably assume that the mean function of Y has the form of fitted regression line when we apply the SLR model.

Now let's discuss the second part, random error ‘ε.’ The conceptual form of the random error is depicted as bell-shaped distributions in _{y|χ1},..., μ_{y|χ5}. What would remain after removing the conditional means? Like

Generally, it is reasonable that we assume the shape of random errors as normal distribution because a small amount of errors can occur frequently, while a large amount of errors may be found rarely. Therefore, traditionally, the error terms are assumed as following distributions: ε~N(0,σ^{2}). The equation means that the distribution of errors follows the normal distribution with mean of zero and variance of a constant, σ^{2}. What does the constant variance tell us? The constant variance shows that all the error distribution for all the subgroups have the same variance. In other words, the shape of the error terms is the same for all the subgroups as shown in

Wrapping up the discussion above, the assumption of fitted line and random error can be collectively summarized into the distribution of Y. Since Y=β_{0}+β_{1}X+ε, the distribution of Y for each subgroup is Y~N(β_{0}+β_{1}X, σ^{2}) for a given X. According to the definition, the distribution of subgroup for the given value of χ_{1} is a normal distribution with the mean of β_{0}+β_{1}χ_{1} and with the constant variance of σ^{2}.

How is the error term expressed in the actual data in _{i}_{i}^{th} observation in _{i}_{i}_{i}

_{i}

_{i}

_{i}

The sum and mean of the observed residuals always equal zero. Suppose the mean of observed residuals is a non-zero value. Then, in calculation procedure for coefficients, the nonzero value should be added into the intercept of the fitted line immediately. Similarly, if the slope of observed residuals is nonzero, the relationship should be added into the slope of the regression line. Eventually, the mean of the observed residuals should be zero and also, the line going through the center of residual distribution should be flat with the slope of zero.

Four basic assumptions of linear regression are linearity, independence, normality, and equality of variance [

Linearity means that the means of subpopulation of Y all lie on the same straight line. The scatter plot of X and Y should show a linear tendency. The fitted line of SLR reflects the trend of linearity in the form of linear equation. The error part is expressed as scattered dots around the fitted line as seen in

^{2}+ε=−1,200+120X−2X

^{2}+ε

Independency assumption means that there is no dependency among observations in the data. In other words, outcome of an arbitrarily selected subject does not affect any other outcomes. In traditional statistics, independence among observations was basically assumed, because the simple random sampling procedure guarantees independence of observations. When we collect data only at one-time point, as a cross-sectional data, we do not worry about independence generally. If sample is chosen by a random sampling method and it is a cross-sectional data, we may simply assume independence and do not actually check this further. However, if the sample was selected using cluster sampling method and there are clusters of subjects in the data, we need to consider the dependence in the analysis procedure.

Violation against this independence assumption frequently occurs in most longitudinal data, which is collected repeatedly from the same subject over time. The repeated data tend to correlate with each other. If we observed subjects in time order, we need to plot those against time order (

The normality assumption is that the distribution of subpopulation is conceptually normal as shown in

We generally use either histogram of residuals or normal quantile-quantile plot (Q-Q plot) to check the normality of the distribution. As a kind of Q-Q plots, a normal percentile-percentile (P-P) plot may be used. The histogram of residuals should appear similar to normal. In Q-Q plots, the observed points (dots) should be around the diagonal line (

Cum Prob, cumulative probability.

Conceptually, the error terms of all the subgroups are assumed to have the same shape of distribution. That is the normal distribution with mean of zero and variance of a constant, σ^{2}, as shown in

The equal variance assumption means that the degree of spreading or variability of residuals is equal across the subpopulations with different fitted values. We call this feature ‘homoscedasticity,’ which means having the same variance.

In contrast,

As summary, we can use residual plots to check three basic assumptions of linear regression as following:

- Linearity: (standardized) Residuals against X variable.

- Normality: Histogram of (standardized) residuals, Normal Q-Q (or P-P) plot of (standardized) residuals.

- Equal variance: (standardized) Residuals against Ŷ (or X only in SLR) variable.

The residual analysis of SLR using IBM SPSS Statistics for Windows Version 23.0 (IBM Corp., Armonk, NY, USA) is performed during or after regression procedure appeared in the previous session [

- D-1: During regression procedure in SLR [

- D-2: To save the predicted values and residual values (original & standardized). Standardized residual values larger than 2 in absolute scale can be used to identify outliers.

- H-1 and H-2: To request a scatter plot of (standardized) residual against X to check the linearity assumption.

^{a}To request scatter plot (F), and the histogram and the normal p-p plot in ^{b}Scatter plot of standardized residuals and predicted value to check equal variance assumption, ^{c}To save predicted values (PRE_1), residual (RES_1), and standardized residuals (ZRE_1), and ^{d}Scatter plot to check linearity assumption.