Data Distribution: Normal or Abnormal?

Farrokh Habibzadeh

doi:10.3346/jkms.2024.39.e35

Abstract

Determining if the frequency distribution of a given data set follows a normal distribution or not is among the first steps of data analysis. Visual examination of the data, commonly by Q-Q plot, although is acceptable by many scientists, is considered subjective and not acceptable by other researchers. One-sample Kolmogorov-Smirnov test with Lilliefors correction (for a sample size ≥ 50) and Shapiro-Wilk test (for a sample size < 50) are common statistical tests for checking the normality of a data set quantitatively. As parametric tests, which assume that the data distribution is normal (Gaussian, bell-shaped), are more robust compared to their non-parametric counterparts, we commonly use transformations (e.g., log-transformation, Box-Cox transformation, etc.) to make the frequency distribution of non-normally distributed data close to a normal distribution. Herein, I wish to reflect on presenting how to practically work with these statistical methods through examining of real data sets.

INTRODUCTION

Appropriate data analysis is one of the cornerstones of every research study. An important part of data analysis is taking into account their distribution. Distribution of a data set affects both its report and analysis. For instance, normally distributed data should be reported as mean and the standard deviation (SD) and analyzed with parametric tests (e.g., Student’s t test for independent samples, Pearson’s correlation, linear regression, and one-way analysis of variance). Data that do not follow a normal distribution should be presented as median and the interquartile range (IQR) and analyzed with non-parametric tests (e.g., Mann-Whitney U test, Spearman’s correlation, and Kruskal-Wallis test).1 2 3 4 5 As parametric tests are generally more robust than their counterpart non-parametric tests, we prefer to use parametric tests for data analysis, if possible. Nonetheless, the parametric tests assume that the data to be analyzed follow a normal distribution. Violation of this assumption may result in unreliable results and biased estimates with trivial to critical consequences.6 7 8 9 Lack of awareness of these assumptions (e.g., normality of the data distribution) contributes to inappropriate use of statistical methods and reporting of results in scientific articles.10 11 12 In fact, all researchers are expected to plan and report the results of their assessments of the underlying assumptions in their study protocols and manuscripts.10 There are many assumptions. In this paper, the focus is on one of the assumptions made by all parametric statistical tests — the normality assumption.

One of the first steps in data analysis, after data cleaning, is to check whether the distribution of the data to be examined follows a normal (Gaussian, bell-shaped) distribution or not, and to try to transform the non-normally distributed data to a data set the distribution of which is closer to the normal. Herein, I wish to reflect on the common ways we can employ to determine whether a data set is normally distributed or not and describe two frequently used transformations to make the distribution of a non-normally distributed data closer to a normal distribution. The discussion is mainly based on data analysis of subsets of real data sets taken from my previous studies. Taking into consideration those journal editors and researchers with limited knowledge of statistics, throughout the article, I will try to emphasize the pragmatic issues of data analysis and avoid statistical details as much as possible.

NORMAL/GAUSSIAN DISTRIBUTION

A normal distribution, also called Gaussian distribution, has a symmetrical shape with the highest frequency at the center of the distribution. It has certain characteristics that help researchers to make predictions based on only the mean and the SD of the data. For example, about 95% of the data values are within the interval mean ± 2 × SD. As an example, Fig. 1 shows the frequency distribution of hepatitis B surface antigen (HBs Ag) measured in 150 study participants (a subset of data from one of our previous studies).13 The data has a normal distribution, visually — the data distribution (gray curve) has an “acceptable” overlap over the hypothetical normal distribution having the same mean and SD of 0.38 and 0.09, respectively. There is another commonly used graphical method to determine whether a distribution follows a normal distribution or not. The graph is called Q-Q plot, which stands for “quantile-quantile plot.” The ordinate of the graph represents the quantile of the sample data; the abscissa, the quantile of a hypothetical data set should the data follow a normal distribution. As a rule of thumb, if the points are “close enough” to a straight line, it can be construed that the data distribution is normal (Fig. 1B); otherwise, it is not (Fig. 2B). These techniques although easy to do, are subjective.14 For example, the word “acceptable” is quite ambiguous and unscientific, and one might ask how close points should be to the line to be considered “close enough?” These terms become more meaningful with experience, but there are also quantitative measures to check the normality of a data distribution. One of the commonly used statistical tests for doing so is the one-sample Kolmogorov-Smirnov (K-S) test. It is a non-parametric test which tests the null hypothesis that the distribution of the data set follows a normal distribution. A significant P value (P < 0.05) implies that we should reject the null hypothesis and that the data distribution does not follow a normal distribution; with a non-significant P value (P ≥ 0.05), the null hypothesis can be retained, and the data distribution can be assumed normal. The one-sample K-S test can technically only be used when the parameters of the distribution of interest (mean and the SD of the normal distribution) are known; otherwise, the results would be extremely conservative, and the test rejects the normality. We therefore need to correct the results when we examine a data sample. Lilliefors correction is what generally is used for this reason.15 Furthermore, the one-sample K-S test is usually recommended when the sample size is 50 or more. When the sample size is less than 50, another test for normality, the so-called Shapiro-Wilk (S-W) test, is better to be used.16

Fig. 1

Frequency distribution and Q-Q plot. (A) Frequency distribution of HBs Ag measured in 150 study participants taken from a previous study13 (bell-shaped gray curve) along with the fitted normal distribution (having the same mean and the standard deviation). (B) The Q-Q plot of the data implies that the distribution can be assumed to be normal.

HBs Ag = hepatitis B surface antigen.

Fig. 2

Frequency distribution and Q-Q plot. (A) The frequency distribution of the PSA measured in 150 study participants taken from a previous study20 (the highly positively skewed gray curve) along with the fitted normal distribution (having the same mean and the standard deviation). (B) The Q-Q plot of the data also implies that the data does not have a normal distribution.

PSA = prostate-specific antigen.

HOW TO CONSTRUCT THE GRAPHS AND PERFORM THE TESTS

There are numerous ways to examine the normality of the frequency distribution of a data set. Statistical Packages for Social Sciences (SPSS^®) is a commonly used software for data analysis. In SPSS^®, you can have both the Q-Q plot and the results of one-sample K-S test with Lilliefors correction and S-W test for a variable named HBS by running the following commands in the Syntax Editor:

EXAMINE VARIABLES=HBS

/PLOT NPPLOT

/MISSING LISTWISE

/NOTOTAL.

Or through the following path from the menu bar: Analyze → Descriptive Statistics → Explore (under the Plots, tick the box for “Normality plots with tests”). Alternatively, one-sample K-S test with Lilliefors correction can be done from the following path: Analyze → Nonparametric Tests → One Sample. The result of one-sample K-S test was not significant (P = 0.200; Fig. 3). Therefore, we could retain the assumption that the HBs Ag had a normal distribution, and predict that 95% of the 150 data (143 participants) had an HBs Ag level between 0.20 and 0.56; in fact, 141 (94%) did. We can also use parametric tests to analyze the data.

Fig. 3

Output of the one-sample Kolmogrov-Smirnov test with Lilliefors correction and Shapiro-Wilk test from IBM^® SPSS^® Statistics ver. 26. Because the sample size was 150, the result of the first test is used.

df = degrees of freedom.

Another common method to perform the one-sample K-S test with Lilliefors correction is using the function lillie.test available from the R package nortest.17 The function shapiro.test from the same package can be used for performing S-W test. Q-Q plot can also be easily drawn using the geom_qq and stat_qq_line from the R package ggplot2.18

LOG-NORMAL DISTRIBUTION

Many researchers fallaciously believe that most biological variables have a normal distribution. However, non-normal skewed distributions are common in biomedicine. For example, the length of the latent period of many infectious diseases has a non-normal positively skewed distribution. This is because the period cannot be negative, the mean is usually short, and the variance [and the SD] is comparably large, usually more than half of the mean value.2 5 The frequency distribution of a variable is said to be log-normal if the distribution of logarithm of the variable values follows a normal distribution.19

As an example, Fig. 2 shows the frequency distribution of the prostate-specific antigen (PSA) measured in 150 study participants (a subset of data from one of our previous studies).20 The data distribution (gray curve, Fig. 2A) is highly positively skewed and does clearly not fit the hypothetical normal distribution (red curve, Fig. 2A) having the same mean and SD of 1.50 and 1.28 ng/mL, respectively. The points in the Q-Q plot do not lie on the straight line too (Fig. 2B). One-sample K-S test with Lilliefors correction also resulted in a significant P value (P < 0.001), which is consistent with our visual methods. All these results imply that the frequency distribution of the PSA data set does not follow a normal distribution. That could be predictable from the mean and SD of the data set; the SD (1.28 ng/mL) exceeded half of the mean of 1.50 ng/mL.2 4 5 The logarithm of PSA, on the other hand, has a normal distribution (Fig. 4; one-sample K-S test with Lilliefors correction P value = 0.089).

Fig. 4

The same graphs as those in Fig. 2 after log-transformation of the PSA, when log(PSA) is used instead of PSA. (A) The frequency distribution (gray curve) is now much closer to a normal distribution and (B) the point in the Q-Q plot lie close enough to the straight line to retain the assumption that the data distribution is normal.

PSA = prostate-specific antigen.

Given that the logarithm of PSA does follow a normal distribution, we may construe that PSA follows a log-normal distribution. We may work with log(PSA) throughout the analysis and use parametric tests, but it is important to bear in mind that the final results should be reported in their original units (not the log-transformed values); we can easily back-transform the log-transformed values by exponentiating (inverse of logarithm) the results.

Log-transformation is a commonly used transformation to make the distribution of positively skewed distributions (e.g., the length of the incubation period of many infectious diseases, serum cholesterol level, and the distribution of minerals in the Earth’s crust) closer to a normal distribution.19 However, it does not always work; other transformations may work better.

BOX-COX TRANSFORMATION

Fig. 5 shows the frequency distribution of the immunoglobulin (Ig) G against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) measured in 40 study participants (a subset of data from one of our previous studies).21 The SARS-CoV-2 IgG level had a mean of 0.25 (SD, 0.10). The frequency distribution of SARS-CoV-2 does not closely fit a normal distribution having the same mean and SD (Fig. 5A). Because the sample size was less than 50, S-W test was performed to test the normality of the distribution, which was found to be significant (P < 0.001), implying that the distribution was not normal. Log-transformation of the data could make the distribution closer to a normal distribution, but not enough to make the results of S-W test (P = 0.012) non-significant. The data were thus transformed using the Box-Cox transformation.

Fig. 5

Frequency distributions before and after transformation. (A) The frequency distribution of SARS-CoV-2 IgG level measured in 40 study participants taken from a previous study21 (gray curve). (B) Frequency distribution of the same data set after a Box-Cox transformation given a λ = −1 (Eq. 2).

SARS-CoV-2 = severe acute respiratory syndrome coronavirus 2, IgG = immunoglobulin G.

The Box-Cox transformation (transforming the value x to the new value y, given a parameter commonly designated by λ) is defined as follows22:

(Eq. 1)

y (λ) = \{\begin{matrix} \frac{x^{λ} - 1}{λ} i f λ \neq 0 \\ \log (x) if λ = 0 \end{matrix}

The transformation acts differently depending on the value of λ. For certain values of λ, the transformation is equivalent to other well-known transformations (Table 1); for instance, if λ = 0, it turns to a log-transformation. But, which value of λ is better and makes the distribution of the transformed data closer to a normal distribution? To determine the most appropriate value for λ, we may use the function boxcox from the R package EnvStats.23 The most appropriate value for λ for our data set was −1. The transformation is then:

(Eq. 2)

y = 1 - \frac{1}{x}

Table 1

Equivalent transformations for certain values of λ in a Box-Cox transformation (Eq. 1)

λ	Equivalent transformation
−2	1/x²
−1	1/x
−0.5
0	Log-transformation
0.5
1	x
2	x ²

where x represents the SARS-CoV-2 IgG level and y the transformed value. Note that the transformation y = 1/x works as well (Table 1), but for the time being, let us use Eq. 2. The frequency distribution of y (transformed SARS-CoV-2 IgG level) is close enough to a normal distribution (Fig. 5B) and the S-W test is non-significant (P = 0.193). Please note that here again the transformed variable y should be used in statistical analyses, but we should report the final values after they are back-transformed to the original scale using the following equation (solving Eq. 2 for x):

(Eq. 3)

x = \frac{1}{1 - y}

LIMITATIONS OF DATA TRANSFORMATION

Although the above-mentioned transformations may make the frequency distribution of our data set close to a normal distribution, which is favorable from the statistical point of view, the results obtained may not be meaningful and interpretable without back-transformation of the results.24 Sometimes, finding an appropriate transformation is challenging; at times, despite all efforts made, no suitable transformation can be found. Under such circumstances, it would be better to use non-parametric statistical tests, although they are less robust than their parametric counterparts.

CONCLUSION

Examining the frequency distribution of a given data set is important in determining how to report and analyze the data. As parametric statistical tests, which assume normal distribution of the data, are more robust than their non-parametric counterparts, transforming a non-normally distributed data set to a set close to normal distribution would be very helpful. Graphical methods (e.g., Q-Q plot) are subjective and may not be reliable and replicable. The use of statistical tests (e.g., one-sample K-S and S-W tests) also have their own limitations. For example, for large samples, the test may show that the data distribution is not normal, even when the departure from normality is trivial and inconsequential. On the other hand, for small samples, serious departures from normality may not be detected by these tests.14 A combination of visual assessment and statistical tests may thus be necessary to come to a reasonable conclusion. Correct interpretation of results and choosing the appropriate transformation, if necessary, are skills that come with experience.