Sample size calculation errors and ethics
Appropriate calculation of the sample size is essential for the cost-effective, effort-effective, and ethical implementation of a study. This increases the number of opportunities to observe the expected effects. Calculation of the sample size is directly related to research ethics. Registration of too many uncalculated participants in a study can expose the subjects to unidentified risks. Too small a sample size is also unethical in the sense that the power of test of the study is decreased, thus limiting the scientific value of the research. Consequently, patients can be harmed by incorrect clinical decision-making based on incorrect study results. Power of test is determined by the sample size, the magnitude of the type I error (α), and the effect size. These values are interrelated. With an increase in the significance level (type I error), that is, with worsening reliability, the power of test level is increased. As the standard deviation increases, the power of test decreases. A smaller difference between two populations decreases the power of test, while a larger sample size increases the power of test. Out of these, the effect size is the most critical factor with regard to the power of test [
11]. It is recommended to use a "clinically meaningful empirical effect size" as an effect size. For example, the correlation between two variables, the regression constant in a regression analysis, the difference in mean values, or a risk such as the ratio of heart attack survivals to heart attack casualties can be used as an effect size. If there is no available information about the effect size, an effect size reported from previous observational studies may be used [
11]. Establishing a clinically meaningless or preposterous effect size to reduce the sample size is obviously erroneous and unethical.
The sample size should be calculated using the primary endpoint. When there are multiple endpoints, the type I error due to multiple tests should be calibrated to estimate the sample size, as the number of hypotheses to be proved may increase, eventually equaling the number of primary endpoints. Bonferroni correction and Šidak correction are generally performed. If a study has a secondary endpoint which is important, the sample size should be sufficiently large for the analysis of the variables. In this case, the sample size may be ideally calculated for the individual endpoints which are considered important. Multiple comparison which is not appropriately corrected has been reported as one of the most commonly discovered statistical errors (multiple comparison error) [
6]. Below is one example of a multiple comparison error. Assume that an experiment is planned to compare the effect of two drugs regulating blood sugar levels. The researcher measured the blood sugar level in three patient groups: a control group, the drug A group, and the drug B group. In such a case, three t-tests are usually performed to compare the control group with the drug A group, the control group with the drug B group, and the drug A group with the drug B group. However, special attention is necessary with this type of analysis. This example is a typical case of false positive generation by multiple comparison error. Because attempts were made to test three hypotheses at a significance level of 5% through one experiment, the significance level needs to be corrected. However, given that no correction was performed in this case, the applied significance level was 15% (5% × 3). If Bonferroni correction is applied, a significance level of 5% / 3 = 1.7% should be used for each test. Adjusting the power of test level after performing an experiment is very difficult. Therefore, it is important to consult with a statistical expert to examine if the study plan has sufficient power of test before starting to collect data. Another problem is that researchers tend to use a smaller sample size than planned for reasons which are to be used in upcoming article publications and/or conference presentations.
Errors associated with study methods and the applied analytical method
A study article should be elaborately described to secure verification and reproducibility so that the readers and reviewers may easily understand it. In particular, when a complicated statistical method, i.e., not a common method (e.g., a t-test), is applied, an additional explanation should be given as well as references, providing information about the application of the statistical method. Readers may easily understand if a brief explanation is provided about why a specific statistical method, not a general one, has been applied.
It is important to apply an analytical method appropriate for the type of data constituting the measured variables. Generally, there are three types of research data: discrete, ordinal, and continuous. Discrete type data represent the quality or group, not the quantity, and are also classified as nominal data or qualitative data. This type of data represent, for example, the gender (male and female), the anesthetic method (general, regional, and local anesthesia) or the location (Seoul, Busan, and/or Daejeon), and are often used for grouping. The second data type is ordinal, including data representing the order and/or rank. Examples include test scores expressed in ranks (first rank, second rank, and third rank), height ranks, and weight ranks. Raw data as well can represent ranks; for example, grades (e.g., A, B, and C) are ordinal-type data. An error may be made by treating ordinal-type data as continuous-type data, or becoming confused with the figures representing the orders. The figures representing ranks may not undergo arithmetic manipulation. Finally, there are continuous-type data having a quantitative meaning. Test scores, weights, and heights are included in this type of data. Continuous-type data are the data type most suitable for statistical analysis, but continuous-type data are often dichotomized (i.e., divided into two or more separate domains) to simplify the analysis in some studies. For example, in a study related to obesity, the weights of patients are measured, but they may not be used as continuous-type data, instead being divided into two groups entitled 'normal weight' and 'overweight.' Such a conversion of continuous-type data into dichotomized data may enable a comparison of two groups with simple statistics such as a t-test instead of a complex regression analysis. However, the problem in such a case is that the measurement precision of the original data is decreased, as is the variability of the data, resulting in a reduction of the information included in the data and the power of test in the study. Moreover, most researchers do not apply common boundaries or cut points when dividing data. Therefore, to dichotomize continuous-type data, a researcher should explain why the data need to be dichotomized despite the sacrifice of data precision, as well as how the cut points were established.
It is ironical that one of the causes of errors made by researchers originates from statistical software programs, which typically help with statistical analyses. Errors from statistical software are often made when researchers use the software without consulting with a statistics expert or without obtaining sufficient statistical knowledge. Some researchers use convenient methods to analyze data and calculate P values without sufficiently considering the data characteristics or statistical assumptions. Once a significant P value is secured, researchers believe that their results are valid. However, it is important to bear in mind that statistical software programs always give a P value regardless of the sample size, data type and scale, or statistical methods used. Various analytical methods applied to statistics are based on the fundamental statistical assumptions. If an analysis is performed without satisfying the fundamental assumptions, incorrect conclusions may be made on the basis of erroneous analytical results.
One common error is that a nonparametric method is not applied in cases where the data are severely skewed, not following a normal distribution. When analyzing continuous-type data, a normality test should be performed with the analyzed data, and the method and result of the test should be described. The t-test, which is generally used with continuous-type data, is a parametric method which can be applied only when normality, equal variance, and independence are tested and satisfied. If these statistical assumptions are satisfied, the author may state the following:
"The data were approximately and normally distributed and thus did not violate the assumptions of the t-test."
If data which do not satisfy these assumptions were analyzed by a nonparametric method, the author may state the following:
"The number of subjects was small and normality was found to be contrary to the normal distribution test results. A nonparametric test was performed using the Wilcoxon rank-sum test."
For appropriate understanding and application, parametric tests and nonparametric tests were discussed in two articles of the KJA Statistical Round in detail [
1213]. Another common error with t-tests is that Student's t-test, rather than a paired ttest, is performed to analyzed paired independent samples.
In an analysis of the results with categorical-type data, Fisher's exact test or asymptotic methods with appropriate adjustments should be used if the event is rare and the sample size is small. A standard chi-squared test and a difference-in-proportions test may be performed, provided that the number of samples and the number of events are sufficiently large. Data for which both rows and columns are dichotomous, an extreme type of discontinuous data, follow a complex distribution consisting of a product of two conditional probabilities (a binomial distribution), which approximately follows a chi-square distribution if the number is sufficiently large. Because such data are basically discontinuous, continuity correction is necessary in the approximation into a continuous chi-square distribution. Although controversial among statisticians, an approach with a direct probability calculation such Fisher's exact test is more feasible if the results from Pearson's chi-square test and Yate's correction differ. In addition, if the expected frequency of at least one of the four cells is less than 5, Fisher's exact test should be used.
A correlation analysis is a method of analyzing the linear relationship between two variables. The calculated correlation coefficient represents the measure of the degree of linearity between two variables. If the correlation between two variables is more curved rather than linear, the correlation coefficient may be very small. In contrast, when some observation data are positioned very differently from the rest, the correlation coefficient may be great. Neither of these cases represents a proper analysis. Hence, it is necessary visually to examine the data distribution using a scatter plot before performing a correlation analysis. A correlation coefficient merely represents the degree of correlation between two variables; it does not explain a causal relationship. 'Correlation' does not necessarily mean that the two variables are in a cause-and-effect relationship; rather, it is simply one of the conditions of a cause-and-effect relationship. Nevertheless, researchers often make the "post hoc, ergo propter hoc" mistake, in which a temporal relationship between two independent variables is considered as a causal relationship, leading to the erroneous conclusion of "B occurred after A; therefore, B occurred due to A." For example, a researcher observed yearly Coke sales trend as well as yearly drowning casualties. A strikingly high correlation was found in the correlation analysis of the two variables. Can the researcher make the conclusion that "the number of drowning casualties is increased because of Coke"? Before believing a research result, researchers should initially check if the result is in accordance with common sense. In this example, the real cause of the increase in the two variables is not between the two variables but is a third cause, which is the summer season. Generally, when there is correlation between A and B, a few more interpretations are possible, besides the third cause mentioned above. For example, "B may be the cause of A" (reverse) or "A is the cause of B and B is also the case of A at the same time" (interactive), and "They occurred at the same time coincidentally without any causal relationship." A correlation itself is not an implication of a causal relationship but is simply one of the necessary conditions of a causal relationship. A correlation analysis is better used as a method of producing a hypothesis rather than testing one, and should be accepted as a proposal of a follow-up study to identify a causal relationship. An additional test should be performed to identify a causal relationship between variables through a well-planned experiment, which is a randomized controlled trial. Equivalence of experimental groups is employed to prove the existence of a causal relationship statistically. Samples are randomly taken from two or more groups and then allocated to a study group and a placebo or control group, making the two groups as homogenous as possible. If the effect by the treatment is greater than the effect by the placebo treatment (greater than a predetermined effect size), it may be concluded that the treatment has a causal effect.
Regression analysis is an analytical method which is used to derive a mathematical relationship which expresses a correlation between an independent variable and a dependent variable. Regression analysis may explain a correlation between two variables and make a statistical prediction through an established model. While correlation analysis refers to the identification of a correlation between two variables, regression analysis serves to calculate the contributions of the correlations of multiple independent variables with a single dependent variable (multiple regression analysis). One error commonly found in medical research papers is that regression analysis is used without clearly showing the necessary statistical assumptions. The simple linear regression analysis, a typical form regression analysis, requires of a model the basic assumptions that a dependent variable and an independent variables should have a linear relationship, and that mutually independent error terms should have a mean value of 0 (zero) and should be in equal in terms of variance and be normally distributed. In addition, the absence of multicollinearity among variables should be assumed. The linear relationship between two variables may be visually determined through a scatter plot. Violation of the basic assumptions of a linear regression equation may be determined on the basis of a residual scatter plot.
Satisfaction of statistical assumptions is a prerequisite of a statistical analysis. Data analysis without satisfying these assumptions can raise questions about the reliability of the results and severely damage the reproducibility of the research. Repeated-measures analysis of variance, which is often used in articles submitted to the KJA, requires various statistical assumptions to be satisfied before the analysis, as in the regression analysis mentioned above, but most of the articles omit an explanation of the necessary assumptions, instead simply providing only the analytical results [
1415]. Finally, detailed information about the computer software programs used for the statistical analysis should be provided in the article. If the raw data are provided, it will help readers or reviewers who want to reproduce the results. It should be noted that the same statistical model may give different results depending on the statistical software program used.