Statistical tests commonly assume that:
If your data does not meet these assumptions you might still be able to use a nonparametric statistical test, which have fewer requirements but also make weaker inferences.
A test statistic is a number calculated by a statistical test. It describes how far your observed data is from the null hypothesis of no relationship between variables or no difference among sample groups.
The test statistic tells you how different two or more groups are from the overall population mean, or how different a linear slope is from the slope predicted by a null hypothesis. Different test statistics are used in different statistical tests.
Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test. Significance is usually denoted by a p-value, or probability value.
Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis.
When the p-value falls below the chosen alpha value, then we say the result of the test is statistically significant.
Your choice of t-test depends on whether you are studying one group or two groups, and whether you care about the direction of the difference in group means.
If you are studying one group, use a paired t-test to compare the group mean over time or after an intervention, or use a one-sample t-test to compare the group mean to a standard value. If you are studying two groups, use a two-sample t-test.
If you want to know only whether a difference exists, use a two-tailed test. If you want to know if one group mean is greater or less than the other, use a left-tailed or right-tailed one-tailed test.
In this way, it calculates a number (the t-value) illustrating the magnitude of the difference between the two group means being compared, and estimates the likelihood that this difference exists purely by chance (p-value).
A one-sample t-test is used to compare a single population to a standard value (for example, to determine whether the average lifespan of a specific town is different from the country average).
A paired t-test is used to compare a single population before and after some experimental intervention or at two different points in time (for example, measuring student performance on a test before and after being taught the material).
A t-test should not be used to measure differences among more than two groups, because the error structure for a t-test will underestimate the actual error when many groups are being compared.
A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).
A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.
Simple linear regression is a regression model that estimates the relationship between one independent variable and one dependent variable using a straight line. Both variables should be quantitative.
For example, the relationship between temperature and the expansion of mercury in a thermometer can be modeled using a straight line: as temperature increases, the mercury expands. This linear relationship is so certain that we can use mercury thermometers to measure temperature.
Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:
Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.
Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.
The only difference between one-way and two-way ANOVA is the number of independent variables. A one-way ANOVA has one independent variable, while a two-way ANOVA has two.
All ANOVAs are designed to test for differences among three or more groups. If you are only testing for a difference between two groups, use a t-test instead.
In ANOVA, the null hypothesis is that there is no difference among group means. If any group differs significantly from the overall group mean, then the ANOVA will report a statistically significant result.
Significant differences among group means are calculated using the F statistic, which is the ratio of the mean sum of squares (the variance explained by the independent variable) to the mean square error (the variance left over).
If the F statistic is higher than the critical value (the value of F that corresponds with your alpha value, usually 0.05), then the difference among groups is deemed statistically significant.
Some examples of factorial ANOVAs include:
The Akaike information criterion is a mathematical test used to evaluate how well a model fits the data it is meant to describe. It penalizes models which use more independent variables (parameters) as a way to avoid over-fitting.
AIC is most often used to compare the relative goodness-of-fit among different models under consideration and to then choose the model that best fits the data.
The Akaike information criterion is calculated from the maximum log-likelihood of the model and the number of parameters (K) used to reach that likelihood. The AIC function is 2K – 2(log-likelihood).
Lower AIC values indicate a better-fit model, and a model with a delta-AIC (the difference between the two AIC values being compared) of more than -2 is considered significantly better than the model it is being compared to.
In statistics, a model is the collection of one or more independent variables and their predicted interactions that researchers use to try to explain variation in their dependent variable.
In statistics, model selection is a process researchers use to compare the relative value of different statistical models and determine which one is the best fit for the observed data.
The Akaike information criterion is one of the most common methods of model selection. AIC weights the ability of the model to predict the observed data against the number of parameters the model requires to reach that level of precision.
AIC model selection can help researchers find a model that explains the observed variation in their data while avoiding overfitting.
The 3 main types of descriptive statistics concern the frequency distribution, central tendency, and variability of a dataset.
The test statistic will change based on the number of observations in your data, how variable your observations are, and how strong the underlying patterns in the data are.
For example, if one data set has higher variability while another has lower variability, the first data set will produce a test statistic closer to the null hypothesis, even if the true correlation between two variables is the same in either data set.
The test statistic you use will be determined by the statistical test.
You can choose the right statistical test by looking at what type of data you have collected and what type of relationship you want to test.
P-values are calculated from the null distribution of the test statistic. They tell you how often a test statistic is expected to occur under the null hypothesis of the statistical test, based on where it falls in the null distribution.
If the test statistic is far from the mean of the null distribution, then the p-value will be small, showing that the test statistic is not likely to have occurred under the null hypothesis.
The alpha value, or the threshold for statistical significance, is arbitrary – which value you use depends on your field of study.
In most cases, researchers use an alpha of 0.05, which means that there is a less than 5% chance that the data being tested could have occurred under the null hypothesis.
If the p-value is below your threshold of significance (typically p < 0.05), then you can reject the null hypothesis, but this does not necessarily mean that your alternative hypothesis is true.
Levels of measurement tell you how precisely variables are recorded. There are 4 levels of measurement, which can be ranked from low to high:
Some variables have fixed levels. For example, gender and ethnicity are always nominal level data because they cannot be ranked.
However, for other variables, you can choose the level of measurement. For example, income is a variable that can be recorded on an ordinal or a ratio scale:
If you have a choice, the ratio level is always preferable because you can analyze data in more ways. The higher the level of measurement, the more precise your data is.
Measures of central tendency help you find the middle, or the average, of a data set.
The 3 most common measures of central tendency are the mean, median and mode.
For data from skewed distributions, the median is better than the mean because it isn’t influenced by extremely large values.
Nominal data is data that can be labelled or classified into mutually exclusive categories within a variable. These categories cannot be ordered in a meaningful way.
For example, for the nominal variable of preferred mode of transportation, you may have the categories of car, bus, train, tram or bicycle.
The confidence level is the percentage of times you expect to get close to the same estimate if you run your experiment again or resample the population in the same way.
The confidence interval consists of the upper and lower bounds of the estimate you expect to find at a given level of confidence.
For example, if you are estimating a 95% confidence interval around the mean proportion of female babies born every year based on a random sample of babies, you might find an upper bound of 0.56 and a lower bound of 0.48. These are the upper and lower bounds of the confidence interval. The confidence level is 95%.
To calculate the confidence interval, you need to know:
Then you can plug these components into the confidence interval formula that corresponds to your data. The formula depends on the type of estimate (e.g. a mean or a proportion) and on the distribution of your data.
These scores are used in statistical tests to show how far from the mean of the predicted distribution your statistical estimate is. If your test produces a z-score of 2.5, this means that your estimate is 2.5 standard deviations from the predicted mean.
The predicted mean and distribution of your estimate are generated by the null hypothesis of the statistical test you are using. The more standard deviations away from the predicted mean your estimate is, the less likely it is that the estimate could have occurred under the null hypothesis.
Any normal distribution can be converted into the standard normal distribution by turning the individual values into z-scores. In a z-distribution, z-scores tell you how many standard deviations away from the mean each value lies.
If your confidence interval for a difference between groups includes zero, that means that if you run your experiment again you have a good chance of finding no difference between groups.
If your confidence interval for a correlation or regression includes zero, that means that if you run your experiment again there is a good chance of finding no correlation in your data.
In both of these cases, you will also find a high p-value when you run your statistical test, meaning that your results could have occurred under the null hypothesis of no relationship between variables or no difference between groups.
Ordinal data has two characteristics:
However, unlike with interval data, the distances between the categories are uneven or unknown.
The t-distribution is a way of describing a set of observations where most observations fall close to the mean, and the rest of the observations make up the tails on either side. It is a type of normal distribution used for smaller sample sizes, where the variance in the data is unknown.
The t-distribution forms a bell curve when plotted on a graph. It can be described mathematically using the mean and the standard deviation.
A t-score (a.k.a. a t-value) is equivalent to the number of standard deviations away from the mean of the t-distribution.
In this way, the t-distribution is more conservative than the standard normal distribution: to reach the same level of confidence or statistical significance, you will need to include a wider range of the data.
A critical value is the value of the test statistic which defines the upper and lower bounds of a confidence interval, or which defines the threshold of statistical significance in a statistical test. It describes how far from the mean of the distribution you have to go to cover a certain amount of the total variation in the data (i.e. 90%, 95%, 99%).
If you are constructing a 95% confidence interval and are using a threshold of statistical significance of p = 0.05, then your critical value will be identical in both cases.
For example, temperature in Celsius or Fahrenheit is at an interval scale because zero is not the lowest possible temperature. In the Kelvin scale, a ratio scale, zero represents a total lack of thermal energy.
Variability tells you how far apart points lie from each other and from the center of a distribution or a data set.
Variability is also referred to as spread, scatter or dispersion.
No. Because the range formula subtracts the lowest number from the highest number, the range is always zero or a positive number.
In normal distributions, a high standard deviation means that values are generally far from the mean, while a low standard deviation indicates that values are clustered close to the mean.
The empirical rule, or the 68-95-99.7 rule, tells you where most of the values lie in a normal distribution:
The empirical rule is a quick way to get an overview of your data and check for any outliers or extreme values that don’t follow this pattern.
Although the units of variance are harder to intuitively understand, variance is important in statistical tests.
Homoscedasticity, or homogeneity of variances, is an assumption of equal or similar variances in different groups being compared.
This is an important assumption of parametric statistical tests because they are sensitive to any dissimilarities. Uneven variances in samples result in biased and skewed test results.
The two most common methods for calculating interquartile range are the exclusive and inclusive methods.
The exclusive method excludes the median when identifying Q1 and Q3, while the inclusive method includes the median as a value in the data set in identifying the quartiles.
For each of these methods, you’ll need different procedures for finding the median, Q1 and Q3 depending on whether your sample size is even- or odd-numbered. The exclusive method works best for even-numbered sample sizes, while the inclusive method is often used with odd-numbered sample sizes.
To find the mode:
Then you simply need to identify the most frequently occurring value.
A data set can often have no mode, one mode or more than one mode – it all depends on how many different values repeat most frequently.
Your data can be:
The median is the most informative measure of central tendency for skewed distributions or distributions with outliers. For example, the median is often used as a measure of central tendency for income distributions, which are generally highly skewed.
The arithmetic mean is the most commonly used mean. It’s often simply called the mean or the average. But there are some other types of means you can calculate depending on your research purposes:
To figure out whether a given number is a parameter or a statistic, ask yourself the following:
If the answer is yes to both questions, the number is likely to be a parameter. For small populations, data can be collected from the whole population and summarized in parameters.
If the answer is no to either of the questions, then the number is more likely to be a statistic.
Both types of estimates are important for gathering a clear idea of where a parameter is likely to lie.
Effect size tells you how meaningful the relationship between variables or the difference between groups is.
A large effect size means that a research finding has practical significance, while a small effect size indicates limited practical applications.
In statistics, power refers to the likelihood of a hypothesis test detecting a true effect if there is one. A statistically powerful test is more likely to reject a false negative (a Type II error).
If you don’t ensure enough power in your study, you may not be able to detect a statistically significant result even when it has practical significance. Your study might not have the ability to answer your research question.
The risk of making a Type I error is the significance level (or alpha) that you choose. That’s a value that you set at the beginning of your study to assess the statistical probability of obtaining your results (p value).
The significance level is usually set at 0.05 or 5%. This means that your results only have a 5% chance of occurring, or less, if the null hypothesis is actually true.
To reduce the Type I error probability, you can set a lower significance level.
A power analysis is a calculation that helps you determine a minimum sample size for your study. It’s made up of four main components. If you know or have estimates for any three of these, you can calculate the fourth component.
There are various ways to improve power:
A correlation coefficient is a single number that describes the strength and direction of the relationship between your variables.
Different types of correlation coefficients might be appropriate for your data based on their levels of measurement and distributions. The Pearson product-moment correlation coefficient (Pearson’s r) is commonly used to assess a linear relationship between two quantitative variables.
These are the assumptions your data must meet if you want to use Pearson’s r:
Correlation coefficients always range between -1 and 1.
The sign of the coefficient tells you the direction of the relationship: a positive value means the variables change together in the same direction, while a negative value means they change together in opposite directions.
The absolute value of a number is equal to the number without its sign. The absolute value of a correlation coefficient tells you the magnitude of the correlation: the greater the absolute value, the stronger the correlation.
No, the steepness or slope of the line isn’t related to the correlation coefficient value. The correlation coefficient only tells you how closely your data fit on a line, so two datasets with the same correlation coefficient can have very different slopes.
To find the slope of the line, you’ll need to perform a regression analysis.
You can choose from four main ways to detect outliers:
It’s best to remove outliers only when you have a sound reason for doing so.
Some outliers represent natural variations in the population, and they should be left as is in your dataset. These are called true outliers.
Other outliers are problematic and should be removed because they represent measurement errors, data entry or processing errors, or poor sampling.
Outliers are extreme values that differ from most values in the dataset. You find outliers at the extreme ends of your dataset.
The geometric mean is an average that multiplies all values and finds a root of the number. For a dataset with n numbers, you find the nth root of their product.
The arithmetic mean is the most commonly used type of mean and is often referred to simply as “the mean.” While the arithmetic mean is based on adding and dividing values, the geometric mean multiplies and finds the root of values.
Even though the geometric mean is a less common measure of central tendency, it’s more accurate than the arithmetic mean for percentage change and positively skewed data. The geometric mean is often reported for financial indices and population growth rates.
There are two steps to calculating the geometric mean:
Before calculating the geometric mean, note that:
To tidy up your missing data, your options usually include accepting, removing, or recreating the missing data.
There are three main types of missing data.
Missing completely at random (MCAR) data are randomly distributed across the variable and unrelated to other variables.
Missing at random (MAR) data are not randomly distributed but they are accounted for by other observed variables.
Missing not at random (MNAR) data systematically differ from the observed values.
You can use the summary() function to view the R² of a linear model in R. You will see the “R-squared” near the bottom of the output.
You can use the qt() function to find the critical value of t in R. The function gives the critical value of t for the one-tailed test. If you want the critical value of t for a two-tailed test, divide the significance level by two.
You can use the T.INV() function to find the critical value of t for one-tailed tests in Excel, and you can use the T.INV.2T() function for two-tailed tests.
To test a hypothesis using the critical value of t, follow these four steps:
The t distribution was first described by statistician William Sealy Gosset under the pseudonym “Student.”
The null hypothesis is often abbreviated as H0. When the null hypothesis is written using mathematical symbols, it always includes an equality symbol (usually =, but sometimes ≥ or ≤).
The alternative hypothesis is often abbreviated as Ha or H1. When the alternative hypothesis is written using mathematical symbols, it always includes an inequality symbol (usually ≠, but sometimes < or >).
A research hypothesis is your proposed answer to your research question. The research hypothesis usually includes an explanation (“x affects y because …”).
A statistical hypothesis, on the other hand, is a mathematical statement about a population parameter. Statistical hypotheses always come in pairs: the null and alternative hypotheses. In a well-designed study, the statistical hypotheses correspond logically to the research hypothesis.
The three types of skewness are:
The e in the Poisson distribution formula stands for the number 2.718. This number is called Euler’s constant. You can simply substitute e with 2.718 when you’re calculating a Poisson probability. Euler’s constant is a very useful number and is especially important in calculus.
|Continuous or discrete||Continuous||Discrete|
|Parameter||Mean (µ) and standard deviation (σ)||Lambda (λ)|
|Shape||Bell-shaped||Depends on λ|
|Symmetry||Symmetrical||Asymmetrical (right-skewed). As λ increases, the asymmetry decreases.|
|Range||−∞ to ∞||0 to ∞|
When the mean of a Poisson distribution is large (>10), it can be approximated by a normal distribution.
The Pearson correlation coefficient (r) is the most common way of measuring a linear correlation. It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables.
You should use the Pearson correlation coefficient when (1) the relationship is linear and (2) both variables are quantitative and (3) normally distributed and (4) have no outliers.
You can use the cor() function to calculate the Pearson correlation coefficient in R. To test the significance of the correlation, you can use the cor.test() function.
You can use the PEARSON() function to calculate the Pearson correlation coefficient in Excel. If your variables are in columns A and B, then click any blank cell and type “PEARSON(A:A,B:B)”.
There is no function to directly test the significance of the correlation.
You can use the QUARTILE() function to find quartiles in Excel. If your data is in column A, then click any blank cell and type “=QUARTILE(A:A,1)” for the first quartile, “=QUARTILE(A:A,2)” for the second quartile, and “=QUARTILE(A:A,3)” for the third quartile.
You can use the quantile() function to find quartiles in R. If your data is called “data”, then “quantile(data, prob=c(.25,.5,.75), type=1)” will return the three quartiles.
To find the quartiles of a probability distribution, you can use the distribution’s quantile function.
You can use the chisq.test() function to perform a chi-square goodness of fit test in R. Give the observed values in the “x” argument, give the expected values in the “p” argument, and set “rescale.p” to true. For example:
chisq.test(x = c(22,30,23), p = c(25,25,25), rescale.p = TRUE)
Chi-square goodness of fit tests are often used in genetics. One common application is to check if two genes are linked (i.e., if the assortment is independent). When genes are linked, the allele inherited for one gene affects the allele inherited for another gene.
Suppose that you want to know if the genes for pea texture (R = round, r = wrinkled) and color (Y = yellow, y = green) are linked. You perform a dihybrid cross between two heterozygous (RY / ry) pea plants. The hypotheses you’re testing with your experiment are:
You observe 100 peas:
To calculate the expected values, you can make a Punnett square. If the two genes are unlinked, the probability of each genotypic combination is equal.
The expected phenotypic ratios are therefore 9 round and yellow: 3 round and green: 3 wrinkled and yellow: 1 wrinkled and green.
From this, you can calculate the expected phenotypic frequencies for 100 peas:
|Round and yellow||78||100 * (9/16) = 56.25|
|Round and green||6||100 * (3/16) = 18.75|
|Wrinkled and yellow||4||100 * (3/16) = 18.75|
|Wrinkled and green||12||100 * (1/16) = 6.21|
|Phenotype||Observed||Expected||O − E||(O − E)2||(O − E)2 / E|
|Round and yellow||78||56.25||21.75||473.06||8.41|
|Round and green||6||18.75||−12.75||162.56||8.67|
|Wrinkled and yellow||4||18.75||−14.75||217.56||11.6|
|Wrinkled and green||12||6.21||5.79||33.52||5.4|
Χ2 = 8.41 + 8.67 + 11.6 + 5.4 = 34.08
Since there are four groups (round and yellow, round and green, wrinkled and yellow, wrinkled and green), there are three degrees of freedom.
For a test of significance at α = .05 and df = 3, the Χ2 critical value is 7.82.
Χ2 = 34.08
Critical value = 7.82
The Χ2 value is greater than the critical value.
The Χ2 value is greater than the critical value, so we reject the null hypothesis that the population of offspring have an equal probability of inheriting all possible genotypic combinations. There is a significant difference between the observed and expected genotypic frequencies (p < .05).
The data supports the alternative hypothesis that the offspring do not have an equal probability of inheriting all possible genotypic combinations, which suggests that the genes are linked
You can use the CHISQ.TEST() function to perform a chi-square test of independence in Excel. It takes two arguments, CHISQ.TEST(observed_range, expected_range), and returns the p value.
You can use the chisq.test() function to perform a chi-square test of independence in R. Give the contingency table as a matrix for the “x” argument. For example:
m = matrix(data = c(89, 84, 86, 9, 8, 24), nrow = 3, ncol = 2)
chisq.test(x = m)
You can use the qchisq() function to find a chi-square critical value in R.
For example, to calculate the chi-square critical value for a test with df = 22 and α = .05:
qchisq(p = .05, df = 22, lower.tail = FALSE)
You can use the CHISQ.INV.RT() function to find a chi-square critical value in Excel.
For example, to calculate the chi-square critical value for a test with df = 22 and α = .05, click any blank cell and type:
Plot a histogram and look at the shape of the bars. If the bars roughly follow a symmetrical bell or hill shape, like the example below, then the distribution is approximately normally distributed.
Probability is the relative frequency over an infinite number of trials.
For example, the probability of a coin landing on heads is .5, meaning that if you flip the coin an infinite number of times, it will land on heads half the time.
Since doing something an infinite number of times is impossible, relative frequency is often used as an estimate of probability. If you flip a coin 1000 times and get 507 heads, the relative frequency, .507, is a good estimate of the probability.
The three categories of kurtosis are:
Want to contact us directly? No problem. We are always here for you.
The Scribbr Plagiarism Checker is powered by elements of Turnitin’s Similarity Checker, namely the plagiarism detection software and the Internet Archive and Premium Scholarly Publications content databases.
The Scribbr Citation Generator is developed using the open-source Citation Style Language (CSL) project and Frank Bennett’s citeproc-js. It’s the same technology used by dozens of other popular citation tools, including Mendeley and Zotero.
You can find all the citation styles and locales used in the Scribbr Citation Generator in our publicly accessible repository on Github.